|
||||||||||||||||||||
|
||||||||||||||||||||
Replication is the copying of data from one place to another and maintaining the synchronization of different databases. Replication is distinguished from caching. Though both of the technologies are common strategies used in scaling up computing in a distributed system, replication is related with server site behaviour – the server decides where and when to copy, while caching is happened in client site -- a client requests a file and stores a copy of the file locally for future use. One the one hand, replication addresses more issues like massage traffic, turnaround time, and server-server interaction, can be used for many purposes; on the other hand, caching doesn’t (is not able to) deal with much performance, and it is only used for improving response time. Replication is also distinguished from backup. Backup normally copies file sets to removable medias (disks or tapes), organizing multiple versions of files by time, the copies of data cannot be automatically overwritten when the original data is modified. Replication normally creates a second copy and continuously updated along with the primary data; thus, it can be accessed directly by an application in case the primary data is unavailable or corrupted, and give very rapid recovery times. There
are many reasons to use
replication: Enable Off-line AccessIncreasingly,
organizations
need to deploy many applications that require the ability to use and
manipulate
data. Replication enables users to work on a subset of a database while
disconnected from the central database server. Later, when a connection
is
established, users can synchronize local replicas on demand -- to
update the
central database with all of their changes, and receive any changes
that may
have happened while they were disconnected. Replication is requested to
create
multiple replica environments quickly, and be able to use variables to
customize each replica environment for its individual needs. An extreme case of this is the Mass Deployment, which are those applications distribute database infrastructure, data, and front-end applications to a large number of users. These applications, such as sales force automation, field service, retail, typically require data to be periodically synchronized between central database systems and a large number of small, remote sites, which are often disconnected from the central database. Members of a sales force must be able to complete transactions, regardless of whether they are connected to the central database. In this case, remote sites must be autonomous. Consider the mobile sales force. Potentially hundreds (or even thousands) of professionals need accurate information about their customers on a laptop in a manner that causes the salesperson very little inconvenience. It requests the database administrator to roll out data and the database infrastructure (tables, indexes, triggers and so on) to all sites in an efficient and timely manner. High availabilityBusinesses
today are faced with the critical need to ensure the availability and
continuous operation of their business systems in spite of potential
failures
ranging from disk crashes and CPU failures to catastrophic losses of
their
computing facilities or communications networks, and planned downtime
for maintenance.
For example, many companies may employ Mission Critical applications,
which are
those where failure of execution, or faulty execution, may have
catastrophic
results. In business environments, information systems managers would
consider
systems where failure could lead to loss of money (e.g. Banking &
Telecom),
serious inability to conduct business (e.g. online investment systems
or
accounting systems), or serious operational chaos (e.g. electronic
trading
systems or electronic data interchange systems), as being
mission-critical.
These applications require data on multiple servers to be synchronized
in a
continuous, nearly instantaneous manner to ensure that the service
provided is
available and equivalent at all times. Using replication, the user can maintain a near real-time "warm standby" database to which applications can switch with virtually no downtime if the primary site fails. Replication can be configured to replicate the entire database, thus creating a complete mirror of the database. Back upWhile off-site
tape
dumps have traditionally satisfied the requirements for disaster
recovery for
batch systems, they are typically inadequate for protecting the
information in
On-line Transaction Processing systems and e-business. Replication
facilities
can provide continuous duplication of critical on line system and
e-business
application information to off-site backup facilities without the high
latency
inherent in tape backup strategies. Once established, such an
environment can
be automated to ensure that information is replicated in a timely
manner and
the switch to backup systems is accomplished with minimal business
interruption. Load BalanceSince replication can be used to distribute data over multiple regional locations, it spreads most of the work among several servers. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Then, applications can access various regional servers instead of accessing one central server. This configuration can also reduce network load dramatically. Data DistributionDistribution of data involves moving all or a subset of the data to one or more locations. Often, this involves data transformation and renormalization. Subsets of the data can be copied to data marts to provide groups of users with local access. This allows users to leverage enterprise data with business intelligence tools, while maintaining the security and performance of production applications. Distribution of data is also used to provide data to applications in the same or different environments. This can be as simple as maintaining a copy of the production data on another similar system. There may be complex data transformation needed to fit new application requirements. The new application may be a Web application, a purchased package, or an application distributed on multiple laptop computers. The data that is copied may need to be filtered and/or transformed for the target application. Distribution of data can also be used to provide application co-existence when migrating from one environment to another. Legacy data can be copied to the new environment for reference by the new applications until such time as the legacy applications are migrated to the new environment. Replication helps companies find the balance between centralization and decentralization. It allows data to become a corporate asset that is stored centrally or distributed, allows an organization to locate data where it is needed, allows bi-directional data sharing with a safe approach for replicating remote updates, and also allows a corporate overview of distributed operations that is very close to real time, even when distributed business units run on a variety of hardware and DBMS platforms. Data replication is a tool that helps companies put necessary data in the hands of local decision-makers and also maintains firm central control over the data. Consolidation of data from remote systemsAn enterprise may have data on many different distributed systems. Retail companies have data at each store. Manufacturing companies have data at each plant. Insurance companies have data at each branch office or on each salesperson’s laptop computer. Data consolidation is needed to migrate data from several database servers to one central database server for centralized data analysis, audit and decision supporting. Data consolidation is also used to help protect data by administer the backups centrally from remote locations, which will reduce the hardware cost of decentralization and the risk of data loss. Here is a scenario of this:
Replication is key solution for data consolidation. It copy changes from each of the distributed sites to a central site for analysis, reporting, and for enterprise application processing. Bidirectional exchange of dataIf the data can be updated at multiple locations, then replication must process changes made at any of the sites in a coordinated fashion. One location serves as the master location and distributes changes to the target locations. Changes made at the targets flow to other target sites through the master. Bidirectional replication can be used for mobile applications where the target may be a computer in a branch office or in a delivery truck. Often, there are many targets and they are only occasionally connected to the source system. The connection may be via phone lines, so efficiency is important. To Improve data access efficiency in a Data GridA
data grid is a grid
computing system that deals with data – the controlled sharing and
management
of large amounts of distributed data. Many scientific
and engineering applications require access to large amounts of
distributed
data, and replication appeared to be a major technique for improving
data
access in a data grid. Study shows a simple replication strategy can
give much
better response performance than no replication. The following use
cases illustrate
the requirements of replication in a data grid: (1) The LOGO System The
Laser Interferometer Gravitational
Wave Observatory (LIGO) collaboration replicates data extensively
and stores more that 40 million files across ten locations. The data
management
requirements for LIGO show as follows:
Fig 1 LIGO
Data Management Requirements
Illustrated by Fig 4, the LIGO (LIGO Project, 2004) system deploys following components: a. A Local Storage stores data replicas; b. A GridFTP Server for efficient file transferring; c. A fully replicated Metadata Catalog maintains a current copy of all associations between logical file names and attributes; d A Local Replica Catalog stores mappings from logical names to physical storage locations; it also responds to periodically send summaries (its state/the existence of newly-replicated files), to Replica Location Index servers at all LIGO locations for discovery usage. e A Replica Location Index collects state summaries from all 10 Local Replica Catalog on each LIGO site. f. A Scheduling Daemon issues SQL queries to the site’s local Meta Catalog to request sets of logical files and queues these files based on their priorities g.
A Transfer Daemon periodically checks the
request file queue. It looks up the Replica
Location Index to find locations in the Grid, then pulls the request
files
in the local Storage, and finally
registers the newly-copied file in the Local
Replica Catalog. ![]() Fig 2 Architecture of LIGO (2)
The SIMDAT Project From it’s distributed architecture (Fig 1), we can see each of 5 V-GISC sites have the same components: a Local DB (storing data and their metadata), a Portal (interacting with users), and a Catalogue Node (main components for data management). ![]() Fig 3 SIMDAT V-GISC distributed architecture Fig 3 depicts the components in each V-GISC node including, a Virtual Database Service, which using OGSA-DAI connects the local data repository; a Metadata Manager; A Data Manager; Monitoring Service; Service Broker, Management Service, etc.
The data management tasks of the V-GISC nodes are:
i. Data
requests are either sent to the local Data
Repositories or forwarded to another V-GISC node to access a remote
Data
Repository
ii. It monitors request execution and once ready the data
is sent to the user
![]() Fig 4 components of the V-GISC node
|
||||||||||||||||||||