Home About Replication Our Solution Blog Contact

 

Use Scenarios 

Existing Technologies 

Replication is the copying of data from one place to another and maintaining the synchronization of different databases.

Replication is distinguished from caching. Though both of the technologies are common strategies used in scaling up computing in a distributed system, replication is related with server site behaviour – the server decides where and when to copy, while caching is happened in client site -- a client requests a file and stores a copy of the file locally for future use. One the one hand, replication addresses more issues like massage traffic, turnaround time, and server-server interaction, can be used for many purposes; on the other hand, caching doesn’t (is not able to) deal with much performance, and it is only used for improving response time.

Replication is also distinguished from backup. Backup normally copies file sets to removable medias (disks or tapes), organizing multiple versions of files by time, the copies of data cannot be automatically overwritten when the original data is modified. Replication normally creates a second copy and continuously updated along with the primary data; thus, it can be accessed directly by an application in case the primary data is unavailable or corrupted, and give very rapid recovery times.

There are many reasons to use replication:

Enable Off-line Access

Increasingly, organizations need to deploy many applications that require the ability to use and manipulate data. Replication enables users to work on a subset of a database while disconnected from the central database server. Later, when a connection is established, users can synchronize local replicas on demand -- to update the central database with all of their changes, and receive any changes that may have happened while they were disconnected. Replication is requested to create multiple replica environments quickly, and be able to use variables to customize each replica environment for its individual needs.

An extreme case of this is the Mass Deployment, which are those applications distribute database infrastructure, data, and front-end applications to a large number of users. These applications, such as sales force automation, field service, retail, typically require data to be periodically synchronized between central database systems and a large number of small, remote sites, which are often disconnected from the central database. Members of a sales force must be able to complete transactions, regardless of whether they are connected to the central database. In this case, remote sites must be autonomous. Consider the mobile sales force. Potentially hundreds (or even thousands) of professionals need accurate information about their customers on a laptop in a manner that causes the salesperson very little inconvenience. It requests the database administrator to roll out data and the database infrastructure (tables, indexes, triggers and so on) to all sites in an efficient and timely manner. 

High availability

Businesses today are faced with the critical need to ensure the availability and continuous operation of their business systems in spite of potential failures ranging from disk crashes and CPU failures to catastrophic losses of their computing facilities or communications networks, and planned downtime for maintenance. For example, many companies may employ Mission Critical applications, which are those where failure of execution, or faulty execution, may have catastrophic results. In business environments, information systems managers would consider systems where failure could lead to loss of money (e.g. Banking & Telecom), serious inability to conduct business (e.g. online investment systems or accounting systems), or serious operational chaos (e.g. electronic trading systems or electronic data interchange systems), as being mission-critical. These applications require data on multiple servers to be synchronized in a continuous, nearly instantaneous manner to ensure that the service provided is available and equivalent at all times.

Using replication, the user can maintain a near real-time "warm standby" database to which applications can switch with virtually no downtime if the primary site fails. Replication can be configured to replicate the entire database, thus creating a complete mirror of the database.

Back up

While off-site tape dumps have traditionally satisfied the requirements for disaster recovery for batch systems, they are typically inadequate for protecting the information in On-line Transaction Processing systems and e-business. Replication facilities can provide continuous duplication of critical on line system and e-business application information to off-site backup facilities without the high latency inherent in tape backup strategies. Once established, such an environment can be automated to ensure that information is replicated in a timely manner and the switch to backup systems is accomplished with minimal business interruption.

Load Balance

Since replication can be used to distribute data over multiple regional locations, it spreads most of the work among several servers. Some users can access one server while other users access different servers, thereby reducing the load at all servers. Then, applications can access various regional servers instead of accessing one central server. This configuration can also reduce network load dramatically.

Data Distribution

Distribution of data involves moving all or a subset of the data to one or more locations. Often, this involves data transformation and renormalization. Subsets of the data can be copied to data marts to provide groups of users with local access. This allows users to leverage enterprise data with business intelligence tools, while maintaining the security and performance of production applications.

Distribution of data is also used to provide data to applications in the same or different environments. This can be as simple as maintaining a copy of the production data on another similar system. There may be complex data transformation needed to fit new application requirements. The new application may be a Web application, a purchased package, or an application distributed on multiple laptop computers. The data that is copied may need to be filtered and/or transformed for the target application.

Distribution of data can also be used to provide application co-existence when migrating from one environment to another. Legacy data can be copied to the new environment for reference by the new applications until such time as the legacy applications are migrated to the new environment.

Replication helps companies find the balance between centralization and decentralization. It allows data to become a corporate asset that is stored centrally or distributed, allows an organization to locate data where it is needed, allows bi-directional data sharing with a safe approach for replicating remote updates, and also allows a corporate overview of distributed operations that is very close to real time, even when distributed business units run on a variety of hardware and DBMS platforms. Data replication is a tool that helps companies put necessary data in the hands of local decision-makers and also maintains firm central control over the data.

Consolidation of data from remote systems

An enterprise may have data on many different distributed systems. Retail companies have data at each store. Manufacturing companies have data at each plant. Insurance companies have data at each branch office or on each salesperson’s laptop computer. Data consolidation is needed to migrate data from several database servers to one central database server for centralized data analysis, audit and decision supporting. Data consolidation is also used to help protect data by administer the backups centrally from remote locations, which will reduce the hardware cost of decentralization and the risk of data loss.

Here is a scenario of this:

  • “A company’s global ‘Customer’ table data resides at the headquarter (DB1 in New York) and is distributed across branch offices (DB3 in Los Angeles, DB4 in Dallas and DB5 in San Francisco for example).”

  • “On a daily basis at the headquarter office, customer data is analyzed for the whole company and whenever appropriate, the data is updated to reflect the new customer status/rating/outstanding offers information.”

  • “Every evening, the Los Angeles DB3 and Dallas DB4 branch offices replicate the changes that occurred for the customers who live in their area, only if such changes exist.”  

  • “Assuming that the 2 branch office sites are already synchronized with the headquarter site.”

Replication is key solution for data consolidation. It copy changes from each of the distributed sites to a central site for analysis, reporting, and for enterprise application processing.

Bidirectional exchange of data

If the data can be updated at multiple locations, then replication must process changes made at any of the sites in a coordinated fashion. One location serves as the master location and distributes changes to the target locations. Changes made at the targets flow to other target sites through the master.

Bidirectional replication can be used for mobile applications where the target may be a computer in a branch office or in a delivery truck. Often, there are many targets and they are only occasionally connected to the source system. The connection may be via phone lines, so efficiency is important.

To Improve data access efficiency in a Data Grid

A data grid is a grid computing system that deals with data – the controlled sharing and management of large amounts of distributed data. Many scientific and engineering applications require access to large amounts of distributed data, and replication appeared to be a major technique for improving data access in a data grid. Study shows a simple replication strategy can give much better response performance than no replication. The following use cases illustrate the requirements of replication in a data grid:

(1) The LOGO System

The Laser Interferometer Gravitational Wave Observatory (LIGO) collaboration replicates data extensively and stores more that 40 million files across ten locations. The data management requirements for LIGO show as follows: 

Type of Data

Generator

Files Size

Frequency / Data Volume

Raw data

LIGO detectors
in Livingston, Louisiana, Hanford, Washington, and Germany

Range in from 1 to 100 MB

Approximated a terabyte per day. Each detector produced a file every 16sec containing measurement data. Approx. 2/3 of 40 million LIGO files.

New or derived data

Scientists from Each 10 site

Highly variable

Approx. 1/3 of 40 million LIGO files, the proportion is rapidly increasing.

Metadata attributes

When Data publication

 

 

 Fig 1 LIGO Data Management Requirements

Illustrated by Fig 4, the LIGO (LIGO Project, 2004) system deploys following components:

a.   A Local Storage stores data replicas;

b.   A GridFTP Server for efficient file transferring;

c.   A fully replicated Metadata Catalog maintains a current copy of all associations between logical file names and attributes;

d   A Local Replica Catalog stores mappings from logical names to physical storage locations; it also responds to periodically send summaries (its state/the existence of newly-replicated files), to Replica Location Index servers at all LIGO locations for discovery usage.

e    A Replica Location Index collects state summaries from all 10 Local Replica Catalog on each LIGO site.

f.   A Scheduling Daemon issues SQL queries to the site’s local Meta Catalog to request sets of logical files and queues these files based on their priorities

g.   A Transfer Daemon periodically checks the request file queue. It looks up the Replica Location Index to find locations in the Grid, then pulls the request files in the local Storage, and finally registers the newly-copied file in the Local Replica Catalog.


Fig 2 Architecture of LIGO

(2) The SIMDAT Project
The aims of SIMDAT are to impalement and enhance data grid technology for supporting industrial and large-scale products and services.

From it’s distributed architecture (Fig 1), we can see each of 5 V-GISC sites have the same components: a Local DB (storing data and their metadata), a Portal (interacting with users), and a Catalogue Node (main components for data management).


Fig 3 SIMDAT V-GISC distributed architecture

Fig 3 depicts the components in each V-GISC node including,  a Virtual Database Service, which using OGSA-DAI connects the local data repository; a Metadata Manager; A Data Manager; Monitoring Service; Service Broker, Management Service, etc.

 

The data management tasks of the V-GISC nodes are:

  • Harvest Metadata (regularly harvest metadata updates from the Data Repositories under its responsibility)
  • Ingest the receive Metadata to include them in its local catalogue
  • Maintain the catalogue and synchronize it with the other node using the Data Communication Layer (DCL) ( When an update is received the catalogue is updated and the synchronization messages are sent to the other nodes)
  • It serves Clients by answering requests received from the associated Portal or from another nodes
i. Data requests are either sent to the local Data Repositories or forwarded to another V-GISC node to access a remote Data Repository
          ii. It monitors request execution and once ready the data is sent to the user
  • Ingest Data coming from external sources : The other GISC

Fig 4 components of the V-GISC node