Mar 18

Simple Approaches for achieving Sync between two Databases of an application.

Introduction

Since the title of the article is quite vague, let me try to make it more specific. The main purpose of this article is to put down few of the simple approaches which can be chosen to build an offline version of an application. The article does not cover all areas for creating an offline application. Main idea here is to explain few of the approaches I came across or worked on, to sync two databases, where the tables between the two DBs are same and data needs to be transferred. The use case for such applications is when people want to create an offline version of their applications, which can run without internet for a while. Later they want to push and pull change to and from the server respectively. There are already off the shelf tools available to achieve DB sync, but mostly the issue with them is on customization effort. Therefore I wanted to put forward few of the simple things one can achieve programmatically with little effort to achieve the basic purpose. Remember cost of simplicity is mostly the complex scenario’s handling. Here also the approaches listed can help you achieve sync but may not work for every boundary case. Hence based on your requirements you might need to customize the given approaches further during your specific implementation.

The approaches listed down are neither exhaustive nor by any standards the best possible ways. But they can help achieve the purpose with least number of changes being made to the existing DB and will work in a practical scenario.

Main Issues of Concern

When one thinks about creating a copy of their DB, they need to answer the following questions

1. How much structural change can be made in tables to achieve the functionality of synchronization

2. What is expected time constraint for the DB sync process

3. Is there a possibility of conflicts to be taken care of and in such cases, what is the most acceptable policy

4. What existing properties of the table can be used, to identify candidate records for the sync process

5. What is the typical amount of data which needs to be synced and in what order data has to be sent.

6. How are the error situations to be handled? The errors can be both application errors as well as network errors.

7. What is the approach of primary key generation being followed and how is it to be managed when records are transferred from one DB to another

Approach 1: Log SQLs as and when application performs a Create or Update operation on DB

This is a very crude and simple approach to achieve this functionality. Such an approach can work if someone is not worried about primary key generation issues, managing conflicts etc.

Main Idea

The idea here is to log each create, delete or update SQL into a file or system table which the application fires onto the DB in any of the DBs. The onus of maintaining the order of SQLs with respect to DB constraints rests with the application. The biggest advantage of using this approach is the mere simplicity to sync two databases. The sync process is nothing more than a process of conduction a file transfer and then executing the logged SQL onto the target DB.

Advantages

· Simple approach

· Faster implementation (especially if writing a new simple application)

· Time for sync will be much less as nothing much is being done at that point.

Disadvantages

· The approach does not really do a “sync” but is a plain file transfer and execution

· No way to handle conflicts as no control over the SQLs being executed once they are logged

· The primary key value in two DBs is constrained to be the same and if using any auto generators this may result in duplication of IDs corrupting the DB itself (hence the use of this approach reduces greatly for any big application)

· The application code needs to be aware of the sync approach as it has to log every statement. Further to keep the code clean, it would be necessary to use a common piece of code from where DB interaction takes place. This may become cumbersome and may need creation of a small DB framework which can deal with all the tables of the application.

· For legacy application it’s not possible to change the code and accommodate logging of SQLs if not being done already.

· The order of SQLs in which they are being written onto a file is being controlled by the application instead of DB properties and constraints defined in the DB. Hence chances of committing errors are far more probable.

· You might be writing unnecessary queries for records which may be getting added or delete in between sync process.


Approach 2: Generating Insert and Updates at sync time using information from DB

Main Idea

This approach is a modification of the previous approach which helps in overcoming few of the limitations of Approach 1. The main idea is instead of logging the SQLs during the application, SQL Insert, Delete and Update statements when the sync is to be done. This gives much finer control over what data is to be synced and the way sync should happen. The flip side is that the time for sync increases much more as the process of creation of sync SQL file is to be done once sync is initiated.

Advantages

· Better control over which records need to be synced

· Need not sync unnecessary records.

· Can apply business logic on records which need to be synced by having a configuration file which can define different rules for tables and which all records of the tables need to be syed.

· Can apply the logic of sending smaller units of data between two DBs over the network. This becomes important when data to be sent over the network can result in timeout if the size is too huge.

· When a record is inserted and updated after sync, only one insert query needs to be fired on the other DB. This can be done in this approach.

Disadvantages

· Control over changing the IDs when going from one DB to another is not possible. This can be a big hindrance as people use Sequences for primary keys in the DB and when multiple offline DB trying to sync with a central DB, without achieving change of the ID, the record cannot be inserted.

· Order of SQLs in the file should be done more carefully as done in one shot. Hence one more necessary configuration detail is needed to tell the order in which SQLs need to be written. This problem can be overcome using the JDBC approach to read meta data and making sure you write SQLs in context of defined constraints.

· If same record is modified in both the DBs, conflict resolution in this approach becomes more cumbersome.

· Tracking Deletion is very hard. Since SQLs are getting generated when sync is done, the records which are deleted can only be tracked by keeping some information about them.


Approach 3: Using XML (a more generic format) to send the sync data

This is a more generic approach which can help in overcoming most of the limitations of the previous two approaches. The idea here is to not rely on writing SQLs on one side and firing them directly on the other side. Instead we should use a more generic format to send data over to the other server like XML (or may be something more efficient and small). Once candidate records for sync are identified, the records should be serialized in the chosen format and sent over to the other side. Then other side reads the data and fires Inserts, Updates and Deletes as the case may be.

The disadvantage with this approach is obviously the complexity and the processing involved for achieving the sync. Further for performing operations like conflict resolutions for IDs etc. this approach makes more sense as you can change the IDs and send the new set of IDs to the client DB to update.

Advantages

· A very generic solution

· Gives control over modifying IDs etc during sync.

· No need to send data for column which are calculated as the processing is anyways being done on the target DB

· The data can be collected in the file in a much more logical structure with related records being represented in a hierarchical way.

Disadvantages

· Implementation is much more complex

· The time for sync will go up significantly.

· Even after having generic representation to deal with conflicts and generation of news IDs, there has to be much more cross talk between the two DB machines.


Other Important Aspects and thoughts

The data sync can work only if tables metadata is same on both sides. In a real world application, there needs to be a mechanism to ensure that the metadata and DDL related aspects in the two DBs is already in sync. As JDBC allows DDL statements, one can even extend the previous approaches to deal with DDL changes and executing them before starting the data sync.

When data is to be sent over a network, one needs worry about aspects like request timeout. In such cases, it makes sense to send data in manageable amount of units, which can run by themselves on the other DB. Still ensuring data consistency is the prime concern and hence techniques like distributed transactions or even control tables (to achieve roll backs) may need to be applied on these units.

Typically any sync process needs to have a policy to deal with conflicts. There can be lot of simple as well complex options to deal with this issue

· Server always wins. If server changed a record ever, that takes precedence.

· The latest client doing a sync always wins. In this case the client which does the last sync overrides the changes made by the other clients (may or may not the server changes though)

· One can try to merge the records using some heuristics of which column should be updated.

· If required, there should be a small rule engine or levels of check to ensure that the conflict resolution reflects real requirements and in exception cases, it exists gracefully without corrupting the data.

The thoughts here are definitely abstract but already the article seems going too long. In following articles I will try and post some more code samples on few of the approaches and explain other aspects related to them.

One Response to “Approaches for Creating an offline DB for an application”

  1. Rakesh says:

    Really good article depicting all the possible approaches in a descriptive manner. What ever approach is taken one common thing is, onus of executing the changes should always be on the receiver. Receiver should not be blindly trying to merge things sent by the sender. Receiver should use some kind of heuristics to identify if the record is eligible for an insert or an update.

Leave a Reply

preload preload preload