Sharing related information among isolated systems has become increasingly important to organizations, as it allows them to improve the quality and availability of data. There are many situations where it is useful to have a data set that is available and consistent in more than one directory server. That’s why knowing the common methods for performing SQL Server data synchronization is important.
Data availability and consistency can be achieved by data replication and data synchronization processes. Data replication is the process of creating one or more redundant copies of a database for the purpose of fault tolerance or accessibility improvement. Data synchronization is the process of establishing data consistency between two or more databases, and the subsequent continuous updates to maintain said consistency.
In many organizations, performing data synchronization across diverse systems is both desirable and challenging. We can find many use cases where we need to perform data synchronization:
- Database migration
- Regular synchronization between information systems
- Importing data from one information system into another
- Moving data sets between different stages or environments
- Importing data from a non-database source
There is no unique way or unanimously agreed method for data synchronization. This task differs from case to case, and even data synchronizations that should be simple at first glance can be complicated, due to the complexity of data structures. In real scenarios, data synchronization consists of many complex tasks, which can take a long time to perform. When a new requirement comes up, database specialists usually have to reimplement the whole synchronization process. Since there are no standard ways of doing this, besides replication, the implementations of data synchronization are rarely optimal. This results in difficult maintenance and higher expenses. Implementation and maintenance of data synchronization is such a time consuming process, it can be a full-time job by itself.
We can implement architecture for data synchronization tasks manually, possibly using Microsoft Sync Framework, or we can benefit from already created solutions within tools for managing Microsoft SQL Server. We will try to describe the most common methods and tools that can be used to solve data synchronization on Microsoft SQL Server databases and try to give some recommendations.
Based on the structure of the source and destination (e.g., databases, tables) we can differentiate use cases when structures are similar or different.
This is very often the case when we use data in various stages of the software development lifecycle. For example, the data structure in the testing and production environments is very similar. The common requirement is to compare data between the testing and production database and import data from the production into the testing database.
If the structures are different, synchronization is more complicated. This is also a more frequently recurring task. A common case is importing from one database into another. The most common case is when a piece of software needs to import data from another piece of software which is maintained by another company. Usually, imports need to run automatically on a scheduled basis.
The method used depends on personal preferences and complexity of the problem you need to solve.
Regardless of how similar the structures are, we can choose four different ways for solving data synchronization:
- Synchronization using manually created SQL scripts
- Synchronization using the data compare method (can be used only when source and target have similar structure)
- Synchronization using automatically generated SQL scripts — need commercial product
The most straightforward and tedious solution is to manually write SQL scripts for synchronization.
- Can be performed by free and open source (FOSS) tools.
- If the table has indexes, it is very fast.
- The SQL script can be saved into a stored procedure, or run periodically as a job for SQL Server.
- Can be used as an automatic import, even on continuously changed data.
- Creating such a SQL script is quite tedious, because three scripts are usually needed for each table:
- You can only synchronize data that is available via SQL queries, so you can’t import from sources like CSV and XML files.
- It is hard to maintain — when database structure is changed, it is necessary to modify two or three scripts (
UPDATE, and sometimes also
We’ll do synchronization between the table
Source, with columns
Value, and the table
Target, with the same columns.
If tables have the same primary key, and the target table doesn’t have an auto-incrementing (identity) primary key, you can execute the following synchronization script.
INSERT INTO Target (ID, Value)
SELECT ID, Value FROM Source
WHERE NOT EXISTS (SELECT * FROM Target WHERE Target.ID = Source.ID);
SET Value = Source.Value
FROM Target INNER JOIN Source ON Target.ID = Source.ID
DELETE FROM Target
WHERE NOT EXISTS (SELECT * FROM Source WHERE Target.ID = Source.ID)
In this method, we can use a tool to compare between source and target data. The comparison process generates SQL scripts that apply the differences from the source database into the target database.
There are a number of programs for data comparison and synchronization. These programs mostly use the same approach. The user selects the source and the target database, but other alternatives could be a DB backup, a folder with SQL scripts, or even a connection to a source control system.
Below are the most popular tools that use the data compare approach:
In the first step, the data is read, or just checksums of larger data from the source and from the target are read. Then the comparison process is executed.
These tools also offer additional settings for synchronizing.
We need to set up the following configuration options that are necessary for data synchronization:
By default, the primary key or a
UNIQUE constraint is used. If there is no primary key, you can choose a combination of columns. The Sync key is used to pair rows of the source with rows of the target.
By default, tables are paired by name. You can change this, and pair them according to your own needs. In the dbForge Data Compare software, you can choose SQL query to be the source or destination.
After confirming, the tool compares source and target data. The whole process consists of downloading all of the source and target data and comparing them based on specified criteria. By default, values from equally named tables and columns are compared. All tools support mapping column and table names. Also, there is the possibility to exclude
IDENTITY (autoincrement) columns or to do some transformations before comparing values (round float types, ignore character case, treat
NULL as an empty string, etc.) Data download is optimized. If the data volume is large, only checksums are downloaded. This optimization is helpful in most cases, but time requirements for performing operations increase with the volume of data.
In the next step, there is a SQL script with generated migrations. This script can be saved or run directly. To be safe, we can even make a database backup before running this script. ApexSQL Data Diff tool can create an executable program which runs the script on a selected database. This script contains data which needs to be changed, not the logic how to change it. This means that the script cannot be run automatically to provide a recurring import. That is the biggest disadvantage of this approach.
- Advanced knowledge of SQL is not required, and can be done via GUI.
- You have the ability to visually check differences between databases before synchronization.
- It’s an advanced feature of commercial products.
- Performance decreases when transferring enormous volumes of data.
- Generated SQL script contains only differences, and thus cannot be reused for automatically synchronizing future data.
Below you can see the typical UI of these tools.
ApexSQL Data Diff
RedGate SQL Compare
Change list in dbForge Data Compare
This method is very similar to the data comparison method. The only difference compared to the previous method is that there is no data comparison, and the generated SQL script does not contain data differences, but synchronization logic. The generated script can be easily saved into a stored procedure, and can be run periodically (e.g., each night). This method is useful for automatic imports between databases. The performance of this method is much better than the data compare method.
Synchronization by automatically generated SQL is only provided by SQL Database Studio.
SQL Database Studio provides a similar interface to the data compare method. We need to select the source and target (databases or tables). Then we need to set up options (sync keys, pairing and mapping). There is a graphical query builder feature for setting up all parameters.
- Advanced knowledge of SQL is not required.
- You can set up everything in a GUI pretty quickly.
- The resulting SQL script can be saved into a stored procedure.
- Can be used as automatic import — as a job for SQL Server.
- It’s an advanced feature of commercial products.
- Differences can’t be checked manually before synchronization, because the whole process is executed in one step.
Two databases (A and B), each containing one table with 2,000,000 rows. The tables are in two different databases on same SQL Server. This test covers two extreme cases: 1) The source table contains all 2,000,000 rows and the target table is empty. Synchronization needs to provide many
INSERTS. 2) The source and target tables contain 2,000,000 rows. The difference is only in one row. Synchronization needs to provide only one
RedGate Data Compare needs 3 steps:
- Generate script
- Run script on target database
ApexSQL Data Diff needs 2 steps:
- Generate script and run script in one step
SQL Database Studio performs the whole synchronization in one step. Below are synchronization times, in seconds. In the column labeled “individual steps” are durations of the synchronization steps listed above.
Case A. many INSERTsCase A. many INSERTs (individual steps)Case B. UPDATE one rowCase B. UPDATE one row (individual steps)SQL Database Studio47 5 RedGate Data Compare31713+92+2122322+0+1ApexSQL Data Diff18818+1702625+
Lower is better.
The same test, but databases are on different SQL servers, which are not connected over a linked server.
Case A. many INSERTsCase A. many INSERTs (individual steps)Case B. UPDATE one rowCase B. UPDATE one row (individual steps)SQL Database Studio78 44 RedGate Data Compare28817+82+1792524+0+1ApexSQL Data Diff20318+1852524+1dbForge Data Compare32611+3151616+0
Lower is better.
From the results, it’s obvious that RedGate and Apex do not care if databases are on the same SQL server, because the synchronization algorithm is not dependent on SQL Server. SQL Database Studio uses native functions of SQL Server; therefore, the result is better when databases are on the same server.
There are also situations when one wide table has to be synchronized into many small related tables.
This example consists of one wide table SourceData which needs to be synchronized into small tables
City. The scheme is given below.
Data in SourceData could be like the one in the image below.
INSERT INTO Continent (Name)
WHERE (SourceData.Continent IS NOT NULL
AND NOT EXISTS (SELECT * FROM Continent tested WHERE tested.Name =SourceData.Continent ))
GROUP BY SourceData.Continent;
INSERT INTO City (Name, CountryId)
LEFT JOIN Continent ON SourceData.Continent = Continent.Name
LEFT JOIN Country ON SourceData.Country = Country.Name AND Continent.Id = Country.ContinentId
WHERE SourceData.City IS NOT NULL
AND Country.Id IS NOT NULL
AND NOT EXISTS (SELECT * FROM City tested
WHERE tested.Name = SourceData.City AND tested.CountryId = Country.Id)
This script is more complicated. It is because records in the tables
Continent have to be found. This script inserts missing records into
City and fills
DELETE scripts could also be written the same way if needed.
- You do not need any commercial products.
- The SQL script can be saved into stored procedure or run periodically as a job for SQL Server.
- Creating such a SQL script is difficult and complicated (for each table, three scripts —
DELETE—are usually necessary).
- It is very hard to maintain.
This kind of synchronization (wide table into many related tables) cannot be done with the data compare method, because it is focused on different use cases. Since the data compare method produces a SQL script with data to be inserted, it has no straightforward ability to look up references in related tables. For that reason, applications using this method can’t be used (dbForge Data Compare for SQL Server, RedGate SQL Data Compare, Apex SQL Data Diff).
However, SQL Database Studio can help you to create synchronization scripts automatically. In the picture below, there is an element called Editor for Data Synchronization in SQL Database Studio.
Editor looks like the well-known Query builder and works in a very similar way. Each table needs to have a defined synchronization key, but there are also defined relations between tables. In the picture above there is mapping for synchronization as well. In the column list (lower part of image) there are the columns of the table
City (for other tables it is similar).
- Id — This column is not mapped because it is the primary key (automatically generated).
- CountryId — This column is defined as a reference for the table.
- Name — This column is filled from column City in the source table (wide table).
Name are chosen as synchronization keys. The synchronization key is a set of columns which uniquely identify a row in the source and target table. You can not use the primary key
Id as a synchronization key because this is not in the source table.
After the synchronization, this is what the tables look like:
In the example above, there was one wide table as a source. There is also a common scenario when source data is stored in several related tables. Relations in SQL Database Studio are not defined using foreign keys, but column names. This way it is also possible to import from CSV or Excel files (the file is loaded into a temporary table, and synchronization is run from that table). It’s good practice to have unique column names. If this is not possible, you can define aliases to those columns.
- Easy and fast to create
- Easy to maintain
- Can be saved into a stored procedure (the stored procedure is saved with data necessary to open the synchronization in an editor later on)
Data synchronization consists of a sequence of
DELETE commands. There are multiple ways to create sequences of these commands. In this article, we looked at three options for creating synchronization SQL scripts. The first option is to create everything manually. It is feasible (but takes too much time), it requires complex understanding of SQL, and it’s difficult to create and maintain. The second option is to use commercial tools. We looked at the following tools:
- dbForge Data Compare for SQL Server
- RedGate SQL Data Compare
- Apex SQL Data Diff
- SQL Database Studio
The first three tools work very similarly. They compare data, let the user analyze differences, and can synchronize selected differences (even automatically or from command line). They are beneficial for these usage scenarios:
- Databases are out of sync due to various errors.
- You need to avoid replication while transferring data between environments.
- Data comparison reports in Excel or HTML are needed.
Each tool is loved for one reason or the other: dbForge has great UI and a lot of options, ApexSQL performs better than the rest, and RedGate is the most popular one.
The fourth tool, SQL Database Studio, works a little differently. It generates SQL scripts that contain synchronization logic, not changes. Performance is also great, because all work is done directly on the database server, so no data transfer between the database server and the synchronization tool is needed. This tool is useful for the following use cases:
- Automatic database migrations where databases has different structure
- Import into multiple related tables
- Import from external sources XML, CSV, MS Excel