The Minimalist Coder: Data migration to Azure SQL Database

I've long been a fan of SQL Azure (under its various names), mainly because it's usually provided the easiest first step for migrating existing .NET apps into the cloud. That said, there have always been hundreds of features it's lacking compared to "real" SQL Server, and I've never been particularly satisfied with any of the official means for data migration.

So imagine (if you can!) my excitement last September with the announcement that it had became possible to add Azure SQL DB as a linked server. As well as allowing queries which joined tables across multiple servers, I also had high hopes that it would make migration of data from one to another much easier. Between two linked SQL Servers, I've often used the nice and simple SELECT INTO command to copy both the data and schema of a table in one easy step.

On my veritable roller coaster of emotions that day, imagine now the devastation caused by the following response from Azure:

Msg 40510, Level 16, State 1, Line 1
Statement 'SELECT INTO' is not supported in this version of SQL Server.

...due, it seems, to something about tables without clustered indices not being supported :-(

After my initial disappointment, I figured that I could still move data from one place to another using INSERT INTO ... SELECT, providing the schema was already in place on the destination. After lots of playing, I finally came up with the procedure defined below for copying an entire database from a classic SQL Server into a linked Azure database - whilst also maintaining referential integrity. Various other restrictions encountered along the way means that it gets a bit ugly in places (especially due to all the dynamic SQL), but it seems to work pretty well and I've found it especially useful in scenarios where I've needed to re-run the data push several times over during development.

1. Build the schema on Azure

The apps I'm writing at the moment are based on Entity Framework (code-first), so I must admit that building the actual schema of databases is something I get for free these days. If you're in a similar situation to me, just plug into your Web.config the connection string which the Azure portal provides for your server, and fire up your app to have Entity Framework build the empty database.

Otherwise, you could use Management Studio's "Script Database as" menu item to give you the SQL to create the schema.

2. Define the linked server

EXEC sp_addlinkedserver 'azure',
 N'SQL Server',
 'serverref.database.windows.net'
GO

EXEC sp_addlinkedsrvlogin
 @rmtsrvname = 'azure',
 @useself = 'FALSE',
 @rmtuser = 'username@serverref', 
 @rmtpassword = 'password'
GO

3. Ready the Azure schema for migration

You'll need two new columns on each table to facilitate the data migration: DataImport_OriginalId (to contain the original primary key value of the data that will be transferred - needed to keep foreign keys valid across tables), and DataImport_IsClone for some of the jiggery-pokery that takes place in (5). I'm assuming that each table has one PK which is an identity and an int.

You'll also need to disable FK constraint checking in readiness for the data to be bulk loaded (because it's quite likely that referenced rows won't yet exist at the time of loading).

This generates some SQL to add the columns to each table, and disable the constraints. If you're brave, replace the PRINT commands with EXEC, but I prefer to view the SQL it generates and then copy and paste to run it manually.

DECLARE @tableName nvarchar(max)
DECLARE @sql nvarchar(max)

DECLARE tableCursor CURSOR FOR  
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES

OPEN tableCursor
FETCH NEXT FROM tableCursor INTO @tableName

WHILE @@FETCH_STATUS = 0
BEGIN
    SELECT @sql = 'ALTER TABLE ' + @tablename 
    + ' ADD DataImport_OriginalId int, DataImport_IsClone bit'
    PRINT @sql
        
    SET @sql = 'ALTER TABLE ' + @tableName 
    + ' NOCHECK CONSTRAINT ALL'
    PRINT @sql
    
    FETCH NEXT FROM tableCursor INTO @tableName
END   

CLOSE tableCursor 
DEALLOCATE tableCursor

4. Push the data

This will generate some SQL to copy over the data from your originating server to Azure, also populating the new columns we added in (3) as appropriate.

DECLARE @tableName nvarchar(max)
DECLARE @identityColumn nvarchar(max)
DECLARE @otherColumns nvarchar(max)
DECLARE @sql nvarchar(max)

DECLARE tableCursor CURSOR FOR  
SELECT TABLE_NAME
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME NOT IN ('EdmMetadata')

OPEN tableCursor
FETCH NEXT FROM tableCursor INTO @tableName

WHILE @@FETCH_STATUS = 0
BEGIN
    SELECT @identityColumn = COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = @tableName
     AND COLUMNPROPERTY(object_id(TABLE_NAME), 
      COLUMN_NAME, 'IsIdentity') = 1
    
    SET @otherColumns = null
    SELECT @otherColumns = COALESCE(@othercolumns + ',','') 
    + ('[' + COLUMN_NAME + ']')
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = @tableName
     AND COLUMN_NAME <> @identityColumn
    ORDER BY ORDINAL_POSITION
    
    SET @Sql = 'INSERT INTO azure.remotedatabase.dbo.' + @tableName 
    + ' ([DataImport_OriginalId],[DataImport_IsClone],' + @otherColumns 
    + ') SELECT [' + @identityColumn + '],0,' + @otherColumns 
    + ' FROM ' + @tableName
    PRINT @sql
    
    FETCH NEXT FROM tableCursor INTO @tableName
END   

CLOSE tableCursor 
DEALLOCATE tableCursor

5. Tidy up!

We now have all of the data sitting in our Azure database, but the primary keys will almost certainly be different from their originals, hence any foreign key values will reference non-existent or incorrect rows. That's because we can't SET IDENTITY_INSERT ON across successive connections, and with a linked server you have no control over which connection each SQL statement runs on. Instead, we have the original IDs in that new DataImport_OriginalId column we defined earlier.

Because we can neither modify the newly assigned PKs, nor even reseed identity columns (DBCC isn't supported in Azure SQL DB), the only way I've found to get the rows back with their original PK values is as follows:

Switch on IDENTITY_INSERT
Find a suitably distant PK value from anything that currently exists or will need to exist... lazily, I've used MAX(PK) + MAX(DataImport_OriginalId)
Insert (and then delete) one row to set the identity seed to this value
Switch off IDENTITY_INSERT
Clone all rows, flagging them as cloned using the DataImport_IsClone column
Delete all original rows (where DataImport_IsClone = 0)
Switch on IDENTITY_INSERT
Clone all cloned rows, with PK now explicitly set to DataImport_OriginalId
Switch off IDENTITY_INSERT
Delete all rows from the first clone!
Repeat for all tables; then (and only then) re-enable FK constraint checking.

Here comes the humdinger of a query to generate the SQL for all tables:

DECLARE @tableName nvarchar(max)
DECLARE @identityColumn nvarchar(max)
DECLARE @otherColumns nvarchar(max)
DECLARE @maxIdentity int
DECLARE @sql nvarchar(max)

DECLARE tableCursor CURSOR FOR  
SELECT TABLE_NAME
FROM Information_Schema.Tables
WHERE TABLE_NAME NOT IN ('EdmMetadata')

OPEN tableCursor
FETCH NEXT FROM tableCursor INTO @tableName

WHILE @@FETCH_STATUS = 0
BEGIN

    SELECT @identityColumn = COLUMN_NAME
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = @tableName
     AND COLUMNPROPERTY(object_id(TABLE_NAME),
      COLUMN_NAME, 'IsIdentity') = 1

    SET @otherColumns = null
    SELECT @otherColumns = COALESCE(@othercolumns + ',','') 
    + ('[' + COLUMN_NAME + ']')
    FROM INFORMATION_SCHEMA.COLUMNS
    WHERE TABLE_NAME = @tableName
     AND COLUMN_NAME <> @identityColumn
     AND COLUMN_NAME NOT LIKE 'DataImport_%'
    ORDER BY ORDINAL_POSITION

    SET @sql = 'SET IDENTITY_INSERT ' + @tableName + ' ON'
    PRINT @sql

    SET @sql = 'INSERT INTO ' + @tableName 
    + ' ([' + @identityColumn + '],[DataImport_IsClone],' + @otherColumns 
    + ') SELECT TOP 1 (SELECT MAX(DataImport_OriginalId) + MAX(' 
    + @identityColumn 
    + ') FROM ' + @tableName + ')+1,1,' + @otherColumns 
    + ' FROM ' + @tableName
    PRINT @sql

    SET @sql = 'SET IDENTITY_INSERT ' + @tableName + ' OFF'
    PRINT @sql
    
    SET @sql = 'DELETE FROM ' + @tablename + ' WHERE DataImport_IsClone = 1'
    PRINT @sql
    
    SET @sql = 'INSERT INTO ' + @tableName 
    + ' ([DataImport_OriginalId],[DataImport_IsClone],' + @otherColumns 
    + ') SELECT [DataImport_OriginalId],1,' + @otherColumns 
    + ' FROM ' + @tableName 
    + ' WHERE DataImport_OriginalId IS NOT NULL'
    PRINT @sql
    
    SET @sql = 'DELETE FROM ' + @tablename 
    + ' WHERE DataImport_IsClone = 0 OR DataImport_IsClone IS NULL'
    PRINT @sql

    SET @sql = 'SET IDENTITY_INSERT ' + @tableName + ' ON'
    PRINT @sql

    SET @sql = 'INSERT INTO ' + @tableName 
    + ' ([' + @identityColumn + '],[DataImport_IsClone],' + @otherColumns 
    + ') SELECT DataImport_OriginalId,0,' + @otherColumns 
    + ' FROM ' + @tableName
    PRINT @sql

    SET @sql = 'SET IDENTITY_INSERT ' + @tableName + ' OFF'
    PRINT @sql
    
    SET @sql = 'DELETE FROM ' + @tablename + ' WHERE DataImport_IsClone = 1'
    PRINT @sql
        
    FETCH NEXT FROM tableCursor INTO @tableName

END   

CLOSE tableCursor 

OPEN tableCursor
FETCH NEXT FROM tableCursor INTO @tableName

WHILE @@FETCH_STATUS = 0
BEGIN

    SET @sql = 'ALTER TABLE ' + @tableName + ' CHECK CONSTRAINT ALL'
    PRINT @sql

    FETCH NEXT FROM tableCursor INTO @tableName

END   

CLOSE tableCursor 
DEALLOCATE tableCursor

Note that I've deliberately left behind the two columns I created in (3), because they have no negative impact on my application, and they allow me to repeat this process as many times as I need by emptying the database, and re-running (4) and (5).

Improvements?

I find this technique for migrating data into the cloud useful, but it would be great to find a way to remove the need for all the data shuffling that goes on in (5) just to restore the original identities. Something like the ability to switch on IDENTITY_INSERT for all connections would make things a whole lot nicer.

If you have any thoughts or suggestions to share for making this cleaner, I'd love to hear from you!

The Minimalist Coder

Sunday, March 31, 2013

Data migration to Azure SQL Database

1 comment:

About Me

Blog Archive

Blogroll