Home About Us SQL Interview Book Contact Us RSS
Articles
Tools
Code Snippets
Links
Tips & Tricks
FAQ
Resources
Articles
Code Snippets
Links
FAQ
Resources

Saravana Kumar
Vinod Kumar


Archiving Data with SQL Server 2005 

Though HDD prices have fallen well beyond imagination, archiving of data and keeping the database to a manageable size is a concern for all organizations. Storing obsolete data online reduces the performance of a database server. A well-designed archival strategy minimizes the performance problems that are caused by maintaining excess data. When designing an archival strategy, you should first determine how much data can be archived, then choose an appropriate structure for the archival data, and finally create a plan for archiving the data.

Why Archive?

By archiving data, you can improve query performance, decrease the disk space usage and reduce the maintenance window. From a performance point of view, if a production database has obsolete data that is never or rarely used, query execution can be time-consuming. This is because queries also scan the obsolete data. To improve query performance, you can move the obsolete data from the production database to another server.

Mission critical databases now-a-days runs few 10s of terabytes and obsolete data over there uses a large amount of disk space. You can free this disk space and use it for other purposes by archiving the obsolete data. If the obsolete data is stored on an expensive disk system, such as a storage area network (SAN), then archiving the obsolete data can considerably reduce storage costs.

Lastly, performing maintenance operations, such as reindexing or defragmenting, on very large tables with a lot of obsolete data can be time-consuming. You can reduce the time required for these operations by archiving data. Archiving also reduces the time required for database backup and restore operations. I do consider this as a major factor when you are thinking of capacity planning to how much resources are required for your administrative tasks. You must think of other strategies of how to backup 100s of GB data in the short window of time and yet find time to do other maintenance tasks of re-indexing etc. Though we don’t cover this topic here, would like you to think on these lines nevertheless.

Is archiving needed?

Well, this might become a million dollar question for application developers and the business users. This MUST come from the client’s requirement. The amount of online data the users require depends on your organization’s business. For example, typically in a health-care domain a patient’s history records need to be online always on the contrary in a banking scenario maybe you need to have a 6 months sliding window available online. So you can see a business expert might sometimes be of big help to take such decisions. Another consideration would be a business regulation can stipulate the time span for which the data must be accessible online. Therefore, you need to consider the effect of the regulatory requirements when you determine the data that you can store offline. Non-compliance to this can be a big problem to the companies.

Partitioning your way with SQL 2005

Previous versions of SQL Server did not provide support for creating and maintaining a partitioning scheme. As a result, the partitioning feature was not used extensively. Instead, people resorted to other mechanism like Application Centric partitioning and Distributed partitioning views to get this done. Both of these are not elegant because there is a bit of logic involved either in the middle tier or the data layer. Hence this was difficult to implement. With the advances made in SQL Server 2005, it is best to use partitioned tables to structure large databases for archival.

You can place partitioned tables and their indexes in separate filegroups. In addition, you can automatically repartition data among various tables. You can also switch tables in and out of a partition. After a table is switched out of a partition, you can move the table and its index to the archival destination. By default, an index that is created on a partitioned table uses the same partition function. As a result, the index is aligned with the table. This simplifies the management of data. You can change the default settings and align the index to another partition function, or choose not to partition the index. However, such structures make it more difficult to archive data.

Check this excellent article on MSDN for more specifics: http://msdn2.microsoft.com/en-us/library/ms345146.aspx

Just to take a sliding window example, which is the most common requirement like the example we took in the previous section of a Bank to keep 6 months of transactions available online always. First to structure the transaction table for archiving, the database administrator partitions it by month. Then, the database administrator creates an archive table. Like the transaction table, the archive table is partitioned by month. This table stores data that is older than 6 months. At the end of each month, the database administrator transfers the oldest month’s sales data to the archive table. This method of table partitioning is typically called the sliding window technique.

For example, at the end of July 2006, the database administrator wants to archive the transaction data for Jan 2006. First, the database administrator creates an empty partition for Aug 2006 in the transaction table. Then, the database administrator switches out the partition for Jan 2006. Next, the database administrator copies the transaction data for Jan 2006 to a table on the archive database. The database administrator then switches in this table as a new partition in the archive table. Finally, the database administrator deletes the data for Jan 2006 from the transaction table.

Normalize or DeNormalize?

Both the techniques are there, but a subtle variations need to be taken when handling each of these scenarios. Archiving related data together preserves the historical context of the data and avoids duplication. To do this, you can use normalized tables for structuring archival data. When using normalized tables, you must ensure that the tables can accommodate changes in lookup values or related tables. One way to accomplish this is by adding date range validity to the normalized tables. You can then specify the date ranges for valid lookup values. If you are unable to archive all related data together, you can use denormalized tables to preserve the historical context of the data. Denormalized tables store actual values rather than references to the current data. In addition to denormalized tables, you can use indexed views to denormalize data. Because denormalized tables persist data physically, you can retrieve data from them more quickly than from indexed views. However, denormalized tables require additional disk space. Moreover, you need to build architecture of updating the data in the denormalized table unlike in an indexed view.

Is archiving that easy then?

Data archiving is a complex process. It must be planned effectively so that the data transfer from a production database to the archival media does not disrupt normal operations. It is an art and proper planning has to go behind each and every operation done.

You can move the archival data from the production server to the destination storage format on a scheduled or on an arbitrary basis. Moving data on an established schedule allows you to create a process that is easily automated and tested, therefore causing fewer errors. When you choose the archiving frequency, you need to consider the impact of the data movement. You must schedule the data movement when the user load is low. Also, you must move small data sets periodically instead of moving one large data set.

If you need to move the data to an archive server, then you must consider the type of connection between the production server and the archive server. If there is a direct connection, then you can use tools such as SQL Server Integration Services (SSIS) and replication for the data transfer. You can also use queries to transfer data between the linked servers. If there is a staged operation, then you need to store the data on an intermediate format, such as a tape or a DVD, before you move the data to the archive server. For indirect data transfer, you can use tools such as SSIS and the bcp utility.

You must ensure that all the storage formats and the network connections involved in the data movement are secure. This is a critical aspect. For example, assume that the archival data is moved from a production server to a tape through an intermediate server. In such a case, the database administrator must ensure that the two servers and the tape storage location are secure, and the network connections between them are secure. The administrator must also ensure that the network connections between the location and the two servers, and the network connections between the two servers are secure. Data stored on a portable format, such as a tape, is more vulnerable to security attacks than the data stored on an archive server in a secure data center. To achieve enhanced security, you can use encrypted data transfer and encrypted files.

Where to Archive?

You can store archival data on formats such as tapes, disks, or optical media. These formats differ in characteristics such as cost, accessibility, shelf life, reliability, durability, and security. If you need to archive large volumes of data frequently, the cost of storage is a significant consideration. Tapes are typically less expensive per megabyte of data than disks or optical media. On the other side, if you need quick access to the archived data, you can use disks or optical media instead of tapes. Data retrieval is faster with disks and optical media because these formats support random access, whereas tapes support sequential access. Parallel tape devices offering significantly better access times are now available. However, these devices are expensive.

Apart from these, also keep an eye on the reliability and durability factor which might vary from one format to another. Typically, tapes tend to deteriorate more easily than disks or optical media. But tapes have a longer shelf life than disks and optical media. Assuming these drives are physically secure, encrypting the data/backup has its own overhead if you are using third party softwares etc. But be cognizant of the same.

Conclusion

We did take a tour to why we need to archive our data and what are the current options available when dealing with large databases. These can be used as guidelines rather than words written on stone. We will dwell into more specifics on managing large data warehouses in our future articles.