Though
HDD prices have fallen well beyond imagination, archiving of data and keeping
the database to a manageable size is a concern for all organizations. Storing
obsolete data online reduces the performance of a database server. A
well-designed archival strategy minimizes the performance problems that are
caused by maintaining excess data. When designing an archival strategy, you
should first determine how much data can be archived, then choose an
appropriate structure for the archival data, and finally create a plan for
archiving the data.
Why
Archive?
By
archiving data, you can improve query performance, decrease the disk space
usage and reduce the maintenance window. From a performance point of view, if a
production database has obsolete data that is never or rarely used, query
execution can be time-consuming. This is because queries also scan the obsolete
data. To improve query performance, you can move the obsolete data from the
production database to another server.
Mission
critical databases now-a-days runs few 10s of terabytes and obsolete data over
there uses a large amount of disk space. You can free this disk space and use
it for other purposes by archiving the obsolete data. If the obsolete data is
stored on an expensive disk system, such as a storage area network (SAN), then
archiving the obsolete data can considerably reduce storage costs.
Lastly,
performing maintenance operations, such as reindexing or defragmenting, on very
large tables with a lot of obsolete data can be time-consuming. You can reduce
the time required for these operations by archiving data. Archiving also
reduces the time required for database backup and restore operations. I do
consider this as a major factor when you are thinking of capacity planning to
how much resources are required for your administrative tasks. You must think
of other strategies of how to backup 100s of GB data in the short window of
time and yet find time to do other maintenance tasks of re-indexing etc. Though
we don’t cover this topic here, would like you to think on these lines
nevertheless.
Is
archiving needed?
Well,
this might become a million dollar question for application developers and the
business users. This MUST come from the client’s requirement. The amount of
online data the users require depends on your organization’s business. For
example, typically in a health-care domain a patient’s history records need to
be online always on the contrary in a banking scenario maybe you need to have a
6 months sliding window available online. So you can see a business expert
might sometimes be of big help to take such decisions. Another consideration
would be a business regulation can stipulate the time span for which the data
must be accessible online. Therefore, you need to consider the effect of the
regulatory requirements when you determine the data that you can store offline.
Non-compliance to this can be a big problem to the companies.
Partitioning
your way with SQL 2005
Previous
versions of SQL Server did not provide support for creating and maintaining a
partitioning scheme. As a result, the partitioning feature was not used
extensively. Instead, people resorted to other mechanism like Application
Centric partitioning and Distributed partitioning views to get this done. Both
of these are not elegant because there is a bit of logic involved either in the
middle tier or the data layer. Hence this was difficult to implement. With the
advances made in SQL Server 2005, it is best to use partitioned tables to
structure large databases for archival.
You
can place partitioned tables and their indexes in separate filegroups. In
addition, you can automatically repartition data among various tables. You can
also switch tables in and out of a partition. After a table is switched out of
a partition, you can move the table and its index to the archival destination.
By default, an index that is created on a partitioned table uses the same
partition function. As a result, the index is aligned with the table. This
simplifies the management of data. You can change the default settings and
align the index to another partition function, or choose not to partition the
index. However, such structures make it more difficult to archive data.
Check
this excellent article on MSDN for more specifics:
http://msdn2.microsoft.com/en-us/library/ms345146.aspx
Just
to take a sliding window example, which is the most common requirement like the
example we took in the previous section of a Bank to keep 6 months of
transactions available online always. First to structure the transaction table
for archiving, the database administrator partitions it by month. Then, the
database administrator creates an archive table. Like the transaction table,
the archive table is partitioned by month. This table stores data that is older
than 6 months. At the end of each month, the database administrator transfers
the oldest month’s sales data to the archive table. This method of table
partitioning is typically called the sliding window technique.
For
example, at the end of July 2006, the database administrator wants to archive
the transaction data for Jan 2006. First, the database administrator creates an
empty partition for Aug 2006 in the transaction table. Then, the database
administrator switches out the partition for Jan 2006. Next, the database
administrator copies the transaction data for Jan 2006 to a table on the
archive database. The database administrator then switches in this table as a
new partition in the archive table. Finally, the database administrator deletes
the data for Jan 2006 from the transaction table.
Normalize
or DeNormalize?
Both
the techniques are there, but a subtle variations need to be taken when
handling each of these scenarios. Archiving related data together preserves the
historical context of the data and avoids duplication. To do this, you can use
normalized tables for structuring archival data. When using normalized tables,
you must ensure that the tables can accommodate changes in lookup values or
related tables. One way to accomplish this is by adding date range validity to
the normalized tables. You can then specify the date ranges for valid lookup
values. If you are unable to archive all related data together, you can use
denormalized tables to preserve the historical context of the data.
Denormalized tables store actual values rather than references to the current
data. In addition to denormalized tables, you can use indexed views to
denormalize data. Because denormalized tables persist data physically, you can
retrieve data from them more quickly than from indexed views. However,
denormalized tables require additional disk space. Moreover, you need to build
architecture of updating the data in the denormalized table unlike in an
indexed view.
Is
archiving that easy then?
Data
archiving is a complex process. It must be planned effectively so that the data
transfer from a production database to the archival media does not disrupt
normal operations. It is an art and proper planning has to go behind each and
every operation done.
You
can move the archival data from the production server to the destination
storage format on a scheduled or on an arbitrary basis. Moving data on an
established schedule allows you to create a process that is easily automated
and tested, therefore causing fewer errors. When you choose the archiving
frequency, you need to consider the impact of the data movement. You must
schedule the data movement when the user load is low. Also, you must move small
data sets periodically instead of moving one large data set.
If
you need to move the data to an archive server, then you must consider the type
of connection between the production server and the archive server. If there is
a direct connection, then you can use tools such as SQL Server Integration
Services (SSIS) and replication for the data transfer. You can also use queries
to transfer data between the linked servers. If there is a staged operation,
then you need to store the data on an intermediate format, such as a tape or a
DVD, before you move the data to the archive server. For indirect data
transfer, you can use tools such as SSIS and the bcp utility.
You
must ensure that all the storage formats and the network connections involved
in the data movement are secure. This is a critical aspect. For example, assume
that the archival data is moved from a production server to a tape through an
intermediate server. In such a case, the database administrator must ensure
that the two servers and the tape storage location are secure, and the network
connections between them are secure. The administrator must also ensure that
the network connections between the location and the two servers, and the
network connections between the two servers are secure. Data stored on a
portable format, such as a tape, is more vulnerable to security attacks than
the data stored on an archive server in a secure data center. To achieve
enhanced security, you can use encrypted data transfer and encrypted files.
Where
to Archive?
You
can store archival data on formats such as tapes, disks, or optical media.
These formats differ in characteristics such as cost, accessibility, shelf
life, reliability, durability, and security. If you need to archive large
volumes of data frequently, the cost of storage is a significant consideration.
Tapes are typically less expensive per megabyte of data than disks or optical
media. On the other side, if you need quick access to the archived data, you
can use disks or optical media instead of tapes. Data retrieval is faster with
disks and optical media because these formats support random access, whereas
tapes support sequential access. Parallel tape devices offering significantly
better access times are now available. However, these devices are expensive.
Apart
from these, also keep an eye on the reliability and durability factor which
might vary from one format to another. Typically, tapes tend to deteriorate
more easily than disks or optical media. But tapes have a longer shelf life
than disks and optical media. Assuming these drives are physically secure,
encrypting the data/backup has its own overhead if you are using third party
softwares etc. But be cognizant of the same.
Conclusion
We
did take a tour to why we need to archive our data and what are the current
options available when dealing with large databases. These can be used as
guidelines rather than words written on stone. We will dwell into more
specifics on managing large data warehouses in our future articles.