Clinical data archiving best practices: How it differs from backup

Healthcare data is similar to other commercial data in how it should be handled across different layers of storage.

The principles for archiving data are the same regardless of the class of data involved. That is, clinical data archiving and financial data archiving are archived the same way. Best practice involves handling the data across several tiers of data storage.

Jon Gaasedelen

Jon Gaasedelen

In a previous article, three kinds of tiered storage were described: online, offline and near-line. This storage was initially described in terms of the kind of data the storage supported. Online storage supports transactional or volatile data; near-line storage supports less-critical data such as user home directories and installed applications. The third tier of storage, offline storage, is really the only kind of everyday storage that is solely used for backing up and archiving.

There is a distinction made between backup files and archive files. Backup files are data that is persistent in the short term but expendable in the long term. Archive files are data that persist over the long term and are consistently collected in the short term. To put it a different way: It is the difference between backup files and disaster recovery (DR) files. Backup files help data centers recover from (generally) human error; while DR files help data centers recover from natural or manmade disasters such as fire, accidents or weather-related issues.

The real challenge with backup and DR is maintaining backups while transitioning these backups to archive files with minimal interruptions to operations. Software relieves these challenges by affecting how the operational data is backed up and maintained. This is also where the concept of storage tiers plays a key role.

Tiered storage is about moving data from the highest level and fastest storage (online) to the lowest level and slowest storage (offline) with minimal interruption to operations. This is accomplished when online transaction data needs backing up; a snapshot of the data is taken at a point in time. A data snapshot is a great way to describe this backup event because it accomplishes a full backup in approximately the same amount of time required to take a picture with a camera.

How storage activity is like a family vacation photo

Think about taking a picture of your family at a point in your vacation where they are near the ocean, on a beach, and all their activity has been frozen for that instance in time. Just like when you look at a vacation picture and relive part of your vacation, backup software is smart enough to understand the metadata and can use it to recover files at a point in time.

Best practice suggests that the backups are stored so recovery can take place quickly. That means that the backup data is put close to the data being collected. If recovery is necessary, the system managing the stored data would then know where to look for and facilitate a full recovery.

The next tier, near-line storage, has to deal with two kinds of backup files. One kind is the snapshot or metadata backup files from the online storage tier. The second kind deals with the requirements necessary to back up the user home directories and the applications. Similar to online backups, these backups also use snapshots. However, in this case the snapshots are less dynamic so preservation is more persistent, because the data just doesn't change that much.

As this discussion moves into the final storage tier, offline storage, the distinction made between backing up and archiving becomes more relevant. To help visualize, think of an archived file as a final resting state for data as it progresses and changes from supporting recovery data that is highly dynamic, to supporting data that is barely changing. At a certain point, a line is drawn that says this data must be used with other data in order for a full data restoration to be meaningful.

Returning to the analogy of a family vacation, think in terms of two different vacations -- both vacations involve the beach, but one beach is in France and the other is in Florida. In either case the snapshot you take of the family will be similar, but the events leading to the picture are vastly different from a meta-information perspective. That is, the transportation taken to reach Florida versus France and where in Florida or France you're staying are different. The collection of this meta-information is equivalent to the progression that a file takes from being a backup file to being an archive file, and the recovery of that data is affected.

Migrating from backup to archive files

Remember, online data is generated from transactions or queries that happen millions of times a day. This volume of data necessarily implies that a snapshot of the data will be stale and outdated within seconds of the data backup. This is where the progression of backup to archive comes into focus.

Yes, the data in a data snapshot is stale almost immediately after it is written, but in addition to this snapshot, the transactions before and after the snapshot are also collected. Combining these collected transactions with the snapshot enables recovering data as it existed at a particular point in time. So the progression that is being referenced is the snapshot as the archive and the transactions as the backup. Similarly the same progression applies to near-line storage. However, with near-line storage, no backups are made of individual changes to user files or applications. These individual changes can also be thought of as transactions that are not backed up. In this case the progression from backup files to archive files is simply a file copy stored as an archive, and no transaction log is applied. Changes to user files beyond a certain point in time are lost and unable to be recovered.

Know when to restore

The best practices for clinical data archiving in healthcare are no different from those for archiving any other kind of commercial data. That is, highly volatile transaction or online data is backed up and archived using a two-stage process of taking snapshots of the data and backing up transactions leading up to the snapshot.

Less-volatile user and application data is backed up by copying appropriate files at a point in time. Restoration of volatile online data involves restoring the data snapshot and then playing back the transaction log to achieve full recovery. Restoration of the less-volatile user and application files is accomplished by just putting users and application files back to where they were originally stored and accepting that some data will be lost.

Jon Gaasedelen is an independent IT consultant with more than 20 years' experience in information systems infrastructures. He has an undergraduate degree in economics and a master's degree in health informatics, both from the University of Minnesota. Let us know what you think about the story; email [email protected] or contact @SearchHealthIT on Twitter.

Next Steps

Pure Storage FlashArray helps hospital CIO improve delivery, speed of care

Dig Deeper on Health records storage management and systems