Tip

Strategies to maximize uptime for SAN systems

Storage area networks are commonly used to support production operations for all kinds of businesses. Simply put, SAN systems are a collection of hard disks, wired together and connected to multiple servers.

    Requires Free Membership to View

Jon Gaasedelen

Each one of these multiple servers may be standalone units providing support for a single application, or they may be several servers connected together as self-backup if a server fails. Regardless of server configuration, the storage area network (SAN) supplies disks for all these types of systems. In order to attain a system state where a disk never fails -- crucial for maintaining uptime in life-and-death situations in healthcare environments -- IT operations centers depend upon redundant, focused hardware and software from server to disk.

Server failovers and SAN controllers

Every application used by healthcare workers requires server resources such as internal memory, CPU and disk. Some mechanism at the hardware and/or software level must be used to make these resources redundant. From the server perspective, redundancy looks like multiple physical or virtual servers wired together with a "watcher" program checking the heartbeats of the servers. If one server fails, the software understands what resources the failed server was supplying, and seamlessly shunts those activities to a new server.

In the case above, the server is primarily focused on where the failed application gets its memory and CPU. The failover process does not care about where it's reading and writing to disk. This concern is handled by the hard disk or SAN software.

While this process seems straightforward, it's not. SAN software runs at an operating system level and maintains two directions it can take to write data to a disk. So when a server fails, the CPU only knows it must write to disk located "here," and "here" could be anywhere. The SAN software is the final arbitrator for that location.

Disk path and disk failovers: Separate animals

Similar to server failovers, disk path failure also requires a watcher program. This ensures the disk path that is going to be used actually exists and that this path can take the order and write to disk. This requires the SAN software to know everything about the path it will take to get to a hard disk. One thing to remember is that the data's path to disk can potentially go through multiple routes to arrive at its destination.

The final consideration in SAN setup is where data is written. Remember, SAN systems are simply devices that wire hard disks together to take over for one another. The complication is that the wiring of the disks involves grouping them together and controlling their work in unison with RAID software. The principle behind RAID is to manage a group of five or more disks, tightly bound, so when a disk fails, it is immediately replaced by another.

In the meantime, the disk's failure is broadcast to the data center and a technician. When repairs are complete, all the data is moved back to where it was, and the process is reset for the next failure.

Maximizing uptime

SAN best practices are totally focused on maintaining application uptime so users will never experience a server or disk failure. The techniques used for this incorporate triple redundancies across hardware domains:

  1. Servers are installed and configured so they are always aware of the other server's situation. If a server fails, processes are in place to cover for that failure. The first thing that a server's operating system does when a partner server fails is let the SAN software know a server has failed. In this case, the paths the SAN software was managing on behalf of the failed server are no longer relevant, and alternate paths should be implemented.
  2. This causes the SAN to move into the next domain of redundancy, namely creating multiple paths for the SAN storage to access stored data. Because data from the failed server is stored apart from the server, the only thing that changes is how the new server will get to the unchanged data.
  3. At this point, the final technique for redundancy comes into play, which is the case of the failed disk itself. Here, the individual disk is failed out and removed from the collective of disks via RAID techniques, and notification of the failure is made to IT.

These best practices might seem straightforward, but they have taken years to develop across many businesses, including healthcare, manufacturing and banking. By conceptually traversing IT infrastructure architecture and building these redundancies, healthcare providers can maximize uptime for the sake of patient safety and regulatory compliance programs that mandate 24-7 access to clinical data.

Jon Gaasedelen is an independent IT consultant with over 20 years' experience in information systems infrastructures. He has an undergraduate degree in economics and a master's degree in health informatics, both from the University of Minnesota. Let us know what you think about the story; email editor@searchhealthit.com or contact @SearchHealthIT on Twitter.

This was first published in July 2013

Join the conversation Comment

Share
Comments

    Results

    Contribute to the conversation

    All fields are required. Comments will appear at the bottom of the article.

    Disclaimer: Our Tips Exchange is a forum for you to share technical advice and expertise with your peers and to learn from other enterprise IT professionals. TechTarget provides the infrastructure to facilitate this sharing of information. However, we cannot guarantee the accuracy or validity of the material submitted. You agree that your use of the Ask The Expert services and your reliance on any questions, answers, information or other materials received through this Web site is at your own risk.