HDD, FS, O_SYNC : Throughput vs. Integrity

Today we will spend some time over filesystems, block-devices, throughput and data-integrity. But first, a few "MYTHBUSTER" statements.

#1: Even the fastest HDD today can do ONLY 650KBps natively.

#2: O_SYNC on a filesystem does NOT guarantee a write to the HDD.

#3: Raw-I/O over BLOCK devices DOES guarantee data-integrity.

Hard to believe, right? Lets analyse these statements one by one...

The fastest Hard-disk drives today run at 10,000RPM (compared to regular ones at 5400,7200). Also the faster HDDs have transitioned to a 4096 byte internal block-size (compared to 512 bytes on regular ones).

Details of HDD components

To read/write one particular sector from/to the HDD, the head needs to be first aligned radially in the proper position. Next one waits as the rotating disk platter positions the desired sector under the disk-head. Now one is able read/write to the sector.

Unit of data on HDD = 1 sector =4096 bytes

MAX RPM = 10,000 = 10,000/60 = 166.667 rotations/second

Seek-time = 1/166.667 = 0.006s

Throughput = 4096/0.006 = 682666.667 ~ 650KBps

The above calculation assumes the worst possible values for both the I/O-size(1 sector) and seek-time(1 entire rotation). This condition though is quite easily seen in real life scenarios like database applications which use the HDD as a raw block-device.

Better speeds in the range of 20-50MBps are commonly obtained by a combination of several strategies like:

Multi-block I/O.
Native Command Queueing.
RAID stripping.

Now lets consider a regular I/O request at the HDD level:

HDD Read:

The kernel raises a disk read request to the HDD.The HDD has a small amount of disk-cache (RAM) which it checks to see if the requested data exists.
If NOT found in the disk-cache then the disk-head is moved to the data location on the platter.
The data is read into disk-cache and a copy is returned to the kernel.

HDD Write:

The kernel raises a disk write request to the HDD.
The HDD has a small amount of cache (RAM) where the data to be written is placed.
An on-board disk controller is in-charge of the saving the contents of the cache to the platter. It operates on the following set of rules:

[Rule1] Write cache to platter as soon as possible.
[Rule2] Re-order write operations in sequence of platter location.
[Rule3] Bunch several random writes together into one sequence.

"Hasta la vista, baby!"
HUD of the disk-IO firmware running on a T-800 Terminator ;-)

[Rule1] minimises data loss. As cache is volatile i.e. any power-outage will mean that data (in the HDD-cache), which is NOT yet written to the HDD-platter, is lost.

[Rule2] optimises the throughput. As serialising access reduces the time spent in seeking by the read-write head.

[Rule3] reduces power consumption, disk-wear by allowing the disk to be stopped from constantly spinning all the time. Only when the cache is filled to a certain limit the disk motor is powered on and the cache is flushed to the platter, following which the motor is powered-down again until the next cache flush.

Its obvious that [1] [2] [3] are counter-productive and the right balance needs to be struck between the three to have data-integrity, high-throughput, low-power-consumption & longer disk-life. Several complex algorithms have been devised to handle this in modern-day HDD controllers.

The problem though is that by default performance is sacrificed in favour of the other two. This is just a "default" setting though and the beauty of the Linux-Kernel being open-source is that one is free to stray from the "default" setting.

If you are doing mostly sequential raw block-I/O on a SATA HDD, you would be a prime candidate for this patch, which effectively moves the operation-point of the HDD closer to high-performance region in the map.

Moving on, we now focus on how the use of O_SYNC affects I/O on filesystems as well as the raw block device.

Regular write on a regular filesystem

fd = open("/media/mount1/file");

Consider the first case where one does a regular write(NO O_SYNC flag) on a regular filesystem on a HDD. The data is copied from the APP to the FS i.e. userspace-app(RAM) into the kernel filesystem page-cache(RAM) and control returns. During this, one does NOT cross any data-barriers and hence data is NOT guaranteed to be written to the HDD. This makes the entire process of a regular write on an fs extremely fast.

Synchronous write on a regular filesystem

fd = open("/media/mount1/file", O_SYNC);

The second case illustrated above depicts a synchronous write (with O_SYNC flag) on a regular filesystem on a HDD. The man page for open() call contains the following notes:

O_SYNC The file is opened for synchronous I/O. Any writes on the resulting file descriptor will block the calling process until the data has been physically written to the underlying hardware.

Although using O_SYNC looks like a sureshot guarantee that data is indeed written to disk, there lies a catch in the implementation. Most HDDs contain a on-board cache on the HDD itself. A write command to the disk transfers the data from the kernel filesystem page-cache(RAM) to the HDD-cache(not the actual mechanical platter of the disk.) This process is limited by the bus(SATA, IDE etc) which is faster than the actual mechanical platter. When a HDD receives a write command, it copies the data into its internal HDD-cache and returns immediately. The HDD's internal firmware later transfers the data to the disk-platter according to its "3 rules" as discussed previously. Thus a write to HDD does NOT necessarily imply a write to the disk platter. Hence even synchronous writes on a filsystem do NOT imply 100% data integrity.

Also a data-barrier exists along this path in the filesystem layer where metadata(inode,superblock) info is stored. This will help in identifying any data integrity on future access. Note that maintaining/updating inode,superblocks does NOT guarantee that the data is written to disk. Rather it makes the last sequence of writes atomic (i.e. all the writes get committed to disk or none). The inode,superblock info serves as a kind of checksum as they are updated accordingly following the atomic write operation. All this processing means that throughput incurs a slight penalty in this case.

Synchronous write on a block device

fd = open("/dev/sda", O_SYNC);

The third case illustrated above is a synchronous write to the HDD directly via its block-device(eg. /dev/sda). In this case there is NO data-barrier in the filesystem. The O_SYNC is implemented by using the data-barrier present in the HDD i.e. flushing the disk-cache explicity to ensure that all the data is indeed transferred to the disk-platter before returning. This incurs the maximum penalty and hence the throughput is the slowest of all 3 scenarios above.

Salient observations:

A data barrier is a module/function across which data integrity is guaranteed i.e if the function is called and it returns successfully, then the data is completely written to non-volatile memory(HDD, in this case). A data barrier introduces an order of magnitude change in:

Access-time (++)
Throughput (--)

A data barrier in the lower layers incurs a larger penalty than one in the upper layers. (Penalty : App < FS < Disk.)
O_SYNC on HDD via filesystems does NOT guarantee a successful write to non-volatile disk-platter.
O_SYNC on HDD via block-device directly guarantees data-integrity but offers very low throughput.
The HDD data-barrier(FLUSH_CACHE) is NOT utilised when using regular filesystems to access the HDD.
Disabling HDD data-barrier and raw DIRECT-I/O via the block device provides maximum throughput to a HDD.