5

I've recently learned that enabling disk write caching can significantly improve system performance. However, I'm concerned about the potential risks of data corruption or loss in case of a sudden power failure.

Here's some context about my setup:

Operating System: Windows Server 2012 R2

Disk Type: SATA 3.0 HDD

Purpose : I'm considering enabling write caching on my disk to boost performance. My understanding is that data corruption can occur if a power failure occurs when the data on write cache has still not been committed to the disk, when the operating system crashes, or when an application that accesses data crashes.

During my research, I found the following details in this Article, they have mentioned "Data corruption occurs without the users awareness when the active disks write cache is enabled and the disk performs a Read Look Ahead (RLA), which is prematurely ended." . I could not understand the exact meaning of this statement.

Is there any cases of data corruption / file corruption happening after enabling write caching even when no data write is taking place at the time of power failure.

3
  • 2
    learning material recommendations are off-topic. how ever that is the reason why you should only use RAID Controller backed up by a battery to solve this common issue imho
    – djdomi
    Sep 6 at 4:08
  • I could not understand the exact meaning of this statement. The article stated that an IBM driver had a defect that caused the problem. The IBM software invalidated data in the cache, which effectively trashed the data because the operating system had already reported to the application that it had been saved. The data was not saved to the disk. IBM corrected the defect and released an update.
    – Greg Askew
    Sep 6 at 10:17
  • 1
    Use BTRFS or ZFS as filesystem, these guarantee sucessful writes, eveb without battery backup.
    – paladin
    Sep 6 at 17:48

2 Answers 2

5

Modern file systems (XFS, ZFS, JFS, ext4, APFS, NTFS, etc) all use journaling so yes, you’re going to lose some data (latest commits and what’s not committed yet and stored in cache, that’s obvious), but no, you won’t experience any data corruption.

Here’s some good reading with lots of diagrams and detailed explanations about IBM’s JFS, everything within the article is 100% relevant to the other journaling file systems:

https://www.ibm.com/docs/en/aix/7.2?topic=types-journaled-file-system-jfs

Either way… You have to do backups! So-called “3-2-2 backup rule” is what you should follow.

https://www.starwindsoftware.com/blog/3-2-1-backup-strategy-why-your-data-always-survives

Hope this helped!

-1

Short version: no, using a modern SATA disk and a journaled filesystem it is not possible to corrupt acknowoledged (ie: synced) writes even when disk cache is enabled. On the other side, unsynced (buffered) writes can be lost/corrupted in case of powerloss. However, the article you linked is about a specific firmware issue and does not talk about generic behavior when using disk caching:

While performing extended disk test exercises, a latent firmware issue was discovered.

Long answer: two kind of writes can be issued:

  • sync writes, which guarantee persistence (and ordering) by leveraging ATA FLUSHes or FUAs;
  • unsynced (buffered) writes, which can be cached, aggregated and reordered by the disk DRAM cache.

When dealing with HDDs and consumer SSDs, sync writes are very slow: the process of flushing any single write means the per-IO latency is payed at each single write. So, sync writes are generally reserved for the most important IOs: journal commit, databases, email delivery, etc. All other less-important writes (ie: a user file copy) are issued as cached/buffered writes and data be lost if powerloss happens at the right moment (up until 30-60s after the original write).

Note that ancient PATA and SATA drives lied to the OS, pretending to honor syncs while actually discarding the required flushing behavior. This led to the suggestion of totally disabling the disk DRAM cache (or setting it in read-only mode), so that any written data was really stored on the (durable) disk platters. A disk with its cache disabled effectively treat each write as sync, providing maximum safety guarantees at a great performance cost.

Please note that this does not means that buffered writes can not be lost: if a crash happen before the OS flushed its buffers, all unsynced data will be lost. For this reason, and considering that modern (post-2008) disks honor ATA FLUSHes or (post-2015) FUAs, the current common advice is to let the disk cache enabled and to rely on the OS to flush important writes.

SSDs and HW RAID cards with powerloss protection escape this performance/safety tradeoff by having on-board circuitry to safely cache any writes (even sync ones). Anyway, when using an HW RAID card, how the disk cache will be managed is implementation dependent (ie: PERC disable it for SAS disks, but not for SATA ones).

8
  • 2
    Excellent information. Probably worth noting that this particular system is 10+ years old. The referenced IBM article is from 2010.
    – Greg Askew
    Sep 6 at 13:02
  • 2
    @shodanshok You're confusing (volume?) data corruption with a partial data loss.
    – NISMO1968
    Sep 6 at 14:52
  • @NISMO1968 partial data loss (or even only data reorder) can bring down the entire filesystem if some key data structures are affected. This is the very reason filesystems try very hard to not let journal commit being only partially successfully or even be reordered. Hence the importance of write barriers: they ensure the journal, and the filesystem, can not be corrupted by a powerloss/crash - but only if that barriers are honored by the underlying disks. The only exception is for battery-backup, powerloss-protected disks/arrays, where they can be safely disabled/ignored.
    – shodanshok
    Sep 6 at 17:25
  • for a similar question regarding writeback cache and possible data corruption, give a look here
    – shodanshok
    Sep 6 at 17:34
  • @shodanshok You talk FAT16, UFS, and maybe ext3... Bottom line: ancient file systems! Modern ones are all COW (Copy-on-Write) and journalling, writing data first, and multiple copies of the metadata next, never performing any in-place data updates, so... What you say never happens.
    – NISMO1968
    Sep 7 at 10:14

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .