0

I have an environment running SQL Server on a Windows VMWare platform using a SAN with SSDs set up in RAID 6, and using Veeam for server backups and LiteSpeed for SQL Server backups.

I've had a problem several times over the past year where sometimes the database slows to a crawl, and my Avg. Disk Queue Length is high, but my Disk Bytes/sec is much lower than it should be able to go.

Here's the Performance Monitor on the database server. When this problem happens, the Avg. Disk Queue Length is always in the range of several hundred, and the Disk Bytes/sec stays around 5-15 MB/sec. During normal operation (when this problem isn't happening), Disk Bytes/sec goes as high a 900 MB/sec or so.

enter image description here

In the time since this problem started happening, I have replaced the SAN hardware -- including the switches. But the problem continues on the new hardware.

My theory has been that this isn't a SQL Server problem, because if the problem was that SQL Server was saturating the disk I/O, I should see much higher Disk Bytes/sec. But whenever this problem happens, Disk Bytes/sec is always very low.

I thought maybe it was the backup software -- either running on the database server or running on another server that's making use of the same VMWare/SAN -- but neither the server backups nor the SQL Server backups seem to be running while this problem is happening.

My last thought is this is a problem with VMWare, but I've contacted them and so far they haven't been able to help.

Rebooting the database server fixes the problem. Sometimes the problem will happen again within a day, and sometimes the problem doesn't happen again for months. Whenever the problem happens, I'm not aware of anything outside of the normal workload running on the database.

What could be causing this problem where the disk throughput slows to around 1% of what it should be capable of?

2 Answers 2

2

HDDs become slower the longer their work queue becomes and vice versa - there's a very limited number of IOPS that you can throw at them (roughly 40-200, depending on grade and RPM). Any increase of demand beyond that point decreases their performance further.

Creating an HDD array increases the total number of possible read IOPS across the array, but usually less than simply summing up their individual IOPS. Write IOPS are more complex and depend heavily on the RAID level, caching etc. as well.

Anything beyond that requires SSDs and appropriate controllers.

4
  • Thank you for the response. The SAN is using SSDs in RAID 6. I'll update my question to specify that. I think it's disk degradation causing this problem, rather than heavy usage causing disk degradation. One reason I think that is that after I reboot the server, I restart the same workloads and it runs without a problem.
    – Ben Rubin
    Oct 26 at 16:32
  • In or outside a RAID, you should always closely monitor your storage to avoid non-obvious degradation problem. See your controller manual for details. Also, closely follow configuration guides for multi-controller or multi-pathing setups, as not all controller are happy with n:m multipathing (esp. for iSCSI).
    – Zac67
    Oct 26 at 17:38
  • 1
    there's a very limited number of IOPS that you can throw at them (roughly 70-200, depending on grade and RPM) A bit late here, but good luck getting even 70 IOPS out of a consumer-grade 5K rpm SATA drive. Those can be down in the 40-50 IOPS range. Nov 17 at 15:55
  • @AndrewHenle Absolutely, and don't even start with SMR drives - I guess on-topic HDDs are at least 7.2k RPM though. ;-)
    – Zac67
    Nov 17 at 17:20
1

Since you're already using SSDs, I'd suggest that an issue might be similar to one I've had, with TRIM not being properly handled in the SSDs. Erasing a data block on an SSD is not instantaneous, preparing a block for re-use can be a slow process, and could be the cause of the slowdown - if your free and prepared blocks are exhausted, the array could slow down drastically as new blocks are prepared. Check that your SAN is aware that these are SSDs, and that they have background TRIM enabled.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .