Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-4834

Expose bad disk and disk access failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 7.1.0
    • Cache, Metrics
    • None

    Description

      We would like to monitor low-level disk access failures and disks marked by ATS as bad.

      I have a patch that exposes that information through

      proxy.process.cache.disk_error_count 10
      proxy.process.cache.disk_bad_count 5
      

      and the following tests/shows how it would work...

      Start ATS with 2 disks and tail diags.log

      $ cat etc/trafficserver/storage.config
      /dev/sdb
      /dev/sdc
      
      $ tail -f var/log/trafficserver/diags.log
      [Sep  8 12:18:48.149] Server {0x2b5f43db54c0} NOTE: traffic server running
      [Sep  8 12:18:48.198] Server {0x2b5f44654700} NOTE: cache enabled
      

      Check related metrics and observe all 0s

      $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
      proxy.process.cache.disk_error_count 0
      proxy.process.cache.disk_bad_count 0
      proxy.process.cache.read.failure 0
      proxy.process.cache.write.failure 0
      proxy.process.cache.volume_0.read.failure 0
      proxy.process.cache.volume_0.write.failure 0
      proxy.process.http.cache_write_errors 0
      proxy.process.http.cache_read_errors 0
      

      Now using your favorite hard disk failure injection tool inject 10 failures, by setting both disks used by this setup /dev/sdb and /dev/sdc to fail all reads. And shoot 5 requests causing 10 failed reads.

      $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o /dev/null -s; done
      
      $ tail -f var/log/trafficserver/diags.log
      [Sep  8 12:19:09.758] Server {0x2aaab4302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.759] Server {0x2aaac0100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.764] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [1/10]
      [Sep  8 12:19:09.769] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [2/10]
      [Sep  8 12:19:09.785] Server {0x2aaac0100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.786] Server {0x2aaab4302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.791] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [3/10]
      [Sep  8 12:19:09.796] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [4/10]
      [Sep  8 12:19:09.812] Server {0x2aaab4100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.813] Server {0x2aaacc100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.817] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [5/10]
      [Sep  8 12:19:09.823] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [6/10]
      [Sep  8 12:19:09.843] Server {0x2aaacc302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.844] Server {0x2aaad8100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.847] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdb [7/10]
      [Sep  8 12:19:09.854] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [8/10]
      [Sep  8 12:19:09.874] Server {0x2aaacc302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.875] Server {0x2aaad8100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:19:09.880] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdb [9/10]
      [Sep  8 12:19:09.887] Server {0x2b5f44654700} WARNING: too many errors accessing disk /dev/sdb [10/10]: declaring disk bad
      

      We see 5 read failures which triggered 10 actually disk reads and marked the failing disk as a bad disk.

      $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
      proxy.process.cache.disk_error_count 10
      proxy.process.cache.disk_bad_count 1
      proxy.process.cache.read.failure 5
      proxy.process.cache.write.failure 5
      proxy.process.cache.volume_0.read.failure 5
      proxy.process.cache.volume_0.write.failure 5
      proxy.process.http.cache_write_errors 0
      proxy.process.http.cache_read_errors 0
      

      Now shoot 5 requests causing 10 failed reads.

      $ for i in 1 2 3 4 5; do curl -x 127.0.0.1:80 http://example.com/1 -o /dev/null -s; done
      
      $ tail -f var/log/trafficserver/diags.log
      [Sep  8 12:26:02.874] Server {0x2aaae4100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.875] Server {0x2aaaf0302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.876] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [1/10]
      [Sep  8 12:26:02.885] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [2/10]
      [Sep  8 12:26:02.902] Server {0x2aaaf0302700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.902] Server {0x2aaae4100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.907] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [3/10]
      [Sep  8 12:26:02.914] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [4/10]
      [Sep  8 12:26:02.936] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.936] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.940] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [5/10]
      [Sep  8 12:26:02.946] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [6/10]
      [Sep  8 12:26:02.971] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.971] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:02.976] Server {0x2b5f43db54c0} WARNING: Error accessing Disk /dev/sdc [7/10]
      [Sep  8 12:26:02.982] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [8/10]
      [Sep  8 12:26:03.002] Server {0x2aaafc100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:03.002] Server {0x2aab08100700} WARNING: cache disk operation failed READ -1 0
      [Sep  8 12:26:03.006] Server {0x2b5f44654700} WARNING: Error accessing Disk /dev/sdc [9/10]
      [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: too many errors accessing disk /dev/sdc [10/10]: declaring disk bad
      

      We see 5 more read failures which triggered 10 more actual failed disk reads and marked the second disk as bad.

      $ ./bin/traffic_ctl metric match "proxy.process.cache*.disk.*" "proxy.process.cache.*(read|write).failure" "proxy.process.http.cache_(read|write)_errors"
      proxy.process.cache.disk_error_count 20
      proxy.process.cache.disk_bad_count 2
      proxy.process.cache.read.failure 10
      proxy.process.cache.write.failure 10
      proxy.process.cache.volume_0.read.failure 10
      proxy.process.cache.volume_0.write.failure 10
      proxy.process.http.cache_write_errors 0
      proxy.process.http.cache_read_errors 0
      

      After marking all the (2) disks as bad finally ATS disabled the cache and started going to origin on every requests.

      $ tail -f var/log/trafficserver/diags.log
      [Sep  8 12:26:03.012] Server {0x2b5f43db54c0} WARNING: All storage devices offline, cache disabled
      

      Attachments

        Issue Links

          Activity

            People

              gancho Gancho Tenev
              gancho Gancho Tenev
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 50m
                  3h 50m