Uploaded image for project: 'Traffic Server'
  1. Traffic Server
  2. TS-4242

Permanent disk failures are not handled gracefully

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Cache
    • None

    Description

      I'm simulating a disk failure of 1 sector with the following setup:

      dd if=/dev/zero of=err.img bs=512 count=2097152
      losetup /dev/loop0 err.img
      dmsetup create err0 <<EOF
      0 1024000 linear /dev/loop0 0
      1024000 1 error
      1024001 1073151 linear /dev/loop0 1024001
      EOF
      dmsetup mknodes err0
      

      With the above command, we create a 1Gib disk, and at 500mib we simulate an error for a single 512bytes sector.

      storage.config:

      /dev/mapper/err0
      

      Now I have a tool that randomly generates urls, stores them, and requests them back with a certain probability. So that I both write and read from the disk with a certain offered/expected hit ratio.

      Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk error. I fear it's because trafficserver keeps writing that bad sector instead of skipping it.

      These are the errors/warnings I'm seeing in the log repeatedly:

      [Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
      [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000]
      [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 75B41B1A2C85AE637DD6CE368BF783D0
      [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 1075CEA6E2E47496BE190DBB448B0B64
      ...
      [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
      [Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000]
      [Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7E3325870F5488955118359E6C4B10F4
      [Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7AE309F21ABF9B3774C67921018FCA0E
      ...
      

      Summary: trafficserver does not treat I/O errors as permanent, but as temporary. Is this true? This leads to either:
      1. Replace the hard disk
      2. Use a devicemapper to skip the bad sector.

      Both cases lead to throwing away a whole disk cache of terabytes for just a bad sector.

      If this is what's really happening, is it feasible to skip the bad sector? If so, I could work on a patch.

      Attachments

        Activity

          People

            Unassigned Unassigned
            lethalman Luca Bruno
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: