Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
Description
I'm simulating a disk failure of 1 sector with the following setup:
dd if=/dev/zero of=err.img bs=512 count=2097152 losetup /dev/loop0 err.img dmsetup create err0 <<EOF 0 1024000 linear /dev/loop0 0 1024000 1 error 1024001 1073151 linear /dev/loop0 1024001 EOF dmsetup mknodes err0
With the above command, we create a 1Gib disk, and at 500mib we simulate an error for a single 512bytes sector.
storage.config:
/dev/mapper/err0
Now I have a tool that randomly generates urls, stores them, and requests them back with a certain probability. So that I both write and read from the disk with a certain offered/expected hit ratio.
Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk error. I fear it's because trafficserver keeps writing that bad sector instead of skipping it.
These are the errors/warnings I'm seeing in the log repeatedly:
[Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5 [Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000] [Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 75B41B1A2C85AE637DD6CE368BF783D0 [Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 1075CEA6E2E47496BE190DBB448B0B64 ... [Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5 [Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000] [Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7E3325870F5488955118359E6C4B10F4 [Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7AE309F21ABF9B3774C67921018FCA0E ...
Summary: trafficserver does not treat I/O errors as permanent, but as temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.
Both cases lead to throwing away a whole disk cache of terabytes for just a bad sector.
If this is what's really happening, is it feasible to skip the bad sector? If so, I could work on a patch.