[TS-4242] Permanent disk failures are not handled gracefully - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Cache
Labels:
None

Description

I'm simulating a disk failure of 1 sector with the following setup:

dd if=/dev/zero of=err.img bs=512 count=2097152
losetup /dev/loop0 err.img
dmsetup create err0 <<EOF
0 1024000 linear /dev/loop0 0
1024000 1 error
1024001 1073151 linear /dev/loop0 1024001
EOF
dmsetup mknodes err0

With the above command, we create a 1Gib disk, and at 500mib we simulate an error for a single 512bytes sector.

storage.config:

/dev/mapper/err0

Now I have a tool that randomly generates urls, stores them, and requests them back with a certain probability. So that I both write and read from the disk with a certain offered/expected hit ratio.

Once I hit the 500mib mark, trafficserver keeps spitting warnings about disk error. I fear it's because trafficserver keeps writing that bad sector instead of skipping it.

These are the errors/warnings I'm seeing in the log repeatedly:

[Feb 29 15:29:33.308] Server {0x2ac3f1cd4700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.309] Server {0x2ac3e56063c0} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1726/100000000]
[Feb 29 15:29:33.320] Server {0x2ac3e56063c0} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 75B41B1A2C85AE637DD6CE368BF783D0
[Feb 29 15:29:33.323] Server {0x2ac3eb480700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 1075CEA6E2E47496BE190DBB448B0B64
...
[Feb 29 15:29:33.284] Server {0x2ac3f28e0700} WARNING: <AIO.cc:410 (cache_op)> cache disk operation failed WRITE -1 5
[Feb 29 15:29:33.287] Server {0x2ac3eb682700} WARNING: <Cache.cc:2089 (handle_disk_failure)> Error accessing Disk /dev/mapper/err0 [1725/100000000]
[Feb 29 15:29:33.289] Server {0x2ac3eb682700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7E3325870F5488955118359E6C4B10F4
[Feb 29 15:29:33.289] Server {0x2ac3eb27e700} WARNING: <CacheRead.cc:1011 (openReadStartHead)> Head : Doc magic does not match for 7AE309F21ABF9B3774C67921018FCA0E
...

Summary: trafficserver does not treat I/O errors as permanent, but as temporary. Is this true? This leads to either:
1. Replace the hard disk
2. Use a devicemapper to skip the bad sector.

Both cases lead to throwing away a whole disk cache of terabytes for just a bad sector.

If this is what's really happening, is it feasible to skip the bad sector? If so, I could work on a patch.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Luca Bruno

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 29/Feb/16 15:31

Updated:: 17/Aug/16 17:49

Resolved:: 17/Aug/16 17:49