Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Cannot Reproduce
-
None
-
None
-
None
Description
Noticed this one on a user's 1.7-ish system.
A number of tablets (~9) were unassigned and reported on the Monitor as having failed to load. Digging into the exception, we could see the tablet load failed due to a FileNotFoundException:
2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to assign tablet xk;... /accumulo/tables/xk/t-00pyzd0 java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640) at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449) at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61) at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480) at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012) at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590) ... 9 more Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446) at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454) at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823) at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456) at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429) at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399) at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113) at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105) at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478) ... 11 more 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet xk;... reporting failure to master 2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds
Upon further investigation of the recovery directory in HDFS for this WAL, we find the following:
$ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/ -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:12 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed -rwxr--r-- 3 accumulo hdfs 0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000 -rw-r--r-- 3 accumulo hdfs 8040761 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data -rw-r--r-- 3 accumulo hdfs 642 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001 -rw-r--r-- 3 accumulo hdfs 8540196 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data -rw-r--r-- 3 accumulo hdfs 524 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002 -rw-r--r-- 3 accumulo hdfs 8150879 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data -rw-r--r-- 3 accumulo hdfs 584 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003 -rw-r--r-- 3 accumulo hdfs 8438021 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data -rw-r--r-- 3 accumulo hdfs 630 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index drwxr-xr-x - accumulo hdfs 0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004 -rw-r--r-- 3 accumulo hdfs 4956770 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data -rw-r--r-- 3 accumulo hdfs 408 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
The strange thing here is that we both finished and failed markers for this WAL's recovery directory. Given the timestamps, it appears that TServer1 tried to do recovery, failed for some reason, and then TServer2 came along and successfully completely LogSort.
However, when the merged-read of the sorted files came along, it treated the failed flag as a sorted-chunk, and failed as such.
I think the simple solution would be to whack the recovery directory if it exists before running the LogSorter.
Obligatory: I don't know if branches in Apache are verbatim to the fork I'm looking at. Identifying all relevant branches is a necessary step here.