Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-4851

WAL recovery directory should be deleted before running LogSorter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Cannot Reproduce
    • None
    • None
    • tserver
    • None

    Description

      Noticed this one on a user's 1.7-ish system.

      A number of tablets (~9) were unassigned and reported on the Monitor as having failed to load. Digging into the exception, we could see the tablet load failed due to a FileNotFoundException:

      2018-04-09 19:57:08,475 [tserver.TabletServer] WARN : exception trying to assign tablet xk;... /accumulo/tables/xk/t-00pyzd0
      java.lang.RuntimeException: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
              at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:640)
              at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:449)
              at org.apache.accumulo.tserver.TabletServer$AssignmentHandler.run(TabletServer.java:2156)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at org.apache.accumulo.tserver.ActiveAssignmentRunnable.run(ActiveAssignmentRunnable.java:61)
              at org.apache.htrace.wrappers.TraceRunnable.run(TraceRunnable.java:57)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
              at org.apache.accumulo.fate.util.LoggingRunnable.run(LoggingRunnable.java:35)
              at java.lang.Thread.run(Thread.java:748)
      Caused by: java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
              at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:480)
              at org.apache.accumulo.tserver.TabletServer.recover(TabletServer.java:3012)
              at org.apache.accumulo.tserver.tablet.Tablet.<init>(Tablet.java:590)
              ... 9 more
      Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
              at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1446)
              at org.apache.hadoop.hdfs.DistributedFileSystem$26.doCall(DistributedFileSystem.java:1438)
              at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
              at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1454)
              at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:1823)
              at org.apache.hadoop.io.MapFile$Reader.createDataFileReader(MapFile.java:456)
              at org.apache.hadoop.io.MapFile$Reader.open(MapFile.java:429)
              at org.apache.hadoop.io.MapFile$Reader.<init>(MapFile.java:399)
              at org.apache.accumulo.tserver.log.MultiReader.<init>(MultiReader.java:113)
              at org.apache.accumulo.tserver.log.SortedLogRecovery.recover(SortedLogRecovery.java:105)
              at org.apache.accumulo.tserver.log.TabletServerLogger.recover(TabletServerLogger.java:478)
              ... 11 more
      2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : java.io.IOException: java.io.FileNotFoundException: File does not exist: /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed/data
      2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : failed to open tablet xk;... reporting failure to master
      2018-04-09 19:57:08,476 [tserver.TabletServer] WARN : rescheduling tablet load in 600.00 seconds
      

      Upon further investigation of the recovery directory in HDFS for this WAL, we find the following:

      $ hdfs dfs -ls -R /accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/
      -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:12 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/failed
      -rwxr--r--   3 accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/finished
      drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000
      -rw-r--r--   3 accumulo hdfs    8040761 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/data
      -rw-r--r--   3 accumulo hdfs        642 2018-04-06 22:09 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00000/index
      drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001
      -rw-r--r--   3 accumulo hdfs    8540196 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/data
      -rw-r--r--   3 accumulo hdfs        524 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00001/index
      drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002
      -rw-r--r--   3 accumulo hdfs    8150879 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/data
      -rw-r--r--   3 accumulo hdfs        584 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00002/index
      drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003
      -rw-r--r--   3 accumulo hdfs    8438021 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/data
      -rw-r--r--   3 accumulo hdfs        630 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00003/index
      drwxr-xr-x   - accumulo hdfs          0 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004
      -rw-r--r--   3 accumulo hdfs    4956770 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/data
      -rw-r--r--   3 accumulo hdfs        408 2018-04-06 22:10 accumulo/recovery/0421c824-5e48-4bad-917a-b54a34a45849/part-r-00004/index
      

       The strange thing here is that we both finished and failed markers for this WAL's recovery directory. Given the timestamps, it appears that TServer1 tried to do recovery, failed for some reason, and then TServer2 came along and successfully completely LogSort.

      However, when the merged-read of the sorted files came along, it treated the failed flag as a sorted-chunk, and failed as such.

      I think the simple solution would be to whack the recovery directory if it exists before running the LogSorter.

      Obligatory: I don't know if branches in Apache are verbatim to the fork I'm looking at. Identifying all relevant branches is a necessary step here.

      Attachments

        Activity

          People

            elserj Josh Elser
            elserj Josh Elser
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: