Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.4.4
    • Fix Version/s: None
    • Component/s: master, test

      Description

      Running bulk randomwalk against 1.4.5-SNAPSHOT, got this in verification:

      Caused by: java.lang.Exception: Bad key at r00000 cf:000 [] 1394658887772 false 1
              at org.apache.accumulo.server.test.randomwalk.bulk.Verify.visit(Verify.java:65)
      

      Possible reasons:

      I see in the logs three internal errors from imports that failed due to the masters being restarted. The failure timing is around 5 seconds after the masters restart. Example:

      12 14:10:17,580 [bulk.BulkMinusOne] ERROR: org.apache.accumulo.core.client.AccumuloException: Intern
      al error processing waitForTableOperation
      org.apache.accumulo.core.client.AccumuloException: Internal error processing waitForTableOperation
              at org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperation
      sImpl.java:290)
              at org.apache.accumulo.core.client.admin.TableOperationsImpl.doTableOperation(TableOperation
      sImpl.java:258)
              at org.apache.accumulo.core.client.admin.TableOperationsImpl.importDirectory(TableOperations
      Impl.java:947)
              at org.apache.accumulo.server.test.randomwalk.bulk.BulkPlusOne.bulkLoadLots(BulkPlusOne.java
      :99)
              at org.apache.accumulo.server.test.randomwalk.bulk.BulkMinusOne.runLater(BulkMinusOne.java:2
      9)
      ...
      Caused by: org.apache.thrift.TApplicationException: Internal error processing waitForTableOperation
      

      Two BulkMinusOne and one BulkPlusOne failed, which may be why the offending row was at value 1.

      The TableOperationsImpl.waitForTableOperation method does not catch TApplicationException, so the imports fail.

      I see lots of previous work on this sort of error in ACCUMULO-334 and ACCUMULO-2110. If anyone has troubleshooting tips I'd be happy to hear them.

        Issue Links

          Activity

          Hide
          Bill Havanki added a comment -

          There are 121 markers for row r00000, each with value 1. Marker 00018 is missing. That is one of the BulkMinusOne steps that failed. The other two correspond to markers 00019 and 00022, which are somehow present in the table.

          Show
          Bill Havanki added a comment - There are 121 markers for row r00000, each with value 1. Marker 00018 is missing. That is one of the BulkMinusOne steps that failed. The other two correspond to markers 00019 and 00022, which are somehow present in the table.
          Hide
          Bill Havanki added a comment -

          There is a bulk import failure in the master shortly after the restart.

          2014-03-12 14:10:17,330 [thrift.MasterClientService$Processor] ERROR: Internal error processing wait
          ForTableOperation
          java.lang.RuntimeException: java.io.IOException: Filesystem closed
          ...
          Caused by: java.io.IOException: Filesystem closed
          ...
                  at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:766)
                  at org.apache.accumulo.server.trace.TraceFileSystem.create(TraceFileSystem.java:104)
                  at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:582)
                  at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:457)
                  at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65)
                  at org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:65)
                  ... 2 more
          

          The line in LoadFiles causing the problem:

          FSDataOutputStream failFile = fs.create(new Path(errorDir, "failures.txt"), true);
          

          Somehow the filesystem reference generated at the start of the action is closed before the action is done. However, the exception is thrown after tablet servers are asked to do bulk imports, and none of them indicate any trouble performing the bulk import, so I wonder why that marker 18 didn't show up. I don't know enough about this mechanism to hazard a good guess, but it could be that this error is not what caused the problem.

          Still, it's my best lead. There are exactly two other waitForTableOperation failures right after this one, but they fail due to interruption of some sort, before the tablet servers are asked to import. Maybe the master is able to try again successfully for these.

          Show
          Bill Havanki added a comment - There is a bulk import failure in the master shortly after the restart. 2014-03-12 14:10:17,330 [thrift.MasterClientService$Processor] ERROR: Internal error processing wait ForTableOperation java.lang.RuntimeException: java.io.IOException: Filesystem closed ... Caused by: java.io.IOException: Filesystem closed ... at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:766) at org.apache.accumulo.server.trace.TraceFileSystem.create(TraceFileSystem.java:104) at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:582) at org.apache.accumulo.server.master.tableOps.LoadFiles.call(BulkImport.java:457) at org.apache.accumulo.server.master.tableOps.TraceRepo.call(TraceRepo.java:65) at org.apache.accumulo.server.fate.Fate$TransactionRunner.run(Fate.java:65) ... 2 more The line in LoadFiles causing the problem: FSDataOutputStream failFile = fs.create( new Path(errorDir, "failures.txt" ), true ); Somehow the filesystem reference generated at the start of the action is closed before the action is done. However, the exception is thrown after tablet servers are asked to do bulk imports, and none of them indicate any trouble performing the bulk import, so I wonder why that marker 18 didn't show up. I don't know enough about this mechanism to hazard a good guess, but it could be that this error is not what caused the problem. Still, it's my best lead. There are exactly two other waitForTableOperation failures right after this one, but they fail due to interruption of some sort, before the tablet servers are asked to import. Maybe the master is able to try again successfully for these.
          Hide
          Eric Newton added a comment - - edited

          The file system is probably being closed by the really annoying shutdown hook.

          You may need to trace through the METADATA table updates by decoding write-ahead logs to find out what happened. It can take days.

          Show
          Eric Newton added a comment - - edited The file system is probably being closed by the really annoying shutdown hook. You may need to trace through the METADATA table updates by decoding write-ahead logs to find out what happened. It can take days.
          Hide
          Bill Havanki added a comment -

          How could a shutdown hook be doing something when the master is just coming up?

          I doubt I'm capable of doing that metadata trace stuff - I still barely understand any details about the METADATA table, and never mind decoding WALs.

          Show
          Bill Havanki added a comment - How could a shutdown hook be doing something when the master is just coming up? I doubt I'm capable of doing that metadata trace stuff - I still barely understand any details about the METADATA table, and never mind decoding WALs.
          Hide
          Mike Drob added a comment -

          Development against 1.4 branch is EOL, please re-open if you see this with a newer version.

          Show
          Mike Drob added a comment - Development against 1.4 branch is EOL, please re-open if you see this with a newer version.

            People

            • Assignee:
              Unassigned
              Reporter:
              Bill Havanki
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development