Accumulo
  1. Accumulo
  2. ACCUMULO-1998

Encrypted WALogs seem to be excessively buffering

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.6.0
    • Component/s: None
    • Labels:
      None

      Description

      The reproduction steps around this are a little bit fuzzy but basically we ran a moderate workload against a 1.6.0 server. Encryption happened to be turned on but that doesn't seem to be germane to the problem. After doing a moderate amount of work, Accumulo is refusing to start up, spewing this error over and over to the log:

      2013-12-10 10:23:02,529 [tserver.TabletServer] WARN : exception while doing multi-scan 
      java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.115:9000/accumulo/tables/!0/table_info/A000042x.rf
      	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1125)
      	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
      	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
      	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
      	at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Failed to open hdfs://10.10.1.115:9000/accumulo/tables/!0/table_info/A000042x.rf
      	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:333)
      	at org.apache.accumulo.tserver.FileManager.access$500(FileManager.java:58)
      	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:478)
      	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFileRefs(FileManager.java:466)
      	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:486)
      	at org.apache.accumulo.tserver.Tablet$ScanDataSource.createIterator(Tablet.java:2027)
      	at org.apache.accumulo.tserver.Tablet$ScanDataSource.iterator(Tablet.java:1989)
      	at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:163)
      	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1565)
      	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1672)
      	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1114)
      	... 6 more
      Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/tables/!0/table_info/A000042x.rf
      	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
      	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
      	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
      	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
      	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
      	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:436)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:256)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:143)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:212)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:367)
      	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:143)
      	at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:825)
      	at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
      	at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(FileOperations.java:119)
      	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:314)
      	... 16 more
      

      Here's some other pieces of context:

      HDFS contents:

      ubuntu@ip-10-10-1-115:/data0/logs/accumulo$ hadoop fs -lsr /accumulo/tables/
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:32 /accumulo/tables/!0
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 01:06 /accumulo/tables/!0/default_tablet
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 10:49 /accumulo/tables/!0/table_info
      -rw-r--r--   5 accumulo hadoop       1698 2013-12-10 00:34 /accumulo/tables/!0/table_info/F0000000.rf
      -rw-r--r--   5 accumulo hadoop      43524 2013-12-10 01:53 /accumulo/tables/!0/table_info/F000062q.rf
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:32 /accumulo/tables/+r
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 10:45 /accumulo/tables/+r/root_tablet
      -rw-r--r--   5 accumulo hadoop       2070 2013-12-10 10:45 /accumulo/tables/+r/root_tablet/A0000738.rf
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:33 /accumulo/tables/1
      drwxr-xr-x   - accumulo hadoop          0 2013-12-10 00:33 /accumulo/tables/1/default_tablet
      

      ZooKeeper entries

      [zk: localhost:2181(CONNECTED) 6] get /accumulo/371cfa3e-fe96-4a50-92e9-da7572589ffa/root_tablet/dir 
      hdfs://10.10.1.115:9000/accumulo/tables/+r/root_tablet
      cZxid = 0x1b
      ctime = Tue Dec 10 00:32:56 EST 2013
      mZxid = 0x1b
      mtime = Tue Dec 10 00:32:56 EST 2013
      pZxid = 0x1b
      cversion = 0
      dataVersion = 0
      aclVersion = 0
      ephemeralOwner = 0x0
      dataLength = 54
      numChildren = 0
      

      I'm going to preserve the state of this machine in HDFS for a while but not forever, so if there are other pieces of context people need, let me know.

        Issue Links

          Activity

          Keith Turner made changes -
          Link This issue is related to ACCUMULO-2668 [ ACCUMULO-2668 ]
          Hide
          ASF subversion and git services added a comment -

          Commit 8cfdc1ffc4413f0712f7ecac1ba05f682e96d249 in branch refs/heads/master from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8cfdc1f ]

          ACCUMULO-1998 fix warning about an unclosed resource.

          Show
          ASF subversion and git services added a comment - Commit 8cfdc1ffc4413f0712f7ecac1ba05f682e96d249 in branch refs/heads/master from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8cfdc1f ] ACCUMULO-1998 fix warning about an unclosed resource.
          Hide
          ASF subversion and git services added a comment -

          Commit 8cfdc1ffc4413f0712f7ecac1ba05f682e96d249 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8cfdc1f ]

          ACCUMULO-1998 fix warning about an unclosed resource.

          Show
          ASF subversion and git services added a comment - Commit 8cfdc1ffc4413f0712f7ecac1ba05f682e96d249 in branch refs/heads/1.6.0-SNAPSHOT from Eric Newton [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=8cfdc1f ] ACCUMULO-1998 fix warning about an unclosed resource.
          Hide
          ASF subversion and git services added a comment -

          Commit cac60938f08a1ce40bbbc0e89c2d01ee8350fd1a in branch refs/heads/master from John Vines
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cac6093 ]

          ACCUMULO-1998 Defense against concurrent write() and close() by relying on DfsLogger output stream interaction

          Show
          ASF subversion and git services added a comment - Commit cac60938f08a1ce40bbbc0e89c2d01ee8350fd1a in branch refs/heads/master from John Vines [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cac6093 ] ACCUMULO-1998 Defense against concurrent write() and close() by relying on DfsLogger output stream interaction
          Hide
          ASF subversion and git services added a comment -

          Commit cac60938f08a1ce40bbbc0e89c2d01ee8350fd1a in branch refs/heads/1.6.0-SNAPSHOT from John Vines
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cac6093 ]

          ACCUMULO-1998 Defense against concurrent write() and close() by relying on DfsLogger output stream interaction

          Show
          ASF subversion and git services added a comment - Commit cac60938f08a1ce40bbbc0e89c2d01ee8350fd1a in branch refs/heads/1.6.0-SNAPSHOT from John Vines [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=cac6093 ] ACCUMULO-1998 Defense against concurrent write() and close() by relying on DfsLogger output stream interaction
          John Vines made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          ASF subversion and git services added a comment -

          Commit 9ba06ff2d515d982b47b4f121a14bb5a98a024f0 in branch refs/heads/1.6.0-SNAPSHOT from John Vines
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=9ba06ff ]

          ACCUMULO-1998 Resolving corner cases around blocked output stream behavior and migration

          Show
          ASF subversion and git services added a comment - Commit 9ba06ff2d515d982b47b4f121a14bb5a98a024f0 in branch refs/heads/1.6.0-SNAPSHOT from John Vines [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=9ba06ff ] ACCUMULO-1998 Resolving corner cases around blocked output stream behavior and migration
          John Vines made changes -
          Resolution Fixed [ 1 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          Hide
          John Vines added a comment -

          Seems there are some corner cases involving failed flushes that are causing buffer overflow excpetions in the blockwriter. Working on fixing these.

          Show
          John Vines added a comment - Seems there are some corner cases involving failed flushes that are causing buffer overflow excpetions in the blockwriter. Working on fixing these.
          John Vines made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          John Vines added a comment -

          read my commit for an explanation of how this was resolved.

          Show
          John Vines added a comment - read my commit for an explanation of how this was resolved.
          Hide
          ASF subversion and git services added a comment -

          Commit 443cba7a7a3838b547880f1c49f2a9e0128692cd in branch refs/heads/master from John Vines
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=443cba7 ]

          ACCUMULO-1998

          All encrypted walog events are now individual blocked on disk. This leads to an additional maxBlockSize parameter (mostly to handle OOM from mismatched crypto). Additionally, because of this behavior, as well as PKCS5 behavior, I have turned off all padding on the default crypto configs and padding should not be used as it can cause data loss in walogs. I have hammered 5 instances on and off every minute for 22 hours and counting with no related issues, so I deem it a fix.

          Show
          ASF subversion and git services added a comment - Commit 443cba7a7a3838b547880f1c49f2a9e0128692cd in branch refs/heads/master from John Vines [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=443cba7 ] ACCUMULO-1998 All encrypted walog events are now individual blocked on disk. This leads to an additional maxBlockSize parameter (mostly to handle OOM from mismatched crypto). Additionally, because of this behavior, as well as PKCS5 behavior, I have turned off all padding on the default crypto configs and padding should not be used as it can cause data loss in walogs. I have hammered 5 instances on and off every minute for 22 hours and counting with no related issues, so I deem it a fix.
          Hide
          ASF subversion and git services added a comment -

          Commit 443cba7a7a3838b547880f1c49f2a9e0128692cd in branch refs/heads/1.6.0-SNAPSHOT from John Vines
          [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=443cba7 ]

          ACCUMULO-1998

          All encrypted walog events are now individual blocked on disk. This leads to an additional maxBlockSize parameter (mostly to handle OOM from mismatched crypto). Additionally, because of this behavior, as well as PKCS5 behavior, I have turned off all padding on the default crypto configs and padding should not be used as it can cause data loss in walogs. I have hammered 5 instances on and off every minute for 22 hours and counting with no related issues, so I deem it a fix.

          Show
          ASF subversion and git services added a comment - Commit 443cba7a7a3838b547880f1c49f2a9e0128692cd in branch refs/heads/1.6.0-SNAPSHOT from John Vines [ https://git-wip-us.apache.org/repos/asf?p=accumulo.git;h=443cba7 ] ACCUMULO-1998 All encrypted walog events are now individual blocked on disk. This leads to an additional maxBlockSize parameter (mostly to handle OOM from mismatched crypto). Additionally, because of this behavior, as well as PKCS5 behavior, I have turned off all padding on the default crypto configs and padding should not be used as it can cause data loss in walogs. I have hammered 5 instances on and off every minute for 22 hours and counting with no related issues, so I deem it a fix.
          John Vines made changes -
          Attachment 0002-ACCUMULO-1998.patch [ 12621394 ]
          Hide
          John Vines added a comment -

          One more go at it, simplified the blockinput stream to deal with a reading issue. I've been slamming Accumulo up and down and have yet to see any issues pop up again.

          Show
          John Vines added a comment - One more go at it, simplified the blockinput stream to deal with a reading issue. I've been slamming Accumulo up and down and have yet to see any issues pop up again.
          John Vines made changes -
          Attachment 0001-ACCUMULO-1998.patch [ 12621377 ]
          Hide
          John Vines added a comment -

          Revised version. We now expect a max block size. I also refactored the buffered input stream to not put everything in memory, rather it will keep maxBlocksize. Additionally, have a checkfor a partially written cipher block.

          Show
          John Vines added a comment - Revised version. We now expect a max block size. I also refactored the buffered input stream to not put everything in memory, rather it will keep maxBlocksize. Additionally, have a checkfor a partially written cipher block.
          Hide
          John Vines added a comment -

          This patch still isn't sufficient. testKeyEncryptionAndCheckThatFileCannotBeReadWithoutKEK will sometimes fail with an OOM exception when the encrypted bytes come to an outlandish size. Additionally, it breaks handling for partial writes.

          Show
          John Vines added a comment - This patch still isn't sufficient. testKeyEncryptionAndCheckThatFileCannotBeReadWithoutKEK will sometimes fail with an OOM exception when the encrypted bytes come to an outlandish size. Additionally, it breaks handling for partial writes.
          Hide
          Josh Elser added a comment - - edited

          should have no impact on unencrypted performance, as it will just introduce a noop

          I figured as much.

          I have not actually done performance testing though as this was to provide correct behavior, not adjust performance characteristics.

          Ok, I was mostly curious as to whether or not this affected things when using the default crypto module. Would be interested to hear when you get there

          Show
          Josh Elser added a comment - - edited should have no impact on unencrypted performance, as it will just introduce a noop I figured as much. I have not actually done performance testing though as this was to provide correct behavior, not adjust performance characteristics. Ok, I was mostly curious as to whether or not this affected things when using the default crypto module. Would be interested to hear when you get there
          Hide
          John Vines added a comment -

          Yes.

          This technique adds 4 bytes written per flush /when a 1k buffer fills. This
          only occurs when using the default crypto module and should have no impact
          on unencrypted performance, as it will just introduce a noop. However, for
          the walogs, this will be called for every walog entry. I have not actually
          done performance testing though as this was to provide correct behavior,
          not adjust performance characteristics.

          Sent from my phone, please pardon the typos and brevity.

          Show
          John Vines added a comment - Yes. This technique adds 4 bytes written per flush /when a 1k buffer fills. This only occurs when using the default crypto module and should have no impact on unencrypted performance, as it will just introduce a noop. However, for the walogs, this will be called for every walog entry. I have not actually done performance testing though as this was to provide correct behavior, not adjust performance characteristics. Sent from my phone, please pardon the typos and brevity.
          Hide
          Josh Elser added a comment -

          So you pad the cipher stream to get it to flush to ensure that you don't lose data (or have old data reintroduced)? Have you noticed any sort of performance change in this? I assume the flush()'s in a real system are far enough between one another that it's not terrible? That, and if you don't have super small buffers?

          Show
          Josh Elser added a comment - So you pad the cipher stream to get it to flush to ensure that you don't lose data (or have old data reintroduced)? Have you noticed any sort of performance change in this? I assume the flush()'s in a real system are far enough between one another that it's not terrible? That, and if you don't have super small buffers?
          John Vines made changes -
          Assignee John Vines [ vines ]
          John Vines made changes -
          Hide
          John Vines added a comment -

          This patch is more comprehensive then the others, as it actually deals with the buffering in the cipher stream. The issue is that this buffer does not clear on flush(). So the solution I've incorporated is manually padding on flush(). This allows flush() semantics to make sure the entire stream is clear. Then, I had the DfsLogger explicitly flush after every write() to ensure the crypto stream is clear. I have another stream in there which catches this flush and doesn't propogate it to ensure functionality doesn't get weird. I'm gonna push this in a few days if I hear no objections.

          Show
          John Vines added a comment - This patch is more comprehensive then the others, as it actually deals with the buffering in the cipher stream. The issue is that this buffer does not clear on flush(). So the solution I've incorporated is manually padding on flush(). This allows flush() semantics to make sure the entire stream is clear. Then, I had the DfsLogger explicitly flush after every write() to ensure the crypto stream is clear. I have another stream in there which catches this flush and doesn't propogate it to ensure functionality doesn't get weird. I'm gonna push this in a few days if I hear no objections.
          Hide
          John Vines added a comment -

          This still leaves the possibility of the cipher block being unwritten when the write returns to the client. I have another idea...

          Show
          John Vines added a comment - This still leaves the possibility of the cipher block being unwritten when the write returns to the client. I have another idea...
          John Vines made changes -
          Hide
          John Vines added a comment -

          Fourth rendition - just need to close the encryptingLogFile and everything else will flow through.

          Show
          John Vines added a comment - Fourth rendition - just need to close the encryptingLogFile and everything else will flow through.
          John Vines made changes -
          Hide
          John Vines added a comment -

          Third rendition. Buffer needs to be closed when the logFile is being closed as well.

          Show
          John Vines added a comment - Third rendition. Buffer needs to be closed when the logFile is being closed as well.
          John Vines made changes -
          Hide
          John Vines added a comment -

          This second patch wraps the output stream provided to the crypto stuff to make sure flush() calls never go into the FSDataOutputStream. This should mitigate concerns

          Show
          John Vines added a comment - This second patch wraps the output stream provided to the crypto stuff to make sure flush() calls never go into the FSDataOutputStream. This should mitigate concerns
          John Vines made changes -
          Hide
          John Vines added a comment -

          So this solutions seems too simple. I'm worried about the repurcussions of flushing before syncing because the flush() will clear out the buffered crypto stream, but then it will keep flushing. Depending on the underlying filesystems, it may do nothing or it may keep flushing.

          Show
          John Vines added a comment - So this solutions seems too simple. I'm worried about the repurcussions of flushing before syncing because the flush() will clear out the buffered crypto stream, but then it will keep flushing. Depending on the underlying filesystems, it may do nothing or it may keep flushing.
          John Vines made changes -
          Summary Exception while attempting to bring up 1.6.0 TabletServer after moderate amount of work Encrypted WALogs seem to be excessively buffering
          Priority Critical [ 2 ] Blocker [ 1 ]
          Hide
          John Vines added a comment -

          Encrypted WALOGs. Those are the issue. They seem buffer more than non-encrypted ones.

          Steps to validate this issue-
          Start with non-encrypted Accumulo. Run start-here.sh. Wait for metadata tablets to come online. Kill hard with stop-here.sh (hit control+c once). Then run bin/accumulo org.apache.accumulo.tserver.logger.LogReader /accumulo/wal/localhost+9997/<TheOnlyLogFileHere>. You will see tens of lines of output for root tablet writes, maybe metadata tablet writes as well.
          Blow away non-encrypted accumulo and replace site.xml with a crypto enabled one. Run start-here.sh. Wait for metadata tablets to come online. Kill hard with stop-here.sh. Run the same command against the new walog file. Observe how empty it is.

          The originally reported error manifests itself in this situation-
          1. File gets written, leading to a metadata/root table update
          2. File gets MajC away, leading to a meta/root table removal
          3. File gets deleted by gc
          4. Tserver hosting tablet goes down hard
          5. walogs replay, but only event 1 is present, not event 2
          6. tablet now thinks is now referencing it's old file

          Show
          John Vines added a comment - Encrypted WALOGs. Those are the issue. They seem buffer more than non-encrypted ones. Steps to validate this issue- Start with non-encrypted Accumulo. Run start-here.sh. Wait for metadata tablets to come online. Kill hard with stop-here.sh (hit control+c once). Then run bin/accumulo org.apache.accumulo.tserver.logger.LogReader /accumulo/wal/localhost+9997/<TheOnlyLogFileHere>. You will see tens of lines of output for root tablet writes, maybe metadata tablet writes as well. Blow away non-encrypted accumulo and replace site.xml with a crypto enabled one. Run start-here.sh. Wait for metadata tablets to come online. Kill hard with stop-here.sh. Run the same command against the new walog file. Observe how empty it is. The originally reported error manifests itself in this situation- 1. File gets written, leading to a metadata/root table update 2. File gets MajC away, leading to a meta/root table removal 3. File gets deleted by gc 4. Tserver hosting tablet goes down hard 5. walogs replay, but only event 1 is present, not event 2 6. tablet now thinks is now referencing it's old file
          Hide
          Eric Newton added a comment -

          Michael Allen they are probably not related, since this is failing after a tablet is assigned. ACCUMULO-2006 fails to assign tablets, so no errors occur at the tserver.

          Show
          Eric Newton added a comment - Michael Allen they are probably not related, since this is failing after a tablet is assigned. ACCUMULO-2006 fails to assign tablets, so no errors occur at the tserver.
          Hide
          Michael Allen added a comment -

          Keith Turner I reproduced, kind of. Unfortunately it's amazingly inconsistent. I have only one t-server running, and they are not restarted during the course of testing. When I see it again I'll do some of the digging you suggest.

          Also I just saw ACCUMULO-2006 come through, so I am wondering if it's related.

          Show
          Michael Allen added a comment - Keith Turner I reproduced, kind of. Unfortunately it's amazingly inconsistent. I have only one t-server running, and they are not restarted during the course of testing. When I see it again I'll do some of the digging you suggest. Also I just saw ACCUMULO-2006 come through, so I am wondering if it's related.
          Hide
          Keith Turner added a comment -

          Michael Allen were you able to reproduce this yesterday? How many tservers did you have? Were tservers restarted during this test? Some other things useful to look would be when and where the root tablet was assigned. Also could look in the root tablets walogs and try to find the mutations related to these files.

          Show
          Keith Turner added a comment - Michael Allen were you able to reproduce this yesterday? How many tservers did you have? Were tservers restarted during this test? Some other things useful to look would be when and where the root tablet was assigned. Also could look in the root tablets walogs and try to find the mutations related to these files.
          Hide
          Michael Allen added a comment -

          Oh, another hint! After reading over those log snippets, I noticed that the Hadoop logs only talk up A00005ym.rf_tmp, and not A00005ym.rf. It looks like perhaps the renaming did not occur correctly?

          Or, perhaps this is nothing. Here's similar output for the file that did exist (at least for a time):

          [root@ip-10-10-1-155 hadoop]# grep A00002gy.rf *
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,219 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/tables/!0/table_info/A00002gy.rf_tmp. blk_-8014878588999673973_1314
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,246 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on  /accumulo/tables/!0/table_info/A00002gy.rf_tmp from client DFSClient_NONMAPREDUCE_-1463880682_12
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,246 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /accumulo/tables/!0/table_info/A00002gy.rf_tmp is closed by DFSClient_NONMAPREDUCE_-1463880682_12
          [root@ip-10-10-1-155 hadoop]# 
          
          Show
          Michael Allen added a comment - Oh, another hint! After reading over those log snippets, I noticed that the Hadoop logs only talk up A00005ym.rf_tmp , and not A00005ym.rf . It looks like perhaps the renaming did not occur correctly? Or, perhaps this is nothing. Here's similar output for the file that did exist (at least for a time): [root@ip-10-10-1-155 hadoop]# grep A00002gy.rf * hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,219 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/tables/!0/table_info/A00002gy.rf_tmp. blk_-8014878588999673973_1314 hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,246 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on /accumulo/tables/!0/table_info/A00002gy.rf_tmp from client DFSClient_NONMAPREDUCE_-1463880682_12 hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:31:56,246 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /accumulo/tables/!0/table_info/A00002gy.rf_tmp is closed by DFSClient_NONMAPREDUCE_-1463880682_12 [root@ip-10-10-1-155 hadoop]#
          Hide
          Michael Allen added a comment -

          Here's more of a precise breakdown. The exception looks like this now:

          2013-12-11 15:46:30,871 [thrift.ProcessFunction] ERROR: Internal error processing startMultiScan
          java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1429)
          	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1381)
          	at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
          	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
          	at java.lang.reflect.Method.invoke(Method.java:597)
          	at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63)
          	at com.sun.proxy.$Proxy10.startMultiScan(Unknown Source)
          	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2252)
          	at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2236)
          	at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
          	at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
          	at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171)
          	at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478)
          	at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:231)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895)
          	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918)
          	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
          	at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34)
          	at java.lang.Thread.run(Thread.java:662)
          Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          	at org.apache.accumulo.tserver.TabletServer$ScanTask.get(TabletServer.java:754)
          	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1416)
          	... 18 more
          Caused by: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1125)
          	at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47)
          	... 5 more
          Caused by: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:333)
          	at org.apache.accumulo.tserver.FileManager.access$500(FileManager.java:58)
          	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:478)
          	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFileRefs(FileManager.java:466)
          	at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:486)
          	at org.apache.accumulo.tserver.Tablet$ScanDataSource.createIterator(Tablet.java:2027)
          	at org.apache.accumulo.tserver.Tablet$ScanDataSource.iterator(Tablet.java:1989)
          	at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:163)
          	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1565)
          	at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1672)
          	at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1114)
          	... 6 more
          Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/tables/!0/table_info/A00002gy.rf
          	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006)
          	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
          	at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967)
          	at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735)
          	at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165)
          	at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:436)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:256)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:143)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:212)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:367)
          	at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:143)
          	at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:825)
          	at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79)
          	at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(FileOperations.java:119)
          	at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:314)
          	... 16 more
          

          Scanning the root tablet shows the following entries:

          root@accumulo accumulo.root> scan
          !0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf []    8652,1763
          !0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F000063o.rf []    66848,35747
          !0;~ last:142e00c604d0053 []    10.10.1.155:9997
          !0;~ loc:142e00c604d0063 []    10.10.1.155:9997
          !0;~ srv:compact []    7
          !0;~ srv:dir []    hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info
          !0;~ srv:flush []    122
          !0;~ srv:lock []    tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063
          !0;~ srv:time []    L39138
          !0;~ ~tab:~pr []    \x00
          !0< file:hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf []    1518,3
          !0< last:142e00c604d000d []    10.10.1.155:9997
          !0< loc:142e00c604d0063 []    10.10.1.155:9997
          !0< srv:compact []    6
          !0< srv:dir []    hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet
          !0< srv:flush []    122
          !0< srv:lock []    tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063
          !0< srv:time []    L12
          !0< ~tab:~pr []    \x01~
          root@accumulo accumulo.root> 
          

          So we looked for where this missing file (A00002gy.rf) got created. Back up in the tserver logs is this:

          2013-12-11 05:31:55,701 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf_tmp  []
          2013-12-11 05:31:55,937 [tserver.TabletServer] DEBUG: ScanSess tid 10.10.1.155:58190 !0 0 entries in 0.10 secs, nbTimes = [98 98 98.00 1]
          2013-12-11 05:31:56,208 [util.TServerUtils] INFO : Decreasing server thread pool size on TabletServer to 20
          2013-12-11 05:31:56,246 [tserver.TabletServer] DEBUG: MultiScanSess 10.10.1.155:60638 2 entries in 0.01 secs (lookup_time:0.00 secs tablets:1 ranges:1)
          2013-12-11 05:31:56,333 [tserver.Compactor] DEBUG: Compaction !0;~< 20,974 read | 1,763 written | 42,979 entries/sec |  0.488 secs
          2013-12-11 05:31:56,345 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs
          2013-12-11 05:31:56,345 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          

          Then it got major compacted away here:

          2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: Major compaction plan: [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] propogate deletes : false
          2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, wait 0.00 secs
          2013-12-11 05:36:56,908 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf_tmp  []
          2013-12-11 05:36:57,080 [tserver.Compactor] DEBUG: Compaction !0;~< 37,692 read | 0 written | 269,228 entries/sec |  0.140 secs
          2013-12-11 05:36:57,085 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs
          2013-12-11 05:36:57,085 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf
          

          The garbage collector decided to collect it:

          2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf
          2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/F00005yk.rf
          2013-12-11 05:41:56,874 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf
          2013-12-11 05:41:56,883 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf
          2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates for deletion: 4
          2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates still in use: 0
          2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of successfully deleted data files: 4
          2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data files delete failures: 0
          2013-12-11 05:41:56,957 [gc.SimpleGarbageCollector] INFO : Collect cycle took 0.29 seconds
          

          So what happened with A00005ym.rf, which supposedly replaced A00002gy.rf?

          The namenode seems to think it was created and closed. But no other Hadoop log mentions it:

          [root@ip-10-10-1-155 hadoop]# pwd
          /data0/logs/hadoop
          
          [root@ip-10-10-1-155 hadoop]# grep A00005ym.rf *
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,061 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/tables/!0/table_info/A00005ym.rf_tmp. blk_-5468581853400342461_1318
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on  /accumulo/tables/!0/table_info/A00005ym.rf_tmp from client DFSClient_NONMAPREDUCE_-1463880682_12
          hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /accumulo/tables/!0/table_info/A00005ym.rf_tmp is closed by DFSClient_NONMAPREDUCE_-1463880682_12
          
          [root@ip-10-10-1-155 hadoop]#
          

          So somewhere between the major compaction finishing and the root tablet's information being updated, something went wrong. There are no other obvious exceptions or errors in the logs.

          Show
          Michael Allen added a comment - Here's more of a precise breakdown. The exception looks like this now: 2013-12-11 15:46:30,871 [thrift.ProcessFunction] ERROR: Internal error processing startMultiScan java.lang.RuntimeException: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1429) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.startMultiScan(TabletServer.java:1381) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.accumulo.trace.instrument.thrift.TraceWrap$1.invoke(TraceWrap.java:63) at com.sun.proxy.$Proxy10.startMultiScan(Unknown Source) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2252) at org.apache.accumulo.core.tabletserver.thrift.TabletClientService$Processor$startMultiScan.getResult(TabletClientService.java:2236) at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39) at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39) at org.apache.accumulo.server.util.TServerUtils$TimedProcessor.process(TServerUtils.java:171) at org.apache.thrift.server.AbstractNonblockingServer$FrameBuffer.invoke(AbstractNonblockingServer.java:478) at org.apache.accumulo.server.util.TServerUtils$THsHaServer$Invocation.run(TServerUtils.java:231) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) at org.apache.accumulo.core.util.LoggingRunnable.run(LoggingRunnable.java:34) at java.lang.Thread.run(Thread.java:662) Caused by: java.util.concurrent.ExecutionException: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf at org.apache.accumulo.tserver.TabletServer$ScanTask.get(TabletServer.java:754) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler.continueMultiScan(TabletServer.java:1416) ... 18 more Caused by: java.lang.RuntimeException: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1125) at org.apache.accumulo.trace.instrument.TraceRunnable.run(TraceRunnable.java:47) ... 5 more Caused by: java.io.IOException: Failed to open hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:333) at org.apache.accumulo.tserver.FileManager.access$500(FileManager.java:58) at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:478) at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFileRefs(FileManager.java:466) at org.apache.accumulo.tserver.FileManager$ScanFileManager.openFiles(FileManager.java:486) at org.apache.accumulo.tserver.Tablet$ScanDataSource.createIterator(Tablet.java:2027) at org.apache.accumulo.tserver.Tablet$ScanDataSource.iterator(Tablet.java:1989) at org.apache.accumulo.core.iterators.system.SourceSwitchingIterator.seek(SourceSwitchingIterator.java:163) at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1565) at org.apache.accumulo.tserver.Tablet.lookup(Tablet.java:1672) at org.apache.accumulo.tserver.TabletServer$ThriftClientHandler$LookupTask.run(TabletServer.java:1114) ... 6 more Caused by: java.io.FileNotFoundException: File does not exist: /accumulo/tables/!0/table_info/A00002gy.rf at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java:1967) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:436) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBCFile(CachableBlockFile.java:256) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.access$000(CachableBlockFile.java:143) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader$MetaBlockLoader.get(CachableBlockFile.java:212) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getBlock(CachableBlockFile.java:313) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:367) at org.apache.accumulo.core.file.blockfile.impl.CachableBlockFile$Reader.getMetaBlock(CachableBlockFile.java:143) at org.apache.accumulo.core.file.rfile.RFile$Reader.<init>(RFile.java:825) at org.apache.accumulo.core.file.rfile.RFileOperations.openReader(RFileOperations.java:79) at org.apache.accumulo.core.file.DispatchingFileFactory.openReader(FileOperations.java:119) at org.apache.accumulo.tserver.FileManager.reserveReaders(FileManager.java:314) ... 16 more Scanning the root tablet shows the following entries: root@accumulo accumulo.root> scan !0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf [] 8652,1763 !0;~ file:hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F000063o.rf [] 66848,35747 !0;~ last:142e00c604d0053 [] 10.10.1.155:9997 !0;~ loc:142e00c604d0063 [] 10.10.1.155:9997 !0;~ srv:compact [] 7 !0;~ srv:dir [] hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info !0;~ srv:flush [] 122 !0;~ srv:lock [] tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063 !0;~ srv:time [] L39138 !0;~ ~tab:~pr [] \x00 !0< file:hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf [] 1518,3 !0< last:142e00c604d000d [] 10.10.1.155:9997 !0< loc:142e00c604d0063 [] 10.10.1.155:9997 !0< srv:compact [] 6 !0< srv:dir [] hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet !0< srv:flush [] 122 !0< srv:lock [] tservers/10.10.1.155:9997/zlock-0000000000$142e00c604d0063 !0< srv:time [] L12 !0< ~tab:~pr [] \x01~ root@accumulo accumulo.root> So we looked for where this missing file (A00002gy.rf) got created. Back up in the tserver logs is this: 2013-12-11 05:31:55,701 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf_tmp [] 2013-12-11 05:31:55,937 [tserver.TabletServer] DEBUG: ScanSess tid 10.10.1.155:58190 !0 0 entries in 0.10 secs, nbTimes = [98 98 98.00 1] 2013-12-11 05:31:56,208 [util.TServerUtils] INFO : Decreasing server thread pool size on TabletServer to 20 2013-12-11 05:31:56,246 [tserver.TabletServer] DEBUG: MultiScanSess 10.10.1.155:60638 2 entries in 0.01 secs (lookup_time:0.00 secs tablets:1 ranges:1) 2013-12-11 05:31:56,333 [tserver.Compactor] DEBUG: Compaction !0;~< 20,974 read | 1,763 written | 42,979 entries/sec | 0.488 secs 2013-12-11 05:31:56,345 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs 2013-12-11 05:31:56,345 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00002gv.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00000fi.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf Then it got major compacted away here: 2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: Major compaction plan: [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] propogate deletes : false 2013-12-11 05:36:56,907 [tserver.Tablet] DEBUG: MajC initiate lock 0.00 secs, wait 0.00 secs 2013-12-11 05:36:56,908 [tserver.Tablet] DEBUG: Starting MajC !0;~< (NORMAL) [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf_tmp [] 2013-12-11 05:36:57,080 [tserver.Compactor] DEBUG: Compaction !0;~< 37,692 read | 0 written | 269,228 entries/sec | 0.140 secs 2013-12-11 05:36:57,085 [tserver.Tablet] DEBUG: MajC finish lock 0.00 secs 2013-12-11 05:36:57,085 [tserver.Tablet] TABLET_HIST: !0;~< MajC [hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf, hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf] --> hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00005ym.rf The garbage collector decided to collect it: 2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/A00002gx.rf 2013-12-11 05:41:56,870 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/default_tablet/F00005yk.rf 2013-12-11 05:41:56,874 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/F00005yj.rf 2013-12-11 05:41:56,883 [gc.SimpleGarbageCollector] DEBUG: Deleting hdfs://10.10.1.155:9000/accumulo/tables/!0/table_info/A00002gy.rf 2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates for deletion: 4 2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data file candidates still in use: 0 2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of successfully deleted data files: 4 2013-12-11 05:41:56,956 [gc.SimpleGarbageCollector] INFO : Number of data files delete failures: 0 2013-12-11 05:41:56,957 [gc.SimpleGarbageCollector] INFO : Collect cycle took 0.29 seconds So what happened with A00005ym.rf , which supposedly replaced A00002gy.rf ? The namenode seems to think it was created and closed. But no other Hadoop log mentions it: [root@ip-10-10-1-155 hadoop]# pwd /data0/logs/hadoop [root@ip-10-10-1-155 hadoop]# grep A00005ym.rf * hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,061 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* allocateBlock: /accumulo/tables/!0/table_info/A00005ym.rf_tmp. blk_-5468581853400342461_1318 hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange: Removing lease on /accumulo/tables/!0/table_info/A00005ym.rf_tmp from client DFSClient_NONMAPREDUCE_-1463880682_12 hadoop-hadoop-namenode-ip-10-10-1-155.log:2013-12-11 05:36:57,068 INFO org.apache.hadoop.hdfs.StateChange: DIR* completeFile: /accumulo/tables/!0/table_info/A00005ym.rf_tmp is closed by DFSClient_NONMAPREDUCE_-1463880682_12 [root@ip-10-10-1-155 hadoop]# So somewhere between the major compaction finishing and the root tablet's information being updated, something went wrong. There are no other obvious exceptions or errors in the logs.
          Hide
          John Vines added a comment -

          I totally spaced and mixed up the metadata table (!0) with the root table (+r)...

          Show
          John Vines added a comment - I totally spaced and mixed up the metadata table (!0) with the root table (+r)...
          Hide
          John Vines added a comment -

          It is concerning that both +r and Unable to render embedded object: File (0 have files in them. What is going on here?) not found.

          Show
          John Vines added a comment - It is concerning that both +r and Unable to render embedded object: File (0 have files in them. What is going on here?) not found.
          John Vines made changes -
          Link This issue relates to ACCUMULO-1481 [ ACCUMULO-1481 ]
          John Vines made changes -
          Field Original Value New Value
          Fix Version/s 1.6.0 [ 12322468 ]
          Affects Version/s 1.6.0 [ 12322468 ]
          Priority Major [ 3 ] Critical [ 2 ]
          Michael Allen created issue -

            People

            • Assignee:
              John Vines
              Reporter:
              Michael Allen
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development