Details

    • Sub-task
    • Status: Resolved
    • Blocker
    • Resolution: Fixed
    • None
    • 1.5.0
    • master, tserver
    • None
    • hadoop-1.0.4, accumulo-1.5-SNAPSHOT svn version 1470047

    Description

      We were doing some testing on an Accumulo snapshot using continuous ingest when the power went out. When it came back we noticed some corrupt blocks in HDFS, mostly around the WAL. I wasn't certain if that was a happenstance of how the sync blocks can turn out, so I went ahead and started Accumulo to see if it could handle it. What I got wasn't what I expected.

      There are 0 errors reported on the monitor. It just sits with 5 tservers available and no tablets online. The master appears it attempted to assign and then is waiting for the walog to close, which never happens-

      2013-04-30 10:38:23,648 [master.EventCoordinator] INFO : There are now 5 tablet servers
      2013-04-30 10:38:23,719 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/fa545e93-5eba-46b4-9266-dbd60cb56943]
      2013-04-30 10:38:23,720 [state.ZooTabletStateStore] DEBUG: root tablet logSet [172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462]
      2013-04-30 10:38:23,725 [state.ZooTabletStateStore] DEBUG: Returning root tablet state: !0;!0<<@(null,172.16.102.202:9997[33e57eff04c0001],172.16.102.202:9997[33e57eff04c0001])
      2013-04-30 10:38:23,740 [master.Master] INFO : Loaded class : org.apache.accumulo.server.master.recovery.HadoopLogCloser
      2013-04-30 10:38:23,741 [recovery.RecoveryManager] INFO : Starting recovery of ed30bd24-b348-4344-8614-a2d79f933462 (in : 10s) created for 172.16.102.202+9997, tablet !0;!0<< holds a reference
      2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet]: scan time 0.04 seconds
      2013-04-30 10:38:23,751 [master.Master] DEBUG: [Root Tablet] sleeping for 60.00 seconds
      2013-04-30 10:38:23,823 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumulo-metrics.xml
      2013-04-30 10:38:23,838 [master.Master] DEBUG: Finished gathering information from 5 servers in 0.21 seconds
      2013-04-30 10:38:23,841 [master.Master] DEBUG: not balancing because there are unhosted tablets
      2013-04-30 10:38:23,852 [master.Master] DEBUG: Finished gathering information from 5 servers in 0.01 seconds
      2013-04-30 10:38:23,852 [master.Master] DEBUG: not balancing because there are unhosted tablets
      2013-04-30 10:38:23,861 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
      2013-04-30 10:38:23,874 [impl.ThriftScanner] DEBUG: Error getting transport to 172.16.102.202:9997 : NotServingTabletException(extent:TKeyExtent(table:21 30, endRow:21 30 3C, prevEndRow:null))

      That Exception repeats endlessly with periodic

      2013-04-30 10:38:34,756 [recovery.HadoopLogCloser] INFO : Waiting for file to be closed /accumulo/wal/172.16.102.202+9997/ed30bd24-b348-4344-8614-a2d79f933462

      On the tserver in question, it seems to have no idea that it's supposed to be recovering the root tablet though

      2013-04-30 10:38:22,432 [tabletserver.TabletServer] DEBUG: org.apache.accumulo.server.tabletserver.TabletServer$ThriftClientHandler created
      2013-04-30 10:38:22,544 [metrics.MetricsConfiguration] DEBUG: Loading config file: /cloud/accumulo/apache-accumulo-1.5.0-SNAPSHOT_1470047/conf/accumu
      lo-metrics.xml
      2013-04-30 10:38:22,549 [metrics.MetricsConfiguration] DEBUG: Metrics collection enabled=false
      2013-04-30 10:38:22,551 [tabletserver.TabletServer] INFO : port = 9997
      2013-04-30 10:38:22,621 [tabletserver.TabletServer] DEBUG: Obtained tablet server lock /accumulo/242078a7-dd19-4d08-8952-f5109f6f7962/tservers/172.16
      .102.202:9997/zlock-0000000000
      2013-04-30 10:38:23,266 [tabletserver.TabletServer] DEBUG: gc ParNew=0.00(+0.00) secs ConcurrentMarkSweep=0.00(+0.00) secs freemem=8,486,794,504(+45,
      036,880) totalmem=8,536,260,608
      2013-04-30 10:38:23,947 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034 0 entries in 0.07 secs (lookup_time:0.00 secs tablets:1
      ranges:1)
      2013-04-30 10:38:23,986 [tabletserver.TabletServer] DEBUG: MultiScanSess 172.16.102.200:50034 0 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
      ranges:1)

      With that debug message repeating endlessly. Out and err files on the master and that tserver are empty.

      Attachments

        Issue Links

          Activity

            People

              ecn Eric C. Newton
              vines John Vines
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Slack

                  Issue deployment