Details

    • Type: Sub-task Sub-task
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: test
    • Labels:
      None
    • Environment:

      20-node AWS test cluster, running hadoop-2.2.0

      Description

      Run the randomwalk test using the LongClean module to completion.

        Issue Links

          Activity

          Hide
          Eric Newton added a comment -

          Ran another 24-hour run of LongClean module: a few of the walkers stopped due to ACCUMULO-2673 (detected out-of-balance).

          Show
          Eric Newton added a comment - Ran another 24-hour run of LongClean module: a few of the walkers stopped due to ACCUMULO-2673 (detected out-of-balance).
          Hide
          Eric Newton added a comment -

          Good point. I've removed the start/stop from Concurrent and restarted. I left a bunch of old tables on there, so that should help with the balance check, too.

          Show
          Eric Newton added a comment - Good point. I've removed the start/stop from Concurrent and restarted. I left a bunch of old tables on there, so that should help with the balance check, too.
          Hide
          Keith Turner added a comment -

          Eric Newton did all walkers freeze of hang? Can you remove the nodes in the graph that causing hang and rerun?

          Show
          Keith Turner added a comment - Eric Newton did all walkers freeze of hang? Can you remove the nodes in the graph that causing hang and rerun?
          Hide
          Eric Newton added a comment -

          Ran successfully on RC2 for 24 hours, each client either

          Show
          Eric Newton added a comment - Ran successfully on RC2 for 24 hours, each client either Hung, due to ACCUMULO-2341 Stopped because it detected an out-of-balance cluster ACCUMULO-2673 Stopped because of an unexpected tablet state ACCUMULO-2678
          Hide
          Keith Turner added a comment -

          the concurrent test intentionally uses the same tables

          Show
          Keith Turner added a comment - the concurrent test intentionally uses the same tables
          Hide
          Bill Havanki added a comment -

          Does this arise because of multiple walkers running concurrent at the same time? I've noticed that the "ctt" tables it creates aren't namespaced by the walker host like other tests.

          Show
          Bill Havanki added a comment - Does this arise because of multiple walkers running concurrent at the same time? I've noticed that the "ctt" tables it creates aren't namespaced by the walker host like other tests.
          Hide
          Eric Newton added a comment -

          Looks like there is a possibility for another operation to come in after the locks are closed for online/offline operations. The Concurrent test needs just accept this case (other tests that do not share state can handle online/offline conflict). Sadly, the code that does this check throws AccumuloException which means that the test would have to look at the exception error string.

          Show
          Eric Newton added a comment - Looks like there is a possibility for another operation to come in after the locks are closed for online/offline operations. The Concurrent test needs just accept this case (other tests that do not share state can handle online/offline conflict). Sadly, the code that does this check throws AccumuloException which means that the test would have to look at the exception error string.
          Hide
          Eric Newton added a comment -

          Also, "Unexpected table state 9a ONLINE != OFFLINE" in 1.6.0RC2, also in ct.OfflineTable

          Caused by: java.lang.Exception: Error running node ct.OfflineTable
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286)
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255)
                  ... 8 more
          Caused by: org.apache.accumulo.core.client.AccumuloException: Unexpected table state 9a ONLINE != OFFLINE
                  at org.apache.accumulo.core.client.admin.TableOperationsImpl.waitForTableStateTransition(TableOperationsImpl.java:1200)
                  at org.apache.accumulo.core.client.admin.TableOperationsImpl.offline(TableOperationsImpl.java:1334)
                  at org.apache.accumulo.test.randomwalk.concurrent.OfflineTable.visit(OfflineTable.java:43)
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255)
                  ... 9 more
          
          Show
          Eric Newton added a comment - Also, "Unexpected table state 9a ONLINE != OFFLINE" in 1.6.0RC2, also in ct.OfflineTable Caused by: java.lang.Exception: Error running node ct.OfflineTable at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286) at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255) ... 8 more Caused by: org.apache.accumulo.core.client.AccumuloException: Unexpected table state 9a ONLINE != OFFLINE at org.apache.accumulo.core.client.admin.TableOperationsImpl.waitForTableStateTransition(TableOperationsImpl.java:1200) at org.apache.accumulo.core.client.admin.TableOperationsImpl.offline(TableOperationsImpl.java:1334) at org.apache.accumulo.test.randomwalk.concurrent.OfflineTable.visit(OfflineTable.java:43) at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255) ... 9 more
          Hide
          Eric Newton added a comment -

          Saw "Unexpected table state 2p DELETING != OFFLINE" in 1.6.0RC2.

          Show
          Eric Newton added a comment - Saw "Unexpected table state 2p DELETING != OFFLINE" in 1.6.0RC2.
          Hide
          Eric Newton added a comment -

          I think the Shard test is failing after the timer attempts to tearDown the test. It's sort of flailing instead of properly quiting.

          Show
          Eric Newton added a comment - I think the Shard test is failing after the timer attempts to tearDown the test. It's sort of flailing instead of properly quiting.
          Hide
          Eric Newton added a comment -

          Saw unexpected document failure again, 2x on a 20-node test cluster running 17 walkers.

          Show
          Eric Newton added a comment - Saw unexpected document failure again, 2x on a 20-node test cluster running 17 walkers.
          Hide
          Eric Newton added a comment -

          Also:

          java.lang.Exception: Error running node Shard.xml
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286)
                  at org.apache.accumulo.test.randomwalk.Framework.run(Framework.java:63)
                  at org.apache.accumulo.test.randomwalk.Framework.main(Framework.java:122)
                  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
                  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
                  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
                  at java.lang.reflect.Method.invoke(Method.java:606)
                  at org.apache.accumulo.start.Main$1.run(Main.java:141)
                  at java.lang.Thread.run(Thread.java:744)
          Caused by: java.lang.Exception: Error running node shard.CompactFilter
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286)
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255)
                  ... 8 more
          Caused by: java.lang.Exception: Saw unexpected document e900000000000000 doc: [] 1397147994361 false
                  at org.apache.accumulo.test.randomwalk.shard.CompactFilter.visit(CompactFilter.java:87)
                  at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255)
                  ... 9 more
          
          Show
          Eric Newton added a comment - Also: java.lang.Exception: Error running node Shard.xml at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286) at org.apache.accumulo.test.randomwalk.Framework.run(Framework.java:63) at org.apache.accumulo.test.randomwalk.Framework.main(Framework.java:122) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.accumulo.start.Main$1.run(Main.java:141) at java.lang.Thread.run(Thread.java:744) Caused by: java.lang.Exception: Error running node shard.CompactFilter at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:286) at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255) ... 8 more Caused by: java.lang.Exception: Saw unexpected document e900000000000000 doc: [] 1397147994361 false at org.apache.accumulo.test.randomwalk.shard.CompactFilter.visit(CompactFilter.java:87) at org.apache.accumulo.test.randomwalk.Module.visit(Module.java:255) ... 9 more
          Hide
          Eric Newton added a comment -

          Presently (479a36bd9...) dies with:

          • CheckBalance failure (count 47 too far from average 29.4375)
          • map/reduce job timeout (sequential.MapRedVerify has been running for 300.008 seconds.)
          • CopyTable timeout: multitable.CopyTable has been running for 300.001 seconds
          • BulkInsert timeout: shard.BulkInsert has been running for 300.001
          • ct.OfflineTable fails: Unexpected table state 5k DELETING != OFFLINE

          I suspect map/reduce/yarn is not running.

          Show
          Eric Newton added a comment - Presently (479a36bd9...) dies with: CheckBalance failure (count 47 too far from average 29.4375) map/reduce job timeout (sequential.MapRedVerify has been running for 300.008 seconds.) CopyTable timeout: multitable.CopyTable has been running for 300.001 seconds BulkInsert timeout: shard.BulkInsert has been running for 300.001 ct.OfflineTable fails: Unexpected table state 5k DELETING != OFFLINE I suspect map/reduce/yarn is not running.

            People

            • Assignee:
              Eric Newton
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development