HBase
  1. HBase
  2. HBASE-5754

data lost with gora continuous ingest test (goraci)

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 0.92.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      10 node test cluster

      Description

      Keith Turner re-wrote the accumulo continuous ingest test using gora, which has both hbase and accumulo back-ends.

      I put a billion entries into HBase, and ran the Verify map/reduce job. The verification failed because about 21K entries were missing. The goraci README explains the test, and how it detects missing data.

      I re-ran the test with 100 million entries, and it verified successfully.
      Both of the times I tested using a billion entries, the verification failed.
      If I run the verification step twice, the results are consistent, so the problem is
      probably not on the verify step.

      Here's the versions of the various packages:

      package version
      hadoop 0.20.205.0
      hbase 0.92.1
      gora http://svn.apache.org/repos/asf/gora/trunk r1311277
      goraci https://github.com/ericnewton/goraci tagged 2012-04-08

      The change I made to goraci was to configure it for hbase and to allow it to build properly.

        Issue Links

          Activity

          Hide
          Ian Varley added a comment -

          Per discussion, this functionality is covered by many other tests, and hasn't been reproducible recently. Please reopen if anyone sees this again.

          Show
          Ian Varley added a comment - Per discussion, this functionality is covered by many other tests, and hasn't been reproducible recently. Please reopen if anyone sees this again.
          Hide
          Ian Varley added a comment -

          OK, thanks Eric Newton. I would move to close unless anyone has reported something similar in the meantime or is actively working on it. Reopen if you disagree.

          Show
          Ian Varley added a comment - OK, thanks Eric Newton . I would move to close unless anyone has reported something similar in the meantime or is actively working on it. Reopen if you disagree.
          Hide
          Eric Newton added a comment -

          Ian Varley I don't know if it has been fixed. I have not re-tested more recent versions of HBase. Since the test has now been incorporated into HBase without any additional findings, perhaps it has been fixed.

          Show
          Eric Newton added a comment - Ian Varley I don't know if it has been fixed. I have not re-tested more recent versions of HBase. Since the test has now been incorporated into HBase without any additional findings, perhaps it has been fixed.
          Hide
          Ian Varley added a comment -

          Eric Newton, this issue has been dormant for 11 months. Do you think all these issues were resolved based on the other lined JIRAs above (e.g. HBASE-5987, HBASE-6060, etc)? Or is this still reproducible?

          Show
          Ian Varley added a comment - Eric Newton , this issue has been dormant for 11 months. Do you think all these issues were resolved based on the other lined JIRAs above (e.g. HBASE-5987 , HBASE-6060 , etc)? Or is this still reproducible?
          Hide
          Nicolas Liochon added a comment -

          @keith: thanks for the info. It will be interesting to see the results. fyi, I've just created HBASE-6401 on an hdfs linked issue.

          Show
          Nicolas Liochon added a comment - @keith: thanks for the info. It will be interesting to see the results. fyi, I've just created HBASE-6401 on an hdfs linked issue.
          Hide
          Enis Soztutar added a comment -

          Just FYI in case, I've been working on adding "long running ingestion tests while randomly killing of servers", and other types of integration tests over at HBASE-6241, HBASE-6201. Feel free to chime in.

          Show
          Enis Soztutar added a comment - Just FYI in case, I've been working on adding "long running ingestion tests while randomly killing of servers", and other types of integration tests over at HBASE-6241 , HBASE-6201 . Feel free to chime in.
          Hide
          Keith Turner added a comment -

          In the past we just killed Accumulo processes, the tablet servers, loggers, and master process(es). We were not looking to test HDFS. For the next release, Accumulo has moved its write ahead logging to HDFS. So for the next release we plan to start killing HDFS processes as part of our testing. See ACCUMULO-578, ACCUMULO-623, and ACCUMULO-636.

          Show
          Keith Turner added a comment - In the past we just killed Accumulo processes, the tablet servers, loggers, and master process(es). We were not looking to test HDFS. For the next release, Accumulo has moved its write ahead logging to HDFS. So for the next release we plan to start killing HDFS processes as part of our testing. See ACCUMULO-578 , ACCUMULO-623 , and ACCUMULO-636 .
          Hide
          Nicolas Liochon added a comment -

          @Eric
          When you kill servers, do you stop the whole hardware, just the accumulo processes, or both hdfs & accumulo?

          Show
          Nicolas Liochon added a comment - @Eric When you kill servers, do you stop the whole hardware, just the accumulo processes, or both hdfs & accumulo?
          Hide
          Eric Newton added a comment -

          I will try to reproduce the bug with 0.94 and hadoop 1.0.3 sometime in the next week.

          We normally run this test against Accumulo for 24 hours, both with and without agitation (random killing of servers). I would be concerned if there was apparent data loss, even if it was transient.

          Show
          Eric Newton added a comment - I will try to reproduce the bug with 0.94 and hadoop 1.0.3 sometime in the next week. We normally run this test against Accumulo for 24 hours, both with and without agitation (random killing of servers). I would be concerned if there was apparent data loss, even if it was transient.
          Hide
          Enis Soztutar added a comment -

          @Lars,
          We have been running this for a while as nightlies, and apart from the reported HBASE-5986, HBASE-6060, and HBASE-6160, we did not run into more issues. All of them can be considered META issues w/o actual data loss. Let's see what Eric would say.

          Show
          Enis Soztutar added a comment - @Lars, We have been running this for a while as nightlies, and apart from the reported HBASE-5986 , HBASE-6060 , and HBASE-6160 , we did not run into more issues. All of them can be considered META issues w/o actual data loss. Let's see what Eric would say.
          Hide
          Lars Hofhansl added a comment -

          Trying to figure out whether this is something to worry about or not.
          Is this a permanent data loss issue, or is it "just" .META. being out of whack temporarily (or at least fixable with hbck)?

          Show
          Lars Hofhansl added a comment - Trying to figure out whether this is something to worry about or not. Is this a permanent data loss issue, or is it "just" .META. being out of whack temporarily (or at least fixable with hbck)?
          Hide
          Enis Soztutar added a comment -

          In one of my 0.92.x tests on a 10 node cluster, 250M inserts, I did manage to get the verify to fail:

          12/05/08 11:11:18 INFO mapred.JobClient:   goraci.Verify$Counts
          12/05/08 11:11:18 INFO mapred.JobClient:     UNDEFINED=972506
          12/05/08 11:11:18 INFO mapred.JobClient:     REFERENCED=248051318
          12/05/08 11:11:18 INFO mapred.JobClient:     UNREFERENCED=972506
          12/05/08 11:11:18 INFO mapred.JobClient:   Map-Reduce Framework
          12/05/08 11:11:18 INFO mapred.JobClient:     Map input records=249023824
          

          Notice that map input records is 1M less that 250M, which indicates that the inputformat did not provide all records in the table. The missing rows all belong to the single region. I rerun the test again after a couple of hours, and it passed. But the failed test created 244 maps, instead of 246, which is the current region count, so I am suspecting there is something wrong in the split calculation or in the supposed transactional behavior for split/balance operations in the meta table. I am still inspecting the code and the logs, but any pointers are welcome.

          Show
          Enis Soztutar added a comment - In one of my 0.92.x tests on a 10 node cluster, 250M inserts, I did manage to get the verify to fail: 12/05/08 11:11:18 INFO mapred.JobClient: goraci.Verify$Counts 12/05/08 11:11:18 INFO mapred.JobClient: UNDEFINED=972506 12/05/08 11:11:18 INFO mapred.JobClient: REFERENCED=248051318 12/05/08 11:11:18 INFO mapred.JobClient: UNREFERENCED=972506 12/05/08 11:11:18 INFO mapred.JobClient: Map-Reduce Framework 12/05/08 11:11:18 INFO mapred.JobClient: Map input records=249023824 Notice that map input records is 1M less that 250M, which indicates that the inputformat did not provide all records in the table. The missing rows all belong to the single region. I rerun the test again after a couple of hours, and it passed. But the failed test created 244 maps, instead of 246, which is the current region count, so I am suspecting there is something wrong in the split calculation or in the supposed transactional behavior for split/balance operations in the meta table. I am still inspecting the code and the logs, but any pointers are welcome.
          Hide
          stack added a comment -

          I ran w/ 10 generators and 10 slots for the verify step and got the below which doesn't prints out only a REFERENCED count.

          Running these recent tests I let it do its natural splitting so it grew from zero to 260odd regions so maybe the issue you see Eric comes of manual splits coming out of the UI. Let me try that next.

          Thanks lads.

          12/04/13 05:16:23 INFO mapred.JobClient:  map 100% reduce 99%
          12/04/13 05:16:54 INFO mapred.JobClient:  map 100% reduce 100%
          12/04/13 05:16:59 INFO mapred.JobClient: Job complete: job_201204092039_0046
          12/04/13 05:16:59 INFO mapred.JobClient: Counters: 30
          12/04/13 05:16:59 INFO mapred.JobClient:   Job Counters
          12/04/13 05:16:59 INFO mapred.JobClient:     Launched reduce tasks=10
          12/04/13 05:16:59 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=30125694
          12/04/13 05:16:59 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
          12/04/13 05:16:59 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
          12/04/13 05:16:59 INFO mapred.JobClient:     Rack-local map tasks=6
          12/04/13 05:16:59 INFO mapred.JobClient:     Launched map tasks=256
          12/04/13 05:16:59 INFO mapred.JobClient:     Data-local map tasks=250
          12/04/13 05:16:59 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=5832198
          12/04/13 05:16:59 INFO mapred.JobClient:   goraci.Verify$Counts
          12/04/13 05:16:59 INFO mapred.JobClient:     REFERENCED=1000000000
          12/04/13 05:16:59 INFO mapred.JobClient:   File Output Format Counters
          12/04/13 05:16:59 INFO mapred.JobClient:     Bytes Written=0
          12/04/13 05:16:59 INFO mapred.JobClient:   FileSystemCounters
          12/04/13 05:16:59 INFO mapred.JobClient:     FILE_BYTES_READ=83022967343
          12/04/13 05:16:59 INFO mapred.JobClient:     HDFS_BYTES_READ=156414
          12/04/13 05:16:59 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=112881560332
          12/04/13 05:16:59 INFO mapred.JobClient:   File Input Format Counters
          12/04/13 05:16:59 INFO mapred.JobClient:     Bytes Read=0
          12/04/13 05:16:59 INFO mapred.JobClient:   Map-Reduce Framework
          12/04/13 05:16:59 INFO mapred.JobClient:     Map output materialized bytes=29992170602
          12/04/13 05:16:59 INFO mapred.JobClient:     Map input records=1000000000
          12/04/13 05:16:59 INFO mapred.JobClient:     Reduce shuffle bytes=29874879887
          12/04/13 05:16:59 INFO mapred.JobClient:     Spilled Records=7527086436
          12/04/13 05:16:59 INFO mapred.JobClient:     Map output bytes=25992155242
          12/04/13 05:16:59 INFO mapred.JobClient:     CPU time spent (ms)=20182570
          12/04/13 05:16:59 INFO mapred.JobClient:     Total committed heap usage (bytes)=99953082368
          12/04/13 05:16:59 INFO mapred.JobClient:     Combine input records=0
          12/04/13 05:16:59 INFO mapred.JobClient:     SPLIT_RAW_BYTES=156414
          12/04/13 05:16:59 INFO mapred.JobClient:     Reduce input records=2000000000
          12/04/13 05:16:59 INFO mapred.JobClient:     Reduce input groups=1000000000
          12/04/13 05:16:59 INFO mapred.JobClient:     Combine output records=0
          12/04/13 05:16:59 INFO mapred.JobClient:     Physical memory (bytes) snapshot=91762372608
          12/04/13 05:16:59 INFO mapred.JobClient:     Reduce output records=0
          12/04/13 05:16:59 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=391126540288
          12/04/13 05:16:59 INFO mapred.JobClient:     Map output records=2000000000
          
          Show
          stack added a comment - I ran w/ 10 generators and 10 slots for the verify step and got the below which doesn't prints out only a REFERENCED count. Running these recent tests I let it do its natural splitting so it grew from zero to 260odd regions so maybe the issue you see Eric comes of manual splits coming out of the UI. Let me try that next. Thanks lads. 12/04/13 05:16:23 INFO mapred.JobClient: map 100% reduce 99% 12/04/13 05:16:54 INFO mapred.JobClient: map 100% reduce 100% 12/04/13 05:16:59 INFO mapred.JobClient: Job complete: job_201204092039_0046 12/04/13 05:16:59 INFO mapred.JobClient: Counters: 30 12/04/13 05:16:59 INFO mapred.JobClient: Job Counters 12/04/13 05:16:59 INFO mapred.JobClient: Launched reduce tasks=10 12/04/13 05:16:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=30125694 12/04/13 05:16:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/04/13 05:16:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/04/13 05:16:59 INFO mapred.JobClient: Rack-local map tasks=6 12/04/13 05:16:59 INFO mapred.JobClient: Launched map tasks=256 12/04/13 05:16:59 INFO mapred.JobClient: Data-local map tasks=250 12/04/13 05:16:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=5832198 12/04/13 05:16:59 INFO mapred.JobClient: goraci.Verify$Counts 12/04/13 05:16:59 INFO mapred.JobClient: REFERENCED=1000000000 12/04/13 05:16:59 INFO mapred.JobClient: File Output Format Counters 12/04/13 05:16:59 INFO mapred.JobClient: Bytes Written=0 12/04/13 05:16:59 INFO mapred.JobClient: FileSystemCounters 12/04/13 05:16:59 INFO mapred.JobClient: FILE_BYTES_READ=83022967343 12/04/13 05:16:59 INFO mapred.JobClient: HDFS_BYTES_READ=156414 12/04/13 05:16:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=112881560332 12/04/13 05:16:59 INFO mapred.JobClient: File Input Format Counters 12/04/13 05:16:59 INFO mapred.JobClient: Bytes Read=0 12/04/13 05:16:59 INFO mapred.JobClient: Map-Reduce Framework 12/04/13 05:16:59 INFO mapred.JobClient: Map output materialized bytes=29992170602 12/04/13 05:16:59 INFO mapred.JobClient: Map input records=1000000000 12/04/13 05:16:59 INFO mapred.JobClient: Reduce shuffle bytes=29874879887 12/04/13 05:16:59 INFO mapred.JobClient: Spilled Records=7527086436 12/04/13 05:16:59 INFO mapred.JobClient: Map output bytes=25992155242 12/04/13 05:16:59 INFO mapred.JobClient: CPU time spent (ms)=20182570 12/04/13 05:16:59 INFO mapred.JobClient: Total committed heap usage (bytes)=99953082368 12/04/13 05:16:59 INFO mapred.JobClient: Combine input records=0 12/04/13 05:16:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=156414 12/04/13 05:16:59 INFO mapred.JobClient: Reduce input records=2000000000 12/04/13 05:16:59 INFO mapred.JobClient: Reduce input groups=1000000000 12/04/13 05:16:59 INFO mapred.JobClient: Combine output records=0 12/04/13 05:16:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=91762372608 12/04/13 05:16:59 INFO mapred.JobClient: Reduce output records=0 12/04/13 05:16:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=391126540288 12/04/13 05:16:59 INFO mapred.JobClient: Map output records=2000000000
          Hide
          stack added a comment -

          Let me do the same. I did not match generator map tasks to verify reducers. Then let me recreate the split issue Eric describes above. Thanks lads.

          Show
          stack added a comment - Let me do the same. I did not match generator map tasks to verify reducers. Then let me recreate the split issue Eric describes above. Thanks lads.
          Hide
          Keith Turner added a comment -

          When Eric ran 10 generators each adding 100M against HBase and no data was lost, he saw 1B referenced, 0 unref, and 0 undef.

          Show
          Keith Turner added a comment - When Eric ran 10 generators each adding 100M against HBase and no data was lost, he saw 1B referenced, 0 unref, and 0 undef.
          Hide
          Keith Turner added a comment -

          The counts for the 1B run seem odd to me , but maybe thats just an artifact of how many map task you ran for the generator and how much data each task generated. If a map task does not does not generate a multiple of 25,000,000 then it will leave some unreferenced. It generates a circular linked list every 25M.

          12/04/12 03:54:11 INFO mapred.JobClient:     REFERENCED=564459547
          12/04/12 03:54:11 INFO mapred.JobClient:     UNREFERENCED=1040000000
          

          If you were to run 10 map task each generating 100M, then this should generate 1B with all nodes referenced. Minimizing the number of unreferenced is ideal, because the test can not detect the loss of unreferenced nodes. I should probably add this info to the readme.

          Show
          Keith Turner added a comment - The counts for the 1B run seem odd to me , but maybe thats just an artifact of how many map task you ran for the generator and how much data each task generated. If a map task does not does not generate a multiple of 25,000,000 then it will leave some unreferenced. It generates a circular linked list every 25M. 12/04/12 03:54:11 INFO mapred.JobClient: REFERENCED=564459547 12/04/12 03:54:11 INFO mapred.JobClient: UNREFERENCED=1040000000 If you were to run 10 map task each generating 100M, then this should generate 1B with all nodes referenced. Minimizing the number of unreferenced is ideal, because the test can not detect the loss of unreferenced nodes. I should probably add this info to the readme.
          Hide
          Eric Newton added a comment -

          Yes, that looks like a clean run.

          My run of 1B completed correctly yesterday.

          However, I just did a run of 100M. I suspect split/balancing is causing the data loss. So, during the 100M run, I pushed the split button on the master web page, every 5 seconds until I had several hundred tablets. Then I left it alone.

          12/04/12 14:17:01 INFO mapred.JobClient:   goraci.Verify$Counts
          12/04/12 14:17:01 INFO mapred.JobClient:     UNDEFINED=37099
          12/04/12 14:17:01 INFO mapred.JobClient:     REFERENCED=89961385
          12/04/12 14:17:01 INFO mapred.JobClient:     UNREFERENCED=10001516
          

          Perhaps if you do this you can replicate the problem.

          Show
          Eric Newton added a comment - Yes, that looks like a clean run. My run of 1B completed correctly yesterday. However, I just did a run of 100M. I suspect split/balancing is causing the data loss. So, during the 100M run, I pushed the split button on the master web page, every 5 seconds until I had several hundred tablets. Then I left it alone. 12/04/12 14:17:01 INFO mapred.JobClient: goraci.Verify$Counts 12/04/12 14:17:01 INFO mapred.JobClient: UNDEFINED=37099 12/04/12 14:17:01 INFO mapred.JobClient: REFERENCED=89961385 12/04/12 14:17:01 INFO mapred.JobClient: UNREFERENCED=10001516 Perhaps if you do this you can replicate the problem.
          Hide
          stack added a comment -

          I managed to get things to work after a few detours and figuring the tool (I'm a little slow). Here is output of a verify run after uploading 1B rows using the Generator tool (I have a five node cluster):

          12/04/12 03:53:54 INFO mapred.JobClient:  map 100% reduce 99%
          12/04/12 03:54:06 INFO mapred.JobClient:  map 100% reduce 100%
          12/04/12 03:54:11 INFO mapred.JobClient: Job complete: job_201204092039_0040
          12/04/12 03:54:11 INFO mapred.JobClient: Counters: 31
          12/04/12 03:54:11 INFO mapred.JobClient:   Job Counters
          12/04/12 03:54:11 INFO mapred.JobClient:     Launched reduce tasks=103
          12/04/12 03:54:11 INFO mapred.JobClient:     SLOTS_MILLIS_MAPS=43090396
          12/04/12 03:54:11 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
          12/04/12 03:54:11 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
          12/04/12 03:54:11 INFO mapred.JobClient:     Rack-local map tasks=75
          12/04/12 03:54:11 INFO mapred.JobClient:     Launched map tasks=256
          12/04/12 03:54:11 INFO mapred.JobClient:     Data-local map tasks=181
          12/04/12 03:54:11 INFO mapred.JobClient:     SLOTS_MILLIS_REDUCES=8911236
          12/04/12 03:54:11 INFO mapred.JobClient:   goraci.Verify$Counts
          12/04/12 03:54:11 INFO mapred.JobClient:     REFERENCED=564459547
          12/04/12 03:54:11 INFO mapred.JobClient:     UNREFERENCED=1040000000
          12/04/12 03:54:11 INFO mapred.JobClient:   File Output Format Counters
          12/04/12 03:54:11 INFO mapred.JobClient:     Bytes Written=0
          12/04/12 03:54:11 INFO mapred.JobClient:   FileSystemCounters
          12/04/12 03:54:11 INFO mapred.JobClient:     FILE_BYTES_READ=80913119406
          12/04/12 03:54:11 INFO mapred.JobClient:     HDFS_BYTES_READ=156449
          12/04/12 03:54:11 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=107202716633
          12/04/12 03:54:11 INFO mapred.JobClient:   File Input Format Counters
          12/04/12 03:54:11 INFO mapred.JobClient:     Bytes Read=0
          12/04/12 03:54:11 INFO mapred.JobClient:   Map-Reduce Framework
          12/04/12 03:54:11 INFO mapred.JobClient:     Map output materialized bytes=28369514665
          12/04/12 03:54:11 INFO mapred.JobClient:     Map input records=1604459547
          12/04/12 03:54:11 INFO mapred.JobClient:     Reduce shuffle bytes=28259732158
          12/04/12 03:54:11 INFO mapred.JobClient:     Spilled Records=8195463443
          12/04/12 03:54:11 INFO mapred.JobClient:     Map output bytes=24031522877
          12/04/12 03:54:11 INFO mapred.JobClient:     CPU time spent (ms)=20730410
          12/04/12 03:54:11 INFO mapred.JobClient:     Total committed heap usage (bytes)=150411739136
          12/04/12 03:54:11 INFO mapred.JobClient:     Combine input records=0
          12/04/12 03:54:11 INFO mapred.JobClient:     SPLIT_RAW_BYTES=156449
          12/04/12 03:54:11 INFO mapred.JobClient:     Reduce input records=2168919094
          12/04/12 03:54:11 INFO mapred.JobClient:     Reduce input groups=1604459547
          12/04/12 03:54:11 INFO mapred.JobClient:     Combine output records=0
          12/04/12 03:54:11 INFO mapred.JobClient:     Physical memory (bytes) snapshot=144318976000
          12/04/12 03:54:11 INFO mapred.JobClient:     Reduce output records=0
          12/04/12 03:54:11 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=522892115968
          12/04/12 03:54:11 INFO mapred.JobClient:     Map output records=2168919094
          

          Going by the README, it would seem we're basically working. Should I close this issue or would you like me to look at something else? Thanks.

          I like this goraci tool. Will play some more with it.

          Show
          stack added a comment - I managed to get things to work after a few detours and figuring the tool (I'm a little slow). Here is output of a verify run after uploading 1B rows using the Generator tool (I have a five node cluster): 12/04/12 03:53:54 INFO mapred.JobClient: map 100% reduce 99% 12/04/12 03:54:06 INFO mapred.JobClient: map 100% reduce 100% 12/04/12 03:54:11 INFO mapred.JobClient: Job complete: job_201204092039_0040 12/04/12 03:54:11 INFO mapred.JobClient: Counters: 31 12/04/12 03:54:11 INFO mapred.JobClient: Job Counters 12/04/12 03:54:11 INFO mapred.JobClient: Launched reduce tasks=103 12/04/12 03:54:11 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=43090396 12/04/12 03:54:11 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/04/12 03:54:11 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/04/12 03:54:11 INFO mapred.JobClient: Rack-local map tasks=75 12/04/12 03:54:11 INFO mapred.JobClient: Launched map tasks=256 12/04/12 03:54:11 INFO mapred.JobClient: Data-local map tasks=181 12/04/12 03:54:11 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=8911236 12/04/12 03:54:11 INFO mapred.JobClient: goraci.Verify$Counts 12/04/12 03:54:11 INFO mapred.JobClient: REFERENCED=564459547 12/04/12 03:54:11 INFO mapred.JobClient: UNREFERENCED=1040000000 12/04/12 03:54:11 INFO mapred.JobClient: File Output Format Counters 12/04/12 03:54:11 INFO mapred.JobClient: Bytes Written=0 12/04/12 03:54:11 INFO mapred.JobClient: FileSystemCounters 12/04/12 03:54:11 INFO mapred.JobClient: FILE_BYTES_READ=80913119406 12/04/12 03:54:11 INFO mapred.JobClient: HDFS_BYTES_READ=156449 12/04/12 03:54:11 INFO mapred.JobClient: FILE_BYTES_WRITTEN=107202716633 12/04/12 03:54:11 INFO mapred.JobClient: File Input Format Counters 12/04/12 03:54:11 INFO mapred.JobClient: Bytes Read=0 12/04/12 03:54:11 INFO mapred.JobClient: Map-Reduce Framework 12/04/12 03:54:11 INFO mapred.JobClient: Map output materialized bytes=28369514665 12/04/12 03:54:11 INFO mapred.JobClient: Map input records=1604459547 12/04/12 03:54:11 INFO mapred.JobClient: Reduce shuffle bytes=28259732158 12/04/12 03:54:11 INFO mapred.JobClient: Spilled Records=8195463443 12/04/12 03:54:11 INFO mapred.JobClient: Map output bytes=24031522877 12/04/12 03:54:11 INFO mapred.JobClient: CPU time spent (ms)=20730410 12/04/12 03:54:11 INFO mapred.JobClient: Total committed heap usage (bytes)=150411739136 12/04/12 03:54:11 INFO mapred.JobClient: Combine input records=0 12/04/12 03:54:11 INFO mapred.JobClient: SPLIT_RAW_BYTES=156449 12/04/12 03:54:11 INFO mapred.JobClient: Reduce input records=2168919094 12/04/12 03:54:11 INFO mapred.JobClient: Reduce input groups=1604459547 12/04/12 03:54:11 INFO mapred.JobClient: Combine output records=0 12/04/12 03:54:11 INFO mapred.JobClient: Physical memory (bytes) snapshot=144318976000 12/04/12 03:54:11 INFO mapred.JobClient: Reduce output records=0 12/04/12 03:54:11 INFO mapred.JobClient: Virtual memory (bytes) snapshot=522892115968 12/04/12 03:54:11 INFO mapred.JobClient: Map output records=2168919094 Going by the README, it would seem we're basically working. Should I close this issue or would you like me to look at something else? Thanks. I like this goraci tool. Will play some more with it.
          Hide
          stack added a comment -

          @Eric No worries. Let me make sure of your 1B test regardless (I messed up and loaded many billions ... need to start over). Thanks for the heads up.

          Show
          stack added a comment - @Eric No worries. Let me make sure of your 1B test regardless (I messed up and loaded many billions ... need to start over). Thanks for the heads up.
          Hide
          Eric Newton added a comment -

          @stack

          Save yourself a mountain of time during the Verify step by taking a look at GORA-117.

          When I pre-split the data, my 100M runs were clean. When I leaned on the split button during ingest, I had problems even at 100M. Could be I just got (un)lucky, but I hope it will reduce your debugging time.

          Show
          Eric Newton added a comment - @stack Save yourself a mountain of time during the Verify step by taking a look at GORA-117 . When I pre-split the data, my 100M runs were clean. When I leaned on the split button during ingest, I had problems even at 100M. Could be I just got (un)lucky, but I hope it will reduce your debugging time.
          Hide
          stack added a comment -

          @Keith Thanks for the headsup.

          Show
          stack added a comment - @Keith Thanks for the headsup.
          Hide
          Keith Turner added a comment -

          You may run into GORA-116, a bug in the gora-hbase store.

          Show
          Keith Turner added a comment - You may run into GORA-116 , a bug in the gora-hbase store.
          Hide
          stack added a comment -

          @Enis Gora seems to work just fine. Thanks for the offer.

          Show
          stack added a comment - @Enis Gora seems to work just fine. Thanks for the offer.
          Hide
          Enis Soztutar added a comment -

          @Stack, if you need any help with Gora, I can help with that.

          Show
          Enis Soztutar added a comment - @Stack, if you need any help with Gora, I can help with that.
          Hide
          stack added a comment -

          Let me take this one. I've been playing w/ goraci. I did 25M for my first test and that seemed to work. Let me do the bigger numbers. Thanks Eric.

          Show
          stack added a comment - Let me take this one. I've been playing w/ goraci. I did 25M for my first test and that seemed to work. Let me do the bigger numbers. Thanks Eric.

            People

            • Assignee:
              stack
              Reporter:
              Eric Newton
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development