Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-3967

bulk import loses records when loading pre-split table

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.4.5, 1.5.3, 1.6.0, 1.6.1, 1.6.2, 1.6.3, 1.7.0
    • Fix Version/s: 1.5.4, 1.6.4, 1.7.1, 1.8.0
    • Component/s: client, tserver
    • Labels:
      None
    • Environment:

      generic hadoop 2.6.0, zookeeper 3.4.6 on redhat 6.7
      7 node cluster

      Description

      I just noticed that some records I'm loading via importDirectory go missing. After a lot of digging around trying to reproduce the problem, I discovered that it occurs most frequently when loading a table that I have just recently added splits to. In the tserver logs I'll see messages like

      20 16:25:36,805 [client.BulkImporter] INFO : Could not assign 1 map files to tablet 1xw;18;17 because : Not Serving Tablet . Will retry ...

      or

      20 16:25:44,826 [tserver.TabletServer] INFO : files [hdfs://xxxx:54310/accumulo/tables/1xw/b-00jnmxe/I00jnmxq.rf] not imported to 1xw;03;02: tablet 1xw;03;02 is closed

      these appear after messages about unloading tablets...it seems that tablets are being redistributed at the same time as the bulk import is occuring.

      Steps to reproduce
      1) I run a mapreduce job that produces random data in rfiles
      2) copy the rfiles to an import directory
      3) create table or deleterows -f
      4) addsplits
      5) importdirectory

      I have also performed the above completely within the mapreduce job, with similar results. The difference with the mapreduce job is that the time between adding splits and the import directory is minutes rather than seconds.

      my current test creates 1000000 records, and after the importdirectory returns a count of rows will be anywhere from ~800000 to 1000000.

      With my original workflow, I found that re-importing the same set of rfiles three times would eventually get all rows loaded.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                elserj Josh Elser
                Reporter:
                etseidl Edward Seidl
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 40m
                  1h 40m