Uploaded image for project: 'Accumulo'
  1. Accumulo
  2. ACCUMULO-3602

BatchScanner optimization for AccumuloInputFormat

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.6.1, 1.6.2
    • 1.7.0
    • client

    Description

      Currently AccumuloInputFormat produces a split for reach Range specified in the configuration. Some table indexing schemes, for instance z-order geospacial index, produce large number of small ranges resulting in large number of splits. This is specifically a concern when using AccumuloInputFormat as a source for Spark RDD where each Split is mapped to an RDD partition.

      Large number of small RDD partitions leads to poor parallism on read and high overhead on processing. A desirable alternative is to group ranges by tablet into a single split and use BatchScanner to produce the records. Grouping by tablets is useful because it represents Accumulos attempt to distributed stored records and can be influance by the user through table splits.

      The grouping functionality already exists in the internal TabletLocator class.

      Current proposal is to modify AbstractInputFormat such that it generates either RangeInputSplit or MultiRangeInputSplit based on a new setting in InputConfigurator. AccumuloInputFormat would then be able to inspect the type of the split and instantiate an appropriate reader.

      The functinality of TabletLocator should be exposed as a public API in 1.7 as it is useful for optimizations.

      Attachments

        1. ACCUMULO-3602.diff
          102 kB
          Josh Elser

        Issue Links

          Activity

            People

              echeipesh Eugene Cheipesh
              echeipesh Eugene Cheipesh
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m