Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.6.1, 1.6.2
Description
Currently AccumuloInputFormat produces a split for reach Range specified in the configuration. Some table indexing schemes, for instance z-order geospacial index, produce large number of small ranges resulting in large number of splits. This is specifically a concern when using AccumuloInputFormat as a source for Spark RDD where each Split is mapped to an RDD partition.
Large number of small RDD partitions leads to poor parallism on read and high overhead on processing. A desirable alternative is to group ranges by tablet into a single split and use BatchScanner to produce the records. Grouping by tablets is useful because it represents Accumulos attempt to distributed stored records and can be influance by the user through table splits.
The grouping functionality already exists in the internal TabletLocator class.
Current proposal is to modify AbstractInputFormat such that it generates either RangeInputSplit or MultiRangeInputSplit based on a new setting in InputConfigurator. AccumuloInputFormat would then be able to inspect the type of the split and instantiate an appropriate reader.
The functinality of TabletLocator should be exposed as a public API in 1.7 as it is useful for optimizations.
Attachments
Attachments
Issue Links
- is related to
-
ACCUMULO-3710 Scanning with many singleton ranges crashes tserver
- Resolved
- relates to
-
ACCUMULO-2883 Add API method(s) that support fetching currently assigned locations for tablets
- Resolved
- links to
1.
|
RangeInputSplit extends non public class | Resolved | Keith Turner |
|