[MAPREDUCE-885] More efficient SQL queries for DBInputFormat - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.21.0
Component/s: None
Labels:
None

Hadoop Flags:

Reviewed

Description

DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.

A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

MAPREDUCE-885.2.patch
26/Aug/09 01:45
56 kB
Aaron Kimball
MAPREDUCE-885.3.patch
27/Aug/09 06:32
55 kB
Aaron Kimball
MAPREDUCE-885.4.patch
28/Aug/09 23:08
54 kB
Aaron Kimball
MAPREDUCE-885.5.patch
28/Aug/09 23:12
63 kB
Aaron Kimball
MAPREDUCE-885.6.patch
09/Sep/09 02:15
64 kB
Aaron Kimball
MAPREDUCE-885.patch
18/Aug/09 23:56
67 kB
Aaron Kimball

Issue Links

is depended upon by

MAPREDUCE-907 Sqoop should use more intelligent splits

Resolved

Activity

People

Assignee:: Aaron Kimball

Reporter:: Aaron Kimball

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 18/Aug/09 23:54

Updated:: 02/May/13 02:29

Resolved:: 14/Sep/09 14:21