Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-885

More efficient SQL queries for DBInputFormat

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.

      A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.

      1. MAPREDUCE-885.patch
        67 kB
        Aaron Kimball
      2. MAPREDUCE-885.6.patch
        64 kB
        Aaron Kimball
      3. MAPREDUCE-885.5.patch
        63 kB
        Aaron Kimball
      4. MAPREDUCE-885.4.patch
        54 kB
        Aaron Kimball
      5. MAPREDUCE-885.3.patch
        55 kB
        Aaron Kimball
      6. MAPREDUCE-885.2.patch
        56 kB
        Aaron Kimball

        Issue Links

          Activity

          No work has yet been logged on this issue.

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              Aaron Kimball
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development