Uploaded image for project: 'Hadoop Map/Reduce'
  1. Hadoop Map/Reduce
  2. MAPREDUCE-885

More efficient SQL queries for DBInputFormat

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      DBInputFormat generates InputSplits by counting the available rows in a table, and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL keywords. These are only meaningful in an ordered context, so the query also includes an "ORDER BY" clause on an index column. The resulting queries are often inefficient and require full table scans. Actually using multiple mappers with these queries can lead to O(n^2) behavior in the database, where n is the number of splits. Attempting to use parallelism with these queries is counter-productive.

      A better mechanism is to organize splits based on data values themselves, which can be performed in the WHERE clause, allowing for index range scans of tables, and can better exploit parallelism in the database.

        Attachments

        1. MAPREDUCE-885.patch
          67 kB
          Aaron Kimball
        2. MAPREDUCE-885.2.patch
          56 kB
          Aaron Kimball
        3. MAPREDUCE-885.3.patch
          55 kB
          Aaron Kimball
        4. MAPREDUCE-885.4.patch
          54 kB
          Aaron Kimball
        5. MAPREDUCE-885.5.patch
          63 kB
          Aaron Kimball
        6. MAPREDUCE-885.6.patch
          64 kB
          Aaron Kimball

          Issue Links

            Activity

              People

              • Assignee:
                kimballa Aaron Kimball
                Reporter:
                kimballa Aaron Kimball
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: