Hadoop Common
  1. Hadoop Common
  2. HADOOP-5967

Sqoop should only use a single map task

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      The current DBInputFormat implementation uses SELECT ... LIMIT ... OFFSET statements to read from a database table. This actually results in several queries all accessing the same table at the same time. Most database implementations will actually use a full table scan for each such query, starting at row 1 and scanning down until the OFFSET is reached before emitting data to the client. The upshot of this is that we see O(n^2) performance in the size of the table when using a large number of mappers, when a single mapper would read through the table in O time in the number of rows.

      This patch sets the number of map tasks to 1 in the MapReduce job sqoop launches.

      1. single-mapper.patch
        0.6 kB
        Aaron Kimball

        Activity

        Hide
        Aaron Kimball added a comment -

        This patch implements this as a one-liner. No new tests because it's trivial. I've verified that it passes existing unit tests, and also that it does indeed use a single mapper on a cluster.

        Show
        Aaron Kimball added a comment - This patch implements this as a one-liner. No new tests because it's trivial. I've verified that it passes existing unit tests, and also that it does indeed use a single mapper on a cluster.
        Hide
        Scott Carey added a comment -

        Some databases optimize multiple queries doing sequential scans on the same table at the same time by having them 'tag along' with the same sequential scan (Postgres, at least) which avoids the O( N^2 ) issue. But LIMIT ... OFFSET is not guaranteed to return distinct, consistent partitions unless it has an ORDER BY clause and is in the same transaction anyway.

        Show
        Scott Carey added a comment - Some databases optimize multiple queries doing sequential scans on the same table at the same time by having them 'tag along' with the same sequential scan (Postgres, at least) which avoids the O( N^2 ) issue. But LIMIT ... OFFSET is not guaranteed to return distinct, consistent partitions unless it has an ORDER BY clause and is in the same transaction anyway.
        Hide
        Aaron Kimball added a comment -

        An ORDER BY clause is included in DBInputFormat's SQL statements that it sends over JDBC. But each mapper (necessarily) runs in a separate transaction, as it's on a separate node.

        Show
        Aaron Kimball added a comment - An ORDER BY clause is included in DBInputFormat's SQL statements that it sends over JDBC. But each mapper (necessarily) runs in a separate transaction, as it's on a separate node.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12409836/single-mapper.patch
        against trunk revision 782083.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 Eclipse classpath. The patch retains Eclipse classpath integrity.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12409836/single-mapper.patch against trunk revision 782083. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 Eclipse classpath. The patch retains Eclipse classpath integrity. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch-vesta.apache.org/472/console This message is automatically generated.
        Hide
        Aaron Kimball added a comment -

        Hudson's test failures are unrelated...

        Show
        Aaron Kimball added a comment - Hudson's test failures are unrelated...
        Hide
        Tom White added a comment -

        +1

        I've just committed this. Thanks Aaron!

        Show
        Tom White added a comment - +1 I've just committed this. Thanks Aaron!

          People

          • Assignee:
            Aaron Kimball
            Reporter:
            Aaron Kimball
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development