Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1224

Calling "SELECT t.* from <table> AS t" to get meta information is too expensive for big tables

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.20.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      all platforms, generic jdbc driver

    • Hadoop Flags:
      Reviewed

      Description

      The SqlManager uses the query, "SELECT t.* from <table> AS t" to get table spec is too expensive for big tables, and it was called twice to generate column names and types. For tables that are big enough to be map-reduced, this is too expensive to make sqoop useful.

      1. SqlManager.java
        13 kB
        Spencer Ho
      2. MAPREDUCE-1224.patch
        0.7 kB
        Spencer Ho

        Activity

        Hide
        Todd Lipcon added a comment -

        Perhaps this could be changed to add "WHERE 1 = 0". Any SQL optimizer should evaluate this very quickly and return an empty result set, allowing metadata to be grabbed without actually doing work. Aaron?

        Show
        Todd Lipcon added a comment - Perhaps this could be changed to add "WHERE 1 = 0". Any SQL optimizer should evaluate this very quickly and return an empty result set, allowing metadata to be grabbed without actually doing work. Aaron?
        Hide
        Spencer Ho added a comment -

        The original code from line 66 to 68 of SqlManager was

        protected String getColNamesQuery(String tableName)

        { return "SELECT t.* FROM " + tableName + " AS t"; }

        As this method was invoked three times in the code to generated column name and type information, it queries the database three times. For a large table, it makes the whole loading work to query the whole table four time.

        The change made is to add an always-false where clause that forces db to return zero-size result set yet with meta data. (from line 66 to 69)

        protected String getColNamesQuery(String tableName)

        { // adding where clause to prevent loading a big table return "SELECT t.* FROM " + tableName + " AS t WHERE 1=0"; }

        The execution time for retrieving one of the large tables we have reduced from 40 minutes to 11 minutes.

        Show
        Spencer Ho added a comment - The original code from line 66 to 68 of SqlManager was protected String getColNamesQuery(String tableName) { return "SELECT t.* FROM " + tableName + " AS t"; } As this method was invoked three times in the code to generated column name and type information, it queries the database three times. For a large table, it makes the whole loading work to query the whole table four time. The change made is to add an always-false where clause that forces db to return zero-size result set yet with meta data. (from line 66 to 69) protected String getColNamesQuery(String tableName) { // adding where clause to prevent loading a big table return "SELECT t.* FROM " + tableName + " AS t WHERE 1=0"; } The execution time for retrieving one of the large tables we have reduced from 40 minutes to 11 minutes.
        Hide
        Todd Lipcon added a comment -

        Spencer: would you mind uploading this as a unified diff (patch) file?

        Show
        Todd Lipcon added a comment - Spencer: would you mind uploading this as a unified diff (patch) file?
        Hide
        Spencer Ho added a comment -

        Here is the patch file. Please review.
        Thanks for the reminder.

        Show
        Spencer Ho added a comment - Here is the patch file. Please review. Thanks for the reminder.
        Hide
        Todd Lipcon added a comment -

        Marking patch available to trigger Hudson

        Show
        Todd Lipcon added a comment - Marking patch available to trigger Hudson
        Hide
        Jeff Hammerbacher added a comment -

        Should we try using actual JDBC metadata calls first? See, e.g., http://blog.codebeach.com/2008/12/database-metadata-with-jdbc.html

        Show
        Jeff Hammerbacher added a comment - Should we try using actual JDBC metadata calls first? See, e.g., http://blog.codebeach.com/2008/12/database-metadata-with-jdbc.html
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425689/MAPREDUCE-1224.patch
        against trunk revision 882790.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no new tests are needed for this patch.
        Also please list what manual steps were performed to verify this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        -1 contrib tests. The patch failed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425689/MAPREDUCE-1224.patch against trunk revision 882790. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-h6.grid.sp2.yahoo.net/256/console This message is automatically generated.
        Hide
        Aaron Kimball added a comment -

        @Jeff Sqoop is already using the ResultSetMetaData associated with the query, rather than trying to read the DatabaseMetaData directly. Especially when we eventually support arbitrary user-supplied queries, this will be necessary. It can also be tricky to set all the parameters for a DatabaseMetaData correctly in a generic way. But to get at ResultSetMetaData (which definitely includes the proper typing information), a query must be submitted.

        @Spenser This is a good catch and improvement! What database are you testing against? This patch passes unit tests against HSQLDB, PostgreSQL, and Oracle, so +1 from me.

        For PostgreSQL and MySQL, Sqoop uses connection.setFetchSize() to specify a row-buffered (rather than table-buffered) result, so it returns fast. But unfortunately, setFetchSize() is, like everything else in JDBC, poorly specified, so there isn't a good way to do this generically. This is a good way to ensure that the query returns quickly even if the database does not respect a row-buffered connection.

        Show
        Aaron Kimball added a comment - @Jeff Sqoop is already using the ResultSetMetaData associated with the query, rather than trying to read the DatabaseMetaData directly. Especially when we eventually support arbitrary user-supplied queries, this will be necessary. It can also be tricky to set all the parameters for a DatabaseMetaData correctly in a generic way. But to get at ResultSetMetaData (which definitely includes the proper typing information), a query must be submitted. @Spenser This is a good catch and improvement! What database are you testing against? This patch passes unit tests against HSQLDB, PostgreSQL, and Oracle, so +1 from me. For PostgreSQL and MySQL, Sqoop uses connection.setFetchSize() to specify a row-buffered (rather than table-buffered) result, so it returns fast. But unfortunately, setFetchSize() is, like everything else in JDBC, poorly specified, so there isn't a good way to do this generically. This is a good way to ensure that the query returns quickly even if the database does not respect a row-buffered connection.
        Hide
        Spencer Ho added a comment -

        @Aaron,
        This particular case that triggered the patch submission is for Microsoft SQL Server. For MySQL, I am using direct mode which works for most of the cases.

        Show
        Spencer Ho added a comment - @Aaron, This particular case that triggered the patch submission is for Microsoft SQL Server. For MySQL, I am using direct mode which works for most of the cases.
        Hide
        Aaron Kimball added a comment -

        Good to know that this works with SQL Server as well. Thanks for the patch.

        Show
        Aaron Kimball added a comment - Good to know that this works with SQL Server as well. Thanks for the patch.
        Hide
        Tom White added a comment -

        I've just committed this. Thanks Spencer!

        Show
        Tom White added a comment - I've just committed this. Thanks Spencer!
        Hide
        Hudson added a comment -

        Integrated in Hadoop-Mapreduce-trunk #196 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/196/)

        Show
        Hudson added a comment - Integrated in Hadoop-Mapreduce-trunk #196 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Mapreduce-trunk/196/ )

          People

          • Assignee:
            Spencer Ho
            Reporter:
            Spencer Ho
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development