Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-716

org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      Java 1.6, HAdoop0.19.0, Linux..Oracle,

    • Hadoop Flags:
      Reviewed

      Description

      org.apache.hadoop.mapred.lib.db.DBInputformat not working with oracle.

      The out of the box implementation of the Hadoop is working properly with mysql/hsqldb, but NOT with oracle.
      Reason is DBInputformat is implemented with mysql/hsqldb specific query constructs like "LIMIT", "OFFSET".

      FIX:
      building a database provider specific logic based on the database providername (which we can get using connection).

      I HAVE ALREADY IMPLEMENTED IT FOR ORACLE...READY TO CHECK_IN CODE

      1. HADOOP-5482.20-branch.patch
        5 kB
        Aaron Kimball
      2. HADOOP-5482.patch
        5 kB
        evanand
      3. HADOOP-5482.trunk.patch
        6 kB
        Aaron Kimball
      4. MAPREDUCE-716.2.branch20.patch
        12 kB
        Aaron Kimball
      5. MAPREDUCE-716.2.trunk.patch
        13 kB
        Aaron Kimball
      6. MAPREDUCE-716.3.trunk.patch
        18 kB
        Aaron Kimball
      7. MAPREDUCE-716.4.branch20.patch
        12 kB
        Aaron Kimball
      8. MAPREDUCE-716.4.trunk.patch
        24 kB
        Aaron Kimball

        Issue Links

          Activity

          Hide
          Aaron Kimball added a comment -

          Warnings fixed in MAPREDUCE-792

          Show
          Aaron Kimball added a comment - Warnings fixed in MAPREDUCE-792
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Since test-patch is broken (see HADOOP-6124), the patch committed introduced some javac warnings without being detected.

              [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:184: warning: [unchecked] unchecked conversion
              [javac] found   : java.lang.Class
              [javac] required: java.lang.Class<T>
              [javac]         return new OracleDBRecordReader<T>(split, inputClass,
              [javac]                                                   ^
              [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:188: warning: [unchecked] unchecked conversion
              [javac] found   : java.lang.Class
              [javac] required: java.lang.Class<T>
              [javac]         return new MySQLDBRecordReader<T>(split, inputClass,
              [javac]                                                  ^
              [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:192: warning: [unchecked] unchecked conversion
              [javac] found   : java.lang.Class
              [javac] required: java.lang.Class<T>
              [javac]         return new DBRecordReader<T>(split, inputClass,
              [javac]                                             ^
          
          Show
          Tsz Wo Nicholas Sze added a comment - Since test-patch is broken (see HADOOP-6124 ), the patch committed introduced some javac warnings without being detected. [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:184: warning: [unchecked] unchecked conversion [javac] found : java.lang.Class [javac] required: java.lang.Class<T> [javac] return new OracleDBRecordReader<T>(split, inputClass, [javac] ^ [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:188: warning: [unchecked] unchecked conversion [javac] found : java.lang.Class [javac] required: java.lang.Class<T> [javac] return new MySQLDBRecordReader<T>(split, inputClass, [javac] ^ [javac] d:\@sze\hadoop\mapreduce\m1\src\java\org\apache\hadoop\mapreduce\lib\db\DBInputFormat.java:192: warning: [unchecked] unchecked conversion [javac] found : java.lang.Class [javac] required: java.lang.Class<T> [javac] return new DBRecordReader<T>(split, inputClass, [javac] ^
          Hide
          Tom White added a comment -

          I've just committed this. Thanks Aaron!

          Show
          Tom White added a comment - I've just committed this. Thanks Aaron!
          Hide
          Aaron Kimball added a comment -

          Thanks for the review Enis.

          The Hudson failures are, of course, unrelated to this patch. There's no new tests because this patch consists of two things; vendor-specific logic which requires drivers incompatible with Apache (mysql, Oracle), and a refactoring. Since we can't check in MySQL's or Oracle's JDBC drivers, Hudson can't test those codepaths. I've used Sqoop + MAPREDUCE-716 to import from Oracle and MySQL, so I've confirmed locally that it works. The refactoring doesn't really require a new test, the existing test plan should cover it.

          Show
          Aaron Kimball added a comment - Thanks for the review Enis. The Hudson failures are, of course, unrelated to this patch. There's no new tests because this patch consists of two things; vendor-specific logic which requires drivers incompatible with Apache (mysql, Oracle), and a refactoring. Since we can't check in MySQL's or Oracle's JDBC drivers, Hudson can't test those codepaths. I've used Sqoop + MAPREDUCE-716 to import from Oracle and MySQL, so I've confirmed locally that it works. The refactoring doesn't really require a new test, the existing test plan should cover it.
          Hide
          Enis Soztutar added a comment -

          Ok, after reading the mysql forum, setting fetch size to int min seems the only way until server supports fetch sizes. Yet, I am as surprised as the guy who asked the original question.

          So are you saying that DBRR should be a top-level class? I don't have strong opinions about this. I can pull it up to top level easily enough. I will only do this on the trunk branch, not the 0.20 branch patch.

          Yes, extending an inner static class from a top-level-class seems inelegant to me.

          +1 overall.

          Show
          Enis Soztutar added a comment - Ok, after reading the mysql forum, setting fetch size to int min seems the only way until server supports fetch sizes. Yet, I am as surprised as the guy who asked the original question. So are you saying that DBRR should be a top-level class? I don't have strong opinions about this. I can pull it up to top level easily enough. I will only do this on the trunk branch, not the 0.20 branch patch. Yes, extending an inner static class from a top-level-class seems inelegant to me. +1 overall.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12412924/MAPREDUCE-716.4.trunk.patch
          against trunk revision 791909.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412924/MAPREDUCE-716.4.trunk.patch against trunk revision 791909. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/366/console This message is automatically generated.
          Hide
          Aaron Kimball added a comment -

          Patch #4 addresses these issues.

          Show
          Aaron Kimball added a comment - Patch #4 addresses these issues.
          Hide
          Aaron Kimball added a comment -
          • Doing setFetchSize(Integer.MIN_VALUE) is the signal to MySQL to send the data row-at-a-time instead of buffering the entire resultset in the client. No other values for the setFetchSize argument are supported (see http://forums.mysql.com/read.php?39,137457). Given the volume of data encountered, it is likely that buffering all results could cause OutOfMemory exceptions as were seen in Sqoop. There are enough bottlenecks elsewhere in HDFS that this is likely to not be the slowest point. Consequently, this is the "correct" setting for result sets which are expected to be large.
          • So are you saying that DBRR should be a top-level class? I don't have strong opinions about this. I can pull it up to top level easily enough. I will only do this on the trunk branch, not the 0.20 branch patch.
          • A reference to the statement object is no longer held onto. I'll reintroduce explicitly tracking the reference to the statement object and close it in the close() method again.
          Show
          Aaron Kimball added a comment - Doing setFetchSize(Integer.MIN_VALUE) is the signal to MySQL to send the data row-at-a-time instead of buffering the entire resultset in the client. No other values for the setFetchSize argument are supported (see http://forums.mysql.com/read.php?39,137457 ). Given the volume of data encountered, it is likely that buffering all results could cause OutOfMemory exceptions as were seen in Sqoop. There are enough bottlenecks elsewhere in HDFS that this is likely to not be the slowest point. Consequently, this is the "correct" setting for result sets which are expected to be large. So are you saying that DBRR should be a top-level class? I don't have strong opinions about this. I can pull it up to top level easily enough. I will only do this on the trunk branch, not the 0.20 branch patch. A reference to the statement object is no longer held onto. I'll reintroduce explicitly tracking the reference to the statement object and close it in the close() method again.
          Hide
          Enis Soztutar added a comment -

          Patch looks good, but a few things that needs to be clear:

          • Will there be a performance impact of setting statement.setFetchSize(1) in mysql? (knowing little about the driver, I assume it will fetch the rows one by one, which might be a real bottleneck )
          • MAPREDUCE-359, changed the protected DBRecordReader to public. Maybe it is better we move DBRR to a new class, so that extending it seems less awkward (extends DBInputFormat.DBRecordReader<T>).
          • why statement is not closed in DBRR.close()?
          Show
          Enis Soztutar added a comment - Patch looks good, but a few things that needs to be clear: Will there be a performance impact of setting statement.setFetchSize(1) in mysql? (knowing little about the driver, I assume it will fetch the rows one by one, which might be a real bottleneck ) MAPREDUCE-359 , changed the protected DBRecordReader to public. Maybe it is better we move DBRR to a new class, so that extending it seems less awkward (extends DBInputFormat.DBRecordReader<T>). why statement is not closed in DBRR.close()?
          Hide
          Aaron Kimball added a comment -

          New version of the patch which includes changes to keep o.a.h.mapred.lib.db.* in compliance. Since various interfaces are now abstract classes, I've added a wrapper which translates the calls from the old RecordReader interface to the old DBRecordReader into the corresponding calls to the new DBRecordReader implementations. This has the unfortunate side-effect of requiring some deprecated methods to be made public again instead of protected in the new-package DBRecordReader. I don't know a better way around this issue though.

          Show
          Aaron Kimball added a comment - New version of the patch which includes changes to keep o.a.h.mapred.lib.db.* in compliance. Since various interfaces are now abstract classes, I've added a wrapper which translates the calls from the old RecordReader interface to the old DBRecordReader into the corresponding calls to the new DBRecordReader implementations. This has the unfortunate side-effect of requiring some deprecated methods to be made public again instead of protected in the new-package DBRecordReader. I don't know a better way around this issue though.
          Hide
          Aaron Kimball added a comment -

          I just realized this patched version won't compile cleanly; the deprecated-API version of DBInputFormat breaks on extending the changed version.

          Show
          Aaron Kimball added a comment - I just realized this patched version won't compile cleanly; the deprecated-API version of DBInputFormat breaks on extending the changed version.
          Hide
          Aaron Kimball added a comment -

          Enis,

          The DBRecordReader API already contained a getSelectQuery() that was designed for overriding by subclasses. So after looking at things more closely, rather than extend DBIF, it makes more sense to extend DBRR in my mind. Of course, DBRecordReader is not static, so any other extensions would have to be in DBInputFormat.java. I've made DBRecordReader a static class and added protected accessor methods for private fields where they make sense.

          I added src/java/org/apache/hadoop/mapreduce/lib/db/OracleDBRecordReader.java to hold the Oracle-specific logic.

          I also added src/java/org/apache/hadoop/mapreduce/lib/db/MySQLDBRecordReader.java to hold MySQL-specific logic to force it to use unbuffered mode for queries, which prevents Out-of-memory errors (see MAPREDUCE-685 for a related problem in other queries run by Sqoop).

          DBInputFormat itself includes logic in getRecordReader() to determine the particular RR implementation to instantiate. This uses the same metadata as was originally pushed down into getSelectQuery().

          I've again tested this locally against Oracle and MySQL databases I've installed; both work.

          Show
          Aaron Kimball added a comment - Enis, The DBRecordReader API already contained a getSelectQuery() that was designed for overriding by subclasses. So after looking at things more closely, rather than extend DBIF, it makes more sense to extend DBRR in my mind. Of course, DBRecordReader is not static, so any other extensions would have to be in DBInputFormat.java. I've made DBRecordReader a static class and added protected accessor methods for private fields where they make sense. I added src/java/org/apache/hadoop/mapreduce/lib/db/OracleDBRecordReader.java to hold the Oracle-specific logic. I also added src/java/org/apache/hadoop/mapreduce/lib/db/MySQLDBRecordReader.java to hold MySQL-specific logic to force it to use unbuffered mode for queries, which prevents Out-of-memory errors (see MAPREDUCE-685 for a related problem in other queries run by Sqoop). DBInputFormat itself includes logic in getRecordReader() to determine the particular RR implementation to instantiate. This uses the same metadata as was originally pushed down into getSelectQuery() . I've again tested this locally against Oracle and MySQL databases I've installed; both work.
          Hide
          Owen O'Malley added a comment -

          Sorry, I thought the other jira was still open.

          Show
          Owen O'Malley added a comment - Sorry, I thought the other jira was still open.
          Hide
          Owen O'Malley added a comment -

          This is a duplicate. There is no need for both to be open.

          Show
          Owen O'Malley added a comment - This is a duplicate. There is no need for both to be open.
          Hide
          Aaron Kimball added a comment -

          +1.

          Show
          Aaron Kimball added a comment - +1.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12412687/HADOOP-5482.trunk.patch
          against trunk revision 791739.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no new tests are needed for this patch.
          Also please list what manual steps were performed to verify this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412687/HADOOP-5482.trunk.patch against trunk revision 791739. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/361/console This message is automatically generated.
          Hide
          Enis Soztutar added a comment -

          How about an OracleDBInputFormat extending DBInputFormat to introduce specific code to Oracle? Code would be cleaner and we can introduce other specific DBInputFormats if needed.

          Show
          Enis Soztutar added a comment - How about an OracleDBInputFormat extending DBInputFormat to introduce specific code to Oracle? Code would be cleaner and we can introduce other specific DBInputFormats if needed.
          Hide
          Aaron Kimball added a comment -

          Moved issue from HADOOP into MAPREDUCE and cycled patch status for Hudson

          Show
          Aaron Kimball added a comment - Moved issue from HADOOP into MAPREDUCE and cycled patch status for Hudson
          Hide
          Aaron Kimball added a comment -

          Removing "Reviewed" flag since my code hasn't been reviewed. Also downgrading priority from "BLOCKER" to Major.

          Show
          Aaron Kimball added a comment - Removing "Reviewed" flag since my code hasn't been reviewed. Also downgrading priority from "BLOCKER" to Major.
          Hide
          Aaron Kimball added a comment -

          Attaching patches for 20-branch and trunk based on the original patch posted here. This rebases the work onto the new API in trunk. Also fixes indentation problems in the initial version of the patch. (The new code was indented at the wrong level.)

          Since this adds a db product-specific field, I also used this to set a flag necessary in MySQL, but that's a trivial difference.

          I have tested this and verified that it works with Oracle and MySQL locally. No new tests because we don't have an Oracle driver that can be checked in to Hadoop.

          Show
          Aaron Kimball added a comment - Attaching patches for 20-branch and trunk based on the original patch posted here. This rebases the work onto the new API in trunk. Also fixes indentation problems in the initial version of the patch. (The new code was indented at the wrong level.) Since this adds a db product-specific field, I also used this to set a flag necessary in MySQL, but that's a trivial difference. I have tested this and verified that it works with Oracle and MySQL locally. No new tests because we don't have an Oracle driver that can be checked in to Hadoop.
          Hide
          Aaron Kimball added a comment -

          This issue was marked "Resolved" but it was never applied to the 20 branch or trunk (for that matter, it never seems to have gone through Hudson).

          I need Oracle support on 20/trunk, so I'm reopening this issue to revisit this patch.

          Show
          Aaron Kimball added a comment - This issue was marked "Resolved" but it was never applied to the 20 branch or trunk (for that matter, it never seems to have gone through Hudson). I need Oracle support on 20/trunk, so I'm reopening this issue to revisit this patch.
          Hide
          evanand added a comment -

          enhanced org.apache.hadoop.mapred.lib.db.DBInputformat database provider specific logic for oracle based on the database providername

          Show
          evanand added a comment - enhanced org.apache.hadoop.mapred.lib.db.DBInputformat database provider specific logic for oracle based on the database providername
          Hide
          evanand added a comment -

          Completed

          Show
          evanand added a comment - Completed
          Hide
          evanand added a comment -

          enhanced org.apache.hadoop.mapred.lib.db.DBInputformat to support database provider specific logic for oracle (in addition to existing databases)

          Show
          evanand added a comment - enhanced org.apache.hadoop.mapred.lib.db.DBInputformat to support database provider specific logic for oracle (in addition to existing databases)
          Hide
          evanand added a comment -

          enhanced org.apache.hadoop.mapred.lib.db.DBInputformat to support database provider specific logic for oracle (in addition to existing databases)

          Show
          evanand added a comment - enhanced org.apache.hadoop.mapred.lib.db.DBInputformat to support database provider specific logic for oracle (in addition to existing databases)
          Hide
          evanand added a comment -

          not able to attach code

          Show
          evanand added a comment - not able to attach code
          Hide
          evanand added a comment -

          enhancing org.apache.hadoop.mapred.lib.db.DBInputformat database provider specific logic for oracle based on the database providername

          Show
          evanand added a comment - enhancing org.apache.hadoop.mapred.lib.db.DBInputformat database provider specific logic for oracle based on the database providername

            People

            • Assignee:
              Aaron Kimball
              Reporter:
              evanand
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 24h
                24h
                Remaining:
                Remaining Estimate - 24h
                24h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development