Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Add support for running MapReduce jobs over data residing in a MySQL table.

      1. mapred_jdbc_v3.patch
        46 kB
        Enis Soztutar
      2. hsqldb.tar.gz
        653 kB
        Enis Soztutar
      3. HADOOP-2536-0.18.2.patch
        45 kB
        Aaron Kimball
      4. database-2.diff
        28 kB
        Fredrik Hedberg
      5. database.diff
        28 kB
        Fredrik Hedberg

        Issue Links

          Activity

          Hide
          Fredrik Hedberg added a comment -

          Initial code. Attached as archive as I didn't want to create a patch before we know where in the source tree we want to put it.

          Show
          Fredrik Hedberg added a comment - Initial code. Attached as archive as I didn't want to create a patch before we know where in the source tree we want to put it.
          Hide
          Fredrik Hedberg added a comment -

          Example. Identity MapReduce from one table to another.

          Show
          Fredrik Hedberg added a comment - Example. Identity MapReduce from one table to another.
          Hide
          Edward J. Yoon added a comment -

          Oh.. Sorry,
          I was just about to watch it.
          (missed 'assign button')

          Show
          Edward J. Yoon added a comment - Oh.. Sorry, I was just about to watch it. (missed 'assign button')
          Hide
          Owen O'Malley added a comment -

          I'm sorry this bug seems to have been forgotten. I'd suggest putting the code into org.apache.hadoop.mapred.lib.jdbc.*

          I'd suggest getting rid of the JDBCMapper and JDBCReducer and moving the initJob into a static method of the JDBCInputFormat and OutputFormat. So have,

          public static void setInput(JobConf job,
                                                         String table, 
                                                         JDBCField keyField,
                                                         JDBCField[] fields) { ... }
          

          and a corresponding setOutput method in JDBCOutputFormat. The preferred style is to have getters and setters rather than public constants of the strings for the configuration.

          You should also use your own property for the table name rather than input/output path, because that might be confusing.

          Show
          Owen O'Malley added a comment - I'm sorry this bug seems to have been forgotten. I'd suggest putting the code into org.apache.hadoop.mapred.lib.jdbc.* I'd suggest getting rid of the JDBCMapper and JDBCReducer and moving the initJob into a static method of the JDBCInputFormat and OutputFormat. So have, public static void setInput(JobConf job, String table, JDBCField keyField, JDBCField[] fields) { ... } and a corresponding setOutput method in JDBCOutputFormat. The preferred style is to have getters and setters rather than public constants of the strings for the configuration. You should also use your own property for the table name rather than input/output path, because that might be confusing.
          Hide
          Fredrik Hedberg added a comment -

          New version of the JDBC layer for Hadoop. Took care of the issues pointed out by Owen and made some other changes that substantially improved performance.

          Show
          Fredrik Hedberg added a comment - New version of the JDBC layer for Hadoop. Took care of the issues pointed out by Owen and made some other changes that substantially improved performance.
          Hide
          Fredrik Hedberg added a comment -

          Updated example. Identity MapReduce from one table to another.

          Show
          Fredrik Hedberg added a comment - Updated example. Identity MapReduce from one table to another.
          Hide
          Owen O'Malley added a comment -

          when patches are ready, you need to submit them to make them "patch available"

          Show
          Owen O'Malley added a comment - when patches are ready, you need to submit them to make them "patch available"
          Hide
          Fredrik Hedberg added a comment -

          OK, just wanted to get your input before doing so.

          Show
          Fredrik Hedberg added a comment - OK, just wanted to get your input before doing so.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383367/Driver.java
          against trunk revision 663079.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2572/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383367/Driver.java against trunk revision 663079. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2572/console This message is automatically generated.
          Hide
          Fredrik Hedberg added a comment -

          Hudsun tried to apply the example. Removed example and resubmitted.

          Show
          Fredrik Hedberg added a comment - Hudsun tried to apply the example. Removed example and resubmitted.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12383366/database.diff
          against trunk revision 663337.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          -1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12383366/database.diff against trunk revision 663337. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. -1 findbugs. The patch appears to introduce 3 new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/console This message is automatically generated.
          Hide
          Fredrik Hedberg added a comment -

          Fixed two out of three FindBugs issues. Last one is rather hard to avoid.

          Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case.

          Comments?

          Show
          Fredrik Hedberg added a comment - Fixed two out of three FindBugs issues. Last one is rather hard to avoid. Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case. Comments?
          Hide
          Doug Cutting added a comment -

          > Bar the inclusion of an embedded SQL database, [ ... ]

          We could add Derby to src/test/lib for this. This would add about 3MB of jar files to Hadoop...

          Show
          Doug Cutting added a comment - > Bar the inclusion of an embedded SQL database, [ ... ] We could add Derby to src/test/lib for this. This would add about 3MB of jar files to Hadoop...
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case.

          We could implement a MiniDBMS with very limited ability (e.g. use array or java collection to store data in memory), implement a java.sql.Driver and register it in java.sql.DriverManager. Then, use it for testing.

          Show
          Tsz Wo Nicholas Sze added a comment - > Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case. We could implement a MiniDBMS with very limited ability (e.g. use array or java collection to store data in memory), implement a java.sql.Driver and register it in java.sql.DriverManager. Then, use it for testing.
          Hide
          Doug Cutting added a comment -

          More embedded SQL options are listed at:
          http://java-source.net/open-source/database-engines

          TinySQL looks attractive. Its jar is less than 100kB.

          Show
          Doug Cutting added a comment - More embedded SQL options are listed at: http://java-source.net/open-source/database-engines TinySQL looks attractive. Its jar is less than 100kB.
          Hide
          Fredrik Hedberg added a comment -

          Thanks for the input. I think I'll use HSQLDB instead of TinySQL - despite it's larger footprint (600kB) - it seems a lot more mature and is apparently used widely in its embedded form.

          Show
          Fredrik Hedberg added a comment - Thanks for the input. I think I'll use HSQLDB instead of TinySQL - despite it's larger footprint (600kB) - it seems a lot more mature and is apparently used widely in its embedded form.
          Hide
          Tom White added a comment -

          When we move to Java 6 (HADOOP-2325) we can use the database it comes with (http://java.sun.com/javase/6/webnotes/features.html). Until then we'll need to include one of the ones mentioned above.

          Show
          Tom White added a comment - When we move to Java 6 ( HADOOP-2325 ) we can use the database it comes with ( http://java.sun.com/javase/6/webnotes/features.html ). Until then we'll need to include one of the ones mentioned above.
          Hide
          Doug Cutting added a comment -

          > I think I'll use HSQLDB instead of TinySQL [...]

          Good choice, since its license is BSD, not LGPL, which would rule TinySQL out.

          > When we move to Java 6 (HADOOP-2325) we can use the database it comes with [ ... ]

          That would be nice. Perhaps we should make this issue dependent on HADOOP-2235?

          Show
          Doug Cutting added a comment - > I think I'll use HSQLDB instead of TinySQL [...] Good choice, since its license is BSD, not LGPL, which would rule TinySQL out. > When we move to Java 6 ( HADOOP-2325 ) we can use the database it comes with [ ... ] That would be nice. Perhaps we should make this issue dependent on HADOOP-2235 ?
          Hide
          Enis Soztutar added a comment -

          Thanks for the useful patch !
          I think we should iron out a few issues before this issue gets in,

          #It has been discussed in several blogs that LIMIT and OFFSET should not be used w/o ORDER BY clause, since the query execution plan might opt for different row orderings (http://azimbabu.blogspot.com/2008/03/sqllimit-offset-without-order-by.html)
          Please note that I am no expert on this subject, any thoughts are welcome.
          #I guess the key field does not have to be a Text object. Shall we make it more general?
          #as suggested by your inline comment, inferring the field types from the ResultSetMetaData might be a better solution
          #It would be really useful if DatabaseInputFormat and DatabaseOutputFormat include more documentation, and a simple example in their javadocs (or in mapred tutorial).
          #we are executing an update request for every record in the RecordWriter, this may not be optimal. Also the connection should not be in autocommit mode. We should issue the commit in the close function of RecordWriter, catch exceptions in write function and do a rollback should an error occur.
          #does ON DUPLICATE KEY UPDATE work only on MySQL. If so we should either change it, or document this in the javadoc for DatabaseOutputFormat.
          #why don't we just use Derby, then switch to JavaDB once HADOOP-2235 is in?
          #the patch has to be changed for the new directory structure. You can use the sed script in HADOOP-2916.
          #The patch uses tabs in several places, should be changed to spaces

          Show
          Enis Soztutar added a comment - Thanks for the useful patch ! I think we should iron out a few issues before this issue gets in, #It has been discussed in several blogs that LIMIT and OFFSET should not be used w/o ORDER BY clause, since the query execution plan might opt for different row orderings ( http://azimbabu.blogspot.com/2008/03/sqllimit-offset-without-order-by.html ) Please note that I am no expert on this subject, any thoughts are welcome. #I guess the key field does not have to be a Text object. Shall we make it more general? #as suggested by your inline comment, inferring the field types from the ResultSetMetaData might be a better solution #It would be really useful if DatabaseInputFormat and DatabaseOutputFormat include more documentation, and a simple example in their javadocs (or in mapred tutorial). #we are executing an update request for every record in the RecordWriter, this may not be optimal. Also the connection should not be in autocommit mode. We should issue the commit in the close function of RecordWriter, catch exceptions in write function and do a rollback should an error occur. #does ON DUPLICATE KEY UPDATE work only on MySQL. If so we should either change it, or document this in the javadoc for DatabaseOutputFormat. #why don't we just use Derby, then switch to JavaDB once HADOOP-2235 is in? #the patch has to be changed for the new directory structure. You can use the sed script in HADOOP-2916 . #The patch uses tabs in several places, should be changed to spaces
          Hide
          Ankur added a comment -

          This is a useful piece of functionality. Not sure if we can include this for 0.18 release which is already branched.

          Show
          Ankur added a comment - This is a useful piece of functionality. Not sure if we can include this for 0.18 release which is already branched.
          Hide
          Fredrik Hedberg added a comment -

          Enis,

          Thanks for the comments.

          Those are all very valid points, I couldn't agree more.

          Unfortunately, I don't have the time to take this further at the moment, so if anyone else would like to continue working on this, I'd be happy to answer any questions. Not surprisingly, I think this functionality could be quite useful, so bringing this up to commit quality would be sweet.

          Show
          Fredrik Hedberg added a comment - Enis, Thanks for the comments. Those are all very valid points, I couldn't agree more. Unfortunately, I don't have the time to take this further at the moment, so if anyone else would like to continue working on this, I'd be happy to answer any questions. Not surprisingly, I think this functionality could be quite useful, so bringing this up to commit quality would be sweet.
          Show
          Fredrik Hedberg added a comment - FYI: http://www.greenplum.com/resources/MapReduce/
          Hide
          Enis Soztutar added a comment -

          Since Fredrik said that he cannot continue to work on the patch, I have updated it with some changes.
          The changes include :

          1. package and class names have DB prefix instead of database.
          2. DBInputSplit is now an inner class of DBInputFormat
          3. instead of the type mapping to convert the data types in the library, a new DBWritable interface is introduced. The classes implement DBWritable to convert from/to db tuples.
          4. DBRecordReader emits <LongWritable, T> types where record number is the key and T is of type DBWritable.
          5. DBRecordWriter accepts <K, V> where K implements DBWritable(hence written to db) and V is discarded.
          6. JDBC uses JDBC batch update.
          7. introduced two ways of setting the input query.
          8. improved documentation.
          9. added a sample mapred program reading data from db and writing the results back to db. The program calculates the number of pageviews in a syntactically generated access log. The example program uses HSQLDB as an embedded database.
          10. added a test case running the example job in the MiniCluster.
          Show
          Enis Soztutar added a comment - Since Fredrik said that he cannot continue to work on the patch, I have updated it with some changes. The changes include : package and class names have DB prefix instead of database. DBInputSplit is now an inner class of DBInputFormat instead of the type mapping to convert the data types in the library, a new DBWritable interface is introduced. The classes implement DBWritable to convert from/to db tuples. DBRecordReader emits <LongWritable, T> types where record number is the key and T is of type DBWritable. DBRecordWriter accepts <K, V> where K implements DBWritable(hence written to db) and V is discarded. JDBC uses JDBC batch update. introduced two ways of setting the input query. improved documentation. added a sample mapred program reading data from db and writing the results back to db. The program calculates the number of pageviews in a syntactically generated access log. The example program uses HSQLDB as an embedded database. added a test case running the example job in the MiniCluster.
          Hide
          Enis Soztutar added a comment -

          Derby does not support LIMIT ... OFFSET clauses, so the patch uses HSQLDB, which has a BSD-like license.
          I have included the jar and license for HSQLDB. The patch will fail w/o these.

          Show
          Enis Soztutar added a comment - Derby does not support LIMIT ... OFFSET clauses, so the patch uses HSQLDB, which has a BSD-like license. I have included the jar and license for HSQLDB. The patch will fail w/o these.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12389832/hsqldb.tar.gz
          against trunk revision 694459.

          +1 @author. The patch does not contain any @author tags.

          -1 tests included. The patch doesn't appear to include any new or modified tests.
          Please justify why no tests are needed for this patch.

          -1 patch. The patch command could not apply the patch.

          Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3248/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389832/hsqldb.tar.gz against trunk revision 694459. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3248/console This message is automatically generated.
          Hide
          Fredrik Hedberg added a comment -

          Nice work Enis. I can't test it right now but it looks good

          Show
          Fredrik Hedberg added a comment - Nice work Enis. I can't test it right now but it looks good
          Hide
          Enis Soztutar added a comment -

          Manually tested the patch (since hudson will fail to build due to hsqldb dependency). The tests and the release audit passes with :
          [exec] +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

          Show
          Enis Soztutar added a comment - Manually tested the patch (since hudson will fail to build due to hsqldb dependency). The tests and the release audit passes with : [exec] +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          Hide
          Arun C Murthy added a comment -

          I just committed this. Thanks, Fredrik and Enis!

          Show
          Arun C Murthy added a comment - I just committed this. Thanks, Fredrik and Enis!
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Hadoop-trunk #611 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/611/ )
          Hide
          Tsz Wo Nicholas Sze added a comment -

          We need to declare hsqldb.jar in eclipse plugin. See HADOOP-4249.

          Show
          Tsz Wo Nicholas Sze added a comment - We need to declare hsqldb.jar in eclipse plugin. See HADOOP-4249 .
          Hide
          Otis Gospodnetic added a comment -

          Frederik or Enis, do you have any usage examples by any chance?

          Show
          Otis Gospodnetic added a comment - Frederik or Enis, do you have any usage examples by any chance?
          Hide
          Enis Soztutar added a comment -

          Indeed, there is an example checked-in with the patch. You can find it at src/examples/org/apache/hadoop/examples/DBCountPageView.java. You can use the example to use local hsqldb or you may configure it to use an external DB.

          Show
          Enis Soztutar added a comment - Indeed, there is an example checked-in with the patch. You can find it at src/examples/org/apache/hadoop/examples/DBCountPageView.java. You can use the example to use local hsqldb or you may configure it to use an external DB.
          Hide
          Aaron Kimball added a comment -

          The HADOOP-2536-0.18.2.patch file backports this functionality to Hadoop 0.18.2 and 0.18.3.

          Show
          Aaron Kimball added a comment - The HADOOP-2536 -0.18.2.patch file backports this functionality to Hadoop 0.18.2 and 0.18.3.

            People

            • Assignee:
              Fredrik Hedberg
              Reporter:
              Fredrik Hedberg
            • Votes:
              1 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development