Issue Details (XML | Word | Printable)

Key: HADOOP-2536
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Fredrik Hedberg
Reporter: Fredrik Hedberg
Votes: 1
Watchers: 15
Operations

If you were logged in you would be able to see more operations.
Hadoop Common

MapReduce for MySQL

Created: 07/Jan/08 11:26 AM   Updated: 08/Jul/09 04:52 PM
Return to search
Component/s: None
Affects Version/s: 0.19.0
Fix Version/s: 0.19.0

Time Tracking:
Not Specified

File Attachments:
  Size
File Licensed for inclusion in ASF works database-2.diff 2008-06-04 10:48 PM Fredrik Hedberg 28 kB
File Licensed for inclusion in ASF works database.diff 2008-06-04 09:42 AM Fredrik Hedberg 28 kB
Text File Licensed for inclusion in ASF works HADOOP-2536-0.18.2.patch 2009-03-04 12:39 AM Aaron Kimball 45 kB
GZip Archive Licensed for inclusion in ASF works hsqldb.tar.gz 2008-09-10 02:30 PM Enis Soztutar 653 kB
Text File Licensed for inclusion in ASF works mapred_jdbc_v3.patch 2008-09-10 02:25 PM Enis Soztutar 46 kB
Issue Links:
Reference
 

Hadoop Flags: Reviewed
Resolution Date: 19/Sep/08 06:52 PM


 Description  « Hide
Add support for running MapReduce jobs over data residing in a MySQL table.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Fredrik Hedberg added a comment - 07/Jan/08 11:28 AM
Initial code. Attached as archive as I didn't want to create a patch before we know where in the source tree we want to put it.

Fredrik Hedberg made changes - 07/Jan/08 11:28 AM
Field Original Value New Value
Attachment hadoop-jdbc.tar.gz [ 12372614 ]
Fredrik Hedberg added a comment - 07/Jan/08 11:32 AM
Example. Identity MapReduce from one table to another.

Fredrik Hedberg made changes - 07/Jan/08 11:32 AM
Attachment Test.java [ 12372615 ]
Fredrik Hedberg made changes - 08/Jan/08 05:38 PM
Affects Version/s 0.16.0 [ 12312740 ]
Edward J. Yoon made changes - 08/Jan/08 11:42 PM
Assignee Edward Yoon [ udanax ]
Edward J. Yoon added a comment - 08/Jan/08 11:44 PM
Oh.. Sorry,
I was just about to watch it.
(missed 'assign button')

Edward J. Yoon made changes - 08/Jan/08 11:44 PM
Assignee Edward Yoon [ udanax ]
Owen O'Malley added a comment - 08/May/08 11:48 PM
I'm sorry this bug seems to have been forgotten. I'd suggest putting the code into org.apache.hadoop.mapred.lib.jdbc.*

I'd suggest getting rid of the JDBCMapper and JDBCReducer and moving the initJob into a static method of the JDBCInputFormat and OutputFormat. So have,

public static void setInput(JobConf job,
                                               String table, 
                                               JDBCField keyField,
                                               JDBCField[] fields) { ... }

and a corresponding setOutput method in JDBCOutputFormat. The preferred style is to have getters and setters rather than public constants of the strings for the configuration.

You should also use your own property for the table name rather than input/output path, because that might be confusing.


Owen O'Malley made changes - 08/May/08 11:50 PM
Assignee Fredrik Hedberg [ fhedberg ]
Fredrik Hedberg added a comment - 04/Jun/08 09:42 AM
New version of the JDBC layer for Hadoop. Took care of the issues pointed out by Owen and made some other changes that substantially improved performance.

Fredrik Hedberg made changes - 04/Jun/08 09:42 AM
Attachment database.diff [ 12383366 ]
Fredrik Hedberg made changes - 04/Jun/08 09:43 AM
Attachment hadoop-jdbc.tar.gz [ 12372614 ]
Fredrik Hedberg made changes - 04/Jun/08 09:43 AM
Attachment Test.java [ 12372615 ]
Fredrik Hedberg added a comment - 04/Jun/08 09:44 AM
Updated example. Identity MapReduce from one table to another.

Fredrik Hedberg made changes - 04/Jun/08 09:44 AM
Attachment Driver.java [ 12383367 ]
Owen O'Malley added a comment - 04/Jun/08 11:31 AM
when patches are ready, you need to submit them to make them "patch available"

Owen O'Malley made changes - 04/Jun/08 11:31 AM
Status Open [ 1 ] Patch Available [ 10002 ]
Fredrik Hedberg added a comment - 04/Jun/08 11:36 AM
OK, just wanted to get your input before doing so.

Hadoop QA added a comment - 04/Jun/08 04:16 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383367/Driver.java
against trunk revision 663079.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2572/console

This message is automatically generated.


Fredrik Hedberg made changes - 04/Jun/08 04:17 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Fredrik Hedberg made changes - 04/Jun/08 04:17 PM
Attachment Driver.java [ 12383367 ]
Fredrik Hedberg made changes - 04/Jun/08 04:17 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Fredrik Hedberg added a comment - 04/Jun/08 04:18 PM
Hudsun tried to apply the example. Removed example and resubmitted.

Hadoop QA added a comment - 04/Jun/08 08:07 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12383366/database.diff
against trunk revision 663337.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

+1 javadoc. The javadoc tool did not generate any warning messages.

+1 javac. The applied patch does not increase the total number of javac compiler warnings.

-1 findbugs. The patch appears to introduce 3 new Findbugs warnings.

+1 release audit. The applied patch does not increase the total number of release audit warnings.

+1 core tests. The patch passed core unit tests.

+1 contrib tests. The patch passed contrib unit tests.

Test results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/testReport/
Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/artifact/trunk/build/test/checkstyle-errors.html
Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/2575/console

This message is automatically generated.


Fredrik Hedberg added a comment - 04/Jun/08 10:48 PM
Fixed two out of three FindBugs issues. Last one is rather hard to avoid.

Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case.

Comments?


Fredrik Hedberg made changes - 04/Jun/08 10:48 PM
Attachment database-2.diff [ 12383422 ]
Fredrik Hedberg made changes - 04/Jun/08 10:48 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Fredrik Hedberg made changes - 04/Jun/08 10:49 PM
Status Open [ 1 ] Patch Available [ 10002 ]
Doug Cutting added a comment - 04/Jun/08 11:13 PM
> Bar the inclusion of an embedded SQL database, [ ... ]

We could add Derby to src/test/lib for this. This would add about 3MB of jar files to Hadoop...


Tsz Wo (Nicholas), SZE added a comment - 04/Jun/08 11:17 PM
> Also, Hudson complains about the lack of unit-tests. Bar the inclusion of an embedded SQL database, I can't really think of anything non-trivial in this case.

We could implement a MiniDBMS with very limited ability (e.g. use array or java collection to store data in memory), implement a java.sql.Driver and register it in java.sql.DriverManager. Then, use it for testing.


Doug Cutting added a comment - 04/Jun/08 11:28 PM
More embedded SQL options are listed at:
http://java-source.net/open-source/database-engines

TinySQL looks attractive. Its jar is less than 100kB.


Fredrik Hedberg made changes - 04/Jun/08 11:38 PM
Status Patch Available [ 10002 ] Open [ 1 ]
Fredrik Hedberg added a comment - 04/Jun/08 11:48 PM
Thanks for the input. I think I'll use HSQLDB instead of TinySQL - despite it's larger footprint (600kB) - it seems a lot more mature and is apparently used widely in its embedded form.

Tom White added a comment - 05/Jun/08 08:28 AM
When we move to Java 6 (HADOOP-2325) we can use the database it comes with (http://java.sun.com/javase/6/webnotes/features.html). Until then we'll need to include one of the ones mentioned above.

Doug Cutting added a comment - 05/Jun/08 04:23 PM
> I think I'll use HSQLDB instead of TinySQL [...]

Good choice, since its license is BSD, not LGPL, which would rule TinySQL out.

> When we move to Java 6 (HADOOP-2325) we can use the database it comes with [ ... ]

That would be nice. Perhaps we should make this issue dependent on HADOOP-2235?


Enis Soztutar added a comment - 13/Jun/08 10:48 AM
Thanks for the useful patch !
I think we should iron out a few issues before this issue gets in,

#It has been discussed in several blogs that LIMIT and OFFSET should not be used w/o ORDER BY clause, since the query execution plan might opt for different row orderings (http://azimbabu.blogspot.com/2008/03/sqllimit-offset-without-order-by.html)
Please note that I am no expert on this subject, any thoughts are welcome.
#I guess the key field does not have to be a Text object. Shall we make it more general?
#as suggested by your inline comment, inferring the field types from the ResultSetMetaData might be a better solution
#It would be really useful if DatabaseInputFormat and DatabaseOutputFormat include more documentation, and a simple example in their javadocs (or in mapred tutorial).
#we are executing an update request for every record in the RecordWriter, this may not be optimal. Also the connection should not be in autocommit mode. We should issue the commit in the close function of RecordWriter, catch exceptions in write function and do a rollback should an error occur.
#does ON DUPLICATE KEY UPDATE work only on MySQL. If so we should either change it, or document this in the javadoc for DatabaseOutputFormat.
#why don't we just use Derby, then switch to JavaDB once HADOOP-2235 is in?
#the patch has to be changed for the new directory structure. You can use the sed script in HADOOP-2916.
#The patch uses tabs in several places, should be changed to spaces


Ankur added a comment - 17/Jul/08 12:56 PM
This is a useful piece of functionality. Not sure if we can include this for 0.18 release which is already branched.

Fredrik Hedberg added a comment - 25/Aug/08 11:32 AM
Enis,

Thanks for the comments.

Those are all very valid points, I couldn't agree more.

Unfortunately, I don't have the time to take this further at the moment, so if anyone else would like to continue working on this, I'd be happy to answer any questions. Not surprisingly, I think this functionality could be quite useful, so bringing this up to commit quality would be sweet.


Fredrik Hedberg added a comment - 28/Aug/08 08:48 PM

Enis Soztutar added a comment - 10/Sep/08 02:25 PM
Since Fredrik said that he cannot continue to work on the patch, I have updated it with some changes.
The changes include :
  1. package and class names have DB prefix instead of database.
  2. DBInputSplit is now an inner class of DBInputFormat
  3. instead of the type mapping to convert the data types in the library, a new DBWritable interface is introduced. The classes implement DBWritable to convert from/to db tuples.
  4. DBRecordReader emits <LongWritable, T> types where record number is the key and T is of type DBWritable.
  5. DBRecordWriter accepts <K, V> where K implements DBWritable(hence written to db) and V is discarded.
  6. JDBC uses JDBC batch update.
  7. introduced two ways of setting the input query.
  8. improved documentation.
  9. added a sample mapred program reading data from db and writing the results back to db. The program calculates the number of pageviews in a syntactically generated access log. The example program uses HSQLDB as an embedded database.
  10. added a test case running the example job in the MiniCluster.

Enis Soztutar made changes - 10/Sep/08 02:25 PM
Attachment mapred_jdbc_v3.patch [ 12389831 ]
Enis Soztutar added a comment - 10/Sep/08 02:30 PM
Derby does not support LIMIT ... OFFSET clauses, so the patch uses HSQLDB, which has a BSD-like license.
I have included the jar and license for HSQLDB. The patch will fail w/o these.

Enis Soztutar made changes - 10/Sep/08 02:30 PM
Attachment hsqldb.tar.gz [ 12389832 ]
Enis Soztutar made changes - 11/Sep/08 08:32 AM
Fix Version/s 0.19.0 [ 12313211 ]
Affects Version/s 0.19.0 [ 12313211 ]
Status Open [ 1 ] Patch Available [ 10002 ]
Hadoop QA added a comment - 11/Sep/08 08:41 PM
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12389832/hsqldb.tar.gz
against trunk revision 694459.

+1 @author. The patch does not contain any @author tags.

-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no tests are needed for this patch.

-1 patch. The patch command could not apply the patch.

Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3248/console

This message is automatically generated.


Fredrik Hedberg added a comment - 16/Sep/08 12:44 PM
Nice work Enis. I can't test it right now but it looks good

Enis Soztutar added a comment - 17/Sep/08 10:02 AM
Manually tested the patch (since hudson will fail to build due to hsqldb dependency). The tests and the release audit passes with :
[exec] +1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] +1 tests included. The patch appears to include 3 new or modified tests.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.

Repository Revision Date User Message
ASF #697184 Fri Sep 19 18:51:41 UTC 2008 acmurthy HADOOP-2536. Implement a JDBC based database input and output formats to allow Map-Reduce applications to work with databases. Contributed by Fredrik Hedberg and Enis Soztutar.
Files Changed
ADD /hadoop/core/trunk/src/examples/org/apache/hadoop/examples/DBCountPageView.java
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db/DBWritable.java
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db/package.html
ADD /hadoop/core/trunk/lib/hsqldb-LICENSE.txt
MODIFY /hadoop/core/trunk/src/examples/org/apache/hadoop/examples/ExampleDriver.java
MODIFY /hadoop/core/trunk/CHANGES.txt
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/lib/db/TestDBJob.java
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db/DBConfiguration.java
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db/DBOutputFormat.java
ADD /hadoop/core/trunk/lib/hsqldb.jar
ADD /hadoop/core/trunk/src/test/org/apache/hadoop/mapred/lib/db
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db/DBInputFormat.java
ADD /hadoop/core/trunk/src/mapred/org/apache/hadoop/mapred/lib/db

Arun C Murthy added a comment - 19/Sep/08 06:52 PM
I just committed this. Thanks, Fredrik and Enis!

Arun C Murthy made changes - 19/Sep/08 06:52 PM
Resolution Fixed [ 1 ]
Hadoop Flags [Reviewed]
Status Patch Available [ 10002 ] Resolved [ 5 ]
Hudson added a comment - 22/Sep/08 03:18 PM

Tsz Wo (Nicholas), SZE made changes - 23/Sep/08 07:35 PM
Link This issue is related to HADOOP-4249 [ HADOOP-4249 ]
Tsz Wo (Nicholas), SZE added a comment - 23/Sep/08 07:35 PM
We need to declare hsqldb.jar in eclipse plugin. See HADOOP-4249.

Nigel Daley made changes - 20/Nov/08 11:38 PM
Status Resolved [ 5 ] Closed [ 6 ]
Otis Gospodnetic added a comment - 23/Dec/08 09:54 PM
Frederik or Enis, do you have any usage examples by any chance?

Enis Soztutar added a comment - 24/Dec/08 07:44 AM
Indeed, there is an example checked-in with the patch. You can find it at src/examples/org/apache/hadoop/examples/DBCountPageView.java. You can use the example to use local hsqldb or you may configure it to use an external DB.

Aaron Kimball added a comment - 04/Mar/09 12:39 AM
The HADOOP-2536-0.18.2.patch file backports this functionality to Hadoop 0.18.2 and 0.18.3.

Aaron Kimball made changes - 04/Mar/09 12:39 AM
Attachment HADOOP-2536-0.18.2.patch [ 12401346 ]
Owen O'Malley made changes - 08/Jul/09 04:52 PM
Component/s mapred [ 12310690 ]