Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
2.5.0, 2.6.0
-
Reviewed
Description
DBInputFormat.createDBRecorder is reusing JDBC connections across instances of DBRecordReader. This is not a good idea. We should be creating separate connection. If performance is a concern, then we should be using connection pooling instead.
I looked at DBOutputFormat.getRecordReader. It actually creates a new Connection object for each DBRecordReader. So can we just change DBInputFormat to create new Connection every time? The connection reuse code was added as part of connection leak bug in MAPREDUCE-1443. Any reason for caching the connection?
We observed this issue in a customer setup where they were reading data from MySQL using Pig. As per customer, the query is returning two records which causes Pig to create two instances of DBRecordReader. These two instances are sharing the database connection instance. The first DBRecordReader runs to extract the first record from MySQL just fine, but then closes the shared connection instance. When the second DBRecordReader runs, it tries to execute a query to retrieve the second record on the closed shared connection instance, which fail. If we set
mapred.map.tasks to 1, the query will be successful.
Attachments
Attachments
- mapreduce-6237.patch
- 7 kB
- Kannan Rajah
- mapreduce-6237.patch
- 6 kB
- Kannan Rajah
- mapreduce-6237.patch
- 6 kB
- Kannan Rajah
Activity
Changed getConnection method to createConnection so that new instances get a new connection.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12695991/mapreduce-6237.patch
against trunk revision 1c09ca2.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 13 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5140//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5140//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5140//console
This message is automatically generated.
I don't see connection pooling being used in other places like DBOutputFormat. Is it OK if we address this as a separate JIRA? Unless you are concerned this particular change will introduce performance regression. Typically, for a DBInputFormat instance, I would expect just a single call to create an instance of RecordReader.
rkannan82 Thank you for the contribution. I prefer to keep the method getConnection since it's public method. Also, TestDbClasses shouldn't be changed.
About the connection pool, I'd like to +1 to do on separate JIRA.
Fixed the patch to retain getConnection method. Reverted the changes to TestDbClasses.
Fixed the patch to retain getConnection method. Reverted the changes to TestDbClasses.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12696317/mapreduce-6237.patch
against trunk revision b73e776.
+1 @author. The patch does not contain any @author tags.
-1 tests included. The patch doesn't appear to include any new or modified tests.
Please justify why no new tests are needed for this patch.
Also please list what manual steps were performed to verify this patch.
+1 javac. The applied patch does not increase the total number of javac compiler warnings.
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 13 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5149//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5149//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5149//console
This message is automatically generated.
rkannan82 the latest change looks an incompatible change - new getConnection() method can return null.
The current semantics of getConnection() is to create new connection to DB and cannot return null. I think we should preserve not only the method name "getConnection()" but also the semantics for backward compatibility though I know the name of getConnection() is confusing. Would you update a patch?
I understand that this causes a change in behavior. But I would like to point out my understanding of how this API is meant to be used. If you still feel that I should bring back the old behavior, I will do so.
- getConnection is a public method, so anyone can access it. I don't know why it was done this way because it breaks encapsulation. Except for the subclasses, no one else should care about accessing the Connection object. In fact, subclasses can already access it without this method because its a protected variable.
- But to avoid breaking subclasses that already use this API to access the connection, we need to keep it. So I updated the patch.
- The connection object is created inside setConf method. So it is guaranteed to be not null unless the caller explicitly called the close method and then tried to access the connection again. This is an improper usage of the API. I doubt if anyone would be using it that way. But if a subclass is doing this, then it is better to fix that.
Thank you for clarification. I know you're correct and my suggestion is not clean fix, but I think this issue is one of critical issues to be fixed for branch-2 and 2.6.1. To commit this to branch-2 and 2.6.1, incompatible changes are not accepted since it's public class. My suggestion is:
- Let's fix the bug here with full backward compatibility about public methods to commit this to branch-2.
- How about fixing the behavior cleanly which targets trunk on separate JIRA?
getConnection maintains same semantics as earlier - creates connection if its NULL.
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12696811/mapreduce-6237.patch
against trunk revision d27439f.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 1 new or modified test files.
-1 javac. The applied patch generated 1154 javac compiler warnings (more than the trunk's current 1149 warnings).
+1 javadoc. There were no new javadoc warning messages.
+1 eclipse:eclipse. The patch built with eclipse:eclipse.
-1 findbugs. The patch appears to introduce 13 new Findbugs (version 2.0.3) warnings.
+1 release audit. The applied patch does not increase the total number of release audit warnings.
+1 core tests. The patch passed unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core.
Test results: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5162//testReport/
Findbugs warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5162//artifact/patchprocess/newPatchFindbugsWarningshadoop-mapreduce-client-core.html
Javac warnings: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5162//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-MAPREDUCE-Build/5162//console
This message is automatically generated.
ozawa Is the patch alright? Anything else I need to do to get this committed?
+1, findbugs look not related to your patch. I'll commit this to branch-2 and trunk shortly.
Committed this to trunk, branch-2, and branch-2.6. Thanks rkannan82 for your contribution and thanks jira.shegalov for your review.
rkannan82, BTW, do you mind creating following JIRA to use thread pool based on Gera's suggestion?
FAILURE: Integrated in Hadoop-Yarn-trunk #833 (See https://builds.apache.org/job/Hadoop-Yarn-trunk/833/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
SUCCESS: Integrated in Hadoop-Yarn-trunk-Java8 #99 (See https://builds.apache.org/job/Hadoop-Yarn-trunk-Java8/99/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
SUCCESS: Integrated in Hadoop-trunk-Commit #7053 (See https://builds.apache.org/job/Hadoop-trunk-Commit/7053/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/CHANGES.txt
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk-Java8 #100 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk-Java8/100/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
FAILURE: Integrated in Hadoop-Hdfs-trunk #2031 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk/2031/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
FAILURE: Integrated in Hadoop-Mapreduce-trunk #2050 (See https://builds.apache.org/job/Hadoop-Mapreduce-trunk/2050/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
SUCCESS: Integrated in Hadoop-Hdfs-trunk-Java8 #97 (See https://builds.apache.org/job/Hadoop-Hdfs-trunk-Java8/97/)
MAPREDUCE-6237. Multiple mappers with DBInputFormat don't work because of reusing conections. Contributed by Kannan Rajah. (ozawa: rev 241336ca2b7cf97d7e0bd84dbe0542b72f304dc9)
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DataDrivenDBInputFormat.java
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/DBInputFormat.java
- hadoop-mapreduce-project/CHANGES.txt
- hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/db/OracleDataDrivenDBInputFormat.java
tomwhite kimballa I will submit a patch in the next day to remove the connection caching and instead create a new instance for each instance of DBRecordReader. Please let me know if you have any concerns with that.