[HIVE-25946] select from external table pointing to MySQL returns multiple copies of the same row - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.7
Fix Version/s: None
Component/s: Hive
Labels:
None
Environment:
Hide

Hive runs on GCP Dataproc, image version is 1.5.56-debian10 (Hive v 2.3.7)

MySQL server version is 5.7.36.

The following jars are used:

add jar gs://d-test-bucket-1/commons-pool-1.6.jar; add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49)
Show
Hive runs on GCP Dataproc, image version is 1.5.56-debian10 (Hive v 2.3.7 ) MySQL server version is 5.7.36 . The following jars are used: add jar gs: //d-test-bucket-1/commons-pool-1.6.jar; add jar gs: //d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs: //d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs: //d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49)

Description

The following table has been created in Hive:

CREATE EXTERNAL TABLE table_with_4_rows
(
  col1 varchar(100),
  col2 varchar(15),
  col3 TIMESTAMP,    
  col4 TIMESTAMP
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
    "hive.sql.dbcp.username" = "<name>",
    "hive.sql.dbcp.password" = "<password>",
    "hive.sql.table" = "TABLE_WITH_4_ROWS",
    "hive.sql.schema" = "schema-name",
    "hive.sql.query" = "select col1, col2, col3, col4 from schema-name.TABLE_WITH_4_ROWS",
    "hive.sql.numPartitions" = "1",
    "hive.sql.dbcp.maxActive" = "1"
);

The table in MySQL has just 4 rows, and is defined as:

CREATE TABLE `TABLE_WITH_4_ROWS` (
  `col1` varchar(100) NOT NULL DEFAULT '',
  `col2` varchar(15) DEFAULT NULL,
  `col3` datetime DEFAULT NULL,
  `col4` datetime DEFAULT NULL,
  PRIMARY KEY (`col1`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

When cluster is not 100% busy and has idle containers, running select col1, col2 from table_with_4_rows results in a job that uses 49 mappers and no reducers, and returns 187 rows, instead of 4 (each original row is duplicated multiple times in the results).

Running the same select but with WHERE col1 = 'specific-value' also uses 49 mappers and instead of returning 1 row also returns duplicated data (46 to 48 rows, depending on the value).

When cluster is 100% busy and the job needs to reclaim containers from other jobs, the above queries use just 1 mapper and return correct data (4 and 1 row, correspondingly).

Running ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS does not change the results, however, it also works incorrectly as it sets numRows in the table's metadata also to 187.

There's ArrayIndexOutOfBoundsException Error during condition build exception thrown during the query execution. Here's the output from the log file:

2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1
org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
Continuing ...
2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] |Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: Found configuration file null
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false
2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitLocationProvider: org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567
2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] |exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0] |dag.RootInputInitializerManager|: Starting InputInitializer for Input: table_with_4_rows on vertex vertex_1644267138354_0004_1_00 [Map 1]
2022-02-08 20:43:40,277 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: GenerateConsistentSplitsInHive=true
2022-02-08 20:43:40,287 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: vertex_1644267138354_0004_1_00 [Map 1] transitioned from NEW to INITIALIZING due to event V_INIT
2022-02-08 20:43:40,288 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: The preferred split size is 52428800
2022-02-08 20:43:40,289 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Processing alias table_with_4_rows
2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Adding 1 inputs; the first input is hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.ids = 0,1
2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.names = col1,col2
2022-02-08 20:43:40,300 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: Generating splits for dirs: hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
2022-02-08 20:43:40,642 [ERROR] [InputInitializer {Map 1} #0] |jdbc.QueryConditionBuilder|: Error during condition build
java.lang.ArrayIndexOutOfBoundsException: 0
    at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
    at org.apache.hive.storage.jdbc.QueryConditionBuilder.createConditionString(QueryConditionBuilder.java:125)
    at org.apache.hive.storage.jdbc.QueryConditionBuilder.buildCondition(QueryConditionBuilder.java:74)
    at org.apache.hive.storage.jdbc.conf.JdbcStorageConfigManager.getQueryToExecute(JdbcStorageConfigManager.java:84)
    at org.apache.hive.storage.jdbc.dao.GenericJdbcDatabaseAccessor.getTotalNumberOfRecords(GenericJdbcDatabaseAccessor.java:98)
    at org.apache.hive.storage.jdbc.JdbcInputFormat.getSplits(JdbcInputFormat.java:70)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
    at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
    at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:422)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
    at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 49
2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 49. 29 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.SerializationUtilities|: Deserializing MapWork using kryo
2022-02-08 20:43:40,714 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
2022-02-08 20:43:40,720 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: # Src groups for split generation: 2
2022-02-08 20:43:40,723 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 49 for bucket 1
2022-02-08 20:43:40,725 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired splits: 49 too small.  Desired splitLength: 1972178859 Max splitLength: 1073741824 New desired splits: 90 Total length: 96636764119 Original splits: 49
2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Using original number of splits: 49 desired splits: 90
2022-02-08 20:43:40,727 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Original split count is 49 grouped split count is 49, for bucket: 1
2022-02-08 20:43:40,729 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of split groups: 49

When testing, I also created another external table pointing to another MySQL table with 470559 rows. The MySQL table uses partitions. All queries always use 1 mapper and I have not seen duplicated rows yet, but COMPUTE STATISTICS also works incorrectly - it sets numRows to 11303242.

select from external table pointing to MySQL returns multiple copies of the same row

Details

Description

Attachments

Activity

People

Dates