Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25946

select from external table pointing to MySQL returns multiple copies of the same row

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.7
    • None
    • Hive
    • None

    Description

      The following table has been created in Hive:

       

      CREATE EXTERNAL TABLE table_with_4_rows
      (
        col1 varchar(100),
        col2 varchar(15),
        col3 TIMESTAMP,    
        col4 TIMESTAMP
      )
      STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
      TBLPROPERTIES (
          "hive.sql.database.type" = "MYSQL",
          "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
          "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
          "hive.sql.dbcp.username" = "<name>",
          "hive.sql.dbcp.password" = "<password>",
          "hive.sql.table" = "TABLE_WITH_4_ROWS",
          "hive.sql.schema" = "schema-name",
          "hive.sql.query" = "select col1, col2, col3, col4 from schema-name.TABLE_WITH_4_ROWS",
          "hive.sql.numPartitions" = "1",
          "hive.sql.dbcp.maxActive" = "1"
      );

       

      The table in MySQL has just 4 rows, and is defined as:

       

      CREATE TABLE `TABLE_WITH_4_ROWS` (
        `col1` varchar(100) NOT NULL DEFAULT '',
        `col2` varchar(15) DEFAULT NULL,
        `col3` datetime DEFAULT NULL,
        `col4` datetime DEFAULT NULL,
        PRIMARY KEY (`col1`)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1;

       

      When cluster is not 100% busy and has idle containers, running select col1, col2 from table_with_4_rows results in a job that uses 49 mappers and no reducers, and returns 187 rows, instead of 4 (each original row is duplicated multiple times in the results).

      Running the same select but with WHERE col1 = 'specific-value' also uses 49 mappers and instead of returning 1 row also returns duplicated data (46 to 48 rows, depending on the value).

      When cluster is 100% busy and the job needs to reclaim containers from other jobs, the above queries use just 1 mapper and return correct data (4 and 1 row, correspondingly).

      Running ANALYZE TABLE table_with_4_rows COMPUTE STATISTICS does not change the results, however, it also works incorrectly as it sets numRows in the table's metadata also to 187.

      There's ArrayIndexOutOfBoundsException Error during condition build exception thrown during the query execution. Here's the output from the log file:

       

      2022-02-08 20:43:39 Running Dag: dag_1644267138354_0004_1
      org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 1; Content is not allowed in prolog.
      Continuing ...
      2022-02-08 20:44:03 Completed Dag: dag_1644267138354_0004_1
      2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Num tasks is -1. Expecting VertexManager/InputInitializers/1-1 split to set #tasks for the vertex vertex_1644267138354_0004_1_00 [Map 1]
      2022-02-08 20:43:39,898 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Vertex will initialize from input initializer. vertex_1644267138354_0004_1_00 [Map 1]
      2022-02-08 20:43:39,900 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: Starting 1 inputInitializers for vertex vertex_1644267138354_0004_1_00 [Map 1]
      2022-02-08 20:43:39,921 [INFO] [Dispatcher thread {Central}] |Configuration.deprecation|: mapred.committer.job.setup.cleanup.needed is deprecated. Instead, use mapreduce.job.committer.setup.cleanup.needed
      2022-02-08 20:43:39,998 [INFO] [Dispatcher thread {Central}] |conf.HiveConf|: Found configuration file null
      2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitGenerator using llap affinitized locations: false
      2022-02-08 20:43:40,002 [INFO] [Dispatcher thread {Central}] |tez.HiveSplitGenerator|: SplitLocationProvider: org.apache.hadoop.hive.ql.exec.tez.Utils$1@565d6567
      2022-02-08 20:43:40,115 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
      2022-02-08 20:43:40,125 [INFO] [Dispatcher thread {Central}] |exec.SerializationUtilities|: Deserializing MapWork using kryo
      2022-02-08 20:43:40,267 [INFO] [Dispatcher thread {Central}] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
      2022-02-08 20:43:40,275 [INFO] [InputInitializer {Map 1} #0] |dag.RootInputInitializerManager|: Starting InputInitializer for Input: table_with_4_rows on vertex vertex_1644267138354_0004_1_00 [Map 1]
      2022-02-08 20:43:40,277 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: GenerateConsistentSplitsInHive=true
      2022-02-08 20:43:40,287 [INFO] [Dispatcher thread {Central}] |impl.VertexImpl|: vertex_1644267138354_0004_1_00 [Map 1] transitioned from NEW to INITIALIZING due to event V_INIT
      2022-02-08 20:43:40,288 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: The preferred split size is 52428800
      2022-02-08 20:43:40,289 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
      2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Processing alias table_with_4_rows
      2022-02-08 20:43:40,290 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Adding 1 inputs; the first input is hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
      2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.ids = 0,1
      2022-02-08 20:43:40,299 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: hive.io.file.readcolumn.names = col1,col2
      2022-02-08 20:43:40,300 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: Generating splits for dirs: hdfs://.../user/wdrabicki/hive/database/table_with_4_rows
      2022-02-08 20:43:40,642 [ERROR] [InputInitializer {Map 1} #0] |jdbc.QueryConditionBuilder|: Error during condition build
      java.lang.ArrayIndexOutOfBoundsException: 0
          at java.beans.XMLDecoder.readObject(XMLDecoder.java:250)
          at org.apache.hive.storage.jdbc.QueryConditionBuilder.createConditionString(QueryConditionBuilder.java:125)
          at org.apache.hive.storage.jdbc.QueryConditionBuilder.buildCondition(QueryConditionBuilder.java:74)
          at org.apache.hive.storage.jdbc.conf.JdbcStorageConfigManager.getQueryToExecute(JdbcStorageConfigManager.java:84)
          at org.apache.hive.storage.jdbc.dao.GenericJdbcDatabaseAccessor.getTotalNumberOfRecords(GenericJdbcDatabaseAccessor.java:98)
          at org.apache.hive.storage.jdbc.JdbcInputFormat.getSplits(JdbcInputFormat.java:70)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:442)
          at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:561)
          at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:196)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
          at java.security.AccessController.doPrivileged(Native Method)
          at javax.security.auth.Subject.doAs(Subject.java:422)
          at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1926)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
          at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)
      2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |io.HiveInputFormat|: number of splits 49
      2022-02-08 20:43:40,709 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of input splits: 49. 29 available slots, 1.7 waves. Input format is: org.apache.hadoop.hive.ql.io.HiveInputFormat
      2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: PLAN PATH = hdfs://.../var/tmp/hive-scratch/wdrabicki/07e003af-375b-4bcf-9cb2-6ec15c67e5dd/hive_2022-02-08_20-43-31_574_1039773130311933277-1/wdrabicki/_tez_scratch_dir/dda4a9d6-af45-4a8c-8a48-c1ddea2ef318/map.xml
      2022-02-08 20:43:40,710 [INFO] [InputInitializer {Map 1} #0] |exec.SerializationUtilities|: Deserializing MapWork using kryo
      2022-02-08 20:43:40,714 [INFO] [InputInitializer {Map 1} #0] |exec.Utilities|: Deserialized plan (via RPC) - name: Map 1 size: 3.61KB
      2022-02-08 20:43:40,720 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: # Src groups for split generation: 2
      2022-02-08 20:43:40,723 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Estimated number of tasks: 49 for bucket 1
      2022-02-08 20:43:40,725 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Grouping splits in Tez
      2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Desired splits: 49 too small.  Desired splitLength: 1972178859 Max splitLength: 1073741824 New desired splits: 90 Total length: 96636764119 Original splits: 49
      2022-02-08 20:43:40,726 [INFO] [InputInitializer {Map 1} #0] |grouper.TezSplitGrouper|: Using original number of splits: 49 desired splits: 90
      2022-02-08 20:43:40,727 [INFO] [InputInitializer {Map 1} #0] |tez.SplitGrouper|: Original split count is 49 grouped split count is 49, for bucket: 1
      2022-02-08 20:43:40,729 [INFO] [InputInitializer {Map 1} #0] |tez.HiveSplitGenerator|: Number of split groups: 49
      

      When testing, I also created another external table pointing to another MySQL table with 470559 rows. The MySQL table uses partitions. All queries always use 1 mapper and I have not seen duplicated rows yet, but COMPUTE STATISTICS also works incorrectly - it sets numRows to 11303242.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wd22 Witold Drabicki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: