Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-25949

predicate pushdown does not seem to work correctly for external tables pointing to MySQL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 2.3.7
    • None
    • Hive
    • None

    Description

      The following external table has been created:

       

      CREATE EXTERNAL TABLE test_table_ext (
        col1 integer,
        col2 integer,
        col3 bigint,
        col4 decimal(38,15),
      ...
        col13 decimal(38,15),
        partition_column integer
      )
      STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
      TBLPROPERTIES (
          "hive.sql.database.type" = "MYSQL",
          "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
          "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
          "hive.sql.dbcp.username" = "<name>",
          "hive.sql.dbcp.password" = "<password>",
          "hive.sql.table" = "TEST_TABLE",
          "hive.sql.schema" = "<schema-name>",
          "hive.sql.query" = "select col1, col2, col3, col4, ..., col13, partition_column from <schema-name>.TEST_TABLE",
          "hive.sql.dbcp.maxActive" = "1",
          "hive.sql.numPartitions" = "1"
      );
      

       

      Corresponding MySQL table:

       

      CREATE TABLE TEST_TABLE (
        col1 int(11),
        col2 int(11),
        col3 bigint(20),
        col4 decimal(38,15),
      ...
        col13 decimal(38,15),
        partition_column int(11)
      ) ENGINE=InnoDB DEFAULT CHARSET=latin1
      PARTITION BY LIST (partition_column)
      

       

      The MySQL table has 470678 rows and 363 partitions.

      Running ANALYZE TABLE COMPUTE STATISTICS calculates the number of rows in the table as 11303242, which is incorrect.

      The following 2 queries have been tested:

      1) select count from test_table_ext where col2 = 7602

      2) select count from test_table_ext where partition_column = 20220207 col2 = 7602

      hive.optimize.ppd setting is set to true

      Query plan for query #2 is the following:

       

          Plan optimized by CBO.
          Vertex dependency in root stage
          Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
          Stage-0
          Fetch Operator
              limit:-1
              Stage-1
              Reducer 2
              File Output Operator [FS_7]
                  Group By Operator [GBY_5] (rows=1 width=8)
                  Output:["_col0"],aggregations:["count(VALUE._col0)"]
                  <-Map 1 [CUSTOM_SIMPLE_EDGE]
                  PARTITION_ONLY_SHUFFLE [RS_4]
                      Group By Operator [GBY_3] (rows=1 width=8)
                      Output:["_col0"],aggregations:["count()"]
                      Select Operator [SEL_2] (rows=1 width=0)
                          Filter Operator [FIL_8] (rows=1 width=0)
                          predicate:((UDFToDouble(partition_column) = 2.0220207E7) and (UDFToDouble(col2) = 7602.0))
                          TableScan [TS_0] (rows=11303242 width=0)
                              wd@test_table_ext,test_table_ext,Tbl:PARTIAL,Col:NONE,Output:["col2","partition_column"]
      

      Since query #2 is specifying partition value, expected behavior is that it should be noticeably faster that query #1 which has to scan all partitions, but these 2 queries take the same amount of time (150 seconds). This has been tested multiple times.

      When the same queries are executed directly in MySQL, query #2 runs noticeably faster than query #1 (0.01 vs 0.26 seconds).

      Attachments

        Activity

          People

            Unassigned Unassigned
            wd22 Witold Drabicki
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: