[HIVE-25949] predicate pushdown does not seem to work correctly for external tables pointing to MySQL - ASF JIRA

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.3.7
Fix Version/s: None
Component/s: Hive
Labels:
None
Environment:
Hide

Hive runs on GCP Dataproc, image version is 1.5.56-debian10 (Hive v 2.3.7)

MySQL server version is 5.7.36.

The following jars are used:

add jar gs://d-test-bucket-1/commons-pool-1.6.jar; add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49)
Show
Hive runs on GCP Dataproc, image version is 1.5.56-debian10 (Hive v 2.3.7 ) MySQL server version is 5.7.36 . The following jars are used: add jar gs: //d-test-bucket-1/commons-pool-1.6.jar; add jar gs: //d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs: //d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs: //d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49)

Description

The following external table has been created:

CREATE EXTERNAL TABLE test_table_ext (
  col1 integer,
  col2 integer,
  col3 bigint,
  col4 decimal(38,15),
...
  col13 decimal(38,15),
  partition_column integer
)
STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler'
TBLPROPERTIES (
    "hive.sql.database.type" = "MYSQL",
    "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver",
    "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>",
    "hive.sql.dbcp.username" = "<name>",
    "hive.sql.dbcp.password" = "<password>",
    "hive.sql.table" = "TEST_TABLE",
    "hive.sql.schema" = "<schema-name>",
    "hive.sql.query" = "select col1, col2, col3, col4, ..., col13, partition_column from <schema-name>.TEST_TABLE",
    "hive.sql.dbcp.maxActive" = "1",
    "hive.sql.numPartitions" = "1"
);

Corresponding MySQL table:

CREATE TABLE TEST_TABLE (
  col1 int(11),
  col2 int(11),
  col3 bigint(20),
  col4 decimal(38,15),
...
  col13 decimal(38,15),
  partition_column int(11)
) ENGINE=InnoDB DEFAULT CHARSET=latin1
PARTITION BY LIST (partition_column)

The MySQL table has 470678 rows and 363 partitions.

Running ANALYZE TABLE COMPUTE STATISTICS calculates the number of rows in the table as 11303242, which is incorrect.

The following 2 queries have been tested:

1) select count from test_table_ext where col2 = 7602

2) select count from test_table_ext where partition_column = 20220207 col2 = 7602

hive.optimize.ppd setting is set to true

Query plan for query #2 is the following:

    Plan optimized by CBO.
    Vertex dependency in root stage
    Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
    Stage-0
    Fetch Operator
        limit:-1
        Stage-1
        Reducer 2
        File Output Operator [FS_7]
            Group By Operator [GBY_5] (rows=1 width=8)
            Output:["_col0"],aggregations:["count(VALUE._col0)"]
            <-Map 1 [CUSTOM_SIMPLE_EDGE]
            PARTITION_ONLY_SHUFFLE [RS_4]
                Group By Operator [GBY_3] (rows=1 width=8)
                Output:["_col0"],aggregations:["count()"]
                Select Operator [SEL_2] (rows=1 width=0)
                    Filter Operator [FIL_8] (rows=1 width=0)
                    predicate:((UDFToDouble(partition_column) = 2.0220207E7) and (UDFToDouble(col2) = 7602.0))
                    TableScan [TS_0] (rows=11303242 width=0)
                        wd@test_table_ext,test_table_ext,Tbl:PARTIAL,Col:NONE,Output:["col2","partition_column"]

Since query #2 is specifying partition value, expected behavior is that it should be noticeably faster that query #1 which has to scan all partitions, but these 2 queries take the same amount of time (150 seconds). This has been tested multiple times.

When the same queries are executed directly in MySQL, query #2 runs noticeably faster than query #1 (0.01 vs 0.26 seconds).

predicate pushdown does not seem to work correctly for external tables pointing to MySQL

Details

Description

Attachments

Activity

People

Dates