Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
Description
Creating a functional index on an existing table fails to process all files and partitions of the table. The col-stats MDT partition ends up having an entry only for subset of files that belong to the table. An example follows.
The following create-table and inserts should create a table with 3 partitions (with each partition having one slice){}
spark.sql( s""" |create table test_table( | id int, | name string, | ts long, | price int |) using hudi | options ( | primaryKey ='id', | type = 'cow', | preCombineField = 'ts', | hoodie.metadata.record.index.enable = 'true', | hoodie.datasource.write.recordkey.field = 'id' | ) | partitioned by(price) | location '$basePath' """.stripMargin) spark.sql(s"insert into test_table (id, name, ts, price) values(1, 'a1', 1000, 10)") spark.sql(s"insert into test_table (id, name, ts, price) values(2, 'a2', 200000, 100)") spark.sql(s"insert into test_table (id, name, ts, price) values(3, 'a3', 2000000000, 1000)")
Now create a functional index (using col stats) on this table. The col-stat in the MDT should have three entries (representing column level stats for 3 files). However, col stats only has one single entry (for one of the file).
var createIndexSql = s"create index idx_datestr on test_table using column_stats(ts) options(func='from_unixtime', format='yyyy-MM-dd')" spark.sql(createIndexSql) spark.sql(s"select key, type, ColumnStatsMetadata from hudi_metadata('test_table') where type = 3").show(false)
As seen below, col-stats has only one entry for one of the file (and is missing statistics for two other files): *{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null,
{1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}*
+------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |key |type|ColumnStatsMetadata | +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |oyTjviKHuhI=/vI1OU7mFjI=Ev9dj4Bf3S0TEjEiWebRSQ==|3 |{32490467-702f-4bb4-81e8-91082da9baf0-0_0-28-66_20240409095623406.parquet, ts, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, {null, null, null, null, null, null, {1970-01-01}, null, null, null, null}, 1, 0, 434874, 869748, false}| +------------------------------------------------+----+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Attachments
Issue Links
- is depended upon by
-
HUDI-7007 Integrate functional index using bloom filter on reader side
- Reopened
- links to