[HUDI-7580] Inserting rows into partitioned table leads to data sanity issues - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Major
Resolution: Invalid
Affects Version/s: 1.0.0-beta1, 0.14.1
Fix Version/s: 1.0.0
Component/s: None
Labels:
- hudi-1.0.0-beta2
- pull-request-available

Epic Link:
Hudi Spark SQL

Description

Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows:

The following is a scala test:

  test("Test Create Functional Index") {
    if (HoodieSparkUtils.gteqSpark3_2) {
      withTempDir { tmp =>
        val tableType = "cow"
          val tableName = "rides"
          val basePath = s"${tmp.getCanonicalPath}/$tableName"
          spark.sql("set hoodie.metadata.enable=true")
          spark.sql(
            s"""
               |create table $tableName (
               |  id int,
               |  name string,
               |  price int,
               |  ts long
               |) using hudi
               | options (
               |  primaryKey ='id',
               |  type = '$tableType',
               |  preCombineField = 'ts',
               |  hoodie.metadata.record.index.enable = 'true',
               |  hoodie.datasource.write.recordkey.field = 'id'
               | )
               | partitioned by(price)
               | location '$basePath'
       """.stripMargin)
          spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 200000)")
          spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 2000000000)")

          spark.sql(s"select id, name, price, ts from $tableName").show(false)
      }
    }
  }

The query returns the following result (note how price and ts columns are mixed up).

+---+----+----------+----+
|id |name|price     |ts  |
+---+----+----------+----+
|3  |a3  |2000000000|1000|
|2  |a2  |200000    |100 |
|1  |a1  |1000      |10  |
+---+----+----------+----+

Having the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of incompatible datatypes, then the insert fails with an error.

Attachments

Issue Links

links to

GitHub Pull Request #11019

Activity

People

Assignee:: Sagar Sumit

Reporter:: Vinaykumar Bhat

Reviewers:: Jonathan Vexler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Apr/24 10:21

Updated:: 13/Oct/24 22:56

Resolved:: 13/Oct/24 22:56

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified