Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7580

Inserting rows into partitioned table leads to data sanity issues

    XMLWordPrintableJSON

Details

    Description

      Came across this behaviour of partitioned tables when trying to debug some other issue with functional-index. It seems that the column ordering gets messed up while inserting records into a hudi table. Hence, a subsequent query returns wrong results. An example follows:

       

      The following is a scala test:

        test("Test Create Functional Index") {
          if (HoodieSparkUtils.gteqSpark3_2) {
            withTempDir { tmp =>
              val tableType = "cow"
                val tableName = "rides"
                val basePath = s"${tmp.getCanonicalPath}/$tableName"
                spark.sql("set hoodie.metadata.enable=true")
                spark.sql(
                  s"""
                     |create table $tableName (
                     |  id int,
                     |  name string,
                     |  price int,
                     |  ts long
                     |) using hudi
                     | options (
                     |  primaryKey ='id',
                     |  type = '$tableType',
                     |  preCombineField = 'ts',
                     |  hoodie.metadata.record.index.enable = 'true',
                     |  hoodie.datasource.write.recordkey.field = 'id'
                     | )
                     | partitioned by(price)
                     | location '$basePath'
             """.stripMargin)
                spark.sql(s"insert into $tableName (id, name, price, ts) values(1, 'a1', 10, 1000)")
                spark.sql(s"insert into $tableName (id, name, price, ts) values(2, 'a2', 100, 200000)")
                spark.sql(s"insert into $tableName (id, name, price, ts) values(3, 'a3', 1000, 2000000000)")
      
                spark.sql(s"select id, name, price, ts from $tableName").show(false)
            }
          }
        } 

       

      The query returns the following result (note how price and ts columns are mixed up). 

      +---+----+----------+----+
      |id |name|price     |ts  |
      +---+----+----------+----+
      |3  |a3  |2000000000|1000|
      |2  |a2  |200000    |100 |
      |1  |a1  |1000      |10  |
      +---+----+----------+----+
       

       

      Having the partition column as the last column in the schema does not cause this problem. If the mixed-up columns are of incompatible datatypes, then the insert fails with an error.

      Attachments

        Issue Links

          Activity

            People

              codope Sagar Sumit
              vinay.bhat Vinaykumar Bhat
              Jonathan Vexler
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 4m
                  4m
                  Remaining:
                  Remaining Estimate - 4m
                  4m
                  Logged:
                  Time Spent - Not Specified
                  Not Specified