Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-8724 Bug fixes - Phase 1
  3. HUDI-4818

Using CustomKeyGenerator fails w/ SparkHoodieTableFileIndex

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: In Progress
    • Critical
    • Resolution: Unresolved
    • None
    • 1.0.1
    • None
    • 4

    Description

      Currently using `CustomKeyGenerator` with the partition-path config {hoodie.datasource.write.partitionpath.field=ts:timestamp} fails w/

      Caused by: java.lang.RuntimeException: Failed to cast value `2022-05-11` to `LongType` for partition column `ts_ms`
      	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$2(Spark3ParsePartitionUtil.scala:72)
      	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:286)
      	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
      	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
      	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
      	at scala.collection.TraversableLike.map(TraversableLike.scala:286)
      	at scala.collection.TraversableLike.map$(TraversableLike.scala:279)
      	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
      	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.$anonfun$parsePartition$1(Spark3ParsePartitionUtil.scala:65)
      	at scala.Option.map(Option.scala:230)
      	at org.apache.spark.sql.execution.datasources.Spark3ParsePartitionUtil.parsePartition(Spark3ParsePartitionUtil.scala:63)
      	at org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionPath(SparkHoodieTableFileIndex.scala:274)
      	at org.apache.hudi.SparkHoodieTableFileIndex.parsePartitionColumnValues(SparkHoodieTableFileIndex.scala:258)
      	at org.apache.hudi.BaseHoodieTableFileIndex.lambda$getAllQueryPartitionPaths$3(BaseHoodieTableFileIndex.java:190)
      	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
      	at java.util.ArrayList$ArrayListSpliterator.forEachRemaining(ArrayList.java:1384)
      	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:482)
      	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:472)
      	at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
      	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
      	at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:566)
      	at org.apache.hudi.BaseHoodieTableFileIndex.getAllQueryPartitionPaths(BaseHoodieTableFileIndex.java:193) 

       

      This occurs b/c SparkHoodieTableFileIndex produces incorrect partition schema at XXX

      where it properly handles only `TimestampBasedKeyGenerator`s but not the other key-generators that might be changing the data-type of the partition-value as compared to the source partition-column (in this case it has `ts` as a long in the source table schema, but it produces partition-value as string)

      Attachments

        Issue Links

          Activity

            People

              jonvex Jonathan Vexler
              alexey.kudinkin Alexey Kudinkin
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified