[HUDI-3041] spark.read fails when drop partitions columns is used w/o glob path - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.9.0
Fix Version/s: None
Component/s: spark
Labels:
- sev:normal
- user-support-issues

Description

With 0.9.0, we added support for dropping partition columns after generating the partition path for hoodie records. but looks like we have some gaps in end to end flow.

Main issue is with not giving glob path. if proper glob path is given, it works.

hudi serializes partition path fields to table properties and re-uses when in need. So, while querying the field may not be part of table schema since hudi would have removed the field (along with trimming the schema) completely.

Stacktrace for Issue (1)

scala> val tripsSnapshotDF = spark.read.format("hudi").load(basePath)
java.lang.IllegalArgumentException: Cannot find column: 'partitionId' in the schema[StructField(_hoodie_commit_time,StringType,true),StructField(_hoodie_commit_seqno,StringType,true),StructField(_hoodie_record_key,StringType,true),StructField(_hoodie_partition_path,StringType,true),StructField(_hoodie_file_name,StringType,true),StructField(rowId,StringType,true),StructField(preComb,LongType,true),StructField(name,StringType,true),StructField(versionId,StringType,true),StructField(toBeDeletedStr,StringType,true),StructField(intToLong,IntegerType,true),StructField(longToInt,LongType,true)]
  at org.apache.hudi.HoodieFileIndex$$anonfun$5$$anonfun$apply$1.apply(HoodieFileIndex.scala:106)
  at org.apache.hudi.HoodieFileIndex$$anonfun$5$$anonfun$apply$1.apply(HoodieFileIndex.scala:106)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:59)
  at org.apache.hudi.HoodieFileIndex$$anonfun$5.apply(HoodieFileIndex.scala:106)
  at org.apache.hudi.HoodieFileIndex$$anonfun$5.apply(HoodieFileIndex.scala:105)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
  at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
  at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
  at org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties$lzycompute(HoodieFileIndex.scala:105)
  at org.apache.hudi.HoodieFileIndex._partitionSchemaFromProperties(HoodieFileIndex.scala:99)
  at org.apache.hudi.HoodieFileIndex.getAllQueryPartitionPaths(HoodieFileIndex.scala:348)
  at org.apache.hudi.HoodieFileIndex.loadPartitionPathFiles(HoodieFileIndex.scala:420)
  at org.apache.hudi.HoodieFileIndex.refresh0(HoodieFileIndex.scala:214)
  at org.apache.hudi.HoodieFileIndex.<init>(HoodieFileIndex.scala:149)
  at org.apache.hudi.DefaultSource.getBaseFileOnlyView(DefaultSource.scala:199)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:116)
  at org.apache.hudi.DefaultSource.createRelation(DefaultSource.scala:67)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
  at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  ... 63 elided

Steps to reproduce:

https://gist.github.com/nsivabalan/570a96004e41f84565c99d8994b12d57

Attachments

Activity

People

Assignee:: Sagar Sumit

Reporter:: sivabalan narayanan

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 16/Dec/21 16:26

Updated:: 17/Dec/21 05:42