Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.0
Description
I'm using Spark (3.1.1) to write a dataframe to a partitioned parquet dataset (using delta.io) which is partitioned by a timestamp field.
The relevant Spark code:
// code placeholder ( df.withColumn( "Date", sf.date_trunc( "DAY", sf.from_unixtime( (sf.col("MyEpochField")), ), ), ) .write.format("delta") .mode("append") .partitionBy("Date") .save("...")
This gives a structure like following:
// code placeholder
/tip
/tip/Date=2021-05-04 00%3A00%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04 07%3A27%3A00
/tip/Date=2021-05-04 00%3A00%3A00/Time=2021-05-04 07%3A27%3A00/part-00000-8846eb80-a369-43f6-a715-fec9cf1adf95.c000.snappy.parquet
Notice the : character is (url?) encoded because of fs protocol violation.
When i try to open this dataset using delta-rs (https://github.com/delta-io/delta-rs) which uses Arrow below the hood, then an error is raised trying to parse the Date (folder) value.
// code placeholder pyarrow.lib.ArrowInvalid: error parsing '2021-05-03 00%3A00%3A00' as scalar of type timestamp[ns]
It seems this error is raised in ScalarParseImpl => ParseValue => StringConverter<TimestampType>::Convert => ParseTimestampISO8601
The mentioned parse method does support for format:
// code placeholder static inline bool ParseTimestampISO8601(const char* s, size_t length, TimeUnit::type unit, TimestampType::c_type* out) { using seconds_type = std::chrono::duration<TimestampType::c_type>; // We allow the following formats for all units: // - "YYYY-MM-DD" // - "YYYY-MM-DD[ T]hhZ?" // - "YYYY-MM-DD[ T]hh:mmZ?" // - "YYYY-MM-DD[ T]hh:mm:ssZ?" <...>
But may not support (url?) decoding the value upfront?
Questions we have:
- Should Arrow support timestamp fields when used as partitioned field?
- Where to decode?
Some more information from the writing side.
The writing is initiated using FileFormatWriter.write that eventually uses a DynamicPartitionDataWriter (passing in the partitionColumns through the job description).
Here the actual "value" is rendered and concatennated.
// code placeholder /** Expression that given partition columns builds a path string like: col1=val/col2=val/... */ private lazy val partitionPathExpression: Expression = Concat( description.partitionColumns.zipWithIndex.flatMap { case (c, i) => val partitionName = ScalaUDF( ExternalCatalogUtils.getPartitionPathString _, StringType, Seq(Literal(c.name), Cast(c, StringType, Option(description.timeZoneId)))) if (i == 0) Seq(partitionName) else Seq(Literal(Path.SEPARATOR), partitionName) })
Where the encoding is done in:
If i understand correct, then Arrow should provide the equivalent of unescapePathName for fields used as partitioned columns.
Attachments
Issue Links
- is related to
-
ARROW-14737 [C++][Dataset] Support URI-decoding partition keys
- Resolved
- relates to
-
ARROW-11378 [C++][Dataset] Writing partitions with timestamp type give mis-formatted (integers) directory names
- Open
-
ARROW-13813 [C++][Dataset] Support URL encoding of partition field values for the file path
- Open
- links to