Description
Here's a parquet file Spark 3.2/master can't read properly.
The file was stored by pandas and must contain 3650 rows, but Spark 3.2/master returns an empty dataset.
>>> import pandas as pd >>> len(pd.read_parquet('/path/to/example.parquet')) 3650 >>> spark.read.parquet('/path/to/example.parquet').count() 0
I guess it's caused by the parquet 1.12.0.
When I reverted two commits related to the parquet 1.12.0 from branch-3.2:
- https://github.com/apache/spark/commit/e40fce919ab77f5faeb0bbd34dc86c56c04adbaa
- https://github.com/apache/spark/commit/cbffc12f90e45d33e651e38cf886d7ab4bcf96da
it reads the data successfully.
We need to add some workaround, or revert the commits.
Attachments
Attachments
Issue Links
- is related to
-
PARQUET-2078 Failed to read parquet file after writing with the same parquet version
- Closed
-
SPARK-34276 Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 and 1.12
- Resolved
- relates to
-
SPARK-36726 Upgrade Parquet to 1.12.1
- Resolved