Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
3.0.0
-
None
-
None
Description
I think we ran into a bug in the Spark framework. Basically, the bug we caught is like this: when reading a data frame in Parquet format partitioned by a column, if the column contains values of “NOW”, NOW will be interpreted as the NOW function as in SQL, and returns the literal timestamp of NOW.
Steps to reproduce:
from pyspark.sql.session import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([['NOW', 1], ['THEN', 2]], schema=['Col1', 'Col2'])
df.write.parquet('/tmp/my_partitioned_data', mode='overwrite', partitionBy=['Col1'])
df_read_back = spark.read.parquet('/tmp/my_partitioned_data')
"""
In [1]: df.show()
------+
Col1 | Col2 |
------+
NOW | 1 |
THEN | 2 |
------+
In [2]: df_read_back.show()
----------------------+
Col2 | Col1 |
----------------------+
1 | 2021-01-22 10:46:... |
2 | THEN |
----------------------+
Attachments
Issue Links
- duplicates
-
SPARK-34259 Reading a partitioned dataset with a partition value of NOW causes the value to be parsed as a timestamp.
- In Progress