Details
-
Question
-
Status: Closed
-
Minor
-
Resolution: Incomplete
-
2.2.1
-
None
-
Patch
Description
The issue here follows the design of the spark, If the filer is applied on a column followed by the drop of the same of the column, Then spark filters only the first record then drops the column as all the transformation filter + drop is applied to a record as it reads because both the transformation falls in Narrow stage.
There by resulting in filtering of only few records neglecting the rest
Here is sample code
inserts_filtered = inserts.toDF().filter(col("op")=='I')
inserts_without_column_op = inserts_filtered.drop('op')
inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path)
The above lines of code will only write one record with 'I' (value of the column 'op') filtered neglecting the order records with 'I' (value of the column 'op') as the column was dropped when first record was filtered.
Below is the sample record in csv trying to convert to parquet writing with partition keys
Op,key1,key2,created_at,updated_at,name
I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3
I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2
I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1