[SPARK-27982] In spark 2.2.1 filter on a particular column follwed by the drop of the same column fail to filter the all the records - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Closed
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.2.1
Fix Version/s: None
Component/s: Spark Core
Labels:
- newbie

Flags:

Patch

Description

The issue here follows the design of the spark, If the filer is applied on a column followed by the drop of the same of the column, Then spark filters only the first record then drops the column as all the transformation filter + drop is applied to a record as it reads because both the transformation falls in Narrow stage.

There by resulting in filtering of only few records neglecting the rest

Here is sample code

inserts_filtered = inserts.toDF().filter(col("op")=='I')
inserts_without_column_op = inserts_filtered.drop('op')

inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path)

The above lines of code will only write one record with 'I' (value of the column 'op') filtered neglecting the order records with 'I' (value of the column 'op') as the column was dropped when first record was filtered.

Below is the sample record in csv trying to convert to parquet writing with partition keys

Op,key1,key2,created_at,updated_at,name
I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3
I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2
I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Karan Hebbar K S

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 08/Jun/19 14:08

Updated:: 09/Jun/19 20:00

Resolved:: 09/Jun/19 19:59

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified