Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27982

In spark 2.2.1 filter on a particular column follwed by the drop of the same column fail to filter the all the records

    XMLWordPrintableJSON

Details

    • Question
    • Status: Closed
    • Minor
    • Resolution: Incomplete
    • 2.2.1
    • None
    • Spark Core
    • Patch

    Description

      The issue here follows the design of the spark, If the filer is applied on a column followed by the drop of the same of the column,  Then spark filters only the first record then drops the column as all the transformation  filter + drop is applied to a record as it reads because both the transformation falls in Narrow stage.

       

      There by resulting in filtering of only few records neglecting the rest

       

      Here is sample code

      inserts_filtered = inserts.toDF().filter(col("op")=='I')
      inserts_without_column_op = inserts_filtered.drop('op')

      inserts_without_column_op.repartition("partition_kerys").write.partitionBy("partition_kerys").mode("append").parquet(Path)

       

      The above lines of code will only write one record with 'I' (value of the column 'op') filtered neglecting the order records with 'I' (value of the column 'op') as the column was dropped when first record was filtered.

       

       

      Below is the sample record in csv trying to convert to parquet writing with partition keys

       
      Op,key1,key2,created_at,updated_at,name
      I,1,11,2017-02-04 12:34:14.000,2019-02-04 12:34:14.000,xyz3
      I,1,11,2017-02-04 12:34:14.000,2019-01-04 12:34:14.000,xyz2
      I,4,41,2018-02-04 12:01:14.000,2018-02-05 12:01:14.000,xyz1
       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            karan970 Karan Hebbar K S
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2m
                2m
                Remaining:
                Remaining Estimate - 2m
                2m
                Logged:
                Time Spent - Not Specified
                Not Specified