Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29621

Querying internal corrupt record column should not be allowed in filter operation

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.3.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:

      Description

      As per *https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala#L119-L126)*,
      "Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column"

      But it's allowing while querying only the internal corrupt record column in case of filter operation.

      from pyspark.sql.types import *
      
      schema = StructType([
          StructField("_corrupt_record", StringType(), False),
          StructField("Name", StringType(), False),
          StructField("Colour", StringType(), True),
          StructField("Price", IntegerType(), True),
          StructField("Quantity", IntegerType(), True)])
      df = spark.read.csv("fruit.csv", schema=schema, mode="PERMISSIVE")
      df.filter(df._corrupt_record.isNotNull()).show()  # Allowed
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Patnaik Suchintak Patnaik
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: