Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-38523

Failure on referring to the corrupt record from CSV

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • SQL
    • None

    Description

      The file below has a invalid value in a field:

      0,2013-111_11 12:13:14
      1,1983-08-04 

      where the timestamp 2013-111_11 12:13:14 is incorrect.

      The query fails when it refers to the corrupt record column:

      spark.read.format("csv")
       .option("header", "true")
       .schema(schema)
       .load("csv_corrupt_record.csv")
       .filter($"_corrupt_record".isNotNull) 

      with the exception:

      org.apache.spark.sql.AnalysisException: 
      Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
      referenced columns only include the internal corrupt record column
      (named _corrupt_record by default). For example:
      spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
      and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
      Instead, you can cache or save the parsed results and then send the same query.
      For example, val df = spark.read.schema(schema).csv(file).cache() and then
      df.filter($"_corrupt_record".isNotNull).count().
            
          at org.apache.spark.sql.errors.QueryCompilationErrors$.queryFromRawFilesIncludeCorruptRecordColumnError(QueryCompilationErrors.scala:2047)
          at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:116) 

      Attachments

        Activity

          People

            apachespark Apache Spark
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: