[SPARK-38523] Failure on referring to the corrupt record from CSV - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.3.0
Fix Version/s: 3.3.0
Component/s: SQL
Labels:
None

Description

The file below has a invalid value in a field:

0,2013-111_11 12:13:14
1,1983-08-04

where the timestamp 2013-111_11 12:13:14 is incorrect.

The query fails when it refers to the corrupt record column:

spark.read.format("csv")
 .option("header", "true")
 .schema(schema)
 .load("csv_corrupt_record.csv")
 .filter($"_corrupt_record".isNotNull)

with the exception:

org.apache.spark.sql.AnalysisException: 
Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the
referenced columns only include the internal corrupt record column
(named _corrupt_record by default). For example:
spark.read.schema(schema).csv(file).filter($"_corrupt_record".isNotNull).count()
and spark.read.schema(schema).csv(file).select("_corrupt_record").show().
Instead, you can cache or save the parsed results and then send the same query.
For example, val df = spark.read.schema(schema).csv(file).cache() and then
df.filter($"_corrupt_record".isNotNull).count().
      
    at org.apache.spark.sql.errors.QueryCompilationErrors$.queryFromRawFilesIncludeCorruptRecordColumnError(QueryCompilationErrors.scala:2047)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.buildReader(CSVFileFormat.scala:116)

Attachments

Issue Links

links to

[Github] Pull Request #35817 (MaxGekk)

[Github] Pull Request #35844 (MaxGekk)

Activity

People

Assignee:: Apache Spark

Reporter:: Max Gekk

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Mar/22 12:41

Updated:: 14/Mar/22 06:36

Resolved:: 14/Mar/22 06:11