[SPARK-45035] Support ignoreCorruptFiles for multiline CSV - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.5.0
Fix Version/s: 4.0.0
Component/s: SQL
Labels:
- pull-request-available

Description

Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.

spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0 = spark.read.option("ignoreCorruptFiles", "true").option("multiline", "true").csv("/tmp/sourcepath/").show()

It throws an exception instead of ignoring silently:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
Parser Configuration: CsvParserSettings:
	Auto configuration enabled=true
	Auto-closing enabled=true
	Autodetect column delimiter=false
	Autodetect quotes=false
	Column reordering enabled=true
	Delimiters for detection=null
	Empty value=
	Escape unquoted values=false
	Header extraction enabled=null
	Headers=null
	Ignore leading whitespaces=false
	Ignore leading whitespaces in quotes=false
	Ignore trailing whitespaces=false
	Ignore trailing whitespaces in quotes=false
	Input buffer size=1048576
	Input reading on separate thread=false
	Keep escape sequences=false
	Keep quotes=false
	Length of content displayed on error=1000
	Line separator detection enabled=true
	Maximum number of characters per column=-1
	Maximum number of columns=20480
	Normalize escaped line separators=true
	Null value=
	Number of records to read=all
	Processor=none
	Restricting data in exceptions=false
	RowProcessor error handler=null
	Selected fields=none
	Skip bits as whitespace=true
	Skip empty lines=true
	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
	CsvFormat:
		Comment character=#
		Field delimiter=,
		Line separator (normalized)=\n
		Line separator sequence=\n
		Quote character="
		Quote escape character=\
		Quote escape escape character=null
Internal state when error was thrown: line=0, column=0, record=0
	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46...

It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) which does not go through `FileScanRDD`. We could potentially add this support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode.

Attachments

Issue Links

links to

GitHub Pull Request #42979

Activity

People

Assignee:: Jia Fan

Reporter:: Yaohua Zhao

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 31/Aug/23 16:42

Updated:: 18/Oct/23 06:08

Resolved:: 18/Oct/23 06:08