Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-45035

Support ignoreCorruptFiles for multiline CSV

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.5.0
    • 4.0.0
    • SQL

    Description

      Today, `ignoreCorruptFiles` does not work well for multiline CSV mode.

      spark.conf.set("spark.sql.files.ignoreCorruptFiles", "true")val testCorruptDF0 = spark.read.option("ignoreCorruptFiles", "true").option("multiline", "true").csv("/tmp/sourcepath/").show() 

      It throws an exception instead of ignoring silently:

      org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4940.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4940.0 (TID 4031) (10.68.177.106 executor 0): com.univocity.parsers.common.TextParsingException: java.lang.IllegalStateException - Error reading from input
      Parser Configuration: CsvParserSettings:
      	Auto configuration enabled=true
      	Auto-closing enabled=true
      	Autodetect column delimiter=false
      	Autodetect quotes=false
      	Column reordering enabled=true
      	Delimiters for detection=null
      	Empty value=
      	Escape unquoted values=false
      	Header extraction enabled=null
      	Headers=null
      	Ignore leading whitespaces=false
      	Ignore leading whitespaces in quotes=false
      	Ignore trailing whitespaces=false
      	Ignore trailing whitespaces in quotes=false
      	Input buffer size=1048576
      	Input reading on separate thread=false
      	Keep escape sequences=false
      	Keep quotes=false
      	Length of content displayed on error=1000
      	Line separator detection enabled=true
      	Maximum number of characters per column=-1
      	Maximum number of columns=20480
      	Normalize escaped line separators=true
      	Null value=
      	Number of records to read=all
      	Processor=none
      	Restricting data in exceptions=false
      	RowProcessor error handler=null
      	Selected fields=none
      	Skip bits as whitespace=true
      	Skip empty lines=true
      	Unescaped quote handling=STOP_AT_DELIMITERFormat configuration:
      	CsvFormat:
      		Comment character=#
      		Field delimiter=,
      		Line separator (normalized)=\n
      		Line separator sequence=\n
      		Quote character="
      		Quote escape character=\
      		Quote escape escape character=null
      Internal state when error was thrown: line=0, column=0, record=0
      	at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:402)
      	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:277)
      	at com.univocity.parsers.common.AbstractParser.beginParsing(AbstractParser.java:843)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser$$anon$1.<init>(UnivocityParser.scala:463)
      	at org.apache.spark.sql.catalyst.csv.UnivocityParser$.convertStream(UnivocityParser.scala:46... 

      It is because the multiline parsing uses a different RDD (`BinaryFileRDD`) which does not go through `FileScanRDD`. We could potentially add this support to `BinaryFileRDD`, or even reuse the `FileScanRDD` for multiline parsing mode.

      Attachments

        Issue Links

          Activity

            People

              fanjia Jia Fan
              yaohua Yaohua Zhao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: