XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • SQL
    • None

    Description

      It looks a new release of Univocity CSV library was published, https://github.com/uniVocity/univocity-parsers/releases.

      This contains some improvements as below:

      1. Performance improvements for parsing/writing CSV and TSV. CSV writing and parsing got 30-40% faster.

      2. Deprecated methods setParseUnescapedQuotes and setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the new setUnescapedQuoteHandling method that takes values from the UnescapedQuoteHandling enumeration.

      3. Default behavior of the CSV parser when unescaped quotes are found on the input changed to parse until a delimiter character is found, i.e. UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be problematic when no closing quote is found, making the parser accumulate all characters into the same value, until the end of the input.

      With Spark,
      Firstly, It uses this library for CSV data source. This will affect the performance.

      Secondly, Spark uses setParseUnescapedQuotesUntilDelimiter which is deprecated in this version because It seems there are some more functionalities for parsing unescaped quotes. This seems not directly related with Spark but we might have to consider using this in the future.

      Attachments

        Activity

          People

            hyukjin.kwon Hyukjin Kwon
            hyukjin.kwon Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: