[SPARK-15148] Upgrade Univocity library from 2.0.2 to 2.1.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Description

It looks a new release of Univocity CSV library was published, https://github.com/uniVocity/univocity-parsers/releases.

This contains some improvements as below:

1. Performance improvements for parsing/writing CSV and TSV. CSV writing and parsing got 30-40% faster.

2. Deprecated methods setParseUnescapedQuotes and setParseUnescapedQuotesUntilDelimiter class CsvParserSettings in favor of the new setUnescapedQuoteHandling method that takes values from the UnescapedQuoteHandling enumeration.

3. Default behavior of the CSV parser when unescaped quotes are found on the input changed to parse until a delimiter character is found, i.e. UnescapedQuoteHandling.STOP_AT_DELIMITER. The old default of trying to find a closing quote (i.e. UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE) can be problematic when no closing quote is found, making the parser accumulate all characters into the same value, until the end of the input.

With Spark,
Firstly, It uses this library for CSV data source. This will affect the performance.

Secondly, Spark uses setParseUnescapedQuotesUntilDelimiter which is deprecated in this version because It seems there are some more functionalities for parsing unescaped quotes. This seems not directly related with Spark but we might have to consider using this in the future.

Attachments

Issue Links

links to

[Github] Pull Request #12923 (HyukjinKwon)

Activity

People

Assignee:: Hyukjin Kwon

Reporter:: Hyukjin Kwon

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/May/16 06:25

Updated:: 12/Dec/22 17:51

Resolved:: 05/May/16 18:26