[ARROW-25] [C++] Implement delimited file scanner / CSV reader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.11.0
Component/s: C++
Labels:
- csv
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/15374

Description

Like Parquet and binary file formats, text files will be an important data medium for converting to and from in-memory Arrow data.

pandas has some (Apache-compatible) business logic we can learn from here (as one of the gold-standard CSV readers in production use)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
https://github.com/pydata/pandas/blob/master/pandas/parser.pyx

While very fast, this this should be largely written from scratch to target the Arrow memory layout, but we can reuse certain aspects like the tokenizer DFA (which originally came from the Python interpreter csv module implementation)

https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713

Attachments

Issue Links

links to

GitHub Pull Request #2576

Activity

People

Assignee:: Antoine Pitrou

Reporter:: Wes McKinney

Votes:: 2 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 02/Mar/16 19:40

Updated:: 11/Jan/23 07:06

Resolved:: 01/Oct/18 10:30

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

9h 40m