Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-25

[C++] Implement delimited file scanner / CSV reader

    XMLWordPrintableJSON

Details

    Description

      Like Parquet and binary file formats, text files will be an important data medium for converting to and from in-memory Arrow data.

      pandas has some (Apache-compatible) business logic we can learn from here (as one of the gold-standard CSV readers in production use)

      https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.h
      https://github.com/pydata/pandas/blob/master/pandas/parser.pyx

      While very fast, this this should be largely written from scratch to target the Arrow memory layout, but we can reuse certain aspects like the tokenizer DFA (which originally came from the Python interpreter csv module implementation)

      https://github.com/pydata/pandas/blob/master/pandas/src/parser/tokenizer.c#L713

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              wesm Wes McKinney
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 9h 40m
                  9h 40m