Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8421

[Rust] [Parquet] Implement parquet writer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Invalid
    • None
    • 5.0.0
    • Rust

    Description

      This is the parent story. See subtasks for more information.

      Notes from wesm :

      A couple of initial things to keep in mind

      • Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
      • You can optimize the special case where a nullable field's data has no nulls
      • A good amount of code is required to handle converting from the Arrow physical form of various logical types to the Parquet equivalent one, see https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc for details
      • It would be worth thinking up front about how dictionary-encoded data is handled both on the Arrow write and Arrow read paths. In parquet-cpp we initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to dense String), and through real world need I was forced to revisit this (quite painfully) to enable Arrow dictionaries to survive roundtrips to Parquet format, and also achieve better performance and memory use in both reads and writes. You can certainly do a dictionary-to-dense conversion like we did, but you may someday find yourselves doing the same painful refactor that I did to make dictionary write and read not only more efficient but also dictionary order preserving.

      Notes from sunchao :

      I roughly skimmed through the C++ implementation and think on the high level we need to do the following:

      1. implement a method similar to WriteArrow in column_writer.cc. We can further break this up into smaller pieces such as: dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on and so forth.
      2. implement an arrow writer in the parquet crate here. This needs to offer similar APIs as writer.h.

      Attachments

        There are no Sub-Tasks for this issue.

        Activity

          People

            nevi_me Neville Dipale
            andygrove Andy Grove
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 33h 10m
                33h 10m