Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-1972

[C++] Switch to format version 2 as default for writing Parquet

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parquet-cpp
    • Labels:
      None

      Description

      Related to the thread on the arrow dev mailing list: https://lists.apache.org/thread.html/rf1a377c66990ae5ac0693119d416c93a7e19228d3eaaea8bd90acb17%40%3Cdev.arrow.apache.org%3E

      Currently, when writing parquet files with Arrow (parquet-cpp), we default to parquet format "1.0". In practice, this means that we don't use certain LogicalTypes (eg we don't write integers other than int32/int64, and we don't write the nanosecond timestamps).

      I think it would be nice to enable nanosecond timestamps by default, but I also have no idea how widely this is already supported by other readers.

      To be clear, this is not about enabling data page version 2 by default, in Arrow that is governed by a separate option.

      While checking this, I made an overview of which types were introduced in
      which parquet format version, in case someone wants to see the details ->
      https://nbviewer.jupyter.org/gist/jorisvandenbossche/3cc9942eaffb53564df65395e5656702

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jorisvandenbossche Joris Van den Bossche
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: