Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16119 [Python] Deprecate the legacy ParquetDataset custom python-based implementation
  3. ARROW-16240

[Python] Support row_group_size/chunk_size keyword in pq.write_to_dataset with use_legacy_dataset=False

    XMLWordPrintableJSON

Details

    Description

      The pq.write_to_dataset (legacy implementation) supports the row_group_size/chunk_size keyword to specify the row group size of the written parquet files.

      Now that we made use_legacy_dataset=False the default, this keyword doesn't work anymore.

      This is because dataset.write_dataset(..) doesn't support the parquet row_group_size keyword. The ParquetFileWriteOptions class doesn't support this keyword.

      On the parquet side, this is also the only keyword that is not passed to the ParquetWriter init (and thus to parquet's WriterProperties or ArrowWriterProperties), but to the actual write_table call. In C++ this can be seen at https://github.com/apache/arrow/blob/76d064c729f5e2287bf2a2d5e02d1fb192ae5738/cpp/src/parquet/arrow/writer.h#L62-L71

      See discussion: https://github.com/apache/arrow/pull/12811#discussion_r845304218

      Attachments

        Issue Links

          Activity

            People

              alenka Alenka Frim
              alenka Alenka Frim
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h 10m
                  1h 10m