Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15855

[Python] Add dictionary_pagesize_limit to Parquet writer

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • None
    • None
    • Parquet, Python
    • None

    Description

      Although the python Parquet api is a wrapper of C++, there are some tuning knobs not included in python. For example, dictionary_pagesize_limit_. The dictionary page size will easily exceed the limit when any or many of the followings happen: 1. The row_group_size is relatively large e.g. the default is 64M. 2. The size per entry is large e.g large string column 3. the repeatability of data is not so high. This may result in the dictionary encoding not being fully utilized if this parameter cannot be tuned. In C++, however, this parameter can be tuned to the optimized setting.

       

      There are also other parameters not exposed in python, for example, max_statistics_size.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              xzeng Xinyu Zeng
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: