Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-5089

[C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.15.0
    • Component/s: C++

      Description

      Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

      The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

      The following example is orders of magnitude slower than the non-dict encoded version:

      import pyarrow as pa
      import pyarrow.parquet as pq
      import pandas as pd
      df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
      table = pa.Table.from_pandas(df)
      buf = pa.BufferOutputStream()
      pq.write_table(
          table,
          buf,
          chunk_size=100,
      )
       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                wesm Wes McKinney
                Reporter:
                fjetter Florian Jetter
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: