[ARROW-5089] [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.13.0
Fix Version/s: 0.15.0
Component/s: C++
Labels:
- parquet
- performance

External issue URL:
https://github.com/apache/arrow/issues/21577

Description

Currently, there is a workaround for dict encoded columns in place to handle writing dict encoded columns to parquet.

The workaround converts the dict encoded array to its plain version before writing to parquet. This is painfully slow since for every row group the entire array is converted over and over again.

The following example is orders of magnitude slower than the non-dict encoded version:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
df = pd.DataFrame({"col": ["A", "B"] * 100000}).astype("category")
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(
    table,
    buf,
    chunk_size=100,
)

Attachments

Issue Links

relates to

ARROW-3246 [Python][Parquet] direct reading/writing of pandas categoricals in parquet

Resolved

Activity

People

Assignee:: Wes McKinney

Reporter:: Florian Jetter

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Apr/19 08:40

Updated:: 11/Jan/23 07:38

Resolved:: 16/Aug/19 16:39