[ARROW-3246] [Python][Parquet] direct reading/writing of pandas categoricals in parquet - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.15.0
Component/s: Python
Labels:
- parquet
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19588

Description

Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.

If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.

(please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at https://github.com/dask/fastparquet/issues/374 as a feature that is useful in fastparquet)

Attachments

Issue Links

depends upon

ARROW-6152 [C++][Parquet] Write arrow::Array directly into parquet::TypedColumnWriter<T>

Resolved

is related to

ARROW-3652 [Python] CategoricalIndex is lost after reading back

Resolved

ARROW-5089 [C++/Python] Writing dictionary encoded columns to parquet is extremely slow when using chunk size

Resolved

PARQUET-800 [C++] Provide public API to access dictionary-encoded indices and values

Resolved

PARQUET-924 [C++] Persist original type metadata from Arrow schemas

Resolved

ARROW-3325 [Python] Support reading Parquet binary/string columns directly as DictionaryArray

Resolved

ARROW-5480 [Python] Pandas categorical type doesn't survive a round-trip through parquet

Resolved

relates to

ARROW-3772 [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

Resolved

supercedes

ARROW-4359 [Python] Column metadata is not saved or loaded in parquet

Closed

links to

GitHub Pull Request #5077

(2 is related to, 1 relates to, 1 supercedes, 1 links to)

Activity

People

Assignee:: Wes McKinney

Reporter:: Martin Durant

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Sep/18 00:00

Updated:: 11/Jan/23 07:26

Resolved:: 16/Aug/19 13:54

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

9h 40m