Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3246

[Python][Parquet] direct reading/writing of pandas categoricals in parquet

    XMLWordPrintableJSON

Details

    Description

      Parquet supports "dictionary encoding" of column data in a manner very similar to the concept of Categoricals in pandas. It is natural to use this encoding for a column which originated as a categorical. Conversely, when loading, if the file metadata says that a given column came from a pandas (or arrow) categorical, then we can trust that the whole of the column is dictionary-encoded and load the data directly into a categorical column, rather than expanding the labels upon load and recategorising later.

      If the data does not have the pandas metadata, then the guarantee cannot hold, and we cannot assume either that the whole column is dictionary encoded or that the labels are the same throughout. In this case, the current behaviour is fine.

       

      (please forgive that some of this has already been mentioned elsewhere; this is one of the entries in the list at https://github.com/dask/fastparquet/issues/374 as a feature that is useful in fastparquet)

      Attachments

        Issue Links

          Activity

            People

              wesm Wes McKinney
              mdurant Martin Durant
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 9h 40m
                  9h 40m