Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-15246

[Python] Automatic conversion of low-cardinality string array to Dictionary Array

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 6.0.1
    • None
    • Python
    • None

    Description

      Users who convert Pandas string arrays to Arrow arrays may be surprised to see the Arrow ones use far more memory when the cardinality is low. The solution is for them to first convert to a Pandas Categorical, but it might save some headaches if we can automatically (or possibly with an option) detect when it's appropriate to use a Dictionary type over a String type.

      Here's an example of what I'm talking about:

      import pyarrow as pa
      import pandas as pd
      
      x_str = "x" * 30
      df = pd.DataFrame({"col": [x_str] * 1_000_000})
      
      %memit tab1 = pa.Table.from_pandas(df)
      # peak memory: 269.44 MiB, increment: 121.62 MiB
      
      df['col'] = df['col'].astype('category')
      %memit tab2 = pa.Table.from_pandas(df)
      # peak memory: 286.14 MiB, increment: 1.20 MiB
      

      One bad consequence of inferring this automatically is if there is a sequence of Pandas DataFrames that are being converted, it's possible they may end up with differing schemas. For that reason it's likely this behavior should be optional.

      Attachments

        Activity

          People

            Unassigned Unassigned
            willjones127 Will Jones
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: