Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12169

DataFrame API: Allow non-deferred column operations on categorical columns

    XMLWordPrintableJSON

    Details

      Description

      There are several operations that we currently disallow because they produce a variable set of columns in the output based on the data (non-deferred-columns). However, for some dtypes (categorical, boolean) we can easily enumerate all the possible values that will be seen at execution time, so we can predict the columns that will be seen.

      Note we still can't implement these operations 100% correctly, as pandas will typically only create columns for the values that are observed, while we'd have to create a column for every possible value.

      We should allow these operations in these special cases.

      Operations in this category:

      • DataFrame.unstack (can work if unstacked level is a categorical or boolean column)
      • Series.str.get_dummies
      • Series.str.split
      • Series.str.rsplit
      • DataFrame.pivot
      • DataFrame.pivot_table
      • len(GroupBy) and ngroups

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                bhulette Brian Hulette
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: