Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-12169

Allow non-deferred column operations on categorical columns

Details

    Description

      There are several operations that we currently disallow because they produce a variable set of columns in the output based on the data (non-deferred-columns). However, for some dtypes (categorical, boolean) we can easily enumerate all the possible values that will be seen at execution time, so we can predict the columns that will be seen.

      Note we still can't implement these operations 100% correctly, as pandas will typically only create columns for the values that are observed, while we'd have to create a column for every possible value.

      We should allow these operations in these special cases.

      Operations in this category:

      • DataFrame.unstack, Series.unstack (can work if unstacked level is a categorical or boolean column)
      • Series.str.get_dummies
      • Series.str.split
      • Series.str.rsplit
      • DataFrame.pivot
      • DataFrame.pivot_table

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              bhulette Brian Hulette
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 34h
                  34h