XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.10.0
    • None
    • None
    • Hide
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations.

      CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

       { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
         (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
       
      The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

      Usage goes something like this:

      events = load '/logs/events' using EventLoader() as (lang, event, app_id);
       cubed = foreach x generate
         FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
           as (lang, event, app_id),
         measure;
       cube = foreach (group cubed
                       by (lang, event, app_id) parallel $P)
              generate
         flatten(group) as (lang, event, app_id),
         COUNT_STAR(cubed),
         SUM(measure);
       store cube into 'event_cube';

      Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.
      Show
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations. CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:  { (a, b, c), (null, null, null), (a, b, null), (a, null, c),    (a, null, null), (null, b, c), (null, null, c), (null, b, null) }   The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all". Usage goes something like this: events = load '/logs/events' using EventLoader() as (lang, event, app_id);  cubed = foreach x generate    FLATTEN(piggybank.CubeDimensions(lang, event, app_id))      as (lang, event, app_id),    measure;  cube = foreach (group cubed                  by (lang, event, app_id) parallel $P)         generate    flatten(group) as (lang, event, app_id),    COUNT_STAR(cubed),    SUM(measure);  store cube into 'event_cube'; Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.

    Description

      A prerequisite for a naive cubing implementation:
      A UDF that, given a set of dimensions (a, b, c) generates all the points on the cube:
      (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, null, null), (null, b, null), (null, null, null).

      Attachments

        1. PIG-2168.2.patch
          7 kB
          Dmitriy V. Ryaboy
        2. PIG-2168.patch
          5 kB
          Dmitriy V. Ryaboy

        Activity

          People

            dvryaboy Dmitriy V. Ryaboy
            dvryaboy Dmitriy V. Ryaboy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: