Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations.

      CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

       { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
         (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
       
      The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

      Usage goes something like this:

      events = load '/logs/events' using EventLoader() as (lang, event, app_id);
       cubed = foreach x generate
         FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
           as (lang, event, app_id),
         measure;
       cube = foreach (group cubed
                       by (lang, event, app_id) parallel $P)
              generate
         flatten(group) as (lang, event, app_id),
         COUNT_STAR(cubed),
         SUM(measure);
       store cube into 'event_cube';

      Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.
      Show
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations. CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:  { (a, b, c), (null, null, null), (a, b, null), (a, null, c),    (a, null, null), (null, b, c), (null, null, c), (null, b, null) }   The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all". Usage goes something like this: events = load '/logs/events' using EventLoader() as (lang, event, app_id);  cubed = foreach x generate    FLATTEN(piggybank.CubeDimensions(lang, event, app_id))      as (lang, event, app_id),    measure;  cube = foreach (group cubed                  by (lang, event, app_id) parallel $P)         generate    flatten(group) as (lang, event, app_id),    COUNT_STAR(cubed),    SUM(measure);  store cube into 'event_cube'; Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.

      Description

      A prerequisite for a naive cubing implementation:
      A UDF that, given a set of dimensions (a, b, c) generates all the points on the cube:
      (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, null, null), (null, b, null), (null, null, null).

      1. PIG-2168.patch
        5 kB
        Dmitriy V. Ryaboy
      2. PIG-2168.2.patch
        7 kB
        Dmitriy V. Ryaboy

        Activity

        No work has yet been logged on this issue.

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development