Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations.

      CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

       { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
         (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
       
      The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

      Usage goes something like this:

      events = load '/logs/events' using EventLoader() as (lang, event, app_id);
       cubed = foreach x generate
         FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
           as (lang, event, app_id),
         measure;
       cube = foreach (group cubed
                       by (lang, event, app_id) parallel $P)
              generate
         flatten(group) as (lang, event, app_id),
         COUNT_STAR(cubed),
         SUM(measure);
       store cube into 'event_cube';

      Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.
      Show
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations. CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:  { (a, b, c), (null, null, null), (a, b, null), (a, null, c),    (a, null, null), (null, b, c), (null, null, c), (null, b, null) }   The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all". Usage goes something like this: events = load '/logs/events' using EventLoader() as (lang, event, app_id);  cubed = foreach x generate    FLATTEN(piggybank.CubeDimensions(lang, event, app_id))      as (lang, event, app_id),    measure;  cube = foreach (group cubed                  by (lang, event, app_id) parallel $P)         generate    flatten(group) as (lang, event, app_id),    COUNT_STAR(cubed),    SUM(measure);  store cube into 'event_cube'; Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.

      Description

      A prerequisite for a naive cubing implementation:
      A UDF that, given a set of dimensions (a, b, c) generates all the points on the cube:
      (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, null, null), (null, b, null), (null, null, null).

      1. PIG-2168.2.patch
        7 kB
        Dmitriy V. Ryaboy
      2. PIG-2168.patch
        5 kB
        Dmitriy V. Ryaboy

        Activity

        Dmitriy V. Ryaboy created issue -
        Dmitriy V. Ryaboy made changes -
        Field Original Value New Value
        Fix Version/s 0.10 [ 12316246 ]
        Dmitriy V. Ryaboy made changes -
        Attachment PIG-2168.patch [ 12486629 ]
        Dmitriy V. Ryaboy made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Dmitriy V. Ryaboy made changes -
        Attachment PIG-2168.2.patch [ 12487868 ]
        Dmitriy V. Ryaboy made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Release Note A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations.

        CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

         { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
           (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
         
        The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

        Usage goes something like this:

        events = load '/logs/events' using EventLoader() as (lang, event, app_id);
         cubed = foreach x generate
           FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
             as (lang, event, app_id),
           measure;
         cube = foreach (group cubed
                         by (lang, event, app_id) parallel $P)
                generate
           flatten(group) as (lang, event, app_id),
           COUNT_STAR(cubed),
           SUM(measure);
         store cube into 'event_cube';

        Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.
        Resolution Fixed [ 1 ]
        Dmitriy V. Ryaboy made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development