Details

    • Type: Sub-task Sub-task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations.

      CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:

       { (a, b, c), (null, null, null), (a, b, null), (a, null, c),
         (a, null, null), (null, b, c), (null, null, c), (null, b, null) }
       
      The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all".

      Usage goes something like this:

      events = load '/logs/events' using EventLoader() as (lang, event, app_id);
       cubed = foreach x generate
         FLATTEN(piggybank.CubeDimensions(lang, event, app_id))
           as (lang, event, app_id),
         measure;
       cube = foreach (group cubed
                       by (lang, event, app_id) parallel $P)
              generate
         flatten(group) as (lang, event, app_id),
         COUNT_STAR(cubed),
         SUM(measure);
       store cube into 'event_cube';

      Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.
      Show
      A new builtin UDF, CubeDimensions, is added to simplify the process of producing cube-like aggregations. CubeDimensions produces a DataBag with all combinations of the argument tuple members as in a data cube. Meaning, (a, b, c) will produce the following bag:  { (a, b, c), (null, null, null), (a, b, null), (a, null, c),    (a, null, null), (null, b, c), (null, null, c), (null, b, null) }   The "all" marker is null by default, but can be set to an arbitrary string by invoking a constructor (via a DEFINE). The constructor takes a single argument, the string you want to represent "all". Usage goes something like this: events = load '/logs/events' using EventLoader() as (lang, event, app_id);  cubed = foreach x generate    FLATTEN(piggybank.CubeDimensions(lang, event, app_id))      as (lang, event, app_id),    measure;  cube = foreach (group cubed                  by (lang, event, app_id) parallel $P)         generate    flatten(group) as (lang, event, app_id),    COUNT_STAR(cubed),    SUM(measure);  store cube into 'event_cube'; Note: doing this with non-algebraic aggregations on large data can result in very slow reducers, since one of the groups is going to get all the records in your relation.

      Description

      A prerequisite for a naive cubing implementation:
      A UDF that, given a set of dimensions (a, b, c) generates all the points on the cube:
      (a, b, c), (a, b, null), (a, null, c), (null, b, c), (null, null, c), (a, null, null), (null, b, null), (null, null, null).

      1. PIG-2168.patch
        5 kB
        Dmitriy V. Ryaboy
      2. PIG-2168.2.patch
        7 kB
        Dmitriy V. Ryaboy

        Activity

        Hide
        Thejas M Nair added a comment -

        Patch looks good. A minor comment - Lists.newArrayListWithCapacity would be a better for this case, than Lists.newArrayListWithExpectedSize. Lists.newArrayListWithExpectedSize adds padding, which is unnecessary in this case.

        Show
        Thejas M Nair added a comment - Patch looks good. A minor comment - Lists.newArrayListWithCapacity would be a better for this case, than Lists.newArrayListWithExpectedSize. Lists.newArrayListWithExpectedSize adds padding, which is unnecessary in this case.
        Hide
        Thejas M Nair added a comment -

        Can you please also add apache license headers for test/org/apache/pig/test/TestCubeDimensions.java and src/org/apache/pig/builtin/CubeDimensions.java ? Everything else in test-patch was successful.

        Show
        Thejas M Nair added a comment - Can you please also add apache license headers for test/org/apache/pig/test/TestCubeDimensions.java and src/org/apache/pig/builtin/CubeDimensions.java ? Everything else in test-patch was successful.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Attaching patch with a change to WithCapacity and proper apache headers.

        Show
        Dmitriy V. Ryaboy added a comment - Attaching patch with a change to WithCapacity and proper apache headers.
        Hide
        Thejas M Nair added a comment -

        +1

        Show
        Thejas M Nair added a comment - +1
        Hide
        Dmitriy V. Ryaboy added a comment -

        Committed to 0.10

        Show
        Dmitriy V. Ryaboy added a comment - Committed to 0.10

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development