Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Release Note:
      Hide
      Pig 0.11+ includes the following UDFs for operating with Map

      1. VALUESET
      2. VALUELIST
      3. KEYSET
      4. INVERSEMAP

      VALUESET

        This UDF takes a Map and returns a Tuple containing the value set.
        Note, this UDF returns only unique values. For all values, use
        VALUELIST instead.

        <code>
        grunt> cat data
        [open#apache,1#2,11#2]
        [apache#hadoop,3#4,12#hadoop]
       
        grunt> a = load 'data' as (M:[]);
        grunt> b = foreach a generate VALUELIST($0);
        ({(apache),(2)})
        ({(4),(hadoop)})
       
        </code>

      VALUELIST

       
        This UDF takes a Map and returns a Bag containing the values from map.
        Note that output tuple contains all values, not just unique ones.
        For obtaining unique values from map, use VALUESET instead.
       
        <code>
        grunt> cat data
        [open#apache,1#2,11#2]
        [apache#hadoop,3#4,12#hadoop]
       
        grunt> a = load 'data' as (M:[]);
        grunt> b = foreach a generate VALUELIST($0);
        grunt> dump b;
        ({(apache),(2),(2)})
        ({(4),(hadoop),(hadoop)})
        </code>

      KEYSET

        This UDF takes a Map and returns a Bag containing the keyset.

        <code>
        grunt> cat data
        [open#apache,1#2,11#2]
        [apache#hadoop,3#4,12#hadoop]
       
        grunt> a = load 'data' as (M:[]);
        grunt> b = foreach a generate KEYSET($0);
        grunt> dump b;
        ({(open),(1),(11)})
        ({(3),(apache),(12)})
        </code>

      INVERSEMAP

        This UDF accepts a Map as input with values of any primitive data type.
        UDF swaps keys with values and returns the new inverse Map.
        Note in case original values are non-unique, the resulting Map would
        contain String Key -> DataBag of values. Here the bag of values is composed
        of the original keys having the same value.
       
        Note: 1. UDF accepts Map with Values of primitive data type
                 2. UDF returns Map<String,DataBag>
        <code>
        grunt> cat 1data
        [open#1,1#2,11#2]
        [apache#2,3#4,12#24]
       
        
        grunt> a = load 'data' as (M:[int]);
        grunt> b = foreach a generate INVERSEMAP($0);
       
        grunt> dump b;
        ([2#{(1),(11)},apache#{(open)}])
        ([hadoop#{(apache),(12)},4#{(3)}])
        </code>
      Show
      Pig 0.11+ includes the following UDFs for operating with Map 1. VALUESET 2. VALUELIST 3. KEYSET 4. INVERSEMAP VALUESET   This UDF takes a Map and returns a Tuple containing the value set.   Note, this UDF returns only unique values. For all values, use   VALUELIST instead.   <code>   grunt> cat data   [open#apache,1#2,11#2]   [apache#hadoop,3#4,12#hadoop]     grunt> a = load 'data' as (M:[]);   grunt> b = foreach a generate VALUELIST($0);   ({(apache),(2)})   ({(4),(hadoop)})     </code> VALUELIST     This UDF takes a Map and returns a Bag containing the values from map.   Note that output tuple contains all values, not just unique ones.   For obtaining unique values from map, use VALUESET instead.     <code>   grunt> cat data   [open#apache,1#2,11#2]   [apache#hadoop,3#4,12#hadoop]     grunt> a = load 'data' as (M:[]);   grunt> b = foreach a generate VALUELIST($0);   grunt> dump b;   ({(apache),(2),(2)})   ({(4),(hadoop),(hadoop)})   </code> KEYSET   This UDF takes a Map and returns a Bag containing the keyset.   <code>   grunt> cat data   [open#apache,1#2,11#2]   [apache#hadoop,3#4,12#hadoop]     grunt> a = load 'data' as (M:[]);   grunt> b = foreach a generate KEYSET($0);   grunt> dump b;   ({(open),(1),(11)})   ({(3),(apache),(12)})   </code> INVERSEMAP   This UDF accepts a Map as input with values of any primitive data type.   UDF swaps keys with values and returns the new inverse Map.   Note in case original values are non-unique, the resulting Map would   contain String Key -> DataBag of values. Here the bag of values is composed   of the original keys having the same value.     Note: 1. UDF accepts Map with Values of primitive data type            2. UDF returns Map<String,DataBag>   <code>   grunt> cat 1data   [open#1,1#2,11#2]   [apache#2,3#4,12#24]        grunt> a = load 'data' as (M:[int]);   grunt> b = foreach a generate INVERSEMAP($0);     grunt> dump b;   ([2#{(1),(11)},apache#{(open)}])   ([hadoop#{(apache),(12)},4#{(3)}])   </code>
    • Tags:
      udf

      Description

      It would be nice if Pig played better with Maps. To that end, I'd like to add a lot of utility around Maps.

      • TOBAG should take a Map and output {(key, value)}
      • TOMAP should take a Bag in that same form and make a map.
      • KEYSET should return the set of keys.
      • VALUESET should return the set of values.
      • VALUELIST should return the List of values (no deduping).
      • INVERSEMAP would return a Map of values => the set of keys that refer to that Key

      This would all be pretty easy. A more substantial piece of work would be to make Pig support non-String keys (this is especially an issue since UDFs and whatnot probably assume that they are all Integers). Not sure if it is worth it.

      I'd love to hear other things that would be useful for people!

      1. PIG-2600.patch
        8 kB
        Prashant Kommireddi
      2. PIG-2600_9.patch
        22 kB
        Prashant Kommireddi
      3. PIG-2600_8.patch
        23 kB
        Prashant Kommireddi
      4. PIG-2600_7.patch
        23 kB
        Prashant Kommireddi
      5. PIG-2600_6.patch
        23 kB
        Prashant Kommireddi
      6. PIG-2600_5.patch
        23 kB
        Prashant Kommireddi
      7. PIG-2600_4.patch
        22 kB
        Prashant Kommireddi
      8. PIG-2600_3.patch
        17 kB
        Prashant Kommireddi
      9. PIG-2600_2.patch
        16 kB
        Prashant Kommireddi

        Activity

        Hide
        Prashant Kommireddi added a comment -

        Jon, for these UDFs, we could name 1 and 2 MAPTOBAG and BAGTOMAP respectively, since TOBAG, TOMAP exist already. In fact, all UDFs here (except INVERSEMAP) could be prefixed with MAP to avoid confusion.

        Show
        Prashant Kommireddi added a comment - Jon, for these UDFs, we could name 1 and 2 MAPTOBAG and BAGTOMAP respectively, since TOBAG, TOMAP exist already. In fact, all UDFs here (except INVERSEMAP) could be prefixed with MAP to avoid confusion.
        Hide
        Jonathan Coveney added a comment -

        Hmm, I think there is a compelling argument to do so, but at the same time: do we want a bunch of type specific conversion functions where one function can logically apply to multiple? I mean, TOBAG can work on Tuples as well as Maps...theoretically, the Pig type system coupled with the function should make it unambiguous which one is getting applied, and it'd be nice not to have more UDFs than we need.

        Show
        Jonathan Coveney added a comment - Hmm, I think there is a compelling argument to do so, but at the same time: do we want a bunch of type specific conversion functions where one function can logically apply to multiple? I mean, TOBAG can work on Tuples as well as Maps...theoretically, the Pig type system coupled with the function should make it unambiguous which one is getting applied, and it'd be nice not to have more UDFs than we need.
        Hide
        Prashant Kommireddi added a comment -

        You are right, we could simply override existing UDFs and not create type-specific ones.

        Show
        Prashant Kommireddi added a comment - You are right, we could simply override existing UDFs and not create type-specific ones.
        Hide
        Prashant Kommireddi added a comment -

        Overload is what I meant, long day!

        Show
        Prashant Kommireddi added a comment - Overload is what I meant, long day!
        Hide
        Bill Graham added a comment -

        +1 for overloading. People often try to use these with Maps or Bags, realize they can't and then write a MapToBag UDF.

        Do you think INVERSEMAP is something that people need often? I question the general usefulness of that one.

        Show
        Bill Graham added a comment - +1 for overloading. People often try to use these with Maps or Bags, realize they can't and then write a MapToBag UDF. Do you think INVERSEMAP is something that people need often? I question the general usefulness of that one.
        Hide
        Russell Jurney added a comment -

        The other thing we need is to parametize maps with columns.

        Show
        Russell Jurney added a comment - The other thing we need is to parametize maps with columns.
        Hide
        Prashant Kommireddi added a comment -

        INVERSEMAP is a nice to have functionality, not necessarily uber-useful. This could be similar to http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html except the fact that it does not require the restriction that values are unique. Rather we could think of a strategy to determine which value->key pairs the function returns

        1. First occurring pair, key pair -> (v1, k1) from (k1,v1)(k2,v1)
        2. Last occurring pair -> (v1, k2) from (k1,v1)(k2,v1)
        3. All keys corresponding to value -> (v1, (k1,k2))

        Input could be tuple of maps or bag of maps.

        INVERSEMAP has been useful in a few situations to me, for eg to get field->index map from index->field. I see it being useful in few cases with click-through analysis as well.

        Again, not a super required use-case but good to have.

        Show
        Prashant Kommireddi added a comment - INVERSEMAP is a nice to have functionality, not necessarily uber-useful. This could be similar to http://google-collections.googlecode.com/svn/trunk/javadoc/com/google/common/collect/BiMap.html except the fact that it does not require the restriction that values are unique. Rather we could think of a strategy to determine which value->key pairs the function returns 1. First occurring pair, key pair -> (v1, k1) from (k1,v1)(k2,v1) 2. Last occurring pair -> (v1, k2) from (k1,v1)(k2,v1) 3. All keys corresponding to value -> (v1, (k1,k2)) Input could be tuple of maps or bag of maps. INVERSEMAP has been useful in a few situations to me, for eg to get field->index map from index->field. I see it being useful in few cases with click-through analysis as well. Again, not a super required use-case but good to have.
        Hide
        Jonathan Coveney added a comment -

        Russell: can you give an example of how you see that looking?

        As far as inversemap... most of the stuff we've talked about thus far is pretty easy to implement. I'm trying to think of if there is anything more ambitious?

        Show
        Jonathan Coveney added a comment - Russell: can you give an example of how you see that looking? As far as inversemap... most of the stuff we've talked about thus far is pretty easy to implement. I'm trying to think of if there is anything more ambitious?
        Hide
        Prashant Kommireddi added a comment -

        Taking a look at TOBAG, the existing UDF already converts a map entry to bag. Are we looking for additional functionality here?

        Show
        Prashant Kommireddi added a comment - Taking a look at TOBAG, the existing UDF already converts a map entry to bag. Are we looking for additional functionality here?
        Hide
        Prashant Kommireddi added a comment -

        An initial patch for KEYSET, VALUESET, VALUELIST. I wanted to have more eyes look at it and comment. I have a test case for KEYSET and will be working on tests for others soon.

        Show
        Prashant Kommireddi added a comment - An initial patch for KEYSET, VALUESET, VALUELIST. I wanted to have more eyes look at it and comment. I have a test case for KEYSET and will be working on tests for others soon.
        Hide
        Prashant Kommireddi added a comment -

        Working on INVERSEMAP, does it make sense to allow this UDF to be used only when values are String too. Since this UDF would involve swapping keys with values, it would be required for the new keys (that were values originally) to be chararray?

        Also, if a value is not unique the UDF would simply return the last key for that value (as per hashmap "put" behavior)

        grunt> cat data;
        [1#2,3#4,4#2]
        [apache#hadoop,doug#hadoop]
        
        grunt> a = load 'data' as (M:[]);
        grunt> b = foreach a generate INVERSEMAP($0);
        grunt> dump b;
        [4#3,2#4]
        [hadoop#doug]
        
        Show
        Prashant Kommireddi added a comment - Working on INVERSEMAP, does it make sense to allow this UDF to be used only when values are String too. Since this UDF would involve swapping keys with values, it would be required for the new keys (that were values originally) to be chararray? Also, if a value is not unique the UDF would simply return the last key for that value (as per hashmap "put" behavior) grunt> cat data; [1#2,3#4,4#2] [apache#hadoop,doug#hadoop] grunt> a = load 'data' as (M:[]); grunt> b = foreach a generate INVERSEMAP($0); grunt> dump b; [4#3,2#4] [hadoop#doug]
        Hide
        Prashant Kommireddi added a comment -

        Adding a new patch with KEYSET, VALUESET, VALUELIST and INVERSEMAP. Added corresponding test cases to TestBuiltin.java. Also added documentation for all the new classes.

        Show
        Prashant Kommireddi added a comment - Adding a new patch with KEYSET, VALUESET, VALUELIST and INVERSEMAP. Added corresponding test cases to TestBuiltin.java. Also added documentation for all the new classes.
        Hide
        Jonathan Coveney added a comment -

        Prashant,

        Thanks for working on this! As far as Inversemap, I don't know if I like the semantic that if there are multiple values, you just get essentially a random Key. I think you should get a Tuple or Bag of or List whatever of Keys that correspond... the assumption is that values can appear more than once.

        Open to debate though

        Another option on the "inversmap only works on Map<String,String>" front would be that, if given Map<String, T>" it converts T to a string, which I think isn't a bad idea. Maybe that could be a constructor option, and the default is an error.

        Show
        Jonathan Coveney added a comment - Prashant, Thanks for working on this! As far as Inversemap, I don't know if I like the semantic that if there are multiple values, you just get essentially a random Key. I think you should get a Tuple or Bag of or List whatever of Keys that correspond... the assumption is that values can appear more than once. Open to debate though Another option on the "inversmap only works on Map<String,String>" front would be that, if given Map<String, T>" it converts T to a string, which I think isn't a bad idea. Maybe that could be a constructor option, and the default is an error.
        Hide
        Prashant Kommireddi added a comment -

        Thanks Jon, I agree with both of your statements. The 2nd case Map<String, T> makes sense for primitive types, will upload a new patch soon.

        Show
        Prashant Kommireddi added a comment - Thanks Jon, I agree with both of your statements. The 2nd case Map<String, T> makes sense for primitive types, will upload a new patch soon.
        Hide
        Prashant Kommireddi added a comment -

        Attaching a revised patch. Here is the new functionality of INVERSEMAP, you would find the documentation in Class docs

         grunt> cat 1data
         [open#1,1#2,11#2]
         [apache#2,3#4,12#24]
        
        
         grunt> a = load 'data' as (M:[int]);
         grunt> b = foreach a generate INVERSEMAP($0);
        
         grunt> dump b;
         ([2#(1,11),1#(open)])
         ([2#(apache),24#(12),4#(3)])
        
        Show
        Prashant Kommireddi added a comment - Attaching a revised patch. Here is the new functionality of INVERSEMAP, you would find the documentation in Class docs grunt> cat 1data [open#1,1#2,11#2] [apache#2,3#4,12#24] grunt> a = load 'data' as (M:[ int ]); grunt> b = foreach a generate INVERSEMAP($0); grunt> dump b; ([2#(1,11),1#(open)]) ([2#(apache),24#(12),4#(3)])
        Hide
        Jonathan Coveney added a comment -

        Hey Prashant, thanks again for your work on this. Took a look at the patch, looks good, though I'm wondering if VALUELIST and VALUESET shouldn't return bags? I guess it depends how someone plans to use it and you could convert easily from one to the other, but yeah... hmm. I think the big question is how it'd be used. The awkward part about having it as a Tuple is that it'd be impossible to do anything with the elements short of having a UDF to read them, whereas if it's a bag, then we know the type of the keys (whatever it is in the Map), and it seems more workable to me. This idea applies to all of the methods, really.

        Show
        Jonathan Coveney added a comment - Hey Prashant, thanks again for your work on this. Took a look at the patch, looks good, though I'm wondering if VALUELIST and VALUESET shouldn't return bags? I guess it depends how someone plans to use it and you could convert easily from one to the other, but yeah... hmm. I think the big question is how it'd be used. The awkward part about having it as a Tuple is that it'd be impossible to do anything with the elements short of having a UDF to read them, whereas if it's a bag, then we know the type of the keys (whatever it is in the Map), and it seems more workable to me. This idea applies to all of the methods, really.
        Hide
        Jonathan Coveney added a comment -

        Woops, hit enter a bit early. With inversemap as well, we know the return type: Map<String,String> (nothing else is really possibly).

        For the tests on inversemap, probably worth testing the case of boolean values, int, etc values.

        Show
        Jonathan Coveney added a comment - Woops, hit enter a bit early. With inversemap as well, we know the return type: Map<String,String> (nothing else is really possibly). For the tests on inversemap, probably worth testing the case of boolean values, int, etc values.
        Hide
        Prashant Kommireddi added a comment -

        I don't see a clear advantage of using Tuples over Bags in this case, I guess I can change it to be Bags. One of the use-cases could be to find Set Intersections or Unions between the value sets or lists. For example, each Map could be an array of (User, PageId) entries, and one might want to find intersections between different click-through patterns. I don't use too much of Maps in my PigLatin code to be honest so others can pitch in on how they would like this to be used.

        Regarding INVERSEMAP, the return type now would be Map<String, DataBag> now. We will be returning a bag of keys a value maps to?

        Show
        Prashant Kommireddi added a comment - I don't see a clear advantage of using Tuples over Bags in this case, I guess I can change it to be Bags. One of the use-cases could be to find Set Intersections or Unions between the value sets or lists. For example, each Map could be an array of (User, PageId) entries, and one might want to find intersections between different click-through patterns. I don't use too much of Maps in my PigLatin code to be honest so others can pitch in on how they would like this to be used. Regarding INVERSEMAP, the return type now would be Map<String, DataBag> now. We will be returning a bag of keys a value maps to?
        Hide
        Prashant Kommireddi added a comment -

        Jon, adding a new patch as per our previous discussion (changing return type to bags). I have also added test cases for various primitive types being Map values.

        Show
        Prashant Kommireddi added a comment - Jon, adding a new patch as per our previous discussion (changing return type to bags). I have also added test cases for various primitive types being Map values.
        Hide
        Jonathan Coveney added a comment -

        Prashant, thanks for tackling this. Maps have been a second class citizen for too long Some comments:

        General:

        • I personally usually do "if (input == null || input.size() == 0)" just so you don't get the array index error, but that's fine. Same issue as in VALUELIST comes up with that as well.

        VALUELIST/SET

        • in some cases maps actually have information on the type of their keys. If this is the case, that information should be used to create the appropriate output schema.
        • I personally like the approach of making TupleFactory and BagFactory a static final member of classes, but that's hardly Pig dogma
        • In the case of a null map that is passed in, there are two options. One is to return null as you do, another is to return an empty bag... the importance is important because a null value won't get thrown out on a flatten, but a null bag will. I think I'd return a null bag, but would be open either way. It's just something we have to be conscious of.

        KEYSET/LIST

        +        //Inner schema for keys would always be type chararray
        +        return new Schema(new Schema.FieldSchema(null, DataType.BAG)); 
        

        You should build the Schema so that it takes into account the fact that you know it will be chararrays.

        VALUESET

        • You have an interesting comment about how you could iterate over the collection, and

        from HashSet, here's the constructor code:

        public HashSet(Collection<? extends E> c) {
            map = new HashMap<>(Math.max((int) (c.size()/.75f) + 1, 16));
            addAll(c);
        }
        

        And addAll does what you're predict. The point of this being that you can probably save in the net amount of logic by iterating over the values directly as you suggest and checking on the HashSet if it already exists. There wouldn't be much of a change compared to what's going on under the covers anyway, as it's not like they do anything special to predict the size it's going to be...you'll just be doing the resizes as you iterate over values manually instead of all at once.

        INVERSEMAP

        • in doInverse, I don't think you really need to do the chain of isntanceofs. You're really just checking that it's not a DataBag or Tuple, right? you could do something like
          if (!(o instanceof Tuple || o instanceof DataBag))
            newKey = o.toString();
          

          The casting is unecessary. If it is the proper type, then toString will give you what you want irrespective.

        • For updating hashmaps in cases like this, I'm a fan of the following to avoid having to do containsKey and then a get right afterwards
          DataBag bag = (DataBag)inverseMap.get(newKey);
          if (bag==null) {
              //logic
          } else {
              bad.add( ... //more logic)
          }
          

        Thanks again for the legwork on this.

        Show
        Jonathan Coveney added a comment - Prashant, thanks for tackling this. Maps have been a second class citizen for too long Some comments: General: I personally usually do "if (input == null || input.size() == 0)" just so you don't get the array index error, but that's fine. Same issue as in VALUELIST comes up with that as well. VALUELIST/SET in some cases maps actually have information on the type of their keys. If this is the case, that information should be used to create the appropriate output schema. I personally like the approach of making TupleFactory and BagFactory a static final member of classes, but that's hardly Pig dogma In the case of a null map that is passed in, there are two options. One is to return null as you do, another is to return an empty bag... the importance is important because a null value won't get thrown out on a flatten, but a null bag will. I think I'd return a null bag, but would be open either way. It's just something we have to be conscious of. KEYSET/LIST + //Inner schema for keys would always be type chararray + return new Schema( new Schema.FieldSchema( null , DataType.BAG)); You should build the Schema so that it takes into account the fact that you know it will be chararrays. VALUESET You have an interesting comment about how you could iterate over the collection, and from HashSet, here's the constructor code: public HashSet(Collection<? extends E> c) { map = new HashMap<>( Math .max(( int ) (c.size()/.75f) + 1, 16)); addAll(c); } And addAll does what you're predict. The point of this being that you can probably save in the net amount of logic by iterating over the values directly as you suggest and checking on the HashSet if it already exists. There wouldn't be much of a change compared to what's going on under the covers anyway, as it's not like they do anything special to predict the size it's going to be...you'll just be doing the resizes as you iterate over values manually instead of all at once. INVERSEMAP in doInverse, I don't think you really need to do the chain of isntanceofs. You're really just checking that it's not a DataBag or Tuple, right? you could do something like if (!(o instanceof Tuple || o instanceof DataBag)) newKey = o.toString(); The casting is unecessary. If it is the proper type, then toString will give you what you want irrespective. For updating hashmaps in cases like this, I'm a fan of the following to avoid having to do containsKey and then a get right afterwards DataBag bag = (DataBag)inverseMap.get(newKey); if (bag== null ) { //logic } else { bad.add( ... //more logic) } Thanks again for the legwork on this.
        Hide
        Prashant Kommireddi added a comment -

        Jon, I made changes addressing most of your concerns. New patch uploaded.

        Show
        Prashant Kommireddi added a comment - Jon, I made changes addressing most of your concerns. New patch uploaded.
        Hide
        Jonathan Coveney added a comment -

        Prashant,

        Thanks for your work on this! It's 99.9% of the way there. A more administrative point...you'll need to reupload it with permissions given to apache to use. What follows is fairly nitpicky.

        • In valuelist, you have
          +            }
          +            else {
          

          just bump that else up

        Less nitpicky:

        • You have many cases where you catch (Exception). Try and catch the least general exception possible. Example: In valuelist, you have
          +        } catch (Exception e) {
          +            String msg = "Error while generating " + this.getClass().getSimpleName();
          +            throw new RuntimeException(msg, e);
          +        }
          

        (I'm not even quite sure what exception is being thrown?)

        Further, instead of throwing a runtime exception, the convention (in exec) is to throw ExecException since exec throws IOException anyway. The builtin functions have many examples of this. But yeah, in general, you want to catch the most specific exception possible in the most specific portion of code possible... I don't like the convention of wrapping an entire method in a try{}catch(Exception). At least, that's how I like to do it and I'm reviewing this So basically: try's should be around the most specific piece of code possible, should catch the most specific exception possible, and should throw ExecException if the code are allows. Throwing a RuntimeException should be a last resort (and is really just jank around the annoying exceptions required by outputSchema).

        Similarly, in KEYSET you have

        +        } catch(FrontendException e) {
        +            // TODO Auto-generated catch block
        +            e.printStackTrace();
        +        }
        

        which I assume was an oversight left by the IDE (those things will be the death of us Java programmers...)

        • I think we had talked a bit about how to do VALUESET... I still think you could do it in one pass, but I don't think the performance gain will be so massive that you can't do it in two passes. Your comment about frequent resizing isn't really accurate, since if you look at the HashSet constructor, it doesn't do any fancy math...it just adds all of the elements, and resizes dynamically as you add more of them. Thus, you could just add to the HashSet, and if it wasn't previously contained, add it to the Tuple, thus doing it in one pass (since that first pass, dynamic resizing and all, is going to happen anyway). Does that make sense?

        Thanks again for your work on this. It is super close.
        Jon

        Show
        Jonathan Coveney added a comment - Prashant, Thanks for your work on this! It's 99.9% of the way there. A more administrative point...you'll need to reupload it with permissions given to apache to use. What follows is fairly nitpicky. In valuelist, you have + } + else { just bump that else up Less nitpicky: You have many cases where you catch (Exception). Try and catch the least general exception possible. Example: In valuelist, you have + } catch (Exception e) { + String msg = "Error while generating " + this .getClass().getSimpleName(); + throw new RuntimeException(msg, e); + } (I'm not even quite sure what exception is being thrown?) Further, instead of throwing a runtime exception, the convention (in exec) is to throw ExecException since exec throws IOException anyway. The builtin functions have many examples of this. But yeah, in general, you want to catch the most specific exception possible in the most specific portion of code possible... I don't like the convention of wrapping an entire method in a try{}catch(Exception). At least, that's how I like to do it and I'm reviewing this So basically: try's should be around the most specific piece of code possible, should catch the most specific exception possible, and should throw ExecException if the code are allows. Throwing a RuntimeException should be a last resort (and is really just jank around the annoying exceptions required by outputSchema). Similarly, in KEYSET you have + } catch (FrontendException e) { + // TODO Auto-generated catch block + e.printStackTrace(); + } which I assume was an oversight left by the IDE (those things will be the death of us Java programmers...) I think we had talked a bit about how to do VALUESET... I still think you could do it in one pass, but I don't think the performance gain will be so massive that you can't do it in two passes. Your comment about frequent resizing isn't really accurate, since if you look at the HashSet constructor, it doesn't do any fancy math...it just adds all of the elements, and resizes dynamically as you add more of them. Thus, you could just add to the HashSet, and if it wasn't previously contained, add it to the Tuple, thus doing it in one pass (since that first pass, dynamic resizing and all, is going to happen anyway). Does that make sense? Thanks again for your work on this. It is super close. Jon
        Hide
        Prashant Kommireddi added a comment -

        Thanks for the review Jon.

        1. I agree on Exception handling in most cases, and thanks for catching "printStackTrace" in the code, it wasn't my intention to leave it in there . In general wrapping specific portions of code within try-catch is a good practice, but I prefer not breaking up try blocks into multiple when most lines within the method throw the same exception, and its not a lot of code otherwise. In these UDFs, Schema.getField is used more than once and ofcourse they all throw FrontEndException.

        And looking at builtin UDFs for examples was really not the best idea . Looks like some refactoring required there.

        2. Regarding resizing of HashSet, trying to optimize right now might be a bit premature. My comment about frequent resizing would make sense if the number of distinct elements in Map values was large. A HashSet start with an internal array of size 16, and starts expanding (creating a new array and copying elements over) once a certain threshold is met. With the current approach, HashSet implementation guesses an approximate size of array based on size of the Collection being passed to it in the constructor. "it just adds all of the elements, and resizes dynamically as you add more of them" - this can be a costly operation if you start with 16 elements and the number of distinct values in the map was thousands, millions... "since that first pass, dynamic resizing and all, is going to happen anyway" - it should make sense that the amount of resizing is not the same in the 2 cases? Either way, too early to be thinking about optimization there

        I will upload a patch soon with the changes, thanks for reviewing again.

        Show
        Prashant Kommireddi added a comment - Thanks for the review Jon. 1. I agree on Exception handling in most cases, and thanks for catching "printStackTrace" in the code, it wasn't my intention to leave it in there . In general wrapping specific portions of code within try-catch is a good practice, but I prefer not breaking up try blocks into multiple when most lines within the method throw the same exception, and its not a lot of code otherwise. In these UDFs, Schema.getField is used more than once and ofcourse they all throw FrontEndException. And looking at builtin UDFs for examples was really not the best idea . Looks like some refactoring required there. 2. Regarding resizing of HashSet, trying to optimize right now might be a bit premature. My comment about frequent resizing would make sense if the number of distinct elements in Map values was large. A HashSet start with an internal array of size 16, and starts expanding (creating a new array and copying elements over) once a certain threshold is met. With the current approach, HashSet implementation guesses an approximate size of array based on size of the Collection being passed to it in the constructor. "it just adds all of the elements, and resizes dynamically as you add more of them" - this can be a costly operation if you start with 16 elements and the number of distinct values in the map was thousands, millions... "since that first pass, dynamic resizing and all, is going to happen anyway" - it should make sense that the amount of resizing is not the same in the 2 cases? Either way, too early to be thinking about optimization there I will upload a patch soon with the changes, thanks for reviewing again.
        Hide
        Jonathan Coveney added a comment -

        So much refactoring is necessary in so many places...oy haha. As far as how much code to wrap...yeah, that's just style at this point. Not worth nitpicking over. I do have a couple of last comments...

        In VALUELIST (and a couple of other places I think)

        +        try {
        +            //Input must be of type Map. This is verified at compile time
        +            m = (Map<String, Object>)(input.get(0));
        +        } catch (ExecException e) {
        +            throw e;
        +        }
        

        the catch shouldn't be necessary anymore, since exec throws IOException and ExecException is an IOException.

        More of style point, but in KEYSET:

        +        Iterator<String> iter = keySet.iterator();
        +        DataBag bag = BAG_FACTORY.newDefaultBag();
        +        //Add keyset elements to bag
        +        while(iter.hasNext()) {
        +            Tuple t = TUPLE_FACTORY.newTuple(iter.next());
        +            bag.add(t);
        +        }
        

        could be rewritten as

        DataBag bag = BAG_FACTORY.newDefaultBag();
        for (String s : keySet) {
            Tuple t = TUPLE_FACTORY.newTuple(iter.next());
            bag.add(t);
        }
        

        IMHO using the for construct with iterators is a lot clearer, but it's not a big deal... if there weren't other things I wouldn't have bother mentioning it, but here it is haha.

        • You can have the doInverse method throw ExecException, then instead of throwing RuntimeException, throw an ExecException, which you won't have to catch since exec throws IOException. Also, the exception message is a little misleading, as the error isn't that of getting and unknown primitive (primitive implies int etc), but just a wrong type in general.

        That's it though. Should be good for committing after that.

        Thanks for the work!

        Show
        Jonathan Coveney added a comment - So much refactoring is necessary in so many places...oy haha. As far as how much code to wrap...yeah, that's just style at this point. Not worth nitpicking over. I do have a couple of last comments... In VALUELIST (and a couple of other places I think) + try { + //Input must be of type Map. This is verified at compile time + m = (Map< String , Object >)(input.get(0)); + } catch (ExecException e) { + throw e; + } the catch shouldn't be necessary anymore, since exec throws IOException and ExecException is an IOException. More of style point, but in KEYSET: + Iterator< String > iter = keySet.iterator(); + DataBag bag = BAG_FACTORY.newDefaultBag(); + //Add keyset elements to bag + while (iter.hasNext()) { + Tuple t = TUPLE_FACTORY.newTuple(iter.next()); + bag.add(t); + } could be rewritten as DataBag bag = BAG_FACTORY.newDefaultBag(); for ( String s : keySet) { Tuple t = TUPLE_FACTORY.newTuple(iter.next()); bag.add(t); } IMHO using the for construct with iterators is a lot clearer, but it's not a big deal... if there weren't other things I wouldn't have bother mentioning it, but here it is haha. You can have the doInverse method throw ExecException, then instead of throwing RuntimeException, throw an ExecException, which you won't have to catch since exec throws IOException. Also, the exception message is a little misleading, as the error isn't that of getting and unknown primitive (primitive implies int etc), but just a wrong type in general. That's it though. Should be good for committing after that. Thanks for the work!
        Hide
        Prashant Kommireddi added a comment -

        Jon, addressed your feedback. Thanks.

        Show
        Prashant Kommireddi added a comment - Jon, addressed your feedback. Thanks.
        Hide
        Jonathan Coveney added a comment -

        Man, I feel so nitpicky, though this one isn't so nitpicky: the javadoc formatting isn't correct. As is, when I compile the javadocs, I get this:

        This UDF accepts a Map as input with values of any primitive data type. UDF swaps keys with values and returns the new inverse Map. Note in case original values are non-unique, the resulting Map would contain String Key -> DataBag of values. Here the bag of values is composed of the original keys having the same value. Note: 1. UDF accepts Map with Values of primitive data type 2. UDF returns Map grunt> cat 1data [open#1,1#2,11#2] [apache#2,3#4,12#24] grunt> a = load 'data' as (M:[int]); grunt> b = foreach a generate INVERSEMAP($0); grunt> dump b; ([2#{(1),(11)},apache#{(open)}]) ([hadoop#{(apache),(12)},4#{(3)}])
        

        Other comments:

        • doInverse should return Map<String,DataBag>
        • INVERSEMAP should extend extends EvalFunc<Map<String,DataBag>>
        • inverseMap should be a Map<String,Databag>
        
        100             DataBag bag = (DataBag)inverseMap.get(newKey);
        101             if(bag == null) {
        102                 DataBag dataBag = BAG_FACTORY.newDefaultBag();
        103                 dataBag.add(TUPLE_FACTORY.newTuple(entry.getKey()));
        104                 inverseMap.put(newKey, dataBag);
        105             } else {
        106                 bag.add(TUPLE_FACTORY.newTuple(entry.getKey()));
        107             }
        

        You can just reuse the bag reference instead of making a new dataBag

        That should be the last of the comments...

        Show
        Jonathan Coveney added a comment - Man, I feel so nitpicky, though this one isn't so nitpicky: the javadoc formatting isn't correct. As is, when I compile the javadocs, I get this: This UDF accepts a Map as input with values of any primitive data type. UDF swaps keys with values and returns the new inverse Map. Note in case original values are non-unique, the resulting Map would contain String Key -> DataBag of values. Here the bag of values is composed of the original keys having the same value. Note: 1. UDF accepts Map with Values of primitive data type 2. UDF returns Map grunt> cat 1data [open#1,1#2,11#2] [apache#2,3#4,12#24] grunt> a = load 'data' as (M:[ int ]); grunt> b = foreach a generate INVERSEMAP($0); grunt> dump b; ([2#{(1),(11)},apache#{(open)}]) ([hadoop#{(apache),(12)},4#{(3)}]) Other comments: doInverse should return Map<String,DataBag> INVERSEMAP should extend extends EvalFunc<Map<String,DataBag>> inverseMap should be a Map<String,Databag> 100 DataBag bag = (DataBag)inverseMap.get(newKey); 101 if (bag == null ) { 102 DataBag dataBag = BAG_FACTORY.newDefaultBag(); 103 dataBag.add(TUPLE_FACTORY.newTuple(entry.getKey())); 104 inverseMap.put(newKey, dataBag); 105 } else { 106 bag.add(TUPLE_FACTORY.newTuple(entry.getKey())); 107 } You can just reuse the bag reference instead of making a new dataBag That should be the last of the comments...
        Hide
        Prashant Kommireddi added a comment -

        Jon, here it is.

        Show
        Prashant Kommireddi added a comment - Jon, here it is.
        Hide
        Jonathan Coveney added a comment -

        Prashant,

        Looks great. Literally one more comment and then I can commit: VALUESET seems to be in 2space form instead of 4space form. My guess is you forgot to change your IDE from the convention your use at work, or something like that. Fix that and then we're good to go

        Jon

        Show
        Jonathan Coveney added a comment - Prashant, Looks great. Literally one more comment and then I can commit: VALUESET seems to be in 2space form instead of 4space form. My guess is you forgot to change your IDE from the convention your use at work, or something like that. Fix that and then we're good to go Jon
        Hide
        Prashant Kommireddi added a comment -

        My bad, I used Hbase formatter for making edits on VALUESET. Seems like the conventions are different there. Uploading a new patch.

        Show
        Prashant Kommireddi added a comment - My bad, I used Hbase formatter for making edits on VALUESET. Seems like the conventions are different there. Uploading a new patch.
        Hide
        Jonathan Coveney added a comment -

        Prashant: it's in. Thanks for your patience! Looking forward to seeing more contribution

        Show
        Jonathan Coveney added a comment - Prashant: it's in. Thanks for your patience! Looking forward to seeing more contribution
        Hide
        Prashant Kommireddi added a comment -

        Thanks Jon. Did you forget to change the resolution status?

        Show
        Prashant Kommireddi added a comment - Thanks Jon. Did you forget to change the resolution status?
        Hide
        Olga Natkovich added a comment -

        can you please add to release notes the UDFs that were added as well as their syntax and usage examples. This is for inclusion in the documentation, thanks!

        Show
        Olga Natkovich added a comment - can you please add to release notes the UDFs that were added as well as their syntax and usage examples. This is for inclusion in the documentation, thanks!
        Hide
        Jonathan Coveney added a comment -

        Prashant, do you want to take the honor?

        Show
        Jonathan Coveney added a comment - Prashant, do you want to take the honor?
        Hide
        Prashant Kommireddi added a comment -

        Sure, I can do that. Where exactly should I be adding this?

        Show
        Prashant Kommireddi added a comment - Sure, I can do that. Where exactly should I be adding this?
        Hide
        Prashant Kommireddi added a comment -

        Btw, syntax and usage examples are all present in java files for each of the UDFs. Lmk if it needs to go into release notes (and where) and I can put it there.

        Show
        Prashant Kommireddi added a comment - Btw, syntax and usage examples are all present in java files for each of the UDFs. Lmk if it needs to go into release notes (and where) and I can put it there.
        Hide
        Olga Natkovich added a comment -

        Yes, please, put all the information into the release notes. This way it is much easier to created documentation patch.

        Show
        Olga Natkovich added a comment - Yes, please, put all the information into the release notes. This way it is much easier to created documentation patch.
        Hide
        Prashant Kommireddi added a comment -

        Olga, adding release notes. Let me know if you need more info.

        Show
        Prashant Kommireddi added a comment - Olga, adding release notes. Let me know if you need more info.

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Jonathan Coveney
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development