Pig
  1. Pig
  2. PIG-2443

[Piggybank] Add UDFs to check if a String is an Integer And if a String is Numeric

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: piggybank
    • Labels:
      None
    • Patch Info:
      Patch Available
    • Hadoop Flags:
      Reviewed
    • Tags:
      Piggybank, IsInt, IsNumeric

      Description

      UDF that could be used to check if a String is numeric (or an Integer). Several tools such as Splunk, AbInitio have this UDF built-in and companies making an effort to move to Hadoop/Pig could use this.

      Use Case:
      In raw logs there are certain filters/conditions applied based on whether a particular field/value is numeric or not. For eg, SPLIT A INTO CATEGORY1 IF IsInt($0), CATEGORY2 IF !IsInt($0);

      1. isIntNumeric.patch
        5 kB
        Prashant Kommireddi
      2. isIntNumeric.patch
        6 kB
        Prashant Kommireddi
      3. PIG-2443.patch
        17 kB
        Prashant Kommireddi
      4. PIG-2443.2.patch
        26 kB
        Prashant Kommireddi

        Activity

        Hide
        Prashant Kommireddi added a comment -

        Daniel, thanks for the commit!
        And Jonathan, thanks for your input.

        Show
        Prashant Kommireddi added a comment - Daniel, thanks for the commit! And Jonathan, thanks for your input.
        Hide
        Daniel Dai added a comment -

        See two piggybank test failure: TestDBStorage and TestMultiStorageCompression. But these certainly not related to this patch. I will trace them in a separate ticket.

        test-patch:
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 15 new or modified tests.
        [exec]
        [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 508 release audit warnings (more than the trunk's current 501 warnings).

        Every new file contains proper header, ignore release audit warning.

        Patch committed to trunk.

        Thanks Prashant for contributing!

        Show
        Daniel Dai added a comment - See two piggybank test failure: TestDBStorage and TestMultiStorageCompression. But these certainly not related to this patch. I will trace them in a separate ticket. test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 15 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 508 release audit warnings (more than the trunk's current 501 warnings). Every new file contains proper header, ignore release audit warning. Patch committed to trunk. Thanks Prashant for contributing!
        Hide
        Prashant Kommireddi added a comment -

        Adding Apache License and Javadoc comments

        Show
        Prashant Kommireddi added a comment - Adding Apache License and Javadoc comments
        Hide
        Daniel Dai added a comment -

        Hi, Prashant,
        Thanks for the patch. Please add Apache License Header to every new file you add. Also can you add javadoc to every UDF you add (you can provide a link if it is a repetition)

        Show
        Daniel Dai added a comment - Hi, Prashant, Thanks for the patch. Please add Apache License Header to every new file you add. Also can you add javadoc to every UDF you add (you can provide a link if it is a repetition)
        Hide
        Prashant Kommireddi added a comment -

        Added IsFloat, isDouble, IsLong. Also added test cases for the same.

        Added documentation for IsNumeric

        Show
        Prashant Kommireddi added a comment - Added IsFloat, isDouble, IsLong. Also added test cases for the same. Added documentation for IsNumeric
        Hide
        Jonathan Coveney added a comment -

        I don't think it's too confusing, I would just explicitly state the purpose of the UDF. It is a fair one, and is something that I've done manually before, so I think it makes sense to ask. Just document the purpose, and more importantly, what it doesn't do.

        Show
        Jonathan Coveney added a comment - I don't think it's too confusing, I would just explicitly state the purpose of the UDF. It is a fair one, and is something that I've done manually before, so I think it makes sense to ask. Just document the purpose, and more importantly, what it doesn't do.
        Hide
        Prashant Kommireddi added a comment -

        1. IsNumeric does not check for Long/Double range at all. Its simply a check to verify whether a String contains ONLY digits or not. The reason to implement this is to give users the ability to make a check for numeric"ness", and not necessarily to cast it back to a data type.

        Example: At my previous company we stored item listings as a Numeric value. These Item Listing IDs could go well beyond the range of Long/Double. If I try to check for numeric"ness" based on a certain data type (long, double) it would fail.
        The reason I implemented this is currently I use it to only SPLIT based on numeric"ness" in the log files. Once I have determined the SPLIT I do not cast it to a particular data type. And the field on which I call isNumeric can be arbitrary in length.

        2. Good point again, I do not expect a huge gain but Regex match will in most cases be slightly faster than parseDouble. Just to reiterate, the primary goal of implementing IsNumeric is not performance.

        I think isNumeric is a nice to have UDF. But if it sounds like it would confuse users more than its worth, we could just stick to isInt/IsLong etc.

        Show
        Prashant Kommireddi added a comment - 1. IsNumeric does not check for Long/Double range at all. Its simply a check to verify whether a String contains ONLY digits or not. The reason to implement this is to give users the ability to make a check for numeric"ness", and not necessarily to cast it back to a data type. Example: At my previous company we stored item listings as a Numeric value. These Item Listing IDs could go well beyond the range of Long/Double. If I try to check for numeric"ness" based on a certain data type (long, double) it would fail. The reason I implemented this is currently I use it to only SPLIT based on numeric"ness" in the log files. Once I have determined the SPLIT I do not cast it to a particular data type. And the field on which I call isNumeric can be arbitrary in length. 2. Good point again, I do not expect a huge gain but Regex match will in most cases be slightly faster than parseDouble. Just to reiterate, the primary goal of implementing IsNumeric is not performance. I think isNumeric is a nice to have UDF. But if it sounds like it would confuse users more than its worth, we could just stick to isInt/IsLong etc.
        Hide
        Jonathan Coveney added a comment -

        That sounds fine. In the case of numeric, I think we need to think about when you want it to return true.

        1. Should it only return true for valid Int, Long, Double, or Float values? Your example would return true for 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111, though this is way too large to be any of the above!
        2. A Java double can take the form 2.22e308, or whathave you. You said that a regex is faster than a Double, but how much faster. You can build in all the rules, but eventually you're just reimplementing the logic of parseDouble.

        Show
        Jonathan Coveney added a comment - That sounds fine. In the case of numeric, I think we need to think about when you want it to return true. 1. Should it only return true for valid Int, Long, Double, or Float values? Your example would return true for 111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111, though this is way too large to be any of the above! 2. A Java double can take the form 2.22e308, or whathave you. You said that a regex is faster than a Double, but how much faster. You can build in all the rules, but eventually you're just reimplementing the logic of parseDouble.
        Hide
        Prashant Kommireddi added a comment -

        I agree with you on that, it would be nice to have the UDF return the integer value in case input is an Integer, and a default otherwise. May be we can visit that at a later time.

        As per your feedback, here is what I am going to do. Let me know your thoughts on this

        1. IsNumeric works for floating points
        2. Override IsInt for Long, Double, Float

        public class IsNumeric extends EvalFunc<Boolean> {
        
            @Override
            public Boolean exec(Tuple input) throws IOException {
                if (input == null || input.size() == 0) return false;
                try {
                    String str = (String)input.get(0);
                    if (str == null || str.length() == 0) return false;
        
                    if (str.startsWith("-")) str = str.substring(1);
        
                    return str.matches("\\d+(\\.\\d+)?");
        
                } catch (ClassCastException e) {
                    warn(e.getMessage(), PigWarning.UDF_WARNING_1);
                    return false;
                }
            }
        }
        
        Show
        Prashant Kommireddi added a comment - I agree with you on that, it would be nice to have the UDF return the integer value in case input is an Integer, and a default otherwise. May be we can visit that at a later time. As per your feedback, here is what I am going to do. Let me know your thoughts on this 1. IsNumeric works for floating points 2. Override IsInt for Long, Double, Float public class IsNumeric extends EvalFunc< Boolean > { @Override public Boolean exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return false ; try { String str = ( String )input.get(0); if (str == null || str.length() == 0) return false ; if (str.startsWith( "-" )) str = str.substring(1); return str.matches( "\\d+(\\.\\d+)?" ); } catch (ClassCastException e) { warn(e.getMessage(), PigWarning.UDF_WARNING_1); return false ; } } }
        Hide
        Jonathan Coveney added a comment -

        Oh, I totally know what you mean, I'm just saying it would be cool... and given that you're doing, say, Integer.parseInt() and then just throwing away the result, it seems silly that someone would do the split, and then recast the int fields in the relation created by the data for which IsInt is true.

        There is currently no way for UDF to produce variable output schema (nor should there, be, really). This would be something specific to this use of split.

        Show
        Jonathan Coveney added a comment - Oh, I totally know what you mean, I'm just saying it would be cool... and given that you're doing, say, Integer.parseInt() and then just throwing away the result, it seems silly that someone would do the split, and then recast the int fields in the relation created by the data for which IsInt is true. There is currently no way for UDF to produce variable output schema (nor should there, be, really). This would be something specific to this use of split.
        Hide
        Prashant Kommireddi added a comment -

        1. IsNumeric is not necessarily implemented for speed, rather it's for a different requirement. That is, for cases when user does not care if value is an Int/Long/Float/Double and simply would like to check if it is numeric. (Though this inherently gives you better performance)

        2. I had originally thought of isInt or IsNumeric to be a UDF to determine if data is int/numeric but not to actually make the cast.
        I am curious as to how the UDF could produce variable output schema.

        Show
        Prashant Kommireddi added a comment - 1. IsNumeric is not necessarily implemented for speed, rather it's for a different requirement. That is, for cases when user does not care if value is an Int/Long/Float/Double and simply would like to check if it is numeric. (Though this inherently gives you better performance) 2. I had originally thought of isInt or IsNumeric to be a UDF to determine if data is int/numeric but not to actually make the cast. I am curious as to how the UDF could produce variable output schema.
        Hide
        Jonathan Coveney added a comment -

        Yeah, that's what I'd do. I wouldn't obsess over speed yet, I'd just implement it and see how fast it is, and then if it's prohibitively slow go from there.

        The more annoying issue is that since we're essentially converting it over, there's going to be two casts when there only needs to be one.

        You'll have IsInt() in the split, and then in the resultant field, you'll have to cast the int one over to an int. It'd be nice if it could take advantage of what is going on and post split, the true values will have the proper schema :int, and the ones that aren't will still be :chararray.

        Show
        Jonathan Coveney added a comment - Yeah, that's what I'd do. I wouldn't obsess over speed yet, I'd just implement it and see how fast it is, and then if it's prohibitively slow go from there. The more annoying issue is that since we're essentially converting it over, there's going to be two casts when there only needs to be one. You'll have IsInt() in the split, and then in the resultant field, you'll have to cast the int one over to an int. It'd be nice if it could take advantage of what is going on and post split, the true values will have the proper schema :int, and the ones that aren't will still be :chararray.
        Hide
        Prashant Kommireddi added a comment -

        Good point. We could have functions IsFloat/IsDouble/IsLong overriding IsInt. And IsNumeric can be a single UDF that handles all the cases, since we do not have a notion of range checks on this UDF?

        Show
        Prashant Kommireddi added a comment - Good point. We could have functions IsFloat/IsDouble/IsLong overriding IsInt. And IsNumeric can be a single UDF that handles all the cases, since we do not have a notion of range checks on this UDF?
        Hide
        Jonathan Coveney added a comment -

        What about floats and doubles? Are we assuming they are not numeric?

        Show
        Jonathan Coveney added a comment - What about floats and doubles? Are we assuming they are not numeric?
        Hide
        Prashant Kommireddi added a comment -

        Proposal to implement 2 UDFs

        1. IsInt
        2. IsNumeric

        IsInt is used to check whether the String input is an Integer. Note this function checks for Integer range 2,147,483,648 to 2,147,483,647.

        Use IsNumeric instead if you would like to check if a String is numeric. Also IsNumeric performs better as its a regex match compared to IsInt which makes a call to Integer.parseInt(String input)

        IsInt checks whether making a call to Integer.parseInt results in a NumberFormatException and returns the boolean accordingly.

        public class IsInt extends EvalFunc<Boolean> {
            @Override
            public Boolean exec(Tuple input) throws IOException {
                if (input == null || input.size() == 0) return false;
                try {
                    String str = (String)input.get(0);
                    if (str == null || str.length() == 0) return false;
                    Integer.parseInt(str);
                } catch (NumberFormatException nfe) {
                    return false;
                } catch (ClassCastException e) {
                    warn("Unable to cast input "+input.get(0)+" of class "+
                            input.get(0).getClass()+" to String", PigWarning.UDF_WARNING_1);
                    return false;
                }
        
                return true;
            }
        }
        

        IsNumeric makes a Regex match against the Input to check whether all characters are numeric digits.

        public class IsNumeric extends EvalFunc<Boolean> {
        
        	@Override
        	public Boolean exec(Tuple input) throws IOException {
        		if (input == null || input.size() == 0)
        			return false;
        		try {
        			String str = (String) input.get(0);
        			if (str == null || str.length() == 0)
        				return false;
        
        			if (str.startsWith("-"))
        				str = str.substring(1);
        
        			return str.matches("\\d*");
        
        		} catch (ClassCastException e) {
        			warn("Unable to cast input " + input.get(0) + " of class "
        					+ input.get(0).getClass() + " to String",
        					PigWarning.UDF_WARNING_1);
        			return false;
        		}
        	}
        }
        

        I have added Test cases for both UDFs as well.

        Show
        Prashant Kommireddi added a comment - Proposal to implement 2 UDFs 1. IsInt 2. IsNumeric IsInt is used to check whether the String input is an Integer. Note this function checks for Integer range 2,147,483,648 to 2,147,483,647. Use IsNumeric instead if you would like to check if a String is numeric. Also IsNumeric performs better as its a regex match compared to IsInt which makes a call to Integer.parseInt(String input) IsInt checks whether making a call to Integer.parseInt results in a NumberFormatException and returns the boolean accordingly. public class IsInt extends EvalFunc< Boolean > { @Override public Boolean exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return false ; try { String str = ( String )input.get(0); if (str == null || str.length() == 0) return false ; Integer .parseInt(str); } catch (NumberFormatException nfe) { return false ; } catch (ClassCastException e) { warn( "Unable to cast input " +input.get(0)+ " of class " + input.get(0).getClass()+ " to String " , PigWarning.UDF_WARNING_1); return false ; } return true ; } } IsNumeric makes a Regex match against the Input to check whether all characters are numeric digits. public class IsNumeric extends EvalFunc< Boolean > { @Override public Boolean exec(Tuple input) throws IOException { if (input == null || input.size() == 0) return false ; try { String str = ( String ) input.get(0); if (str == null || str.length() == 0) return false ; if (str.startsWith( "-" )) str = str.substring(1); return str.matches( "\\d*" ); } catch (ClassCastException e) { warn( "Unable to cast input " + input.get(0) + " of class " + input.get(0).getClass() + " to String " , PigWarning.UDF_WARNING_1); return false ; } } } I have added Test cases for both UDFs as well.

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Prashant Kommireddi
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development