Pig
  1. Pig
  2. PIG-1942

script UDF (jython) should utilize the intended output schema to more directly convert Py objects to Pig objects

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.8.0, 0.9.0
    • Fix Version/s: None
    • Component/s: impl
    • Labels:

      Description

      from https://issues.apache.org/jira/browse/PIG-1824

      import re
      @outputSchema("y:bag{t:tuple(word:chararray)}")
      def strsplittobag(content,regex):
              return re.compile(regex).split(content)
      

      does not work because split returns a list of strings. However, the output schema is known, and it would be quite simple to implicitly promote the string element to a tupled element.
      also, a list/array/tuple/set etc. are all equally convertable to bag, and list/array/tuple are equally convertable to Tuple, this conversion can be done in a much less rigid way with the use of the schema.

      this allows much more facile re-use of existing python code and less memory overhead to create intermediate re-converting of object types.
      I have written the code to do this a while back as part of my version of the jython script framework, i'll isolate that and attach.

      1. 1942_with_junit.patch
        65 kB
        Woody Anderson
      2. 1942.patch
        37 kB
        Woody Anderson

        Activity

        Woody Anderson created issue -
        Olga Natkovich made changes -
        Field Original Value New Value
        Fix Version/s 0.10 [ 12316246 ]
        Fix Version/s 0.9.0 [ 12315191 ]
        Hide
        Woody Anderson added a comment -

        I wanted to get this started, as this is a bit of a change.

        often, it seems that people misuse the outputSchema annotation such that the output does not match the specified schema. At least, there was a unit test that did this, and it's possible that a few users in the wild have this issue as well.

        At any rate, this patch includes code in JythonUtils that will coerce jythout object model output into the schema that the function is annotated with.

        It's faster than the existing code and has quite a bit more functionality. It can convert arrays and many more types than previously. It also makes it much easier and faster to convert [1,2,3] to a bag rather than in jython create [(1), (2), (3)].

        Given that this changes the functionality of udfs that use @outputSchema (by coercing schema adherence), we may want to use a different annotation, and allow outputSchema to exist in it's previous form, in that it doesn't actually convert the schema.

        Show
        Woody Anderson added a comment - I wanted to get this started, as this is a bit of a change. often, it seems that people misuse the outputSchema annotation such that the output does not match the specified schema. At least, there was a unit test that did this, and it's possible that a few users in the wild have this issue as well. At any rate, this patch includes code in JythonUtils that will coerce jythout object model output into the schema that the function is annotated with. It's faster than the existing code and has quite a bit more functionality. It can convert arrays and many more types than previously. It also makes it much easier and faster to convert [1,2,3] to a bag rather than in jython create [(1), (2), (3)] . Given that this changes the functionality of udfs that use @outputSchema (by coercing schema adherence), we may want to use a different annotation, and allow outputSchema to exist in it's previous form, in that it doesn't actually convert the schema.
        Woody Anderson made changes -
        Attachment 1942.patch [ 12478038 ]
        Hide
        Woody Anderson added a comment -

        i forgot to svn add my unit test that contains a lot of useful tests and comments.

        it's included in this patch. it has a timing loop at the end that you can enable by adding an annotation etc. or running it directly in eclipse etc. to show the performance difference between the methods.

        Show
        Woody Anderson added a comment - i forgot to svn add my unit test that contains a lot of useful tests and comments. it's included in this patch. it has a timing loop at the end that you can enable by adding an annotation etc. or running it directly in eclipse etc. to show the performance difference between the methods.
        Woody Anderson made changes -
        Attachment 1942_with_junit.patch [ 12478077 ]
        Woody Anderson made changes -
        Assignee Woody Anderson [ woody.anderson@gmail.com ]
        Hide
        Alan Gates added a comment -

        marking as SubmitPatch so this can be reviewed and committed.

        Show
        Alan Gates added a comment - marking as SubmitPatch so this can be reviewed and committed.
        Alan Gates made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Hide
        Thejas M Nair added a comment -

        I think this is a very good idea, it will make it easier to write python udfs.
        The patch is like one that introduces several new API's. Each type conversion behavior introduced here will need to be retained to preserve backward compatibility. I think we should restrict the conversions to the cases where we are sure it makes sense (return either null+warning or error in other cases).

        Review of 1942_with_junit.patch

        • I think when the tuple schema and python udf return value are not compatible (for example when number of fields in schema are less than number of fields in object returned by udf), it should return null + warning. The case where the number of fields in object returned by python udf is fewer than ones in schema, null fields should be appended to new tuple to match schema.
        • In JythonUtils.asBag, I am not sure if the automatic decision made for the type converting using contents of udf output object is worth the increase in complexity and potential for surprises. I think the user should wrap the 'list type' within another list type for the case when the schema represents a bag of tuples. ie do type conversions for only the cases where compatible == true.
        • In JythonUtils.asBag, Why are the null values skipped ? This behavior is not consistent to behavior in other places in pig.
        • Why does the patch do type conversions for non python datatypes ? Are these expected from python udf output ?
        Show
        Thejas M Nair added a comment - I think this is a very good idea, it will make it easier to write python udfs. The patch is like one that introduces several new API's. Each type conversion behavior introduced here will need to be retained to preserve backward compatibility. I think we should restrict the conversions to the cases where we are sure it makes sense (return either null+warning or error in other cases). Review of 1942_with_junit.patch I think when the tuple schema and python udf return value are not compatible (for example when number of fields in schema are less than number of fields in object returned by udf), it should return null + warning. The case where the number of fields in object returned by python udf is fewer than ones in schema, null fields should be appended to new tuple to match schema. In JythonUtils.asBag, I am not sure if the automatic decision made for the type converting using contents of udf output object is worth the increase in complexity and potential for surprises. I think the user should wrap the 'list type' within another list type for the case when the schema represents a bag of tuples. ie do type conversions for only the cases where compatible == true. In JythonUtils.asBag, Why are the null values skipped ? This behavior is not consistent to behavior in other places in pig. Why does the patch do type conversions for non python datatypes ? Are these expected from python udf output ?
        Hide
        Woody Anderson added a comment -

        I think your feedback raises some fair questions, but I have some reasons for disagreeing:

        wrt schema has fewer fields than actual:
        this is a common case for me b/c pig doesn't allow specification of *-tuple i.e. all rows of data will have the same number of (int) elements, but it's not known how many. This is a week area of pig in general imho. If there is only 1 element in the tuple it can be seen to infer some type information for the remaining rows. (at least i think this is how 'tuple(int)' shows up). I think that when there are more than 1 columns in a tuple, then it's not a generic tuple, then i can see an error being appropriate. but for 1-tuples i appreciate the flexibility of using it for the type and writing udfs that accept tuples of arbitrary dimension, even if the args-to-function stuff is to simplistic to apply in this scenario it's easy enough to write useful udfs that utilize tuple dimension flexibility.

        I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. This may be where reasonable people disagree, but i think null rather than a tuple reflecting the returned data is less expected.

        The whole reporting of 'schema != data' could be improved tho. I am not sure of the best way to reflect that anything "grey/WARN" is happening. It seems liking logging 1 line per encountered edge case is major overkill, and prone to generate huge log output. We could count each WARN scenario and log/counter that information to give a more succinct description of execution behavior that a simple user can fix, and an advanced user can ignore judiciously. Possibly more specific counters and only 1 warn per type per execution.

        Pig schemas are often so.. imprecise, that i think best effort coercion is useful, but i think a fine compromise would be to support only a specific set of conversions that would be a subset of this patch, but perform the others b/c they are mostly intuitive and useful, but a WARN will be generated when executed if we think it's too esoteric. We may draw lines in slightly different places, but i tried to cover a fair number of cases in the test code, which is think is a fairly survey of expected coercions.

        wrt JU.asBag, i think auto-tupling is a must. This is one of the most common mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a very common refrain, and just silly 99.9% of the time. Plus it's a bunch of extra unnecessary objects that one must create, and causes a bit slower execution for simple udfs.
        I'd have to re-read the code again to examine the edge cases. I do recall the disambiguation for embedded bags being a pain to write and describe. Documentation being the remaining concern. That said, i think it does something reasonable and still executes faster than existing rigid code. Also in the code is a decent synopsis of the disambiguations that are intended.

        wrt skipping nulls: can you cite the line number? do you mean skipping null bags? or null element/tuples when creating a bag? This might just be me not understanding something properly. I thought bags didn't have null tuples, just tuples with null elements?

        wrt various types:
        jython is fully capable of returning any jvm type. so that means anything really.
        I decided to cover the collections classes, lang classes, base types, and PY* classes.
        Jython is nice in that many classes implement the collections ifaces, but not always as efficiently as using the python classes directly.
        this is common in python/jython of course. not in udfs as of yet... b/c it wasn't allowed. But i began doing it pretty quickly once it was possible.

        Show
        Woody Anderson added a comment - I think your feedback raises some fair questions, but I have some reasons for disagreeing: wrt schema has fewer fields than actual: this is a common case for me b/c pig doesn't allow specification of *-tuple i.e. all rows of data will have the same number of (int) elements, but it's not known how many. This is a week area of pig in general imho. If there is only 1 element in the tuple it can be seen to infer some type information for the remaining rows. (at least i think this is how 'tuple(int)' shows up). I think that when there are more than 1 columns in a tuple, then it's not a generic tuple, then i can see an error being appropriate. but for 1-tuples i appreciate the flexibility of using it for the type and writing udfs that accept tuples of arbitrary dimension, even if the args-to-function stuff is to simplistic to apply in this scenario it's easy enough to write useful udfs that utilize tuple dimension flexibility. I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. This may be where reasonable people disagree, but i think null rather than a tuple reflecting the returned data is less expected. The whole reporting of 'schema != data' could be improved tho. I am not sure of the best way to reflect that anything "grey/WARN" is happening. It seems liking logging 1 line per encountered edge case is major overkill, and prone to generate huge log output. We could count each WARN scenario and log/counter that information to give a more succinct description of execution behavior that a simple user can fix, and an advanced user can ignore judiciously. Possibly more specific counters and only 1 warn per type per execution. Pig schemas are often so.. imprecise, that i think best effort coercion is useful, but i think a fine compromise would be to support only a specific set of conversions that would be a subset of this patch, but perform the others b/c they are mostly intuitive and useful, but a WARN will be generated when executed if we think it's too esoteric. We may draw lines in slightly different places, but i tried to cover a fair number of cases in the test code, which is think is a fairly survey of expected coercions. wrt JU.asBag, i think auto-tupling is a must. This is one of the most common mistakes for jython udf devs. "why must i wrap tokens inside of tuples" is a very common refrain, and just silly 99.9% of the time. Plus it's a bunch of extra unnecessary objects that one must create, and causes a bit slower execution for simple udfs. I'd have to re-read the code again to examine the edge cases. I do recall the disambiguation for embedded bags being a pain to write and describe. Documentation being the remaining concern. That said, i think it does something reasonable and still executes faster than existing rigid code. Also in the code is a decent synopsis of the disambiguations that are intended. wrt skipping nulls: can you cite the line number? do you mean skipping null bags? or null element/tuples when creating a bag? This might just be me not understanding something properly. I thought bags didn't have null tuples, just tuples with null elements? wrt various types: jython is fully capable of returning any jvm type. so that means anything really. I decided to cover the collections classes, lang classes, base types, and PY* classes. Jython is nice in that many classes implement the collections ifaces, but not always as efficiently as using the python classes directly. this is common in python/jython of course. not in udfs as of yet... b/c it wasn't allowed. But i began doing it pretty quickly once it was possible.
        Hide
        Thejas M Nair added a comment -

        wrt schema has fewer fields than actual:

        I think pig schema needs to support the feature of specifying partial schemas, or types for schemas with variable number of fields of certain/unspecified type. But I think it is better to have this feature in schema, rather than doing conversions based on a schema that is not compatible. Also, I think it is a good thing to check for schema consistency, so that the user knows when they make a mistake.

        I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code.

        Returning null+WARN is the convention followed in load funcs like PigStorage and in type conversion code. But I see that the situation in type conversion is different because there is no reasonable interpretation if the type conversion fails.
        Does any body else have opinions on this ?

        Re: logging 1 line per warning.
        PigLogger.warn(..) can be used to aggregate the warnings.

        Regarding auto-tupling, I agree that it is useful when -
        1. output schema is a tuple
        2. output schema is a bag of tuples with single fields.
        But if it is a bag of tuples that have multiple fields, I think it is makes sense for the output value to have a list type representing the tuple.

        If output schema is 
        {(int, int)}
        
        I think the output value should look like - ((1,2),(3,4)). I don't see a need to convert (1,2) into ((1,2)).
        

        I also have concern that with auto tupling, python udf users will have an incorrect understanding of pig bags of primitive types. They might not realize that the bags always contain a tuple. How much of a performance difference did you notice while adding adding tuple wrappers for fields in a bag? I am trying to evaluate the option of providing utility libraries that python udfs can use to convert to pig type.

        Regarding skipping nulls in JythonUtils.asBag, it is at line 491. I am not sure about if pig actually works with null tuples in a bag, I need to check that.

                  if (it != null) {
                        while (first == null && it.hasNext()) {
                            first = it.next();
                        }
                  } 
        
        Show
        Thejas M Nair added a comment - wrt schema has fewer fields than actual: I think pig schema needs to support the feature of specifying partial schemas, or types for schemas with variable number of fields of certain/unspecified type. But I think it is better to have this feature in schema, rather than doing conversions based on a schema that is not compatible. Also, I think it is a good thing to check for schema consistency, so that the user knows when they make a mistake. I am against the idea of returning null and WARN (nearly as a rule). I think a reasonable interpretation is always better than NULL (with WARN). I would only advocate for an actual error that forces a user to rectify their code. Returning null+WARN is the convention followed in load funcs like PigStorage and in type conversion code. But I see that the situation in type conversion is different because there is no reasonable interpretation if the type conversion fails. Does any body else have opinions on this ? Re: logging 1 line per warning. PigLogger.warn(..) can be used to aggregate the warnings. Regarding auto-tupling, I agree that it is useful when - 1. output schema is a tuple 2. output schema is a bag of tuples with single fields. But if it is a bag of tuples that have multiple fields, I think it is makes sense for the output value to have a list type representing the tuple. If output schema is {( int , int )} I think the output value should look like - ((1,2),(3,4)). I don't see a need to convert (1,2) into ((1,2)). I also have concern that with auto tupling, python udf users will have an incorrect understanding of pig bags of primitive types. They might not realize that the bags always contain a tuple. How much of a performance difference did you notice while adding adding tuple wrappers for fields in a bag? I am trying to evaluate the option of providing utility libraries that python udfs can use to convert to pig type. Regarding skipping nulls in JythonUtils.asBag, it is at line 491. I am not sure about if pig actually works with null tuples in a bag, I need to check that. if (it != null ) { while (first == null && it.hasNext()) { first = it.next(); } }
        Hide
        Alan Gates added a comment -

        Unlinking from 0.10 as we don't seem to have agreement on how to proceed yet.

        Show
        Alan Gates added a comment - Unlinking from 0.10 as we don't seem to have agreement on how to proceed yet.
        Alan Gates made changes -
        Fix Version/s 0.10 [ 12316246 ]
        Hide
        Alan Gates added a comment -

        Marking open pending response to Thejas' comments.

        Show
        Alan Gates added a comment - Marking open pending response to Thejas' comments.
        Alan Gates made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]

          People

          • Assignee:
            Woody Anderson
            Reporter:
            Woody Anderson
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development