Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.11
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None

      Description

      The current EvalFunc interface (and associated Algebraic and Accumulator interfaces) have grown unwieldy. In particular, people have noted the following issues:

      1. Writing a UDF requires a lot of boiler plate code.
      2. Since UDFs always pass a tuple, users are required to manage their own type checking for input.
      3. Declaring schemas for output data is confusing.
      4. Writing a UDF that accepts multiple different parameters (using getArgToFuncMapping) is confusing.
      5. Using Algebraic and Accumulator interfaces often entails duplicating code from the initial implementation.
      6. UDF implementors are exposed to the internals of Pig since they have to know when to return a tuple (Initial, Intermediate) and when not to (exec, Final).
      7. The separation of Initial, Intermediate, and Final into separate classes forces code duplication and makes it hard for UDFs in other languages to use those interfaces.
      8. There is unused code in the current interface that occasionally causes confusion (e.g. isAsynchronous)

      Any change must be done in a way that allows existing UDFs to continue working essentially forever.

      1. examples.patch
        9 kB
        Julien Le Dem
      2. PIG-newudf.patch
        87 kB
        Alan Gates

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          Attached is a first pass at this. This is not complete at all. In particular it does not deal with how these changes would affect the scripting language UDFs, which it needs to.

          This is just to give an idea of how we might approach this. I have tried to address the issues I noted in the description. In particular I have moved to an annotation based scheme that allows users to have multiple instances of exec and other functions, but also to use a single function for multiple purposes (such as exec and initial). I have also tried to hide the internal wrapping/unwrapping so details developers could more easily share code between functions (e.g. initial and intermediate.

          The idea is that the current EvalFunc class would remain as is, with these methods being placed in a separate package. Any new functionality would only be added to these new classes.

          Show
          Alan Gates added a comment - Attached is a first pass at this. This is not complete at all. In particular it does not deal with how these changes would affect the scripting language UDFs, which it needs to. This is just to give an idea of how we might approach this. I have tried to address the issues I noted in the description. In particular I have moved to an annotation based scheme that allows users to have multiple instances of exec and other functions, but also to use a single function for multiple purposes (such as exec and initial). I have also tried to hide the internal wrapping/unwrapping so details developers could more easily share code between functions (e.g. initial and intermediate. The idea is that the current EvalFunc class would remain as is, with these methods being placed in a separate package. Any new functionality would only be added to these new classes.
          Hide
          Julien Le Dem added a comment -

          Hi Alan,
          Thanks for starting this.
          I like the annotation approach. I'm attaching a list of examples (examples.patch) of what I had in mind (sorry, it's been seating on my computer for a while).
          It is pretty close to what you are proposing.
          In particular, I'm suggesting:

          • Not to have to extend EvalFunc at all: the udf context is provided through a @Context annotation and is different per call of the UDF (not per FuncSpec). The UDF context also specifies if we are in the frontend or backend and provides methods for optional information passed by the UDF (output schema, ...) and access to the distributed cache (name-spaced by the UDF context).
          • To have @Mapper @Combiner @Reducer instead of @Initial, @Intermediate, @Final.
          • We don't need to define @Accumulate if we allow the UDF to take an Iterator<Tuple> as an input.
          • I was trying to define schema-aware Tuples but did not get there yet.
            Let me know what you think.
            Julien
          Show
          Julien Le Dem added a comment - Hi Alan, Thanks for starting this. I like the annotation approach. I'm attaching a list of examples (examples.patch) of what I had in mind (sorry, it's been seating on my computer for a while). It is pretty close to what you are proposing. In particular, I'm suggesting: Not to have to extend EvalFunc at all: the udf context is provided through a @Context annotation and is different per call of the UDF (not per FuncSpec). The UDF context also specifies if we are in the frontend or backend and provides methods for optional information passed by the UDF (output schema, ...) and access to the distributed cache (name-spaced by the UDF context). To have @Mapper @Combiner @Reducer instead of @Initial, @Intermediate, @Final. We don't need to define @Accumulate if we allow the UDF to take an Iterator<Tuple> as an input. I was trying to define schema-aware Tuples but did not get there yet. Let me know what you think. Julien
          Hide
          Dmitriy V. Ryaboy added a comment -

          Some more thoughts on this:

          • We can easily match through reflection any static methods, and perhaps object methods for classes with no-arg constructors, invoker-style. This would allow us to transparently reuse a ton of existing java code without forcing people to write annotation-laden scaffolding. That means pushing the dynamicInvoker logic deeper into pig (right now it's essentially just a UDF hack).
          • not sure I'm comfortable with mapper / combiner / reducer annotations, or initial/intermediate/final. Ideally, for COUNT, for example, we want to be able to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to happen – we have to reimplement for count. Can we allow udf authors to return to us method pointers?
          • a lot of the pain we have right now is from not providing proper Context objects. This forces us into a pretty tight space, design-wise. If we make a strict contract about when Contexts are passed in and available, we can add what we add to context easily (definitely, things like the job conf, counter and logger helpers, exec mode, requested schema, input schema, etc should be in there. The approach of squirreling things away into the conf on the fe and unrolling it in the first invocation of exec() on the be is error-prone and overly complex).
          • it would be awesome if tuples knew their schema, and you could get their fields by name as well as by index.
          • evalFuncs currently must return 1 row per 1 input row. This leads to hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns (the latter is also potentially very expensive memory-wise). Ideally we could return a plain value, tuple, or bag and have Pig behave as it currently does, or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that means 0 to many results are coming out of the udf.
          • while we are experimenting with annotations, perhaps we can add something to advanced eval funcs that would let them tell the planner how much data they are producing? It'd be neat to be able to say that ngramming a text will blow up the number of records, while counting words will shrink it.
          • it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse is huge for efficiency (and a potential source of many bugs, so it's a tradeoff).
          Show
          Dmitriy V. Ryaboy added a comment - Some more thoughts on this: We can easily match through reflection any static methods, and perhaps object methods for classes with no-arg constructors, invoker-style. This would allow us to transparently reuse a ton of existing java code without forcing people to write annotation-laden scaffolding. That means pushing the dynamicInvoker logic deeper into pig (right now it's essentially just a UDF hack). not sure I'm comfortable with mapper / combiner / reducer annotations, or initial/intermediate/final. Ideally, for COUNT, for example, we want to be able to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to happen – we have to reimplement for count. Can we allow udf authors to return to us method pointers? a lot of the pain we have right now is from not providing proper Context objects. This forces us into a pretty tight space, design-wise. If we make a strict contract about when Contexts are passed in and available, we can add what we add to context easily (definitely, things like the job conf, counter and logger helpers, exec mode, requested schema, input schema, etc should be in there. The approach of squirreling things away into the conf on the fe and unrolling it in the first invocation of exec() on the be is error-prone and overly complex). it would be awesome if tuples knew their schema, and you could get their fields by name as well as by index. evalFuncs currently must return 1 row per 1 input row. This leads to hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns (the latter is also potentially very expensive memory-wise). Ideally we could return a plain value, tuple, or bag and have Pig behave as it currently does, or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that means 0 to many results are coming out of the udf. while we are experimenting with annotations, perhaps we can add something to advanced eval funcs that would let them tell the planner how much data they are producing? It'd be neat to be able to say that ngramming a text will blow up the number of records, while counting words will shrink it. it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse is huge for efficiency (and a potential source of many bugs, so it's a tradeoff).
          Hide
          Alan Gates added a comment -

          Responses to Julien's comments above:

          Not to have to extend EvalFunc at all: the udf context is provided through a @Context annotation and is different per call of the UDF (not per FuncSpec). The UDF context also specifies if we are in the frontend or backend and provides methods for optional information passed by the UDF (output schema, ...) and access to the distributed cache (name-spaced by the UDF context).

          What's the value of not extending EvalFunc? Your classes still have an init method, which you'll have to find through reflection. It seems like you're routing around Java's class hierarchy here.

          I also don't understand what passing the @Context annotation does. And I'm not clear what "a @Context annotation that is different per call of the UDF (not per FuncSpec)" means.

          I do like expanding the context object to contain more information we need to pass, including inbound schema and backend vs. frontend. I'm less sure about the distributed cache. I think it's much easier to make these explicit methods in the interface rather than hide everything in a config object. One of the most confusing things about the Hadoop interface is that lots of things are just "values" in a config object, and don't show up in the API docs anywhere. I guess if we expanded the config object to have methods like "addFileToCache" and "fetchFileFromCache", then it's fine.

          I was trying to define schema-aware Tuples but did not get there yet.

          I think this is separable from what I'm proposing here.

          Show
          Alan Gates added a comment - Responses to Julien's comments above: Not to have to extend EvalFunc at all: the udf context is provided through a @Context annotation and is different per call of the UDF (not per FuncSpec). The UDF context also specifies if we are in the frontend or backend and provides methods for optional information passed by the UDF (output schema, ...) and access to the distributed cache (name-spaced by the UDF context). What's the value of not extending EvalFunc? Your classes still have an init method, which you'll have to find through reflection. It seems like you're routing around Java's class hierarchy here. I also don't understand what passing the @Context annotation does. And I'm not clear what "a @Context annotation that is different per call of the UDF (not per FuncSpec)" means. I do like expanding the context object to contain more information we need to pass, including inbound schema and backend vs. frontend. I'm less sure about the distributed cache. I think it's much easier to make these explicit methods in the interface rather than hide everything in a config object. One of the most confusing things about the Hadoop interface is that lots of things are just "values" in a config object, and don't show up in the API docs anywhere. I guess if we expanded the config object to have methods like "addFileToCache" and "fetchFileFromCache", then it's fine. I was trying to define schema-aware Tuples but did not get there yet. I think this is separable from what I'm proposing here.
          Hide
          Alan Gates added a comment -

          Responses to Dmitriy's comments:

          We can easily match through reflection any static methods, and perhaps object methods for classes with no-arg constructors, invoker-style. This would allow us to transparently reuse a ton of existing java code without forcing people to write annotation-laden scaffolding. That means pushing the dynamicInvoker logic deeper into pig (right now it's essentially just a UDF hack).

          I agree we should more deeply integrate the dynamicInvoker logic. We need to make it easier to declare (more along the line of defining a python UDF), we need to make it so you can use methods on objects, and we need a way to pass arguments to the constructors of those objects. But I don't see how that changes this proposal. I see that as a separate track from this.

          not sure I'm comfortable with mapper / combiner / reducer annotations, or initial/intermediate/final. Ideally, for COUNT, for example, we want to be able to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to happen – we have to reimplement for count. Can we allow udf authors to return to us method pointers?

          Using annotations it seems difficult to return method pointers, since return types have to be a String, Enum, or Class. We could define some pigeon language where we return a string like @Intermediate("org.apache.pig.newudf.SUM.exec"), I suppose, but that seems nasty.

          Are there that many cases where there will be crossover for method implementations? You're right that the proposed annotation method only allows sharing of methods within a particular UDF, not across UDFs. If we think across UDFs will be that common, we could allow the UDF classname to be decoupled from the Pig UDF name, and then in the annotations indicate which UDFs a particular implementation is for. For example:

          @UDFName("SUM", "COUNT")
          public class SUMandCOUNT extends EvalFunc {
          ...
              @Initital("COUNT")
              public long countInitial(int val) {
                  return 1;
              }
          
              @Intermediate("COUNT")
              @Final("COUNT")
              @Initial("SUM")
              @Intermediate("SUM")
              @Final("SUM")
              public long verySharedCode(bag vals) {
                  ...
              }
          }
          

          But I'm not sure this is a frequent enough use case to build the interface around.

          a lot of the pain we have right now is from not providing proper Context objects. This forces us into a pretty tight space, design-wise. If we make a strict contract about when Contexts are passed in and available, we can add what we add to context easily (definitely, things like the job conf, counter and logger helpers, exec mode, requested schema, input schema, etc should be in there. The approach of squirreling things away into the conf on the fe and unrolling it in the first invocation of exec() on the be is error-prone and overly complex).

          Agreed, as with Julien's comment. I want to spend some more time thinking about Context
          objects, what should be there, and when we should pass them. I'll come back with more
          proposals there.

          it would be awesome if tuples knew their schema, and you could get their fields by name as well as by index.

          I think this is a separate topic.

          evalFuncs currently must return 1 row per 1 input row. This leads to hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns (the latter is also potentially very expensive memory-wise). Ideally we could return a plain value, tuple, or bag and have Pig behave as it currently does, or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that means 0 to many results are coming out of the udf.

          In the case where more than 1 value comes out, what does Pig do? Put them in a bag?
          Auto-flatten them with other elements in the generate? This seems closely related to the
          OUTER_FLATTEN work Jonathan is proposing.

          while we are experimenting with annotations, perhaps we can add something to advanced eval funcs that would let them tell the planner how much data they are producing? It'd be neat to be able to say that ngramming a text will blow up the number of records, while counting words will shrink it.

          I like the idea, but I think we should wait until we have an optimizer that can make use of
          these. Otherwise we won't know what we should and shouldn't annotate for.

          it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse is huge for efficiency (and a potential source of many bugs, so it's a tradeoff).

          I'm not clear how this relates to the current topic.

          Show
          Alan Gates added a comment - Responses to Dmitriy's comments: We can easily match through reflection any static methods, and perhaps object methods for classes with no-arg constructors, invoker-style. This would allow us to transparently reuse a ton of existing java code without forcing people to write annotation-laden scaffolding. That means pushing the dynamicInvoker logic deeper into pig (right now it's essentially just a UDF hack). I agree we should more deeply integrate the dynamicInvoker logic. We need to make it easier to declare (more along the line of defining a python UDF), we need to make it so you can use methods on objects, and we need a way to pass arguments to the constructors of those objects. But I don't see how that changes this proposal. I see that as a separate track from this. not sure I'm comfortable with mapper / combiner / reducer annotations, or initial/intermediate/final. Ideally, for COUNT, for example, we want to be able to say "COUNT is equivalent to a SUM of COUNTs". As is, you don't allow that to happen – we have to reimplement for count. Can we allow udf authors to return to us method pointers? Using annotations it seems difficult to return method pointers, since return types have to be a String, Enum, or Class. We could define some pigeon language where we return a string like @Intermediate("org.apache.pig.newudf.SUM.exec"), I suppose, but that seems nasty. Are there that many cases where there will be crossover for method implementations? You're right that the proposed annotation method only allows sharing of methods within a particular UDF, not across UDFs. If we think across UDFs will be that common, we could allow the UDF classname to be decoupled from the Pig UDF name, and then in the annotations indicate which UDFs a particular implementation is for. For example: @UDFName( "SUM" , "COUNT" ) public class SUMandCOUNT extends EvalFunc { ... @Initital( "COUNT" ) public long countInitial( int val) { return 1; } @Intermediate( "COUNT" ) @Final( "COUNT" ) @Initial( "SUM" ) @Intermediate( "SUM" ) @Final( "SUM" ) public long verySharedCode(bag vals) { ... } } But I'm not sure this is a frequent enough use case to build the interface around. a lot of the pain we have right now is from not providing proper Context objects. This forces us into a pretty tight space, design-wise. If we make a strict contract about when Contexts are passed in and available, we can add what we add to context easily (definitely, things like the job conf, counter and logger helpers, exec mode, requested schema, input schema, etc should be in there. The approach of squirreling things away into the conf on the fe and unrolling it in the first invocation of exec() on the be is error-prone and overly complex). Agreed, as with Julien's comment. I want to spend some more time thinking about Context objects, what should be there, and when we should pass them. I'll come back with more proposals there. it would be awesome if tuples knew their schema, and you could get their fields by name as well as by index. I think this is a separate topic. evalFuncs currently must return 1 row per 1 input row. This leads to hard-to-explain "filter for nulls" and "return a bag, then flatten" patterns (the latter is also potentially very expensive memory-wise). Ideally we could return a plain value, tuple, or bag and have Pig behave as it currently does, or have evalfuncs return Iterator<value/tuple/bag> and have Pig understand that means 0 to many results are coming out of the udf. In the case where more than 1 value comes out, what does Pig do? Put them in a bag? Auto-flatten them with other elements in the generate? This seems closely related to the OUTER_FLATTEN work Jonathan is proposing. while we are experimenting with annotations, perhaps we can add something to advanced eval funcs that would let them tell the planner how much data they are producing? It'd be neat to be able to say that ngramming a text will blow up the number of records, while counting words will shrink it. I like the idea, but I think we should wait until we have an optimizer that can make use of these. Otherwise we won't know what we should and shouldn't annotate for. it's currently unclear when it's ok or not ok to reuse tuples. Tuple reuse is huge for efficiency (and a potential source of many bugs, so it's a tradeoff). I'm not clear how this relates to the current topic.
          Hide
          Alan Gates added a comment -

          BTW, thanks both Julien and Dmitriy for the feedback.

          Show
          Alan Gates added a comment - BTW, thanks both Julien and Dmitriy for the feedback.
          Hide
          Thejas M Nair added a comment -

          There is one problem with the annotations based approach for outputschema, you loose the benefit of having a function! Take the case of builtin.TOBAG udf, the output schema is computed based on input type. To overcome this we can either continue to support getOutputSchema or have annotation support for specifying the equivalent function.

          I like the Dmitriy's idea letting the udf function return an iterator. It is equivalent of an accumulator interface, but for udf output. Support for that can also be done as a 2nd step. The output can be treated as a bag if the iterator is an iterator of tuples. In other cases, I think we would need to force the user to use a flatten on the udf. Doing an implicit flatten in the udf is likely to be confusing.

          I wonder if we should first make a decision about supporting a new list type that acts as a list of any type (unlike bag, which is always list of tuples). That would have an impact on what we decide the semantics of udf returning Iterator should be.

          Show
          Thejas M Nair added a comment - There is one problem with the annotations based approach for outputschema, you loose the benefit of having a function! Take the case of builtin.TOBAG udf, the output schema is computed based on input type. To overcome this we can either continue to support getOutputSchema or have annotation support for specifying the equivalent function. I like the Dmitriy's idea letting the udf function return an iterator. It is equivalent of an accumulator interface, but for udf output. Support for that can also be done as a 2nd step. The output can be treated as a bag if the iterator is an iterator of tuples. In other cases, I think we would need to force the user to use a flatten on the udf. Doing an implicit flatten in the udf is likely to be confusing. I wonder if we should first make a decision about supporting a new list type that acts as a list of any type (unlike bag, which is always list of tuples). That would have an impact on what we decide the semantics of udf returning Iterator should be.
          Hide
          Jonathan Coveney added a comment -

          I think this is a super key thing to do as far as making pig more extensible and useable. I think that the annotation stuff is nice, but it seems geared towards making it easy to write a UDF with as few lines of possible, which I don't necessarily think should be the goal...instead, I think that the goal should be a rock solid EvalFunc base which meets a set of goals we lay out, one of which should be allowing is to extend it to provide nice annotations and whatnot. Here are some worthwhile goals, which relate to what we have above (not in any particular order):

          1. Minimize code reuse. Alan, you asked if this is a thing, and it absolutely is. Right now you have to jump through many hoops in order to reuse code, and even in the builtin stuff, it's quite patchy (just look at Dmitriy's recent commit of -1500 lines of math function code, and the fact that more could be done still). I think the goal should be to make it so that it is really easy to build new UDF's using existing functionality in an elegant way. I agree with Dmitrity that using annotations could make that difficult.
          2. Making it MUCH more explicit what is happening where. Look at Julien's example, which checks if it is on the frontend or not...I agree with Alan that things like this could be split out, or at least made much clearer. You could have a frontendInit, frontendFinalize, frontendInit, backendFinalize. Once again, we could provide a much simpler "SimpleEvalFunc" that doesn't have all this, but I think part of the current problem with pig is that you have to jump through a ton of hoops to do somewhat reasonable things, and we should facilitate those things, because they will ultimately enable more elegant solutions (instead of manually serializing things in crazy places, etc).
          3. Directly relating to the above, it should be dead easy to pass information between the front end and the back end.
          4. Allowing functions to both receive and return iterators seems very usable, and would cut down on the same "get first element cast to bag iterate over bag" that's in every script, and give us Accumulative UDFs for free. It would be nice to have one -> many functions where a given row may result in 0 or more results. Alan, your point is a good one, but I think we can present it in such a way that it's clear when a bag is being return and when it isn't. Flattening of course gives the same result, but I am under the impression that by returning iterators we could be much more efficient about it. Maybe the solution is to have an OUTER_FLATTEN, and be smart about how we generate and flatten the intermediate data. I don't know much about how pig currently deals with that sort of thing.

          Thejas: I think that a Bag of any type would be neat. Basically, other spillable data structures would be cool, and potentially an PrimitiveBag could see the exact same benefits that Dmitriy's PrimitiveTuples see.

          Show
          Jonathan Coveney added a comment - I think this is a super key thing to do as far as making pig more extensible and useable. I think that the annotation stuff is nice, but it seems geared towards making it easy to write a UDF with as few lines of possible, which I don't necessarily think should be the goal...instead, I think that the goal should be a rock solid EvalFunc base which meets a set of goals we lay out, one of which should be allowing is to extend it to provide nice annotations and whatnot. Here are some worthwhile goals, which relate to what we have above (not in any particular order): 1. Minimize code reuse. Alan, you asked if this is a thing, and it absolutely is. Right now you have to jump through many hoops in order to reuse code, and even in the builtin stuff, it's quite patchy (just look at Dmitriy's recent commit of -1500 lines of math function code, and the fact that more could be done still). I think the goal should be to make it so that it is really easy to build new UDF's using existing functionality in an elegant way. I agree with Dmitrity that using annotations could make that difficult. 2. Making it MUCH more explicit what is happening where. Look at Julien's example, which checks if it is on the frontend or not...I agree with Alan that things like this could be split out, or at least made much clearer. You could have a frontendInit, frontendFinalize, frontendInit, backendFinalize. Once again, we could provide a much simpler "SimpleEvalFunc" that doesn't have all this, but I think part of the current problem with pig is that you have to jump through a ton of hoops to do somewhat reasonable things, and we should facilitate those things, because they will ultimately enable more elegant solutions (instead of manually serializing things in crazy places, etc). 3. Directly relating to the above, it should be dead easy to pass information between the front end and the back end. 4. Allowing functions to both receive and return iterators seems very usable, and would cut down on the same "get first element cast to bag iterate over bag" that's in every script, and give us Accumulative UDFs for free. It would be nice to have one -> many functions where a given row may result in 0 or more results. Alan, your point is a good one, but I think we can present it in such a way that it's clear when a bag is being return and when it isn't. Flattening of course gives the same result, but I am under the impression that by returning iterators we could be much more efficient about it. Maybe the solution is to have an OUTER_FLATTEN, and be smart about how we generate and flatten the intermediate data. I don't know much about how pig currently deals with that sort of thing. Thejas: I think that a Bag of any type would be neat. Basically, other spillable data structures would be cool, and potentially an PrimitiveBag could see the exact same benefits that Dmitriy's PrimitiveTuples see.
          Hide
          Jonathan Coveney added a comment -

          I thought I would just mention that as a result of this JIRA https://issues.apache.org/jira/browse/PIG-2430, I'd mention some stuff Julien and I had chatted about.

          Right now, the whole getArgToFuncMapping thing is pretty rough around the edges. It doesn't handle varArgs, it uses a somewhat cumbersome FuncSpec, etc. It could be a lot more elegant to allow people to either register an EvalFunc, or an EvalFuncFactory. This would decouple creating evalfuncs based on whatever and the evalfuncs themselves. The factory could have any number of parameters and any number of methods, but it would allow people to more easily handle the various cases that come up in that JIRA. We could, of course, provide a lot of convenience methods and could make simpler interfaces to extend on top, but I think part of the difficulty of Pig is that you have your high level EvalFunc, and if you want to do anything beyond that, you're knee deep in the code. It'd be nice to have abstractions that provide a lot more power, and let you write more elegant layers on top to expose to people who aren't real power users.

          Show
          Jonathan Coveney added a comment - I thought I would just mention that as a result of this JIRA https://issues.apache.org/jira/browse/PIG-2430 , I'd mention some stuff Julien and I had chatted about. Right now, the whole getArgToFuncMapping thing is pretty rough around the edges. It doesn't handle varArgs, it uses a somewhat cumbersome FuncSpec, etc. It could be a lot more elegant to allow people to either register an EvalFunc, or an EvalFuncFactory. This would decouple creating evalfuncs based on whatever and the evalfuncs themselves. The factory could have any number of parameters and any number of methods, but it would allow people to more easily handle the various cases that come up in that JIRA. We could, of course, provide a lot of convenience methods and could make simpler interfaces to extend on top, but I think part of the difficulty of Pig is that you have your high level EvalFunc, and if you want to do anything beyond that, you're knee deep in the code. It'd be nice to have abstractions that provide a lot more power, and let you write more elegant layers on top to expose to people who aren't real power users.
          Hide
          Raghu Angadi added a comment -
          1. +1 for making a context available (current UDFContext is not available for UDFs).
            • use case: I want to be able to this write UDF 'NullIfMissing()' define this way:
              a = load 'input' as (p:(one, two, three), q:int);
              b = foreach a generate NullIfMissing(p);
              describe b;
              {t: (one: bytearray, two: bytearray, three: bytearray)}
              -- NullIfMissing Returns 
              -- (null, null, null) if 'p' is null
              -- (x, y, z), if p == (x, y, z)
              -- (x, y, null) if p == (x, y)
              
          2. making conf available (readonly is sufficient, and probably preferred since a UDF context can used to store any state).
          Show
          Raghu Angadi added a comment - +1 for making a context available (current UDFContext is not available for UDFs). use case: I want to be able to this write UDF 'NullIfMissing()' define this way: a = load 'input' as (p:(one, two, three), q: int ); b = foreach a generate NullIfMissing(p); describe b; {t: (one: bytearray, two: bytearray, three: bytearray)} -- NullIfMissing Returns -- ( null , null , null ) if 'p' is null -- (x, y, z), if p == (x, y, z) -- (x, y, null ) if p == (x, y) making conf available (readonly is sufficient, and probably preferred since a UDF context can used to store any state).

            People

            • Assignee:
              Alan Gates
              Reporter:
              Alan Gates
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development