Pig
  1. Pig
  2. PIG-1718

Cannot directly cast output of UDF

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.7.0
    • Fix Version/s: None
    • Component/s: impl
    • Labels:
      None
    • Environment:

      Macbook Pro 6.2, Ubuntu 10.04 AMD64, CDH3 beta 3

      Description

      I'm in the process of writing a suite of UDFs to deal with nested JSON data inside of Pig. In one case, I created a UDF of type EvalFunc<String> and wanted to use it like so:

      RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
      IN = foreach RAW generate id, ExtractString(json, 'count') as count:int
      

      When I do this, I get the following error:

      ERROR 1022: Type mismatch merging schema prefix. Field Schema: chararray. Other Field Schema: count: int

      I can work around it by adding another projection with just a cast (as below), but I'd prefer if the form I just first just worked.

      RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
      MID = foreach RAW generate id, ExtractString(json, 'count') as count
      IN = foreach MID generate id, (int)count
      

      I'd prefer not to have to have ExtractInteger extends EvalFun<Integer> if I can avoid it. In our case, it gets even more cumbersome because we want to have something like ExtractStringTuple extends EvalFunc<Tuple> that returns a tuple of strings without parsing the JSON over and over again:

      RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
      IN = foreach RAW generate id, ExtractStringTuple(json, 'name', 'count', 'mean') as (name, count:int, mean:double);
      

      As indicated, I have tested this with Pig 0.7.0. My apologies if this is already fixed in 0.8 since I was not able to test with a newer version.

        Activity

        Hide
        Alan Gates added a comment -

        Does

        RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
        IN = foreach RAW generate id, (int)ExtractString(json, 'count') as count
        

        work? That is the proper syntax.

        Show
        Alan Gates added a comment - Does RAW = load 'input.tsv' using PigStorage as ( id: int , json: chararray ); IN = foreach RAW generate id, ( int )ExtractString(json, 'count') as count work? That is the proper syntax.
        Hide
        Mike Dillon added a comment -

        Yes, that syntax does work, but it's very hard to correlate output type annotations with field names for stuff like this:

        RAW = load 'input.tsv' using PigStorage as ( id: int, json: chararray );
        IN = foreach RAW generate id,
                (tuple(int,double))ExtractStringTuple(count_json, 'count', 'mean') as info (count, mean);
        

        It seems like enhancing Pig to allow the type annotations to sit right next to the field names for this case would be a big win. Not to mention the duplicate information about the type shape that is implicit in having both "tuple(int,double)" as a cast and "info(count, mean)" as a schema specification.

        Show
        Mike Dillon added a comment - Yes, that syntax does work, but it's very hard to correlate output type annotations with field names for stuff like this: RAW = load 'input.tsv' using PigStorage as ( id: int , json: chararray ); IN = foreach RAW generate id, (tuple( int , double ))ExtractStringTuple(count_json, 'count', 'mean') as info (count, mean); It seems like enhancing Pig to allow the type annotations to sit right next to the field names for this case would be a big win. Not to mention the duplicate information about the type shape that is implicit in having both "tuple(int,double)" as a cast and "info(count, mean)" as a schema specification.
        Hide
        Santhosh Srinivasan added a comment -

        This should be fixed as part of the semantics cleanup. The foreach allows the specification of the type information when the current semantic is plain aliasing.

        Show
        Santhosh Srinivasan added a comment - This should be fixed as part of the semantics cleanup. The foreach allows the specification of the type information when the current semantic is plain aliasing.
        Hide
        Mike Dillon added a comment -

        Thanks for the update Santhosh. Is the semantics cleanup targeted for a particular release or milestone? If so, it would be great if this JIRA issue could either be included in that milestone, marked as depending on an upstream issue, or closed as a duplicate.

        Show
        Mike Dillon added a comment - Thanks for the update Santhosh. Is the semantics cleanup targeted for a particular release or milestone? If so, it would be great if this JIRA issue could either be included in that milestone, marked as depending on an upstream issue, or closed as a duplicate.
        Hide
        Santhosh Srinivasan added a comment -

        The semantics cleanup is tracked in a wiki: http://wiki.apache.org/pig/SemanticsCleanup I have added this JIRA to the list of items.

        Show
        Santhosh Srinivasan added a comment - The semantics cleanup is tracked in a wiki: http://wiki.apache.org/pig/SemanticsCleanup I have added this JIRA to the list of items.
        Hide
        Mike Dillon added a comment -

        The table on that wiki page says that the changes required for this JIRA are backwards incompatible, but that only applies to the first of the two cleanup possibilities (i.e. removing the ability to have types in the AS clause). If the second option is chosen of making the schema declared by an AS clause act like an implied cast, then there is no backwards compatibility problem since any script that is currently putting types in the AS clause is only working if those types exactly match the source types. If anyone had tried to rely on conversion (as I did), then they'd get an error.

        Incidentally, I hugely prefer the backwards compatible option of allowing an implied coercion in this case.

        Show
        Mike Dillon added a comment - The table on that wiki page says that the changes required for this JIRA are backwards incompatible, but that only applies to the first of the two cleanup possibilities (i.e. removing the ability to have types in the AS clause). If the second option is chosen of making the schema declared by an AS clause act like an implied cast, then there is no backwards compatibility problem since any script that is currently putting types in the AS clause is only working if those types exactly match the source types. If anyone had tried to rely on conversion (as I did), then they'd get an error. Incidentally, I hugely prefer the backwards compatible option of allowing an implied coercion in this case.
        Hide
        Santhosh Srinivasan added a comment -

        I agree that only the first part is backward incompatible.

        Show
        Santhosh Srinivasan added a comment - I agree that only the first part is backward incompatible.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mike Dillon
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development