Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      We have several requests to make input schema available to the UDF for inspection.

      1. PIG-2337.patch
        14 kB
        xuting zhao
      2. PIG-2337-2.patch
        7 kB
        xuting zhao
      3. PIG-2337-3.patch
        8 kB
        xuting zhao
      4. PIG-2337-4.patch
        10 kB
        xuting zhao
      5. PIG-2337-5.patch
        8 kB
        Daniel Dai

        Issue Links

          Activity

          Hide
          Russell Jurney added a comment -

          Thanks, I looked for that, I read the Javadoc and the classes... and couldn't find it

          To answer your question: stupidity mostly.

          Show
          Russell Jurney added a comment - Thanks, I looked for that, I read the Javadoc and the classes... and couldn't find it To answer your question: stupidity mostly.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Is anything preventing you from calling

          new ResourceSchema(Schema pigSchema)

          ?

          Show
          Dmitriy V. Ryaboy added a comment - Is anything preventing you from calling new ResourceSchema(Schema pigSchema) ?
          Hide
          Russell Jurney added a comment -

          I am looking at JsonStorage in the process of writing ToJson as a builtin, and it uses ResourceSchema, which are serializable. The implementation uses features like: field.getName(), schema.getFields(), field.getSchema() - can you tell me the schema equivalents?

          Show
          Russell Jurney added a comment - I am looking at JsonStorage in the process of writing ToJson as a builtin, and it uses ResourceSchema, which are serializable. The implementation uses features like: field.getName(), schema.getFields(), field.getSchema() - can you tell me the schema equivalents?
          Hide
          Jonathan Coveney added a comment -

          Russell: I'm not sure what you mean? When it comes to UDFss, pretty much everything is in terms of Schemas. Can you give a case where a Schema isn't sufficient, but a ResourceSchema would be?

          Further, converting to a ResourceSchema is pretty easy, but I'd be very curious to know of a use case.

          Show
          Jonathan Coveney added a comment - Russell: I'm not sure what you mean? When it comes to UDFss, pretty much everything is in terms of Schemas. Can you give a case where a Schema isn't sufficient, but a ResourceSchema would be? Further, converting to a ResourceSchema is pretty easy, but I'd be very curious to know of a use case.
          Hide
          Russell Jurney added a comment -

          One problem with this patch/commit: what you really need is a ResourceSchema, not a schema

          Show
          Russell Jurney added a comment - One problem with this patch/commit: what you really need is a ResourceSchema, not a schema
          Hide
          Daniel Dai added a comment -

          Unit test pass. Test-patch result:
          [exec] -1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 10 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] -1 release audit. The applied patch generated 468 release audit warnings (more than the trunk's current 461 warnings).

          No new file added, ignore release audit warning.

          Patch committed to trunk. Thanks Xuting!

          Show
          Daniel Dai added a comment - Unit test pass. Test-patch result: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 10 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 468 release audit warnings (more than the trunk's current 461 warnings). No new file added, ignore release audit warning. Patch committed to trunk. Thanks Xuting!
          Hide
          Daniel Dai added a comment -

          PIG-2337-5.patch include some minor change in javadoc.

          Show
          Daniel Dai added a comment - PIG-2337 -5.patch include some minor change in javadoc.
          Hide
          Daniel Dai added a comment -

          Everything looks right now. Will commit once tests pass.

          Show
          Daniel Dai added a comment - Everything looks right now. Will commit once tests pass.
          Hide
          xuting zhao added a comment -

          Hi Daniel,

          I have modified the jsFunction and changed the functions names in EvalFunc to getInputSchema() and setInputSchema(). test-commit and e2e test on the new test case: UDFContextAuto have been successfully run.

          Xuting

          Show
          xuting zhao added a comment - Hi Daniel, I have modified the jsFunction and changed the functions names in EvalFunc to getInputSchema() and setInputSchema(). test-commit and e2e test on the new test case: UDFContextAuto have been successfully run. Xuting
          Hide
          Daniel Dai added a comment -

          Patch looks good. jsFunction seems to be a perfect candidate to use this new infrastructure. I would suggest to do that in this patch as well (And also we can reclaim the name they stole )

          Show
          Daniel Dai added a comment - Patch looks good. jsFunction seems to be a perfect candidate to use this new infrastructure. I would suggest to do that in this patch as well (And also we can reclaim the name they stole )
          Hide
          xuting zhao added a comment -

          Thanks for this comments.
          The 2337-3.patch is the modified one. The changes including:
          1. moving the store and retrieve actions of signature into UserFuncExpression and POUserFunc.
          2. changing the prefix of the key to: "pig.evalfunc.inputshcmea."
          3. adding javadoc into the getEFInputSchema.
          4. I tried to change the function name of autoGetInputSchema and autoSetInputSchema to getInputSchema and setInputSchema. However, there seems some function in the jsFunction with the same name: private Schema getInputSchema(),private Schema setInputSchema(). So if I change to that name, there will be an error. As a result, I changed them to setEFInputSchema and getEFInputSchema()

          ant commit-test and e2e test on the new test case: UDFContextAuto have been successfully run.

          Show
          xuting zhao added a comment - Thanks for this comments. The 2337-3.patch is the modified one. The changes including: 1. moving the store and retrieve actions of signature into UserFuncExpression and POUserFunc. 2. changing the prefix of the key to: "pig.evalfunc.inputshcmea." 3. adding javadoc into the getEFInputSchema. 4. I tried to change the function name of autoGetInputSchema and autoSetInputSchema to getInputSchema and setInputSchema. However, there seems some function in the jsFunction with the same name: private Schema getInputSchema(),private Schema setInputSchema(). So if I change to that name, there will be an error. As a result, I changed them to setEFInputSchema and getEFInputSchema() ant commit-test and e2e test on the new test case: UDFContextAuto have been successfully run.
          Hide
          Daniel Dai added a comment -

          Thanks for the patch. Couple of comments:
          1. You cannot put logic in EvalFunc.setUDFContextSignature() and EvalFunc.getUDFContextSignature(), user might override it. So don't save signature in EvalFunc. You can do the serialization in UserFuncExpression.getFieldSchema (You are quite there, except you need to use the signature in UserFuncExpression). You can do the deserialization in POUserFunc.instantiateFunc
          2. prefix for the key:"pig.evalfunc.signature", better be "pig.evalfunc.inputschema"
          3. method name "autoGetInputSchema", better "getInputSchema", "autoSetInputSchema", better "setInputSchema"
          4. getInputSchema is user facing, should add some javadoc

          Show
          Daniel Dai added a comment - Thanks for the patch. Couple of comments: 1. You cannot put logic in EvalFunc.setUDFContextSignature() and EvalFunc.getUDFContextSignature(), user might override it. So don't save signature in EvalFunc. You can do the serialization in UserFuncExpression.getFieldSchema (You are quite there, except you need to use the signature in UserFuncExpression). You can do the deserialization in POUserFunc.instantiateFunc 2. prefix for the key:"pig.evalfunc.signature", better be "pig.evalfunc.inputschema" 3. method name "autoGetInputSchema", better "getInputSchema", "autoSetInputSchema", better "setInputSchema" 4. getInputSchema is user facing, should add some javadoc
          Hide
          xuting zhao added a comment -

          1.The test-commit has been run successfully with this patch
          2.An e2e test has been added

          Show
          xuting zhao added a comment - 1.The test-commit has been run successfully with this patch 2.An e2e test has been added
          Hide
          Daniel Dai added a comment -

          Should be separate. I will try to commit PIG-2338 ASAP. So you can regenerate the patch.

          Show
          Daniel Dai added a comment - Should be separate. I will try to commit PIG-2338 ASAP. So you can regenerate the patch.
          Hide
          xuting zhao added a comment -

          Hi Daniel,
          I includes those modification in 2338 in this patch. I am not sure if I need to seperate them or put them together. Thanks.

          Show
          xuting zhao added a comment - Hi Daniel, I includes those modification in 2338 in this patch. I am not sure if I need to seperate them or put them together. Thanks.
          Hide
          Daniel Dai added a comment -

          You will need to wait PIG-2338 check in, right?

          Show
          Daniel Dai added a comment - You will need to wait PIG-2338 check in, right?
          Hide
          xuting zhao added a comment -

          1.test-commit has been successfully run on this patch.
          2.Unit test has been added in the TestSchema.java as testAutoSchemaSerialization()

          Show
          xuting zhao added a comment - 1.test-commit has been successfully run on this patch. 2.Unit test has been added in the TestSchema.java as testAutoSchemaSerialization()
          Hide
          Daniel Dai added a comment -

          Currently, if users want input schema for EvalFunc, they need to do it themselves:
          1. In front end, serialize the schema and put in UDFContext in method outputSchema
          2. In the backend, deserialize the schema from UDFContext.

          The sample use case can be found in TestSchema.InputSchemaUDF (https://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestSchema.java)

          This process is quite involved and we shall to do it automatically. This involves:
          1. In front end, we serialize the schema and put in UDFContext for every EvalFunc
          2. In the backend, deserialize the schema for every EvalFunc
          3. User can use EvalFunc.getSchema() to retrieve input schema for this EvalFunc

          To do this, we need a unique signature for EvalFunc so we can use it as a key to store to/retrieve from UDFContext. This mechanism is not there yet and it is tracked in PIG-2338

          Show
          Daniel Dai added a comment - Currently, if users want input schema for EvalFunc, they need to do it themselves: 1. In front end, serialize the schema and put in UDFContext in method outputSchema 2. In the backend, deserialize the schema from UDFContext. The sample use case can be found in TestSchema.InputSchemaUDF ( https://svn.apache.org/repos/asf/pig/trunk/test/org/apache/pig/test/TestSchema.java ) This process is quite involved and we shall to do it automatically. This involves: 1. In front end, we serialize the schema and put in UDFContext for every EvalFunc 2. In the backend, deserialize the schema for every EvalFunc 3. User can use EvalFunc.getSchema() to retrieve input schema for this EvalFunc To do this, we need a unique signature for EvalFunc so we can use it as a key to store to/retrieve from UDFContext. This mechanism is not there yet and it is tracked in PIG-2338

            People

            • Assignee:
              xuting zhao
              Reporter:
              Olga Natkovich
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development