Pig
  1. Pig
  2. PIG-2153

POProject throws an error with tuples containing a single non-tuple field

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 0.8.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      When POProject.getNext(tuple) processes a tuple with one field, the field is pulled out. If that field is not a tuple, a cast exception is thrown. This is happening in the folliwing block of code at line 401.

      if(columns.size() == 1) {
      try{
      ret = inpValue.get(columns.get(0));
      ...
      res.result = (Tuple)ret;

      I am seeing this error in a unit test that is loading an array of floats. The LoadFunc is converting the array to bag, and wrapping the bag in a tuple.

      (

      {(3.3),(1.2),(5.6)})

      This results on POProject attempting to cast the bag to a tuple. Looking at the code, it appears that if I wrapped the previous tuple in another tuple, then it would work.

      (({(3.3),(1.2),(5.6)}

      ))

      In this case it would work because POProject would extract the first inner tuple and return it. But this would require the LoadFunc to check for tuples with a single non-tuple field and only wrap those.

      This could be fixed by first checking that the tuple does actually wrap another tuple.

      if(columns.size() == 1 && inpValue.getType(0) == DataType.TUPLE) {...

      I don't know the original intent of this code well enough to say this is the appropriate fix or not. Hoping someone with more Pig experience can help here. Right now this is preventing the unit tests in AvroStorage from working. I can change the unit test, but I think in this case the unit test is catching a real bug.

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        Linking to 1890.

        Ken, thanks for digging into this. Can you "svn blame" the code in this area and see who the most appropriate reviewer is?

        Show
        Dmitriy V. Ryaboy added a comment - Linking to 1890. Ken, thanks for digging into this. Can you "svn blame" the code in this area and see who the most appropriate reviewer is?
        Hide
        Ken Goodhope added a comment -

        I ran unit tests with the change I recommend in the description. Good news is several tests that failed before now work and are listed below.
        org.apache.pig.test.TestBestFitCast
        org.apache.pig.test.TestDataBagAccess
        org.apache.pig.test.TestGrunt
        org.apache.pig.test.TestImplicitSplit
        org.apache.pig.test.TestMapSideCogroup
        org.apache.pig.test.TestPigRunner
        org.apache.pig.test.TestPigSplit
        org.apache.pig.test.TestScriptUDF

        The bad news is several tests that were working now fail.
        org.apache.pig.test.TestBuiltin
        org.apache.pig.test.TestCollectedGroup
        org.apache.pig.test.TestCombiner
        org.apache.pig.test.TestCommit
        org.apache.pig.test.TestEvalPipeline2
        org.apache.pig.test.TestEvalPipelineLocal
        org.apache.pig.test.TestFRJoin2
        org.apache.pig.test.TestFilter
        org.apache.pig.test.TestForEach
        org.apache.pig.test.TestForEachNestedPlanLocal
        org.apache.pig.test.TestJoin
        org.apache.pig.test.TestJoinSmoke
        org.apache.pig.test.TestLimitAdjuster
        org.apache.pig.test.TestLocalRearrange
        org.apache.pig.test.TestNativeMapReduce
        org.apache.pig.test.TestNewPlanImplicitSplit
        org.apache.pig.test.TestProject
        org.apache.pig.test.TestStore
        org.apache.pig.test.TestStoreInstances
        org.apache.pig.test.TestUnionOnSchema

        Obviously, there are more tests that break than get fixed.

        Show
        Ken Goodhope added a comment - I ran unit tests with the change I recommend in the description. Good news is several tests that failed before now work and are listed below. org.apache.pig.test.TestBestFitCast org.apache.pig.test.TestDataBagAccess org.apache.pig.test.TestGrunt org.apache.pig.test.TestImplicitSplit org.apache.pig.test.TestMapSideCogroup org.apache.pig.test.TestPigRunner org.apache.pig.test.TestPigSplit org.apache.pig.test.TestScriptUDF The bad news is several tests that were working now fail. org.apache.pig.test.TestBuiltin org.apache.pig.test.TestCollectedGroup org.apache.pig.test.TestCombiner org.apache.pig.test.TestCommit org.apache.pig.test.TestEvalPipeline2 org.apache.pig.test.TestEvalPipelineLocal org.apache.pig.test.TestFRJoin2 org.apache.pig.test.TestFilter org.apache.pig.test.TestForEach org.apache.pig.test.TestForEachNestedPlanLocal org.apache.pig.test.TestJoin org.apache.pig.test.TestJoinSmoke org.apache.pig.test.TestLimitAdjuster org.apache.pig.test.TestLocalRearrange org.apache.pig.test.TestNativeMapReduce org.apache.pig.test.TestNewPlanImplicitSplit org.apache.pig.test.TestProject org.apache.pig.test.TestStore org.apache.pig.test.TestStoreInstances org.apache.pig.test.TestUnionOnSchema Obviously, there are more tests that break than get fixed.
        Hide
        Ken Goodhope added a comment -

        It looks like the last time this code was touched it was for PIG-1369 by Pradeep Kamath.

        Show
        Ken Goodhope added a comment - It looks like the last time this code was touched it was for PIG-1369 by Pradeep Kamath.
        Hide
        Ken Goodhope added a comment -

        I am the first to admit this is ugly, and if someone has a better idea I would be thrilled. I am currently running unit tests with this possible fix.

        if(columns.size() == 1 && ((!overloaded && inpValue.getType(0) == DataType.TUPLE) || (overloaded && inpValue.getType(0) == DataType.BAG))) {
        ...

        My current thinking is the reason the previous fix broke so many unit tests is single element tuples containing a databag are acceptable if overloaded is set. I will post the results of the tests when complete.

        This might fix the issue in ElephantBird, but I haven't had time to investigate that.

        Show
        Ken Goodhope added a comment - I am the first to admit this is ugly, and if someone has a better idea I would be thrilled. I am currently running unit tests with this possible fix. if(columns.size() == 1 && ((!overloaded && inpValue.getType(0) == DataType.TUPLE) || (overloaded && inpValue.getType(0) == DataType.BAG))) { ... My current thinking is the reason the previous fix broke so many unit tests is single element tuples containing a databag are acceptable if overloaded is set. I will post the results of the tests when complete. This might fix the issue in ElephantBird, but I haven't had time to investigate that.
        Hide
        Pradeep Kamath added a comment -

        I don't have full context and given that I have not actively looked at Pig code in quite a while, my comments should be taken with a grain of salt. I am assuming POProject.getNext(Tuple) is being called because the schema (of load?) says that a tuple field should be projected. If that is indeed the case, then shouldn't the LoadFunc be returning a Tuple (with the bag in it)? The outer tuple that the LoadFunc returns simply represents a record and does not count - the types of the fields inside the outer tuple are the ones that matter in the schema and if the schema says there is one field of type Tuple, then POProject would except a type Tuple - so am wondering if the cast is correct as it is.

        Again, I have been out of touch with Pig for a good 8 months now - so my thinking above could be completely wrong - hopefully the more active Pig committers can confirm/refute my hypothesis.

        Show
        Pradeep Kamath added a comment - I don't have full context and given that I have not actively looked at Pig code in quite a while, my comments should be taken with a grain of salt. I am assuming POProject.getNext(Tuple) is being called because the schema (of load?) says that a tuple field should be projected. If that is indeed the case, then shouldn't the LoadFunc be returning a Tuple (with the bag in it)? The outer tuple that the LoadFunc returns simply represents a record and does not count - the types of the fields inside the outer tuple are the ones that matter in the schema and if the schema says there is one field of type Tuple, then POProject would except a type Tuple - so am wondering if the cast is correct as it is. Again, I have been out of touch with Pig for a good 8 months now - so my thinking above could be completely wrong - hopefully the more active Pig committers can confirm/refute my hypothesis.
        Hide
        Ken Goodhope added a comment -

        Thanks Pradeep, that is actually very helpful. If I understand you correctly, the outer tuple isn't part of the schema returned by LoadFunc.getSchema(). Is it possible that the result of LoadFunc.getNext used to be wrapped in an implicit tuple, and that is no longer happening?

        The results of the unit tests with the fix I suggested in my last comment showed 11 tests now working that were broke before, and 11 tests now breaking that used to work. This makes me wonder if some of the tests have been written with the expectation there is an implicit wrapping tuple, and some have been written with expectation that there is no implicit wrapper. Am I missing something?

        Here are the test results.

        Test that were broke and now work.
        > [junit] Test org.apache.pig.test.TestBestFitCast
        > [junit] Test org.apache.pig.test.TestCounters
        > [junit] Test org.apache.pig.test.TestDataBagAccess
        > [junit] Test org.apache.pig.test.TestEmptyInputDir
        > [junit] Test org.apache.pig.test.TestImplicitSplit
        > [junit] Test org.apache.pig.test.TestInvoker
        > [junit] Test org.apache.pig.test.TestPigRunner
        > [junit] Test org.apache.pig.test.TestPigSplit
        > [junit] Test org.apache.pig.test.TestScriptLanguage
        > [junit] Test org.apache.pig.test.TestScriptUDF
        > [junit] Test org.apache.pig.test.TestSkewedJoin

        Tests that used to work, but break with the fix I tried.
        < [junit] Test org.apache.pig.test.TestCombiner FAILED
        < [junit] Test org.apache.pig.test.TestCommit FAILED
        < [junit] Test org.apache.pig.test.TestEvalPipeline2 FAILED
        < [junit] Test org.apache.pig.test.TestEvalPipelineLocal FAILED
        < [junit] Test org.apache.pig.test.TestForEachNestedPlanLocal FAILED
        < [junit] Test org.apache.pig.test.TestLimitAdjuster FAILED
        < [junit] Test org.apache.pig.test.TestMergeJoinOuter FAILED
        < [junit] Test org.apache.pig.test.TestProject FAILED
        < [junit] Test org.apache.pig.test.TestProjectRange FAILED
        < [junit] Test org.apache.pig.test.TestPruneColumn FAILED
        < [junit] Test org.apache.pig.test.TestUnionOnSchema FAILED

        Show
        Ken Goodhope added a comment - Thanks Pradeep, that is actually very helpful. If I understand you correctly, the outer tuple isn't part of the schema returned by LoadFunc.getSchema(). Is it possible that the result of LoadFunc.getNext used to be wrapped in an implicit tuple, and that is no longer happening? The results of the unit tests with the fix I suggested in my last comment showed 11 tests now working that were broke before, and 11 tests now breaking that used to work. This makes me wonder if some of the tests have been written with the expectation there is an implicit wrapping tuple, and some have been written with expectation that there is no implicit wrapper. Am I missing something? Here are the test results. Test that were broke and now work. > [junit] Test org.apache.pig.test.TestBestFitCast > [junit] Test org.apache.pig.test.TestCounters > [junit] Test org.apache.pig.test.TestDataBagAccess > [junit] Test org.apache.pig.test.TestEmptyInputDir > [junit] Test org.apache.pig.test.TestImplicitSplit > [junit] Test org.apache.pig.test.TestInvoker > [junit] Test org.apache.pig.test.TestPigRunner > [junit] Test org.apache.pig.test.TestPigSplit > [junit] Test org.apache.pig.test.TestScriptLanguage > [junit] Test org.apache.pig.test.TestScriptUDF > [junit] Test org.apache.pig.test.TestSkewedJoin Tests that used to work, but break with the fix I tried. < [junit] Test org.apache.pig.test.TestCombiner FAILED < [junit] Test org.apache.pig.test.TestCommit FAILED < [junit] Test org.apache.pig.test.TestEvalPipeline2 FAILED < [junit] Test org.apache.pig.test.TestEvalPipelineLocal FAILED < [junit] Test org.apache.pig.test.TestForEachNestedPlanLocal FAILED < [junit] Test org.apache.pig.test.TestLimitAdjuster FAILED < [junit] Test org.apache.pig.test.TestMergeJoinOuter FAILED < [junit] Test org.apache.pig.test.TestProject FAILED < [junit] Test org.apache.pig.test.TestProjectRange FAILED < [junit] Test org.apache.pig.test.TestPruneColumn FAILED < [junit] Test org.apache.pig.test.TestUnionOnSchema FAILED
        Hide
        Pradeep Kamath added a comment -

        Based on my (old) knowledge, the tuple returned by LoadFunc (LoadFunc always has to return a tuple) simply stands for the record and the schema deals with the types of the fields inside it. So if the schema is A:

        {t:tuple(i:int,c:char)}

        that means each record contains one field of type tuple which has an int and char). I would think this means the LoadFunc returns an outer tuple (for the record), with a tuple inside (standing for the field) which has int and char subfields. I will let the more active committers comment on whether anything with respect to LoadFunc tuple handling has changed. Hopefully I am not giving wrong information here based my old knowledge, apologies in advance if so.

        Show
        Pradeep Kamath added a comment - Based on my (old) knowledge, the tuple returned by LoadFunc (LoadFunc always has to return a tuple) simply stands for the record and the schema deals with the types of the fields inside it. So if the schema is A: {t:tuple(i:int,c:char)} that means each record contains one field of type tuple which has an int and char). I would think this means the LoadFunc returns an outer tuple (for the record), with a tuple inside (standing for the field) which has int and char subfields. I will let the more active committers comment on whether anything with respect to LoadFunc tuple handling has changed. Hopefully I am not giving wrong information here based my old knowledge, apologies in advance if so.
        Hide
        Pradeep Kamath added a comment -

        Also am wondering if changes (any fix) are needed in the appropriate LoadFunc rather than in POProject (if my initial hypothesis that the cast is valid is true)

        Show
        Pradeep Kamath added a comment - Also am wondering if changes (any fix) are needed in the appropriate LoadFunc rather than in POProject (if my initial hypothesis that the cast is valid is true)
        Hide
        Ken Goodhope added a comment -

        That makes sense, and if it is still the case it would mean the fix needs to occur in the LoadFunc and not POProject. This is also consistent with the original comments by Daniel Dae for PIG-1890. AvroStorage has always included the wrapping tuple as part of the schema. In most cases the outer tuple isn't really a wrapper, but a record with multiple fields and that works fine. Later tonight I will take a look and see what changes I need to make at the LoadFunc level. I am still perplexed why the incorrect behavior used to work. Thanks again Pradeep.

        Show
        Ken Goodhope added a comment - That makes sense, and if it is still the case it would mean the fix needs to occur in the LoadFunc and not POProject. This is also consistent with the original comments by Daniel Dae for PIG-1890 . AvroStorage has always included the wrapping tuple as part of the schema. In most cases the outer tuple isn't really a wrapper, but a record with multiple fields and that works fine. Later tonight I will take a look and see what changes I need to make at the LoadFunc level. I am still perplexed why the incorrect behavior used to work. Thanks again Pradeep.
        Hide
        Ken Goodhope added a comment -

        In my LoadFunc, I modified getSchema to check for a single element wrapping tuple and return the inner ResourceSchema when one is found. This fixed the errors I was getting from POProject.java. The unit tests for my LoadFunc are still breaking, because the output has changed. However I suspect the new output is correct, so after some more investigation I will probably change the unit tests. Why including the wrapping tuple in the schema used to work is still a mystery. Maybe someone currently working on the project can answer that question.

        Show
        Ken Goodhope added a comment - In my LoadFunc, I modified getSchema to check for a single element wrapping tuple and return the inner ResourceSchema when one is found. This fixed the errors I was getting from POProject.java. The unit tests for my LoadFunc are still breaking, because the output has changed. However I suspect the new output is correct, so after some more investigation I will probably change the unit tests. Why including the wrapping tuple in the schema used to work is still a mystery. Maybe someone currently working on the project can answer that question.
        Hide
        Ken Goodhope added a comment -

        The behavior of POProject is correct. LoadFuncs need make sure the pig schema they return does not include the implicit wrapping tuple. The schema should only reflect the contents inside the wrapping tuple.

        I am not 100% sure how this relates to the issue with ElephantBird, but I am reasonably convinced the problem there would lie in either how the schema is built, or possibly how the logical plan is being executed. Regardless I believe this jira can be closed, since POProject is no longer suspect.

        Show
        Ken Goodhope added a comment - The behavior of POProject is correct. LoadFuncs need make sure the pig schema they return does not include the implicit wrapping tuple. The schema should only reflect the contents inside the wrapping tuple. I am not 100% sure how this relates to the issue with ElephantBird, but I am reasonably convinced the problem there would lie in either how the schema is built, or possibly how the logical plan is being executed. Regardless I believe this jira can be closed, since POProject is no longer suspect.

          People

          • Assignee:
            Unassigned
            Reporter:
            Ken Goodhope
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development