Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.13.0
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Under certain conditions PigServer creates a cycle in the logical plan. Consider the following pseudocode:

      A = load from 'A' using F1;
      ...process...
      B = store X into 'B' using F2;
      C = load from 'B' using F3;
      ...process...
      D = store Y into 'A' using F1;
      

      PigServer will, in ordinary cases, notice that an output path is equal to an input path, and, if there's no path from the input to the output, make the input a dependency of the output. However, PigServer orders the loads and stores arbitrarily during that logic. Sometimes, in the code above, C is correctly wired as a dependency of B and, since that creates a path from A to D, A won't be made a dependency of D and we're good. On occasion though, the ordering being arbitrary, A is wired as a dependency of D. That's no good. To be fair, it's not actually a cycle, since when A is wired to D, there's a path between C and B so the cycle won't actually get created. But it's still a broken plan.

      The offending PigServer code: https://github.com/apache/pig/blob/branch-0.11/src/org/apache/pig/PigServer.java#L1678-L1693

      And here's some actual pig code that should reproduce the broken plan. Notice I had to use a store function that wouldn't check the output. If you're just using PigStorage this won't be reproducible since you can't write to the same location you read from in that case.

      A = load '$A' as (line:chararray);
      A = foreach A generate flatten(TOKENIZE(LOWER(line))) as token;
      store A into '$B';
      B = load '$B' as (token:chararray);
      B = filter B by SIZE(token) > 3;
      store B into '$A' using org.apache.pig.piggybank.storage.DBStorage('com.mysql.jdbc.Driver', 'dbc:mysql://localhost/test', 'INSERT INTO foobar (token) VALUES(?)');
      

      As far as a fix goes... I'd love some input. I've got some workarounds in mind for the specific use case that brought this up, but the general problem is more difficult.

      As an aside, there's other issues with the PigServer code referenced above. For example, it should almost certainly be using the full path (after LoadFunc/StoreFunc.relativeToAbsolutePath) no? Try storing to a relative path then loading from the absolute representation of that path in the same script... Also, why isn't it checking the FuncSpec as well as the location? Just trying to open up the discussion.

      1. broken-plan.png
        133 kB
        Jacob Perkins
      2. cycle.diff
        0.8 kB
        Jacob Perkins
      3. multiquery-cycle.diff
        4 kB
        Jacob Perkins
      4. multiquery-cycle.diff
        19 kB
        Jacob Perkins
      5. multiquery-20140509.diff
        16 kB
        Jacob Perkins

        Activity

        Hide
        Jacob Perkins added a comment -

        Here's the lipstick representation of the broken plan using the code provided.

        Show
        Jacob Perkins added a comment - Here's the lipstick representation of the broken plan using the code provided.
        Hide
        Jacob Perkins added a comment -

        I'm not happy with this fix as it relies on location in the script. However, when writing a pig script by hand, a developer most often (in fact I can't think of a reason I wouldn't do it this way) puts the store before the load that relies on it in the script.

        Show
        Jacob Perkins added a comment - I'm not happy with this fix as it relies on location in the script. However, when writing a pig script by hand, a developer most often (in fact I can't think of a reason I wouldn't do it this way) puts the store before the load that relies on it in the script.
        Hide
        Cheolsoo Park added a comment -

        Jacob Perkins, your patch breaks the following tests. The area of code that you're modifying seems to affect multi query optimization. Can you take a look?

        >>> org.apache.pig.test.TestMultiQueryBasic.testMultiQueryWithCoGroup_2 	3.7 sec	1
        >>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithCoGroup 	0.13 sec	1
        >>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithIntermediateStores 	0.1 sec	1
        >>> org.apache.pig.test.TestMultiQueryCompiler.testStoreOrder 	0.18 sec	1
        >>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemoval 	0.13 sec	1
        >>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemovalCollapseSplit 	0.14 sec	1
        >>> org.apache.pig.test.TestMultiQueryLocal.testStoreOrder 
        

        Canceling the patch for now.

        Show
        Cheolsoo Park added a comment - Jacob Perkins , your patch breaks the following tests. The area of code that you're modifying seems to affect multi query optimization. Can you take a look? >>> org.apache.pig.test.TestMultiQueryBasic.testMultiQueryWithCoGroup_2 3.7 sec 1 >>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithCoGroup 0.13 sec 1 >>> org.apache.pig.test.TestMultiQueryCompiler.testMultiQueryWithIntermediateStores 0.1 sec 1 >>> org.apache.pig.test.TestMultiQueryCompiler.testStoreOrder 0.18 sec 1 >>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemoval 0.13 sec 1 >>> org.apache.pig.test.TestMultiQueryCompiler.testUnnecessaryStoreRemovalCollapseSplit 0.14 sec 1 >>> org.apache.pig.test.TestMultiQueryLocal.testStoreOrder Canceling the patch for now.
        Hide
        Jacob Perkins added a comment -

        Without the location in the script/query being a reliable proxy for order there's really no other way I can think of to fix this. Currently, a workaround is to manually place an exec statement in the script to force the plans to be executed in the right order. This is unfortunate because, as far as I can tell, computation isn't shared across exec statements.

        Show
        Jacob Perkins added a comment - Without the location in the script/query being a reliable proxy for order there's really no other way I can think of to fix this. Currently, a workaround is to manually place an exec statement in the script to force the plans to be executed in the right order. This is unfortunate because, as far as I can tell, computation isn't shared across exec statements.
        Hide
        Jacob Perkins added a comment -

        This goes deep. I updated the LogicalPlanBuilder such that, when building a load op, it checks the plan that's been generated "so far" for stores that match the location the load references. If so it links them.

        Some of the multiquery compilation code exploited the <i>bug</i> that plan.getSinks() always returned the stores. You could get away with that before the "postProcess" method on PigServer got called because loads were not yet linked to the stores they depended on. Hence, all the stores were leaves of the plan. However, with this patch that bug exploitation will no longer work. In order to get the loads you'll have to get the operators of the plan explicitly and create a list of stores. Presumably this happens elsewhere but I only ran the multiquery tests with my patch.

        Show
        Jacob Perkins added a comment - This goes deep. I updated the LogicalPlanBuilder such that, when building a load op, it checks the plan that's been generated "so far" for stores that match the location the load references. If so it links them. Some of the multiquery compilation code exploited the <i>bug</i> that plan.getSinks() always returned the stores. You could get away with that before the "postProcess" method on PigServer got called because loads were not yet linked to the stores they depended on. Hence, all the stores were leaves of the plan. However, with this patch that bug exploitation will no longer work. In order to get the loads you'll have to get the operators of the plan explicitly and create a list of stores. Presumably this happens elsewhere but I only ran the multiquery tests with my patch.
        Hide
        Cheolsoo Park added a comment -

        Presumably this happens elsewhere but I only ran the multiquery tests with my patch.

        Here are unit test failures-

        >>> org.apache.pig.test.TestGrunt.testIllustrate 	0.23 sec	1
        >>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobSimple 	1 min 27 sec	1
        >>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobMultiStoreOnPred 	1 min 26 sec	1
        >>> org.apache.pig.test.TestPigStorage.testSchemaConversion2 	6.6 sec	1
        >>> org.apache.pig.test.TestPigStorage.testSchemaConversion 
        
        Show
        Cheolsoo Park added a comment - Presumably this happens elsewhere but I only ran the multiquery tests with my patch. Here are unit test failures- >>> org.apache.pig.test.TestGrunt.testIllustrate 0.23 sec 1 >>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobSimple 1 min 27 sec 1 >>> org.apache.pig.test.TestNativeMapReduce.testNativeMRJobMultiStoreOnPred 1 min 26 sec 1 >>> org.apache.pig.test.TestPigStorage.testSchemaConversion2 6.6 sec 1 >>> org.apache.pig.test.TestPigStorage.testSchemaConversion
        Hide
        Jacob Perkins added a comment -

        Added test, fixed latest round of unit test failures. Should be good to go now.

        Show
        Jacob Perkins added a comment - Added test, fixed latest round of unit test failures. Should be good to go now.
        Hide
        Cheolsoo Park added a comment -

        Jacob Perkins, thank you for the patch. Overall looks good to me. I ran full unit tests and verified there's no failure.

        I have few minor comments as follows-

        1. Can you rebase your patch to trunk? We always commit a patch into trunk and, if necessary, we back-port it into a branch. The fixVersion should be also 0.13.
        2. Can you add a comment that explains what your unit test is for? Possibly, you can describe how it fails in-deterministically without the fix.
        3. In several places, you're extracting a type of operators from operator plan through a loop. Since this is a common thing to do, I am wondering whether you can write a helper function for that. I have two suggestions-
          1. Write a function to PlanHelper.getPhysicalOperators(PhysicalPlan plan, Class<C> opClass) for LogicalPlan, or
          2. Use Guava Iterables.filter(Iterable<?> unfiltered, Class<T> type).
        4. In PigServer#skipStores(), can't you change processedStores to LOStore collection? Currently, countExecutedStores() constructs a LOStore collection and counts it. Then, skipStores() constructs another LOStore collection and skips as many LOStores as processedStores. Why not keep the original LOStore collection constructed by countExecutedStores() and remove its entries from plan in skipStores()? I might be misunderstanding the logic.
        5. In the following code, can't you use FileSpec#equals() instead of String#compareTo()?
          +            String ifile = op.getFileSpec().getFileName();
          +            String ofile = store.getFileSpec().getFileName();
          +            if (ofile.compareTo(ifile) == 0) { // --> if (load.getFileSpec().equals(store.getFileSpec())) {
          +                inputAliases.add( store.getAlias() );
          +            }
          
        6. Please remove trailing whitespaces and tabs.
        Show
        Cheolsoo Park added a comment - Jacob Perkins , thank you for the patch. Overall looks good to me. I ran full unit tests and verified there's no failure. I have few minor comments as follows- Can you rebase your patch to trunk? We always commit a patch into trunk and, if necessary, we back-port it into a branch. The fixVersion should be also 0.13. Can you add a comment that explains what your unit test is for? Possibly, you can describe how it fails in-deterministically without the fix. In several places, you're extracting a type of operators from operator plan through a loop. Since this is a common thing to do, I am wondering whether you can write a helper function for that. I have two suggestions- Write a function to PlanHelper.getPhysicalOperators(PhysicalPlan plan, Class<C> opClass) for LogicalPlan, or Use Guava Iterables.filter(Iterable<?> unfiltered, Class<T> type) . In PigServer#skipStores(), can't you change processedStores to LOStore collection? Currently, countExecutedStores() constructs a LOStore collection and counts it. Then, skipStores() constructs another LOStore collection and skips as many LOStores as processedStores. Why not keep the original LOStore collection constructed by countExecutedStores() and remove its entries from plan in skipStores()? I might be misunderstanding the logic. In the following code, can't you use FileSpec#equals() instead of String#compareTo()? + String ifile = op.getFileSpec().getFileName(); + String ofile = store.getFileSpec().getFileName(); + if (ofile.compareTo(ifile) == 0) { // --> if (load.getFileSpec().equals(store.getFileSpec())) { + inputAliases.add( store.getAlias() ); + } Please remove trailing whitespaces and tabs.
        Hide
        Jacob Perkins added a comment -

        Cheolsoo Park Regarding #5, FileSpec encapsulates both the location uri as well as the FuncSpec. There are several cases where there is a load that depends on a store where the two FuncSpecs could be different. I do think it should actually be the full path resolved via the Load or Store func's 'relativeToAbsolutePath' method, but I'd rather address that with a separate jira.

        I'll submit another patch soon that addresses the other points. Thanks!

        Show
        Jacob Perkins added a comment - Cheolsoo Park Regarding #5, FileSpec encapsulates both the location uri as well as the FuncSpec. There are several cases where there is a load that depends on a store where the two FuncSpecs could be different. I do think it should actually be the full path resolved via the Load or Store func's 'relativeToAbsolutePath' method, but I'd rather address that with a separate jira. I'll submit another patch soon that addresses the other points. Thanks!
        Hide
        Jacob Perkins added a comment -

        Latest patch. Added a utility function for getting operators of a specific type from the logical plan. Builds against trunk.

        Show
        Jacob Perkins added a comment - Latest patch. Added a utility function for getting operators of a specific type from the logical plan. Builds against trunk.
        Hide
        Cheolsoo Park added a comment -

        Great! I'll run unit tests with the new patch and commit it. Thank you Jacob!

        Show
        Cheolsoo Park added a comment - Great! I'll run unit tests with the new patch and commit it. Thank you Jacob!
        Hide
        Cheolsoo Park added a comment -

        All unit tests passed. Committed to trunk.

        I made one minor change to the final patch. I replaced ifile.compareTo(ofile) == 0 with ifile.equals(ofile) for clarity. According to the javadoc, compareTo returns 0 exactly when the equals(Object) method would return true.

        Show
        Cheolsoo Park added a comment - All unit tests passed. Committed to trunk. I made one minor change to the final patch. I replaced ifile.compareTo(ofile) == 0 with ifile.equals(ofile) for clarity. According to the javadoc , compareTo returns 0 exactly when the equals(Object) method would return true .

          People

          • Assignee:
            Jacob Perkins
            Reporter:
            Jacob Perkins
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development