Pig
  1. Pig
  2. PIG-3204

Change script parsing to parse entire script instead of line by line

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.10.1
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently there are a lot of NN calls made to determine if there is a schema file for a path in a LOAD statement. When there is a slow NN(caused by whole bunch of other issues), it takes a lot of time for this and we found the scripts spending anywhere from 5 mins to 40 mins depending upon the script. It seems to be a good place for optimization.

      1. PIG-3204-6.patch
        73 kB
        Rohini Palaniswamy
      2. PIG-3204-5.patch
        73 kB
        Rohini Palaniswamy
      3. PIG-3204-4.patch
        73 kB
        Rohini Palaniswamy
      4. PIG-3204-3.patch
        17 kB
        Rohini Palaniswamy
      5. PIG-3204-2.patch
        8 kB
        Rohini Palaniswamy
      6. PIG-3204-1.patch
        8 kB
        Rohini Palaniswamy

        Issue Links

          Activity

          Hide
          Prashant Kommireddi added a comment -

          Interesting. Rohini Palaniswamy where exactly are these checks being made?

          Show
          Prashant Kommireddi added a comment - Interesting. Rohini Palaniswamy where exactly are these checks being made?
          Hide
          Rohini Palaniswamy added a comment -

          cat log4j.conf

          # ***** Set root logger level to DEBUG and its only appender to A.
          log4j.rootLogger=debug, A
          
          # ***** A is set to be a ConsoleAppender.
          log4j.appender.A=org.apache.log4j.ConsoleAppender
          # ***** A uses PatternLayout.
          log4j.appender.A.layout=org.apache.log4j.PatternLayout
          log4j.appender.A.layout.ConversionPattern=%d [%t] %-5p %c %x - %m%n
          

          cat simpleload.pig

          A = LOAD '/tmp/data';
          STORE A into '/tmp/out';
          

          pig -log4jconf ~/pig/log4j.conf simpleload.pig

          Doing

          sed -n '/Pig features used in the script/,/getDelegationToken/p' /tmp/debug.log | grep getFileInfo | wc -l
          

          gives 20 getFileInfo calls if /tmp/data is a directory and 35 calls if /tmp/data is a file.

          grep org.apache.pig.builtin.JsonMetadata /tmp/debug.log gives 10 statements of 2013-02-21 22:04:41,096 [main] DEBUG org.apache.pig.builtin.JsonMetadata - Could not find schema file for /tmp/data

          Haven't stepped through the code, but based on the logs seems to be a good candidate for optimization to cut down on the number of FS calls.

          Show
          Rohini Palaniswamy added a comment - cat log4j.conf # ***** Set root logger level to DEBUG and its only appender to A. log4j.rootLogger=debug, A # ***** A is set to be a ConsoleAppender. log4j.appender.A=org.apache.log4j.ConsoleAppender # ***** A uses PatternLayout. log4j.appender.A.layout=org.apache.log4j.PatternLayout log4j.appender.A.layout.ConversionPattern=%d [%t] %-5p %c %x - %m%n cat simpleload.pig A = LOAD '/tmp/data'; STORE A into '/tmp/out'; pig -log4jconf ~/pig/log4j.conf simpleload.pig Doing sed -n '/Pig features used in the script/,/getDelegationToken/p' /tmp/debug.log | grep getFileInfo | wc -l gives 20 getFileInfo calls if /tmp/data is a directory and 35 calls if /tmp/data is a file. grep org.apache.pig.builtin.JsonMetadata /tmp/debug.log gives 10 statements of 2013-02-21 22:04:41,096 [main] DEBUG org.apache.pig.builtin.JsonMetadata - Could not find schema file for /tmp/data Haven't stepped through the code, but based on the logs seems to be a good candidate for optimization to cut down on the number of FS calls.
          Hide
          Prashant Kommireddi added a comment -

          Good info, thanks.

          Show
          Prashant Kommireddi added a comment - Good info, thanks.
          Hide
          Rohini Palaniswamy added a comment -

          The problem was that when a script is parsed, there was a registerQuery done for each line by GruntParser which did a parse for that line and the previous lines from the script cache. So if there was a LOAD statement followed by 10 other statements, then the LOAD will be parsed 11 times in registerQuery (which made the calls to getSchema). Added an option to skip that parsing if it is in batch mode. executeBatch() will take care of the parsing.

          This will speedup pig script parsing a lot.

          Show
          Rohini Palaniswamy added a comment - The problem was that when a script is parsed, there was a registerQuery done for each line by GruntParser which did a parse for that line and the previous lines from the script cache. So if there was a LOAD statement followed by 10 other statements, then the LOAD will be parsed 11 times in registerQuery (which made the calls to getSchema). Added an option to skip that parsing if it is in batch mode. executeBatch() will take care of the parsing. This will speedup pig script parsing a lot.
          Hide
          Cheolsoo Park added a comment -

          Rohini Palaniswamy, I find this logic a bit confusing:

          if( batchMode ) {
              ...
          }
          if(validateEachStatement){
             validateQuery();
          }
          if (batchMode && !skipParseForBatch) {
             parseQuery();
          }
          if( !batchMode ) { 
             parseQuery();
             ...
          }
          

          Instead, how about returning early if batchMode && skipParseForBatch? That is,

          if( batchMode ) {
              ...
              if (skipParseForBatch) {
                  return;
              }
          }
          if(validateEachStatement){
             validateQuery();
          }
          parseQuery();
          

          I am assuming that there is no need for validateQuery() when skipping parseQuery().

          Does this make sense?

          Show
          Cheolsoo Park added a comment - Rohini Palaniswamy , I find this logic a bit confusing: if ( batchMode ) { ... } if (validateEachStatement){ validateQuery(); } if (batchMode && !skipParseForBatch) { parseQuery(); } if ( !batchMode ) { parseQuery(); ... } Instead, how about returning early if batchMode && skipParseForBatch? That is, if ( batchMode ) { ... if (skipParseForBatch) { return ; } } if (validateEachStatement){ validateQuery(); } parseQuery(); I am assuming that there is no need for validateQuery() when skipping parseQuery(). Does this make sense?
          Hide
          Rohini Palaniswamy added a comment -

          Cheolsoo Park
          You are right. validateQuery is not required. Attaching PIG-3204-2.patch with suggested changes.

          Show
          Rohini Palaniswamy added a comment - Cheolsoo Park You are right. validateQuery is not required. Attaching PIG-3204 -2.patch with suggested changes.
          Hide
          Cheolsoo Park added a comment -

          I have one very minor question. In BoundScript.java, you're not setting skipParseForBatch in registerQuery() while you're in call() and exec(). Is this because this particular method is only called by describe/explain/illustrate, and thus, they don't need this optimization? Or is there another reason why we shouldn't?

          I found it inconsistent and wanted to clarify. Other than that, looks good to me.

          Show
          Cheolsoo Park added a comment - I have one very minor question. In BoundScript.java, you're not setting skipParseForBatch in registerQuery() while you're in call() and exec() . Is this because this particular method is only called by describe/explain/illustrate, and thus, they don't need this optimization? Or is there another reason why we shouldn't? I found it inconsistent and wanted to clarify. Other than that, looks good to me.
          Hide
          Rohini Palaniswamy added a comment -

          Yes. Did not because it was only describe/explain/illustrate. Will check and if there are no problems will add it there too. Actually I have not handled running via PigRunner/Main which is most important. Canceling the patch till I fix that.

          Show
          Rohini Palaniswamy added a comment - Yes. Did not because it was only describe/explain/illustrate. Will check and if there are no problems will add it there too. Actually I have not handled running via PigRunner/Main which is most important. Canceling the patch till I fix that.
          Hide
          Rohini Palaniswamy added a comment -

          Patch with whitespace changes. PIG-3204-3.patch does not have whitespace changes.

          Show
          Rohini Palaniswamy added a comment - Patch with whitespace changes. PIG-3204 -3.patch does not have whitespace changes.
          Hide
          Rohini Palaniswamy added a comment -
          Show
          Rohini Palaniswamy added a comment - Review board link - https://reviews.apache.org/r/13535/
          Hide
          Rohini Palaniswamy added a comment -

          Patch with minor comments in review board addressed

          Show
          Rohini Palaniswamy added a comment - Patch with minor comments in review board addressed
          Hide
          Rohini Palaniswamy added a comment -

          Got a +1 from Cheolsoo in review board. Committed to trunk (0.12). Thanks Cheolsoo.

          Show
          Rohini Palaniswamy added a comment - Got a +1 from Cheolsoo in review board. Committed to trunk (0.12). Thanks Cheolsoo.
          Hide
          Shiwei Guo added a comment -

          Is the executeBatch() call in 'processDescribe' introduced by this patch necessary?
          Can it be replaced by:

            if (mPigServer.isBatchOn()) {
                      mPigServer.parseAndBuild();
                  }
          

          ?

          We have a usage case, which may be quite normal, like this:

          describe aliaseA;
          store aliaseA;
          
          describe aliaseB;
          store aliaseB;
          
          describe aliaseC;
          store aliaseC;
          ...
          

          The 'executeBatch' call in describe make pig have no chance to execute the store operations in parallel and do more optimize.

          Show
          Shiwei Guo added a comment - Is the executeBatch() call in 'processDescribe' introduced by this patch necessary? Can it be replaced by: if (mPigServer.isBatchOn()) { mPigServer.parseAndBuild(); } ? We have a usage case, which may be quite normal, like this: describe aliaseA; store aliaseA; describe aliaseB; store aliaseB; describe aliaseC; store aliaseC; ... The 'executeBatch' call in describe make pig have no chance to execute the store operations in parallel and do more optimize.
          Hide
          Cheolsoo Park added a comment -

          Shiwei Guo, thank you for reporting the issue. I ran into the same problem myself.

          I opened PIG-4106.

          Show
          Cheolsoo Park added a comment - Shiwei Guo , thank you for reporting the issue. I ran into the same problem myself. I opened PIG-4106 .

            People

            • Assignee:
              Rohini Palaniswamy
              Reporter:
              Rohini Palaniswamy
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development