Pig
  1. Pig
  2. PIG-1218

Use distributed cache to store samples

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Currently, in the case of skew join and order by we use sample that is just written to the dfs (not distributed cache) and, as the result, get opened and copied around more than necessary. This impacts query performance and also places unnecesary load on the name node

      1. PIG-1218_3.patch
        31 kB
        Richard Ding
      2. PIG-1218_2.patch
        28 kB
        Richard Ding
      3. PIG-1218.patch
        34 kB
        Richard Ding

        Activity

        Hide
        Arun C Murthy added a comment -

        I'd also suggest we increase replication factor for the sample-file in HDFS before adding it to the distributed-cache.

        Show
        Arun C Murthy added a comment - I'd also suggest we increase replication factor for the sample-file in HDFS before adding it to the distributed-cache.
        Hide
        Richard Ding added a comment -

        This patch uses Hadoop DistributedCache to cache the sample files used by order by and skewed join, as well as the side files used in FR join.

        When a HDFS file is added to the DistributedCache, Pig generates a symlink to the file and, at runtime, this symlink is used to open the file from the local working directory of the task. To avoid symlink colision, instead of using file name, a symlink name is generated by using a combination of the hashcode of the file path and the current timestamp.

        The replication factor for the sample file in HDFS is not changed with this patch. The reasons are that we're not clear what's the right factor to increase, and the work to implement the change in Pig is not trivail.

        Show
        Richard Ding added a comment - This patch uses Hadoop DistributedCache to cache the sample files used by order by and skewed join, as well as the side files used in FR join. When a HDFS file is added to the DistributedCache, Pig generates a symlink to the file and, at runtime, this symlink is used to open the file from the local working directory of the task. To avoid symlink colision, instead of using file name, a symlink name is generated by using a combination of the hashcode of the file path and the current timestamp. The replication factor for the sample file in HDFS is not changed with this patch. The reasons are that we're not clear what's the right factor to increase, and the work to implement the change in Pig is not trivail.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Richard, does this have any implications on the size of relations that can be used for FR Joins?

        Show
        Dmitriy V. Ryaboy added a comment - Richard, does this have any implications on the size of relations that can be used for FR Joins?
        Hide
        Richard Ding added a comment -

        There is no hard limit on file size for DistributedCache. The files in the DistributedCache are copied to all nodes before the job starts. So the large files will impact the performance due to the transmission of files to all nodes.

        Show
        Richard Ding added a comment - There is no hard limit on file size for DistributedCache. The files in the DistributedCache are copied to all nodes before the job starts. So the large files will impact the performance due to the transmission of files to all nodes.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12435515/PIG-1218.patch
        against trunk revision 908324.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12435515/PIG-1218.patch against trunk revision 908324. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/200/console This message is automatically generated.
        Hide
        Olga Natkovich added a comment -

        Looks like this patch is for trunk. Since we are planning to merge LSR branch onto trunk next week, it would be better if this patch directly applied to LSR.

        Show
        Olga Natkovich added a comment - Looks like this patch is for trunk. Since we are planning to merge LSR branch onto trunk next week, it would be better if this patch directly applied to LSR.
        Hide
        Richard Ding added a comment -

        It makes sense since the merge to the branch isn't trivial. I'll do the merge with this jira.

        Show
        Richard Ding added a comment - It makes sense since the merge to the branch isn't trivial. I'll do the merge with this jira.
        Hide
        Richard Ding added a comment -

        The second patch is for LSR branch and ready for review.

        Show
        Richard Ding added a comment - The second patch is for LSR branch and ready for review.
        Hide
        Pradeep Kamath added a comment -

        +1 Patch mostly looks good - couple of comments:

        • In a couple of places instead of using Configuration and JobConf based on PigMapReduce.sJobConf, you should create a new Configiuration(false) and new JobConf(false) so we create fresh datastructures without any properties coming from the Map reduce based datastructures.
        • Since partitionFile is no longer used in POPartitionRearrange.java we should remove it.

        You can make these changes and go ahead and commit it if it passes tests

        Show
        Pradeep Kamath added a comment - +1 Patch mostly looks good - couple of comments: In a couple of places instead of using Configuration and JobConf based on PigMapReduce.sJobConf, you should create a new Configiuration(false) and new JobConf(false) so we create fresh datastructures without any properties coming from the Map reduce based datastructures. Since partitionFile is no longer used in POPartitionRearrange.java we should remove it. You can make these changes and go ahead and commit it if it passes tests
        Hide
        Ashutosh Chauhan added a comment -

        On trunk - patch
        In POFRJoin#setUpHashMap()

        POLoad ld = new POLoad(new OperatorKey("Repl File Loader", 1L),
                            replFile, false);
        

        should it be?

         POLoad ld = new POLoad(new OperatorKey("Repl File Loader", NodeIdGenerator.getGenerator().getNextNodeId("Repl File Loader")),
                            replfile, false);
        

        Also following can be moved out of for loop to avoid multiple connect() on pc.

         PigContext pc = new PigContext(ExecType.MAPREDUCE, props);                  
                    pc.connect();
        

        In jobControlCompiler#setupDistributedCacheForFRJoin()

        new FRJoinDistributedCacheVisitor(mro.reducePlan, pigContext, conf)
                        .visit();
        

        Do we need this? Isn't FR Join a map-side join. So, if POFRJoin ends up in mro.reducePlan thats a bug, no?

        Show
        Ashutosh Chauhan added a comment - On trunk - patch In POFRJoin#setUpHashMap() POLoad ld = new POLoad( new OperatorKey( "Repl File Loader" , 1L), replFile, false ); should it be? POLoad ld = new POLoad( new OperatorKey( "Repl File Loader" , NodeIdGenerator.getGenerator().getNextNodeId( "Repl File Loader" )), replfile, false ); Also following can be moved out of for loop to avoid multiple connect() on pc. PigContext pc = new PigContext(ExecType.MAPREDUCE, props); pc.connect(); In jobControlCompiler#setupDistributedCacheForFRJoin() new FRJoinDistributedCacheVisitor(mro.reducePlan, pigContext, conf) .visit(); Do we need this? Isn't FR Join a map-side join. So, if POFRJoin ends up in mro.reducePlan thats a bug, no?
        Hide
        Richard Ding added a comment -

        Updated the patch to address the comments of Pradeep and Ashutosh.

        Show
        Richard Ding added a comment - Updated the patch to address the comments of Pradeep and Ashutosh.
        Hide
        Richard Ding added a comment -

        The patch 3 includes all of patch 2 plus distributed cache for merge join's index file (PIG-1079).

        Show
        Richard Ding added a comment - The patch 3 includes all of patch 2 plus distributed cache for merge join's index file ( PIG-1079 ).
        Hide
        Pradeep Kamath added a comment -

        Committed patch PIG-1218_2.patch since the merge join changes need to be re-worked and will be handled in a different patch.

        Thanks Richard!

        Show
        Pradeep Kamath added a comment - Committed patch PIG-1218 _2.patch since the merge join changes need to be re-worked and will be handled in a different patch. Thanks Richard!

          People

          • Assignee:
            Richard Ding
            Reporter:
            Olga Natkovich
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development