Pig
  1. Pig
  2. PIG-872

use distributed cache for the replicated data set in FR join

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.7.0
    • Component/s: None
    • Labels:
      None

      Description

      Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls.

      Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine.

      The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache.

      Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.)

      1. PIG_872.patch.1
        7 kB
        Sriranjan Manjunath

        Activity

        Hide
        Pradeep Kamath added a comment -

        Distributed cache can be used for the case where the "replicate" input for the fragment-replicate join is a file present on DFS at query start time. When the replicate input is an intermediate output in the query (say the output of a filter), the client side will know the filename of this intermediate output - but we have to check that hadoop will honor the ditributed cache property only while launching the job corresponding to the FR join (so that at that time, the file is present on dfs) - if this is not the case, the approach may not work for intermediate output scenarios.

        Show
        Pradeep Kamath added a comment - Distributed cache can be used for the case where the "replicate" input for the fragment-replicate join is a file present on DFS at query start time. When the replicate input is an intermediate output in the query (say the output of a filter), the client side will know the filename of this intermediate output - but we have to check that hadoop will honor the ditributed cache property only while launching the job corresponding to the FR join (so that at that time, the file is present on dfs) - if this is not the case, the approach may not work for intermediate output scenarios.
        Hide
        Milind Bhandarkar added a comment -

        A couple of things:

        As Pradeep says, only the hadoop job that performs FR join needs to add the replicated dataset to distributed cache.

        Second, make sure that the replicated dataset has high replication, such as 10 (or the same replication as job.jar). For already materialized dataset, Pig need not do anything but only warn if the replication factor is small (e.g. 3) But if the replicated dataset is being produced as an intermediate output by Pig, it needs to be generated with high replication factor.

        Show
        Milind Bhandarkar added a comment - A couple of things: As Pradeep says, only the hadoop job that performs FR join needs to add the replicated dataset to distributed cache. Second, make sure that the replicated dataset has high replication, such as 10 (or the same replication as job.jar). For already materialized dataset, Pig need not do anything but only warn if the replication factor is small (e.g. 3) But if the replicated dataset is being produced as an intermediate output by Pig, it needs to be generated with high replication factor.
        Hide
        Sriranjan Manjunath added a comment -

        I have verified that the job.xml has mapred.cache.files set to the replicated files.

        Show
        Sriranjan Manjunath added a comment - I have verified that the job.xml has mapred.cache.files set to the replicated files.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch
        against trunk revision 881008.

        +1 @author. The patch does not contain any @author tags.

        +1 tests included. The patch appears to include 3 new or modified tests.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch against trunk revision 881008. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console This message is automatically generated.
        Hide
        Olga Natkovich added a comment -

        I am reviewing this patch

        Show
        Olga Natkovich added a comment - I am reviewing this patch
        Hide
        Olga Natkovich added a comment -

        Patch looks good. A couple of comments:

        (1) Looks like the newly added test just have a plain replicated join. I think this is already covered in TestFRJoin.java.
        (2) JobControlCompiler.java" 943 lines: the comment says that the fragmented input is usually null and that it is the first one. Would it be more clear if we just started with the second file and did not have to check for null every time since null in any other position should be treated as error.

        Show
        Olga Natkovich added a comment - Patch looks good. A couple of comments: (1) Looks like the newly added test just have a plain replicated join. I think this is already covered in TestFRJoin.java. (2) JobControlCompiler.java" 943 lines: the comment says that the fragmented input is usually null and that it is the first one. Would it be more clear if we just started with the second file and did not have to check for null every time since null in any other position should be treated as error.
        Hide
        Ashutosh Chauhan added a comment -

        There already exists MapReduceOper.getFragment() which will return the index of fragmented input in replFiles[]. This index should be used to identify the fragmented input instead of null check. And then as Olga suggested any other null is an error.

        Show
        Ashutosh Chauhan added a comment - There already exists MapReduceOper.getFragment() which will return the index of fragmented input in replFiles[]. This index should be used to identify the fragmented input instead of null check. And then as Olga suggested any other null is an error.
        Hide
        Sriranjan Manjunath added a comment -

        Olga, I agree with your 1st point. I will get rid of the test case.
        To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the replicated files? What's the rationale behind returning a null for the fragmented input? I could change it to what Ashutosh suggested, but it would just be cleaner if fragmented input was not represented by a null.

        Show
        Sriranjan Manjunath added a comment - Olga, I agree with your 1st point. I will get rid of the test case. To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the replicated files? What's the rationale behind returning a null for the fragmented input? I could change it to what Ashutosh suggested, but it would just be cleaner if fragmented input was not represented by a null.
        Hide
        Olga Natkovich added a comment -

        I am fine if you want to remove it as long as it does not break any existing functionality. I am not sure why it is present in the list.

        Show
        Olga Natkovich added a comment - I am fine if you want to remove it as long as it does not break any existing functionality. I am not sure why it is present in the list.
        Hide
        Ashutosh Chauhan added a comment -

        I think original intent was not to hard code the fact that fragmented input should be the first input. I think its good to have that flexibility (e.g., if we later decide that ordering of join inputs should be consistent across different join algorithms and thus fragmented input should be last, in line with symmetric hash join). This has led to this twisted need for having fragmented input represented as null in replFiles[]. Nonetheless, it could be fixed such that replFiles[] consist of exactly n-1 values with no nulls. However, that will make this patch bigger and is kind of orthogonal to this issue. So, I will suggest to track that in separate jira, if we think that is something that should be fixed.

        Show
        Ashutosh Chauhan added a comment - I think original intent was not to hard code the fact that fragmented input should be the first input. I think its good to have that flexibility (e.g., if we later decide that ordering of join inputs should be consistent across different join algorithms and thus fragmented input should be last, in line with symmetric hash join). This has led to this twisted need for having fragmented input represented as null in replFiles[]. Nonetheless, it could be fixed such that replFiles[] consist of exactly n-1 values with no nulls. However, that will make this patch bigger and is kind of orthogonal to this issue. So, I will suggest to track that in separate jira, if we think that is something that should be fixed.
        Hide
        Sriranjan Manjunath added a comment -

        Fixed both the issues.

        Show
        Sriranjan Manjunath added a comment - Fixed both the issues.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1
        against trunk revision 882818.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        -1 core tests. The patch failed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1 against trunk revision 882818. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/console This message is automatically generated.
        Hide
        Olga Natkovich added a comment -

        resubmitting the patch. looks like we had problems running tests

        Show
        Olga Natkovich added a comment - resubmitting the patch. looks like we had problems running tests
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1
        against trunk revision 882818.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        +1 javadoc. The javadoc tool did not generate any warning messages.

        +1 javac. The applied patch does not increase the total number of javac compiler warnings.

        +1 findbugs. The patch does not introduce any new Findbugs warnings.

        +1 release audit. The applied patch does not increase the total number of release audit warnings.

        +1 core tests. The patch passed core unit tests.

        +1 contrib tests. The patch passed contrib unit tests.

        Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/testReport/
        Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1 against trunk revision 882818. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/console This message is automatically generated.
        Hide
        Olga Natkovich added a comment -

        Patch committed. Thanks Sri for contributing.

        (The patch did not need additional unit tests because it is improvement of the existing functionality and the current tests are sufficient. Also, there is no programatic way to test this change and manual testing has been deployed.)

        Show
        Olga Natkovich added a comment - Patch committed. Thanks Sri for contributing. (The patch did not need additional unit tests because it is improvement of the existing functionality and the current tests are sufficient. Also, there is no programatic way to test this change and manual testing has been deployed.)
        Hide
        Olga Natkovich added a comment -

        patch committed to 0.6.0 branch

        Show
        Olga Natkovich added a comment - patch committed to 0.6.0 branch

          People

          • Assignee:
            Sriranjan Manjunath
            Reporter:
            Olga Natkovich
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development