Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed

      Description

      Get several request for a UDF to flatten a bag. Seems reasonable to create one in builtin:
      1. BagToTuple:

      {(a),(b),(c)}

      -> (a,b,c)
      2. BagToString(delimit="_"): {(a),(b),(c) -> "a_b_c"

      1. bagtotuplestring.diff
        19 kB
        Hien Luu
      2. PIG-2166.diff
        27 kB
        Hien Luu
      3. PIG-2166-e2e.diff
        34 kB
        Hien Luu
      4. test_harnesss_1338753364
        70 kB
        Hien Luu

        Activity

        Hide
        Julien Le Dem added a comment -

        We should not say "flatten" in that case. What about saying "join" ?

        Show
        Julien Le Dem added a comment - We should not say "flatten" in that case. What about saying "join" ?
        Hide
        Daniel Dai added a comment -

        Sounds good.

        Show
        Daniel Dai added a comment - Sounds good.
        Hide
        Alan Gates added a comment -

        -1 to join. We already use that for another concept. What's wrong with BagToTuple and BagToString?

        Show
        Alan Gates added a comment - -1 to join. We already use that for another concept. What's wrong with BagToTuple and BagToString?
        Hide
        Julien Le Dem added a comment -

        Hi Alan, I was merely commenting on the title of the JIRA, not the UDF name.

        Show
        Julien Le Dem added a comment - Hi Alan, I was merely commenting on the title of the JIRA, not the UDF name.
        Hide
        Hien Luu added a comment -

        Hi Daniel,

        These two UDFs will perform flattening only at the first level right? They don't need to recursively flatten nested bags, do they?

        For example:
        Input: {(a),

        {(b,c)},(d)} the output will be (a,{(b,c)}

        ,d) or should it be (a,b,c,d)

        I don't think they should, but just wanted to double check.

        Show
        Hien Luu added a comment - Hi Daniel, These two UDFs will perform flattening only at the first level right? They don't need to recursively flatten nested bags, do they? For example: Input: {(a), {(b,c)},(d)} the output will be (a,{(b,c)} ,d) or should it be (a,b,c,d) I don't think they should, but just wanted to double check.
        Hide
        Daniel Dai added a comment -

        Hi, Hien, I don't think either. BagToTuple({(a),

        {(b,c)},(d)}) should be (a,{(b,c)}

        ,d). Otherwise, the result is ambiguous.

        Show
        Daniel Dai added a comment - Hi, Hien, I don't think either. BagToTuple({(a), {(b,c)},(d)}) should be (a,{(b,c)} ,d). Otherwise, the result is ambiguous.
        Hide
        Thejas M Nair added a comment -

        Bags in pig are expected to be bags containing tuples. So the bag should actually be - {(a),(

        {(b,c)}),(d)}, and the output of BagToTuple on it should be same as what Daniel said - (a,{(b,c)}

        ,d) .

        Show
        Thejas M Nair added a comment - Bags in pig are expected to be bags containing tuples. So the bag should actually be - {(a),( {(b,c)}),(d)}, and the output of BagToTuple on it should be same as what Daniel said - (a,{(b,c)} ,d) .
        Hide
        Gianmarco De Francisci Morales added a comment -

        I think we need at least 2 delimiters, one for bag elements (which are tuples) and one for tuple elements (which are anything), but I am not sure it is worth supporting nested structures in the tuples.

        Show
        Gianmarco De Francisci Morales added a comment - I think we need at least 2 delimiters, one for bag elements (which are tuples) and one for tuple elements (which are anything), but I am not sure it is worth supporting nested structures in the tuples.
        Hide
        Hien Luu added a comment -

        I tried my best to follow the convention about what exception to throw in UDF. Let me know if I missed anything.

        Show
        Hien Luu added a comment - I tried my best to follow the convention about what exception to throw in UDF. Let me know if I missed anything.
        Hide
        Hien Luu added a comment -

        I was testing these UDFs in a Pig script by modifying one of the scripts in tutorial directory. It was a manual testing so it is not ideal.

        The testing through calling the exec method is in class TestBuiltInBagToTupleOrString.java.

        For these two UDFs, is it necessary to test them in a Pig script?

        Show
        Hien Luu added a comment - I was testing these UDFs in a Pig script by modifying one of the scripts in tutorial directory. It was a manual testing so it is not ideal. The testing through calling the exec method is in class TestBuiltInBagToTupleOrString.java. For these two UDFs, is it necessary to test them in a Pig script?
        Hide
        Daniel Dai added a comment -

        Hi, Hien,
        Patch looks good. For BagToString, it is better to have a default delimit, so it does not complain if we don't pass a delimit.

        It also makes sense to add some e2e tests to test/e2e/pig/tests/nightly.conf. You can use the input file studentcomplextab10k which contains bag. Refer to https://cwiki.apache.org/confluence/display/PIG/HowToTest for how to run e2e tests.

        Show
        Daniel Dai added a comment - Hi, Hien, Patch looks good. For BagToString, it is better to have a default delimit, so it does not complain if we don't pass a delimit. It also makes sense to add some e2e tests to test/e2e/pig/tests/nightly.conf. You can use the input file studentcomplextab10k which contains bag. Refer to https://cwiki.apache.org/confluence/display/PIG/HowToTest for how to run e2e tests.
        Hide
        Hien Luu added a comment -

        I ran in a problem when trying to generate test data using the command "ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-deploy" on https://cwiki.apache.org/confluence/display/PIG/HowToTest page.

        Can't locate IPC/Run.pm in @INC (@INC contains: . . . ./libexec . . ./libexec /Library/Perl/Updates/5.10.0 /System/Library/Perl/

        Then I tried to install IPC::Run perl module and ran into another error.

        On cpan.org, there is paragraph:
        "OSX comes with Perl pre-installed, in order to build and install your own modules you will need to install the 'developer' package which can be found on your OSX install DVD (you only need the 'unix tools'). Once you have done this you can use all of the tools mentioned above."

        Do you know the 'developer' package it is talking about?

        BTW, I am on Mac OSX.

        Thanks,

        Hien

        Show
        Hien Luu added a comment - I ran in a problem when trying to generate test data using the command "ant -Dharness.old.pig=old_pig -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-deploy" on https://cwiki.apache.org/confluence/display/PIG/HowToTest page. Can't locate IPC/Run.pm in @INC (@INC contains: . . . ./libexec . . ./libexec /Library/Perl/Updates/5.10.0 /System/Library/Perl/ Then I tried to install IPC::Run perl module and ran into another error. On cpan.org, there is paragraph: "OSX comes with Perl pre-installed, in order to build and install your own modules you will need to install the 'developer' package which can be found on your OSX install DVD (you only need the 'unix tools'). Once you have done this you can use all of the tools mentioned above." Do you know the 'developer' package it is talking about? BTW, I am on Mac OSX. Thanks, Hien
        Hide
        Daniel Dai added a comment -

        You need to install IPC::Run module. I usually download it from http://search.cpan.org/~toddr/IPC-Run-0.91/lib/IPC/Run.pm, build and install it.

        Show
        Daniel Dai added a comment - You need to install IPC::Run module. I usually download it from http://search.cpan.org/~toddr/IPC-Run-0.91/lib/IPC/Run.pm , build and install it.
        Hide
        Thejas M Nair added a comment -

        this might also work for you - cpan install IPC::Run

        Show
        Thejas M Nair added a comment - this might also work for you - cpan install IPC::Run
        Hide
        Hien Luu added a comment -

        I had to upgrade to xcode version 3.2 to get over the IPC::Run installation error. It was complaining about missing some header file.

        Does it really take 10 hours to complete the e2e tests when running in local mode? Is there a way to run a specific set of tests only?

        Show
        Hien Luu added a comment - I had to upgrade to xcode version 3.2 to get over the IPC::Run installation error. It was complaining about missing some header file. Does it really take 10 hours to complete the e2e tests when running in local mode? Is there a way to run a specific set of tests only?
        Hide
        Thejas M Nair added a comment -

        Is there a way to run a specific set of tests only?

        Yes, for example to run the test number 1 in Checkin test group, add the following param to command line: -Dtests.to.run="-t Checkin_1"

        Show
        Thejas M Nair added a comment - Is there a way to run a specific set of tests only? Yes, for example to run the test number 1 in Checkin test group, add the following param to command line: -Dtests.to.run="-t Checkin_1"
        Hide
        Prashant Kommireddi added a comment -

        You can also run a single unit test if you would like https://cwiki.apache.org/PIG/howtotest.html#HowToTest-Runningasingleunittest

        Show
        Prashant Kommireddi added a comment - You can also run a single unit test if you would like https://cwiki.apache.org/PIG/howtotest.html#HowToTest-Runningasingleunittest
        Hide
        Julien Le Dem added a comment -

        Here is an example of how you could test your UDF in a pig script from a java unit test:
        http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/builtin/mock/TestMockStorage.java?revision=1331070&view=markup

        Show
        Julien Le Dem added a comment - Here is an example of how you could test your UDF in a pig script from a java unit test: http://svn.apache.org/viewvc/pig/trunk/test/org/apache/pig/builtin/mock/TestMockStorage.java?revision=1331070&view=markup
        Hide
        Hien Luu added a comment -

        Cool. Thanks for the answers and suggestions. Very helpful.

        Show
        Hien Luu added a comment - Cool. Thanks for the answers and suggestions. Very helpful.
        Hide
        Hien Luu added a comment -

        Added support for default delimiter in BagToString UDF and more tests using embedded PigServer.

        Show
        Hien Luu added a comment - Added support for default delimiter in BagToString UDF and more tests using embedded PigServer.
        Hide
        Hien Luu added a comment -

        I tried to add more tests to nightly.conf and I kept getting an error when trying to run it.

        Here is the command I used to run a specific test:

        ant -Dharness.old.pig=/Users/hluu/dev/pig_project/pig/old_pig/pig-0.10.0 -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-local -Dtests.to.run="-t BagToString_1"

        The test log file is attached and the error is on line 412.

        Here is line 412:

        ERROR TestDriver::run at : 470 Failed to run test BagToString_1 <Unable to open file $

        {PH_BENCHMARK_CACHE_PATH}

        /BagToString_1_benchmark.pig to write pig script, No such file or directory
        >

        Any ideas? What is PH_BENCHMARK_CACHE_PATH?

        Thanks.

        Show
        Hien Luu added a comment - I tried to add more tests to nightly.conf and I kept getting an error when trying to run it. Here is the command I used to run a specific test: ant -Dharness.old.pig=/Users/hluu/dev/pig_project/pig/old_pig/pig-0.10.0 -Dharness.cluster.conf=hadoop_conf_dir -Dharness.cluster.bin=hadoop_script test-e2e-local -Dtests.to.run="-t BagToString_1" The test log file is attached and the error is on line 412. Here is line 412: ERROR TestDriver::run at : 470 Failed to run test BagToString_1 <Unable to open file $ {PH_BENCHMARK_CACHE_PATH} /BagToString_1_benchmark.pig to write pig script, No such file or directory > Any ideas? What is PH_BENCHMARK_CACHE_PATH? Thanks.
        Hide
        Daniel Dai added a comment -

        It is the cache directory for benchmark files.

        Show
        Daniel Dai added a comment - It is the cache directory for benchmark files.
        Hide
        Hien Luu added a comment -

        How do I set a value for this environment variable?

        I tried to add an environment variable to my .bash_profile like below and still ran into the same issue:

        export PH_BENCHMARK_CACHE_PATH=/Users/hluu/dev/pig_project/pig/cache

        Here is the error in the test harness log file (<pig home>/test/e2e/pig/testdist/out/log/test_harnesss_1339002322):

        sort ./out/pigtest/hluu/hluu.1339002409/Distinct_1.out/out_original
        ERROR TestDriver::run at : 470 Failed to run test Distinct_1 <Unable to open file $

        {PH_BENCHMARK_CACHE_PATH}

        /Distinct_1_benchmark.pig to write pig script, No such file or directory
        >

        This issue is blocking my progress. Please let me know what needs to be done so I can move forward.

        Thanks.

        Show
        Hien Luu added a comment - How do I set a value for this environment variable? I tried to add an environment variable to my .bash_profile like below and still ran into the same issue: export PH_BENCHMARK_CACHE_PATH=/Users/hluu/dev/pig_project/pig/cache Here is the error in the test harness log file (<pig home>/test/e2e/pig/testdist/out/log/test_harnesss_1339002322): sort ./out/pigtest/hluu/hluu.1339002409/Distinct_1.out/out_original ERROR TestDriver::run at : 470 Failed to run test Distinct_1 <Unable to open file $ {PH_BENCHMARK_CACHE_PATH} /Distinct_1_benchmark.pig to write pig script, No such file or directory > This issue is blocking my progress. Please let me know what needs to be done so I can move forward. Thanks.
        Hide
        Daniel Dai added a comment -

        We fixed it on trunk and 0.10 branch. Please try "svn up", and run again. (Don't need PH_BENCHMARK_CACHE_PATH)

        Show
        Daniel Dai added a comment - We fixed it on trunk and 0.10 branch. Please try "svn up", and run again. (Don't need PH_BENCHMARK_CACHE_PATH)
        Hide
        Hien Luu added a comment -

        Awesome. I was able to make some progress after "svn up".

        Show
        Hien Luu added a comment - Awesome. I was able to make some progress after "svn up".
        Hide
        Daniel Dai added a comment -

        Hi, Hien,
        Are you able to run e2e tests? Do you need any help?

        Show
        Daniel Dai added a comment - Hi, Hien, Are you able to run e2e tests? Do you need any help?
        Hide
        Hien Luu added a comment -

        Yes, I am able to run e2e tests. I am hoping to finish adding tests to nightly.conf this weekend

        Show
        Hien Luu added a comment - Yes, I am able to run e2e tests. I am hoping to finish adding tests to nightly.conf this weekend
        Hide
        Hien Luu added a comment -

        Added 2 test groups to nightly.conf. A total of 4 tests and they all passed.

        Show
        Hien Luu added a comment - Added 2 test groups to nightly.conf. A total of 4 tests and they all passed.
        Hide
        Daniel Dai added a comment -

        +1.

        Patch committed to trunk.

        Thanks Hien!

        Show
        Daniel Dai added a comment - +1. Patch committed to trunk. Thanks Hien!

          People

          • Assignee:
            Hien Luu
            Reporter:
            Daniel Dai
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development