Pig
  1. Pig
  2. PIG-2541

Automatic record provenance (source tagging) for PigStorage

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg:

      a = load '1.txt' using PigStorage('\t', '-tagsource');
      Show
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg: a = load '1.txt' using PigStorage('\t', '-tagsource');

      Description

      There are a lot of interests in knowing where the data comes from when loading from a directory (or a set of directories). One can do it manually (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be more convenient for users if we implement this in the PigStorage with a command line option (e.g., pig.source.tagging=true/false) to turn it on/off. By default it will be off.

      1. PIG-2541.patch
        3 kB
        Prashant Kommireddi
      2. PIG-2541.doc.patch
        1 kB
        Daniel Dai
      3. PIG-2541_3.patch
        11 kB
        Prashant Kommireddi
      4. PIG-2541_2.patch
        7 kB
        Prashant Kommireddi

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        Good call.
        If you are working on implementing this, can I ask that instead of appending to a tuple, you wrap the normal pigStorage tuple with one that adds the column on demand?
        That'll make potential tuple memory optimizations that don't deal well with strings work (not to mention that you can keep only one copy of each source tag in ram, instead of keeping one for each tuple).

        Show
        Dmitriy V. Ryaboy added a comment - Good call. If you are working on implementing this, can I ask that instead of appending to a tuple, you wrap the normal pigStorage tuple with one that adds the column on demand? That'll make potential tuple memory optimizations that don't deal well with strings work (not to mention that you can keep only one copy of each source tag in ram, instead of keeping one for each tuple).
        Hide
        Prashant Kommireddi added a comment -

        This would be a nice add. The option could be provided similar to current options (schema, noschema) https://issues.apache.org/jira/browse/PIG-2143

        Show
        Prashant Kommireddi added a comment - This would be a nice add. The option could be provided similar to current options (schema, noschema) https://issues.apache.org/jira/browse/PIG-2143
        Hide
        Prashant Kommireddi added a comment -

        Richard, I started working on this but I can stop if you would like to take this up

        Show
        Prashant Kommireddi added a comment - Richard, I started working on this but I can stop if you would like to take this up
        Hide
        Ashutosh Chauhan added a comment -

        Isn't this one of the usecases for penny? Its in contrib/penny

        Show
        Ashutosh Chauhan added a comment - Isn't this one of the usecases for penny? Its in contrib/penny
        Hide
        Prashant Kommireddi added a comment -

        Initial patch (need to add test cases). Regression test with TestPigStorage is fine. Also, I added the final component of Path to each tuple. Does not make much sense to append cluster (HDFS location).

        Show
        Prashant Kommireddi added a comment - Initial patch (need to add test cases). Regression test with TestPigStorage is fine. Also, I added the final component of Path to each tuple. Does not make much sense to append cluster (HDFS location).
        Hide
        Dmitriy V. Ryaboy added a comment -

        Prashant,
        Nice start!

        I think this needs more work to make it work with cases when people use a loaded schema (the reported schema will be wrong).

        Is appending to the end the right thing to do? Putting it in front ensures we know where to expect it. Without a schema, PigStorage can produce variable-length tuples depending on the number of entries it sees in a row. That will make source tag's ordinal float about.

        Minor code/style comments: please make static final vars all caps. Also please make sure the tabulation is strictly 4 spaces (a few lines looks off – tabs?). Just set your IDE settings to enforce this.

        Also the final patch should include the new parameters in Javadoc, not just in populateValidOptions().

        D

        Show
        Dmitriy V. Ryaboy added a comment - Prashant, Nice start! I think this needs more work to make it work with cases when people use a loaded schema (the reported schema will be wrong). Is appending to the end the right thing to do? Putting it in front ensures we know where to expect it. Without a schema, PigStorage can produce variable-length tuples depending on the number of entries it sees in a row. That will make source tag's ordinal float about. Minor code/style comments: please make static final vars all caps. Also please make sure the tabulation is strictly 4 spaces (a few lines looks off – tabs?). Just set your IDE settings to enforce this. Also the final patch should include the new parameters in Javadoc, not just in populateValidOptions(). D
        Hide
        Prashant Kommireddi added a comment -

        Thanks Dmitriy for the review.

        1. I will look at the case of loaded schema. I wonder how regression passed, may be I should take a closer look at the test cases.

        2. Great point, I had the exact same thought while coding it but decided against adding path to the front. Reason being with this feature we would want users to not make changes to existing scripts other than adding this capability. For any scripts written with positional references; adding path to the front would involve shifting all those references.

        May be add another feature (incorporate into Pig syntax or have a UDF, GETLASTCOLUMN() ) to be able to parse the last column from a Tuple? That actually makes for a nice feature which can be used in cases schema is unknown/variable.

        3. Will make the code style and Javadoc changes.

        Show
        Prashant Kommireddi added a comment - Thanks Dmitriy for the review. 1. I will look at the case of loaded schema. I wonder how regression passed, may be I should take a closer look at the test cases. 2. Great point, I had the exact same thought while coding it but decided against adding path to the front. Reason being with this feature we would want users to not make changes to existing scripts other than adding this capability. For any scripts written with positional references; adding path to the front would involve shifting all those references. May be add another feature (incorporate into Pig syntax or have a UDF, GETLASTCOLUMN() ) to be able to parse the last column from a Tuple? That actually makes for a nice feature which can be used in cases schema is unknown/variable. 3. Will make the code style and Javadoc changes.
        Hide
        Prashant Kommireddi added a comment -

        Dmitriy,

        I was thinking some more about storing schema. In what case would schema for "source path" need to be registered under the hood? I would assume a user will always need to reference "source path" before using it, in which case storing schema would occur as is done currently?

        Show
        Prashant Kommireddi added a comment - Dmitriy, I was thinking some more about storing schema. In what case would schema for "source path" need to be registered under the hood? I would assume a user will always need to reference "source path" before using it, in which case storing schema would occur as is done currently?
        Hide
        Dmitriy V. Ryaboy added a comment -

        Prashant, I am thinking of the case when a loaded schema is something like (a:int, b:int) but, due to loading with "using PigStorage('\t', '-useSchema -pig.source.tagging=true'), the schema expected by the user is (a:int, b:int, source_tag:chararray). Since the loader doesn't report this modified schema, the user won't be able to access the new field. I suspect regression wasn't caught because you didn't test both options combined, and only used them separately.

        This should "just work" on the storage, as opposed to loader, side, I don't think there's a problem there as long as the loader side is fixed.

        Regarding position of the tag – I really think putting it in the beginning is better. As I described above, putting it on the end leads to straight-up unpredictable results in some circumstances; avoiding that situation takes precedence (in my mind) over convenience of modifying existing scripts (which will need to be modified anyway to take advantage of this.. so in for a penny, in for a pound).

        Show
        Dmitriy V. Ryaboy added a comment - Prashant, I am thinking of the case when a loaded schema is something like (a:int, b:int) but, due to loading with "using PigStorage('\t', '-useSchema -pig.source.tagging=true'), the schema expected by the user is (a:int, b:int, source_tag:chararray). Since the loader doesn't report this modified schema, the user won't be able to access the new field. I suspect regression wasn't caught because you didn't test both options combined, and only used them separately. This should "just work" on the storage, as opposed to loader, side, I don't think there's a problem there as long as the loader side is fixed. Regarding position of the tag – I really think putting it in the beginning is better. As I described above, putting it on the end leads to straight-up unpredictable results in some circumstances; avoiding that situation takes precedence (in my mind) over convenience of modifying existing scripts (which will need to be modified anyway to take advantage of this.. so in for a penny, in for a pound).
        Hide
        Prashant Kommireddi added a comment -

        This gets tricky now (if I understand correctly), consider the sequence (please note the following is keeping in mind the current way of appending source_tag to end)

        1. A = load 'input' using PigStorage('\t', '-tagsource');    // schema is undefined at this point
        
        2. B = FOREACH A GENERATE (int)$0 as col1, (long)$1 as col2, (chararray)$2 as source_tag; // schema: (col1:int, col2:long, source_tag:chararray)
        
        3. C = GROUP B BY source_tag;
        
        .
        .
        .
        .
        
        n. STORE N INTO 'intermediate' using PigStorage('\t', '-schema');  //Schema, lets say: (source_tag: chararray, cnt: long)
        

        Now, the user wants to read 'intermediate' using schema, and also know the (new) source path(s).

        --Mentioning -schema is not required here, included just for clarity.
        A = load 'intermediate' using PigStorage('\t', '-schema -tagsource');  //Schema: (source_tag:chararray, cnt:long, source_tag:chararray)
        

        There would be a conflict in auto-loading 'source_tag' in the above case. I think de-coupling 'schema' from 'tagsource' would be a nice alternative, as the input path is not part of "real data". It is a derived field which could be treated differently from actual data contained within input files. So the user always expects the right schema for first n-1 columns, with nth column being the source_tag for which schema does not really need to be auto-loaded? Similar to how it would work if one had extended PigStorage to implement source tagging.

        I completely agree with you on the pain of appending source_tag to the end, its less predictable than at the start. However, things would get complicated in terms of maintainence when users want to switch between using and not using source tagging. It would be great to minimize reference repositioning changes for production jobs (error-prone, might result in large number of script changes if fields are not referenced via an alias).

        Lastly, I am leaning towards the 'append' approach but fine with either one. We just need to make sure this is an easy to use/adopt feature.

        Show
        Prashant Kommireddi added a comment - This gets tricky now (if I understand correctly), consider the sequence (please note the following is keeping in mind the current way of appending source_tag to end) 1. A = load 'input' using PigStorage('\t', '-tagsource'); // schema is undefined at this point 2. B = FOREACH A GENERATE ( int )$0 as col1, ( long )$1 as col2, (chararray)$2 as source_tag; // schema: (col1: int , col2: long , source_tag:chararray) 3. C = GROUP B BY source_tag; . . . . n. STORE N INTO 'intermediate' using PigStorage('\t', '-schema'); //Schema, lets say: (source_tag: chararray, cnt: long ) Now, the user wants to read 'intermediate' using schema, and also know the (new) source path(s). --Mentioning -schema is not required here, included just for clarity. A = load 'intermediate' using PigStorage('\t', '-schema -tagsource'); //Schema: (source_tag:chararray, cnt: long , source_tag:chararray) There would be a conflict in auto-loading 'source_tag' in the above case. I think de-coupling 'schema' from 'tagsource' would be a nice alternative, as the input path is not part of "real data". It is a derived field which could be treated differently from actual data contained within input files. So the user always expects the right schema for first n-1 columns, with nth column being the source_tag for which schema does not really need to be auto-loaded? Similar to how it would work if one had extended PigStorage to implement source tagging. I completely agree with you on the pain of appending source_tag to the end, its less predictable than at the start. However, things would get complicated in terms of maintainence when users want to switch between using and not using source tagging. It would be great to minimize reference repositioning changes for production jobs (error-prone, might result in large number of script changes if fields are not referenced via an alias). Lastly, I am leaning towards the 'append' approach but fine with either one. We just need to make sure this is an easy to use/adopt feature.
        Hide
        Richard Ding added a comment -

        Prashant,

        Thanks for working on this.

        Here are a couple of things we need:

        1. Make sure that the source tag is in the loader's output schema
        2. Avoid data bloat by not simply appending the paths

        -Richard

        Show
        Richard Ding added a comment - Prashant, Thanks for working on this. Here are a couple of things we need: 1. Make sure that the source tag is in the loader's output schema 2. Avoid data bloat by not simply appending the paths -Richard
        Hide
        Prashant Kommireddi added a comment -

        Hi Richard,

        What do you mean by "Avoid data bloat by not simply appending the paths" ?

        Dmitriy, what do you think about my comments above?

        Show
        Prashant Kommireddi added a comment - Hi Richard, What do you mean by "Avoid data bloat by not simply appending the paths" ? Dmitriy, what do you think about my comments above?
        Hide
        Dmitriy V. Ryaboy added a comment -

        So the user always expects the right schema for first n-1 columns, with nth column being the source_tag for which schema does not really need to be auto-loaded?

        I don't think it works that way. You can't access a field that's not declared in a non-null schema.

        Show
        Dmitriy V. Ryaboy added a comment - So the user always expects the right schema for first n-1 columns, with nth column being the source_tag for which schema does not really need to be auto-loaded? I don't think it works that way. You can't access a field that's not declared in a non-null schema.
        Hide
        Prashant Kommireddi added a comment -

        In which case I will plan on adding source tag at the start. Auto-loading 'source_tag' when schema is needed can be done, however we need to think about the case when there might be a conflict between auto-loaded 'source-tag' and an already available 'source_tag' in the schema.

        One way is to look for source_tag in schema and if present append an id to the auto loaded source_tag, something like 'source_tag_001'. However, this is not ideal performance-wise, and will cause confusion.

        Show
        Prashant Kommireddi added a comment - In which case I will plan on adding source tag at the start. Auto-loading 'source_tag' when schema is needed can be done, however we need to think about the case when there might be a conflict between auto-loaded 'source-tag' and an already available 'source_tag' in the schema. One way is to look for source_tag in schema and if present append an id to the auto loaded source_tag, something like 'source_tag_001'. However, this is not ideal performance-wise, and will cause confusion.
        Hide
        Daniel Dai added a comment -

        Here are my thoughts:
        1. As Dmitriy said, we need to include this column in getSchema.
        2. Hive name this column INPUT_FILE_NAME, we can follow this convention, no need to invent something new.
        3. Usually user don't care about the input file name for intermediate file, name conflict should be rare. I would like to fail out when this happen to make the patch simpler(believe parser already do the name conflict check).

        Show
        Daniel Dai added a comment - Here are my thoughts: 1. As Dmitriy said, we need to include this column in getSchema. 2. Hive name this column INPUT_ FILE _NAME, we can follow this convention, no need to invent something new. 3. Usually user don't care about the input file name for intermediate file, name conflict should be rare. I would like to fail out when this happen to make the patch simpler(believe parser already do the name conflict check).
        Hide
        Prashant Kommireddi added a comment -

        Thanks Daniel. I have a patch ready as per the above discussion.

        Show
        Prashant Kommireddi added a comment - Thanks Daniel. I have a patch ready as per the above discussion.
        Hide
        Prashant Kommireddi added a comment -

        Can any of the reviewers please take a look at the patch when possible.

        Show
        Prashant Kommireddi added a comment - Can any of the reviewers please take a look at the patch when possible.
        Hide
        Daniel Dai added a comment -

        I will take a look today or tomorrow.

        Show
        Daniel Dai added a comment - I will take a look today or tomorrow.
        Hide
        Daniel Dai added a comment -

        Looks good. We do need test cases.

        Show
        Daniel Dai added a comment - Looks good. We do need test cases.
        Hide
        Prashant Kommireddi added a comment -

        Thanks Daniel, I will put up a patch with test cases soon.

        Show
        Prashant Kommireddi added a comment - Thanks Daniel, I will put up a patch with test cases soon.
        Hide
        Prashant Kommireddi added a comment -

        Daniel, I have attached test cases on this patch.

        Show
        Prashant Kommireddi added a comment - Daniel, I have attached test cases on this patch.
        Hide
        Daniel Dai added a comment -

        Unit tests pass. test-patch:
        [exec] -1 overall.
        [exec]
        [exec] +1 @author. The patch does not contain any @author tags.
        [exec]
        [exec] +1 tests included. The patch appears to include 3 new or modified tests.
        [exec]
        [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages.
        [exec]
        [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
        [exec]
        [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
        [exec]
        [exec] -1 release audit. The applied patch generated 533 release audit warnings (more than the trunk's current 530 warnings).

        javac and release audit warning is unrelated.

        Patch committed to trunk, thanks Prashant!

        Show
        Daniel Dai added a comment - Unit tests pass. test-patch: [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] -1 javadoc. The javadoc tool appears to have generated 1 warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] -1 release audit. The applied patch generated 533 release audit warnings (more than the trunk's current 530 warnings). javac and release audit warning is unrelated. Patch committed to trunk, thanks Prashant!
        Hide
        Prashant Kommireddi added a comment -

        Daniel, can we include this in 0.10 release? I am trying not to have a custom build just for this patch if it can be a part of 0.10 altogether. This is a really useful feature and would be great to have this out for users sooner rather than later.

        Show
        Prashant Kommireddi added a comment - Daniel, can we include this in 0.10 release? I am trying not to have a custom build just for this patch if it can be a part of 0.10 altogether. This is a really useful feature and would be great to have this out for users sooner rather than later.
        Hide
        Daniel Dai added a comment -

        Committed to 0.10 branch as per requested by Prashant. Also commit document update.

        Show
        Daniel Dai added a comment - Committed to 0.10 branch as per requested by Prashant. Also commit document update.
        Hide
        Prashant Kommireddi added a comment -

        Thanks Daniel!

        Show
        Prashant Kommireddi added a comment - Thanks Daniel!
        Hide
        Prashant Kommireddi added a comment -

        Thanks Daniel!

        Show
        Prashant Kommireddi added a comment - Thanks Daniel!

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Richard Ding
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development