Pig
  1. Pig
  2. PIG-2541

Automatic record provenance (source tagging) for PigStorage

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg:

      a = load '1.txt' using PigStorage('\t', '-tagsource');
      Show
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg: a = load '1.txt' using PigStorage('\t', '-tagsource');

      Description

      There are a lot of interests in knowing where the data comes from when loading from a directory (or a set of directories). One can do it manually (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be more convenient for users if we implement this in the PigStorage with a command line option (e.g., pig.source.tagging=true/false) to turn it on/off. By default it will be off.

      1. PIG-2541.patch
        3 kB
        Prashant Kommireddi
      2. PIG-2541_2.patch
        7 kB
        Prashant Kommireddi
      3. PIG-2541_3.patch
        11 kB
        Prashant Kommireddi
      4. PIG-2541.doc.patch
        1 kB
        Daniel Dai

        Activity

        Richard Ding created issue -
        Prashant Kommireddi made changes -
        Field Original Value New Value
        Attachment PIG-2541.patch [ 12514946 ]
        Prashant Kommireddi made changes -
        Assignee Prashant Kommireddi [ prkommireddi ]
        Prashant Kommireddi made changes -
        Attachment PIG-2541_2.patch [ 12515799 ]
        Prashant Kommireddi made changes -
        Attachment PIG-2541_3.patch [ 12516248 ]
        Daniel Dai made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Hadoop Flags Reviewed [ 10343 ]
        Release Note We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg:

        a = load '1.txt' using PigStorage('\t', '-tagsource');
        Fix Version/s 0.11 [ 12318878 ]
        Resolution Fixed [ 1 ]
        Daniel Dai made changes -
        Attachment PIG-2541.doc.patch [ 12521916 ]
        Daniel Dai made changes -
        Fix Version/s 0.10.0 [ 12316246 ]
        Daniel Dai made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            Prashant Kommireddi
            Reporter:
            Richard Ding
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development