Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-2541

Automatic record provenance (source tagging) for PigStorage

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.1
    • Fix Version/s: 0.10.0, 0.11
    • Component/s: impl
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg:

      a = load '1.txt' using PigStorage('\t', '-tagsource');
      Show
      We add a new option -tagsource to PigStorage. With this flag, we can get the INPUT_FILE_NAME as the first column of the output data. eg: a = load '1.txt' using PigStorage('\t', '-tagsource');

      Description

      There are a lot of interests in knowing where the data comes from when loading from a directory (or a set of directories). One can do it manually (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be more convenient for users if we implement this in the PigStorage with a command line option (e.g., pig.source.tagging=true/false) to turn it on/off. By default it will be off.

        Attachments

        1. PIG-2541.patch
          3 kB
          Prashant Kommireddi
        2. PIG-2541_2.patch
          7 kB
          Prashant Kommireddi
        3. PIG-2541_3.patch
          11 kB
          Prashant Kommireddi
        4. PIG-2541.doc.patch
          1 kB
          Daniel Dai

          Activity

            People

            • Assignee:
              prkommireddi Prashant Kommireddi
              Reporter:
              rding Richard Ding
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: