Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-5231

PigStorage with -schema may produce inconsistent outputs with more fields

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.17.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      When multiple directories are passed to PigStorage(',','-schema'), pig will

      No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.

      For two directories input with schema
      file1: (f1:chararray, f2:int) and
      file2: (f1:chararray, f2:int, f3:int)

      Pig will pick the first schema from file1 and only allow f1, f2 access.
      However, output would still contain 3 fields for tuples from file2. This later leads to complete corrupt outputs due to shifted fields resulting in incorrect references.
      (This may also happen when input itself contains the delimiter.)

      If file2 schema is picked, this is already handled by filling the missing fields with null. (PIG-3100)

        Attachments

        1. pig-5231-v01.patch
          3 kB
          Koji Noguchi

          Activity

            People

            • Assignee:
              knoguchi Koji Noguchi
              Reporter:
              knoguchi Koji Noguchi
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: