Pig
  1. Pig
  2. PIG-2143

Make PigStorage optionally store schema; improve docs.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      Documentation has been updated to reflect reality.

      An optional second constructor argument is provided that allows one to customize advanced behaviors. A list of available options is below:

      -schema Stores the schema of the relation using a hidden JSON file.
      -noschema Ignores a stored schema during loading.

      Schemas
      If -schema is specified, a hidden ".pig_schema" file is created in the output directory when storing data. It is used by PigStorage (with or without -schema) during loading to determine the field names and types of the data without the need for a user to explicitly provide the schema in an as clause, unless -noschema is specified. No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.
      In addition, using -schema drops a ".pig_headers" file in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data).

      Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store reading delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.

      Show
      Documentation has been updated to reflect reality. An optional second constructor argument is provided that allows one to customize advanced behaviors. A list of available options is below: -schema Stores the schema of the relation using a hidden JSON file. -noschema Ignores a stored schema during loading. Schemas If -schema is specified, a hidden ".pig_schema" file is created in the output directory when storing data. It is used by PigStorage (with or without -schema) during loading to determine the field names and types of the data without the need for a user to explicitly provide the schema in an as clause, unless -noschema is specified. No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used. In addition, using -schema drops a ".pig_headers" file in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data). Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store reading delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.

      Description

      I'd like to propose that we allow for a greater degree of customization in PigStorage.

      An incomplete list features that we might want to add:

      • flag to tell it to overwrite existing output if it exists
      • flag to tell it to compress output using gzip|bzip|lzo (currently this can be achieved by setting the directory name to end in .gz or .bz2, which is a bit awkward)
      • flag to tell it to store the schema and header (perhaps by merging in PigStorageSchema work?)
      1. PIG-2143.2.diff
        59 kB
        Dmitriy V. Ryaboy
      2. PIG-2143.3.patch
        73 kB
        Dmitriy V. Ryaboy
      3. PIG-2143.4.patch
        73 kB
        Dmitriy V. Ryaboy
      4. PIG-2143.5.patch
        73 kB
        Dmitriy V. Ryaboy
      5. PIG-2143.diff
        35 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development