Pig
  1. Pig
  2. PIG-2143

Make PigStorage optionally store schema; improve docs.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.10.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      Documentation has been updated to reflect reality.

      An optional second constructor argument is provided that allows one to customize advanced behaviors. A list of available options is below:

      -schema Stores the schema of the relation using a hidden JSON file.
      -noschema Ignores a stored schema during loading.

      Schemas
      If -schema is specified, a hidden ".pig_schema" file is created in the output directory when storing data. It is used by PigStorage (with or without -schema) during loading to determine the field names and types of the data without the need for a user to explicitly provide the schema in an as clause, unless -noschema is specified. No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used.
      In addition, using -schema drops a ".pig_headers" file in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data).

      Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store reading delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.

      Show
      Documentation has been updated to reflect reality. An optional second constructor argument is provided that allows one to customize advanced behaviors. A list of available options is below: -schema Stores the schema of the relation using a hidden JSON file. -noschema Ignores a stored schema during loading. Schemas If -schema is specified, a hidden ".pig_schema" file is created in the output directory when storing data. It is used by PigStorage (with or without -schema) during loading to determine the field names and types of the data without the need for a user to explicitly provide the schema in an as clause, unless -noschema is specified. No attempt to merge conflicting schemas is made during loading. The first schema encountered during a file system scan is used. In addition, using -schema drops a ".pig_headers" file in the output directory. This file simply lists the delimited aliases. This is intended to make export to tools that can read files with header lines easier (just cat the header to your data). Note that regardless of whether or not you store the schema, you always need to specify the correct delimiter to read your data. If you store reading delimiter "#" and then load using the default delimiter, your data will not be parsed correctly.

      Description

      I'd like to propose that we allow for a greater degree of customization in PigStorage.

      An incomplete list features that we might want to add:

      • flag to tell it to overwrite existing output if it exists
      • flag to tell it to compress output using gzip|bzip|lzo (currently this can be achieved by setting the directory name to end in .gz or .bz2, which is a bit awkward)
      • flag to tell it to store the schema and header (perhaps by merging in PigStorageSchema work?)
      1. PIG-2143.diff
        35 kB
        Dmitriy V. Ryaboy
      2. PIG-2143.5.patch
        73 kB
        Dmitriy V. Ryaboy
      3. PIG-2143.4.patch
        73 kB
        Dmitriy V. Ryaboy
      4. PIG-2143.3.patch
        73 kB
        Dmitriy V. Ryaboy
      5. PIG-2143.2.diff
        59 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          In order to make this kind of thing work as more and more options are added, I suggest we add a single optional second argument, which specifies all the flags / options, similar to how we did it in HBaseStorage.

          So the invocation would be something like this:

          store mystuff into 'myplace' using PigStorage('|', '-compress=lzo -storeSchema -overwrite');
          
          Show
          Dmitriy V. Ryaboy added a comment - In order to make this kind of thing work as more and more options are added, I suggest we add a single optional second argument, which specifies all the flags / options, similar to how we did it in HBaseStorage. So the invocation would be something like this: store mystuff into 'myplace' using PigStorage('|', '-compress=lzo -storeSchema -overwrite');
          Hide
          Dmitriy V. Ryaboy added a comment -

          Attached patch adds the code for treating an optional second parameter as a list of arguments.
          It also moves PigStorageSchema code into PigStorage, controlled by a '-schema' flag, and deprecates PigStorageSchema (no plans to remove it completely yet).

          I plan to add the compression and overwrite flags as well in this ticket, consider this an early draft.

          Show
          Dmitriy V. Ryaboy added a comment - Attached patch adds the code for treating an optional second parameter as a list of arguments. It also moves PigStorageSchema code into PigStorage, controlled by a '-schema' flag, and deprecates PigStorageSchema (no plans to remove it completely yet). I plan to add the compression and overwrite flags as well in this ticket, consider this an early draft.
          Hide
          Dmitriy V. Ryaboy added a comment -

          I did some work on the compression thing, and had a second thought. Let's keep these changes isolated from each other. This code is already pretty baked (in PigStorageSchema) so it's probably better to make the other changes separately. Marking this as ready for review; will open separate tickets for the other improvements.

          Show
          Dmitriy V. Ryaboy added a comment - I did some work on the compression thing, and had a second thought. Let's keep these changes isolated from each other. This code is already pretty baked (in PigStorageSchema) so it's probably better to make the other changes separately. Marking this as ready for review; will open separate tickets for the other improvements.
          Hide
          Raghu Angadi added a comment -

          +1 looks good.

          • We could remove most of PigStorageSchema.java by invoking super(delimiter, "-schema");
            • this will also let us remove JsonMetadata.java and tests for PigStorageSchema.
          Show
          Raghu Angadi added a comment - +1 looks good. We could remove most of PigStorageSchema.java by invoking super(delimiter, "-schema"); this will also let us remove JsonMetadata.java and tests for PigStorageSchema.
          Hide
          Thejas M Nair added a comment -

          In PigStorage.applySchema, the schema is being deserialised in every call. It needs to be done only the first time.

          Some minor things -

          formatting of a curly braces in the functions - PigStorage(String delimiter, String options) and applySchema(Tuple tup) are off .

          I ran test-patch, and there was a new javadoc warning -

          > [javac] /tmp/trunk/src/org/apache/pig/builtin/JsonMetadata.java:94: warning: [deprecation] fullPath(java.lang.String,org.apache.pig.backend.datastorage.DataStorage) in org.apache.pig.impl.io.FileLocalizer has been deprecated
          > [javac] String fullPath = FileLocalizer.fullPath(path, storage);

          There is also a javac warning from JsonMetadata.java, but that can be addressed separately (different jira) since JsonMetadata.java is just copied from existing file.
          /tmp/trunk/src/org/apache/pig/builtin/JsonMetadata.java:191: warning - Tag @see: can't find getStatistics(String, Configuration) in org.apache.pig.LoadMetadata

          Other thoughts (we can also track these in different jira's)

          • If PigStorage without parameters is used (ie same as specifying no load func) on data that has been stored with '-schema'. Should the schema and delim from metadata be used ?
          • PigStorage is used with delim on file stored with '-schema', should it throw error if the delim in metadata file is different ? or warn and just use the delim specified in metadata file ?
          Show
          Thejas M Nair added a comment - In PigStorage.applySchema, the schema is being deserialised in every call. It needs to be done only the first time. Some minor things - formatting of a curly braces in the functions - PigStorage(String delimiter, String options) and applySchema(Tuple tup) are off . I ran test-patch, and there was a new javadoc warning - > [javac] /tmp/trunk/src/org/apache/pig/builtin/JsonMetadata.java:94: warning: [deprecation] fullPath(java.lang.String,org.apache.pig.backend.datastorage.DataStorage) in org.apache.pig.impl.io.FileLocalizer has been deprecated > [javac] String fullPath = FileLocalizer.fullPath(path, storage); There is also a javac warning from JsonMetadata.java, but that can be addressed separately (different jira) since JsonMetadata.java is just copied from existing file. /tmp/trunk/src/org/apache/pig/builtin/JsonMetadata.java:191: warning - Tag @see: can't find getStatistics(String, Configuration) in org.apache.pig.LoadMetadata Other thoughts (we can also track these in different jira's) If PigStorage without parameters is used (ie same as specifying no load func) on data that has been stored with '-schema'. Should the schema and delim from metadata be used ? PigStorage is used with delim on file stored with '-schema', should it throw error if the delim in metadata file is different ? or warn and just use the delim specified in metadata file ?
          Hide
          Dmitriy V. Ryaboy added a comment -

          Thanks for the reviews.

          Uploading a patch that fixes the repeated deserialization (nice catch!), adjusts whitespace, and makes the piggybank stuff shallow deprecated proxies for the builtins.

          I am not sure if loading the schema when it was created but isn't being requested is a good idea.. can see arguments both ways.

          I do think we should allow loading with a different delimiter than that set in the schema.

          Show
          Dmitriy V. Ryaboy added a comment - Thanks for the reviews. Uploading a patch that fixes the repeated deserialization (nice catch!), adjusts whitespace, and makes the piggybank stuff shallow deprecated proxies for the builtins. I am not sure if loading the schema when it was created but isn't being requested is a good idea.. can see arguments both ways. I do think we should allow loading with a different delimiter than that set in the schema.
          Hide
          Dmitriy V. Ryaboy added a comment -

          resubmitting patch

          Show
          Dmitriy V. Ryaboy added a comment - resubmitting patch
          Hide
          Raghu Angadi added a comment -

          PigStorageSchema is not setting "-schema" argument.

          > I am not sure if loading the schema when it was created but isn't being requested is a good idea..

          +1 for using the existing schema. May be we could add '-ignoreSchema' flag.

          > PigStorage is used with delim on file stored with '-schema', should it throw error if the delim in metadata file is different ? or warn and just use the delim specified in metadata file ?

          +1 for throwing an error. If we have -ignoreSchema, it will let users handle some rare cases.

          Show
          Raghu Angadi added a comment - PigStorageSchema is not setting "-schema" argument. > I am not sure if loading the schema when it was created but isn't being requested is a good idea.. +1 for using the existing schema. May be we could add '-ignoreSchema' flag. > PigStorage is used with delim on file stored with '-schema', should it throw error if the delim in metadata file is different ? or warn and just use the delim specified in metadata file ? +1 for throwing an error. If we have -ignoreSchema, it will let users handle some rare cases.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Canceling patch to incorporate the feedback.

          Show
          Dmitriy V. Ryaboy added a comment - Canceling patch to incorporate the feedback.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Submitting a new version of the patch.

          • added a bunch of javadocs
          • added "-noschema" to turn off loading of schema
          • removed the compression code accidentally introduced in the previous version – saving that for another ticket
          • did NOT fix the delimiter issue. That's because we currently don't store the delimiter when storing the schema! We just use it to delimit the header file. There is no clean way to store this kind of information, we need to think about the proper design for this.
          Show
          Dmitriy V. Ryaboy added a comment - Submitting a new version of the patch. added a bunch of javadocs added "-noschema" to turn off loading of schema removed the compression code accidentally introduced in the previous version – saving that for another ticket did NOT fix the delimiter issue. That's because we currently don't store the delimiter when storing the schema! We just use it to delimit the header file. There is no clean way to store this kind of information, we need to think about the proper design for this.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Recreated the patch, this time using git --no-prefix so it'll apply for svn users.

          Show
          Dmitriy V. Ryaboy added a comment - Recreated the patch, this time using git --no-prefix so it'll apply for svn users.
          Hide
          Raghu Angadi added a comment -

          Thanks for the detailed javadoc. noticed the compression changes in the prev patch.

          PigStorageSchema class does not set "-schema" option. Is that correct?

          didn't know .pig_schema didn't store the delimiter.

          Show
          Raghu Angadi added a comment - Thanks for the detailed javadoc. noticed the compression changes in the prev patch. PigStorageSchema class does not set "-schema" option. Is that correct? didn't know .pig_schema didn't store the delimiter.
          Hide
          Thejas M Nair added a comment -

          Thanks for adding the comprehensive documentation and fixing the incorrect old one!

          Review of PIG-2143.4.patch -

          • In PigStorage.getSchema(..) , it should check for (!dontLoadSchema) for deciding if the schema file should be read. (instead of (storeschema) ).
          • A test case where pig loads schema with the default constructor will be useful. One of the new test cases in the patch can be modified for this. I think we need one for the -noschema as well.
          • In javadoc for constructor PigStorage(String delimiter, String options), the line about "-Dprop=value" can be removed as its not used right now.
          • A nitpick - In the PigStorage class javadoc, I think 'An optional second constructor' is a bit misleading. There are 3 constructors including default one, and all 3 constructors are 'optional' . Maybe calling it 'Another constructor' is better.
          Show
          Thejas M Nair added a comment - Thanks for adding the comprehensive documentation and fixing the incorrect old one! Review of PIG-2143 .4.patch - In PigStorage.getSchema(..) , it should check for (!dontLoadSchema) for deciding if the schema file should be read. (instead of (storeschema) ). A test case where pig loads schema with the default constructor will be useful. One of the new test cases in the patch can be modified for this. I think we need one for the -noschema as well. In javadoc for constructor PigStorage(String delimiter, String options), the line about "-Dprop=value" can be removed as its not used right now. A nitpick - In the PigStorage class javadoc, I think 'An optional second constructor' is a bit misleading. There are 3 constructors including default one, and all 3 constructors are 'optional' . Maybe calling it 'Another constructor' is better.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Addressed the comments.
          Added tests. They pass.

          Show
          Dmitriy V. Ryaboy added a comment - Addressed the comments. Added tests. They pass.
          Hide
          Thejas M Nair added a comment -

          PIG-2143.5.patch looks great. +1

          Show
          Thejas M Nair added a comment - PIG-2143 .5.patch looks great. +1
          Hide
          Dmitriy V. Ryaboy added a comment -

          Changed the title to reflect what we actually did in this iteration.

          Show
          Dmitriy V. Ryaboy added a comment - Changed the title to reflect what we actually did in this iteration.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Committed. Thanks Raghu and Thejas for reviews.

          Show
          Dmitriy V. Ryaboy added a comment - Committed. Thanks Raghu and Thejas for reviews.
          Hide
          Daniel Dai added a comment -

          Can we put release notes on this? Thanks.

          Show
          Daniel Dai added a comment - Can we put release notes on this? Thanks.
          Hide
          Cheng Guang-Nan added a comment -

          Would like to have the -overwrite option implimented.

          Also it's ugly that while what I want is to overwrite exsting output location and I have to speicify the delimter again.

          store mystuff into 'myplace' using PigStorage('\t', '-overwrite');
          
          Show
          Cheng Guang-Nan added a comment - Would like to have the -overwrite option implimented. Also it's ugly that while what I want is to overwrite exsting output location and I have to speicify the delimter again. store mystuff into 'myplace' using PigStorage('\t', '-overwrite');
          Hide
          Dmitriy V. Ryaboy added a comment -

          Cheng, we narrowed the scope of this ticket to actually get the rewrite in – do you want to open another jira with this feature request?

          Show
          Dmitriy V. Ryaboy added a comment - Cheng, we narrowed the scope of this ticket to actually get the rewrite in – do you want to open another jira with this feature request?
          Hide
          Sylvain Menu added a comment -

          I try to launch the store command 2 times in the same query :

          pigServer.registerQuery("
          STORE data1 INTO 'store1' USING PigStorage('
          t', '-schema');
          STORE data2 INTO 'store2' USING PigStorage('
          t', '-schema');
          ");

          'store1' have his schema but not 'store2', have you got the same problem ?

          Show
          Sylvain Menu added a comment - I try to launch the store command 2 times in the same query : pigServer.registerQuery(" STORE data1 INTO 'store1' USING PigStorage(' t', '-schema'); STORE data2 INTO 'store2' USING PigStorage(' t', '-schema'); "); 'store1' have his schema but not 'store2', have you got the same problem ?

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development