Pig
  1. Pig
  2. PIG-3445

Make Parquet format available out of the box in Pig

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.12.0
    • Component/s: None
    • Labels:
      None

      Description

      We would add the Parquet jar in the Pig packages to make it available out of the box to pig users.
      On top of that we could add the parquet.pig package to the list of packages to search for UDFs. (alternatively, the parquet jar could contain classes name or.apache.pig.builtin.ParquetLoader and ParquetStorer)
      This way users can use Parquet simply by typing:
      A = LOAD 'foo' USING ParquetLoader();
      STORE A INTO 'bar' USING ParquetStorer();

      1. PIG-3445-5.patch
        27 kB
        Lorand Bendig
      2. PIG-3445-4.patch
        28 kB
        Lorand Bendig
      3. PIG-3445-3.patch
        28 kB
        Lorand Bendig
      4. PIG-3445-2.patch
        13 kB
        Lorand Bendig
      5. PIG-3445.patch
        4 kB
        Lorand Bendig

        Activity

        Hide
        Aniket Mokashi added a comment -

        Committed to trunk and Pig-0.12. Thanks Lorand Bendig and Julien Le Dem.

        Show
        Aniket Mokashi added a comment - Committed to trunk and Pig-0.12. Thanks Lorand Bendig and Julien Le Dem .
        Hide
        Lorand Bendig added a comment -

        Patch modified to use parquet-pig-bundle

        Show
        Lorand Bendig added a comment - Patch modified to use parquet-pig-bundle
        Hide
        Lorand Bendig added a comment -

        I have the modified patch for parquet-pig-bundle, but I'd like to attach it when it becomes visible in maven central,
        just to be sure.

        Show
        Lorand Bendig added a comment - I have the modified patch for parquet-pig-bundle, but I'd like to attach it when it becomes visible in maven central, just to be sure.
        Hide
        Daniel Dai added a comment -

        Great, thanks!

        Show
        Daniel Dai added a comment - Great, thanks!
        Hide
        Julien Le Dem added a comment -

        I just released parquet-pig-bundle-1.2.3
        this should show up in maven central overnight

        Show
        Julien Le Dem added a comment - I just released parquet-pig-bundle-1.2.3 this should show up in maven central overnight
        Hide
        Daniel Dai added a comment -

        Hi, Julien Le Dem, I am trying to roll a Pig 0.12.0 RC tomorrow, can we get it done by then?

        Show
        Daniel Dai added a comment - Hi, Julien Le Dem , I am trying to roll a Pig 0.12.0 RC tomorrow, can we get it done by then?
        Hide
        Julien Le Dem added a comment -

        We merged the PR for parquet-pig-bundle
        I'm making a release so that this can be merge in pig 0.12

        Show
        Julien Le Dem added a comment - We merged the PR for parquet-pig-bundle I'm making a release so that this can be merge in pig 0.12
        Hide
        Julien Le Dem added a comment -


        parquet-format.version should be 1.0.0

        Show
        Julien Le Dem added a comment - parquet-format.version should be 1.0.0
        Hide
        Julien Le Dem added a comment -

        I add a parquet-pig-bundle and the shading of fastutil:
        https://github.com/Parquet/parquet-mr/pull/186
        We can make a new release to simplify

        Show
        Julien Le Dem added a comment - I add a parquet-pig-bundle and the shading of fastutil: https://github.com/Parquet/parquet-mr/pull/186 We can make a new release to simplify
        Hide
        Lorand Bendig added a comment -

        Dmitriy V. Ryaboy Thank you.
        Well, yes, ParquetUtil is general util, so I merged it to JarManager.

        Show
        Lorand Bendig added a comment - Dmitriy V. Ryaboy Thank you. Well, yes, ParquetUtil is general util, so I merged it to JarManager.
        Hide
        Dmitriy V. Ryaboy added a comment -

        That's a great addition, thanks Lorand.

        The code looks really tidy now.

        Looks like ParquetUtil is actually general util? Maybe add that functionality to org.apache.pig.impl.util.JarManager or something along those lines?

        Julien Le Dem do we need to publish a new artifact version so fastutil isn't required for dictionary encoding?

        Show
        Dmitriy V. Ryaboy added a comment - That's a great addition, thanks Lorand. The code looks really tidy now. Looks like ParquetUtil is actually general util? Maybe add that functionality to org.apache.pig.impl.util.JarManager or something along those lines? Julien Le Dem do we need to publish a new artifact version so fastutil isn't required for dictionary encoding?
        Hide
        Lorand Bendig added a comment -

        Dmitriy V. Ryaboy Thanks for pointing this out, I was not aware of this class.
        AFAICS there was no wrappers for the LoadFunc, I added them too.

        Show
        Lorand Bendig added a comment - Dmitriy V. Ryaboy Thanks for pointing this out, I was not aware of this class. AFAICS there was no wrappers for the LoadFunc, I added them too.
        Hide
        Dmitriy V. Ryaboy added a comment -

        Lorand Bendig might be more succinct to use StoreFuncWrapper ?

        Show
        Dmitriy V. Ryaboy added a comment - Lorand Bendig might be more succinct to use StoreFuncWrapper ?
        Hide
        Lorand Bendig added a comment -

        This patch attempts to address the wrapper approach.
        Remarks:

        • parquet jars are taken as compile-time dependencies
        • The wrapper loader/storer classes ship the parquet jars to tmpjars from
          the classpath (using PigContext::addJar would be probably better, but how
          can it be retrieved in the LoadFunc?)
        Show
        Lorand Bendig added a comment - This patch attempts to address the wrapper approach. Remarks: parquet jars are taken as compile-time dependencies The wrapper loader/storer classes ship the parquet jars to tmpjars from the classpath (using PigContext::addJar would be probably better, but how can it be retrieved in the LoadFunc?)
        Hide
        Dmitriy V. Ryaboy added a comment -

        Other loaders like csv, avro, json, xml, etc (even RC, though it's in piggybank due to heavy dependencies and lack of support) are all in already so I don't see this as unfair, but as consistent.
        Not packaging the pq jars into pig monojar and instead adding them, the way we add guava et al for hbase, sounds like a good idea.
        Julien Le Dem should we do that by providing a simple wrapper in pig builtins, or by messing with the job conf in parquet's own loader/storer?

        Show
        Dmitriy V. Ryaboy added a comment - Other loaders like csv, avro, json, xml, etc (even RC, though it's in piggybank due to heavy dependencies and lack of support) are all in already so I don't see this as unfair, but as consistent. Not packaging the pq jars into pig monojar and instead adding them, the way we add guava et al for hbase, sounds like a good idea. Julien Le Dem should we do that by providing a simple wrapper in pig builtins, or by messing with the job conf in parquet's own loader/storer?
        Hide
        Daniel Dai added a comment -

        Size maybe one thing, but still, doing a favor for Parquet sounds unfair to other loaders. Is it possible to push the jar dependency logic into LoadFunc, only shipping jar to backend when use the LoadFunc.

        Show
        Daniel Dai added a comment - Size maybe one thing, but still, doing a favor for Parquet sounds unfair to other loaders. Is it possible to push the jar dependency logic into LoadFunc, only shipping jar to backend when use the LoadFunc.
        Hide
        Dmitriy V. Ryaboy added a comment -

        The size of the dependency introduced by this is orders of magnitude smaller than the HBase (or Avro) one, since everything comes from a single project (unlike HBase's liberal use of guava, metric, ZK, and everything else under the sun). The total size is less than 1 meg.

        Can we add parquet.pig to udf import list in the same patch?

        Show
        Dmitriy V. Ryaboy added a comment - The size of the dependency introduced by this is orders of magnitude smaller than the HBase (or Avro) one, since everything comes from a single project (unlike HBase's liberal use of guava, metric, ZK, and everything else under the sun). The total size is less than 1 meg. Can we add parquet.pig to udf import list in the same patch?
        Hide
        Lorand Bendig added a comment -

        Yes, that's definitely a drawback of this patch.
        Is it an option here to utilize pig.additional.jars and udf.import.list?
        If so, I can think of the following:
        pig.properties:

        pig.additional.jars.parquet.column=/path/to/parquet-column.jar
        pig.additional.jars.parquet.common=
        pig.additional.jars.parquet.encoding=
        ...
        
        or: pig.additional.jars.parquet=parquet-column.jar:parquet-common.jar
        
        udf.import.list.parquet=parquet.pig.
        

        At the point where 3rd party jars and import packages are initialized an additional code could take care of these grouped properties. If some checks (can be defined per group) succeed, like valid paths..etc then these props would be merged to pig.additional.jars and udf.import.list.
        The rest is the same as before.
        However, this might be silly solution which may not address all the issues that can arise, I'm curious if it can be an option.

        Show
        Lorand Bendig added a comment - Yes, that's definitely a drawback of this patch. Is it an option here to utilize pig.additional.jars and udf.import.list? If so, I can think of the following: pig.properties: pig.additional.jars.parquet.column=/path/to/parquet-column.jar pig.additional.jars.parquet.common= pig.additional.jars.parquet.encoding= ... or: pig.additional.jars.parquet=parquet-column.jar:parquet-common.jar udf. import .list.parquet=parquet.pig. At the point where 3rd party jars and import packages are initialized an additional code could take care of these grouped properties. If some checks (can be defined per group) succeed, like valid paths..etc then these props would be merged to pig.additional.jars and udf.import.list. The rest is the same as before. However, this might be silly solution which may not address all the issues that can arise, I'm curious if it can be an option.
        Hide
        Daniel Dai added a comment -

        This reminds me a similar ticket for HBase PIG-3285. Not sure packing a bunch of jars for a new loader is a good idea. I am not objecting the patch, but seems we need a better solution for that in the future.

        Show
        Daniel Dai added a comment - This reminds me a similar ticket for HBase PIG-3285 . Not sure packing a bunch of jars for a new loader is a good idea. I am not objecting the patch, but seems we need a better solution for that in the future.
        Hide
        Lorand Bendig added a comment -

        This patch adds the parquet-pig related packages to the pig-withouthadoop and pig-withdependencies jars and parquet.pig is added to the import search path.

        Show
        Lorand Bendig added a comment - This patch adds the parquet-pig related packages to the pig-withouthadoop and pig-withdependencies jars and parquet.pig is added to the import search path.

          People

          • Assignee:
            Lorand Bendig
            Reporter:
            Julien Le Dem
          • Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development