Hive
  1. Hive
  2. HIVE-3585

Integrate Trevni as another columnar oriented file format

    Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 0.10.0
    • Fix Version/s: None
    • Labels:
      None

      Description

      add new avro module trevni as another columnar format.New columnar format need a columnar SerDe,seems fastutil is a good choice.the shark project use fastutil library as columnar serde library but it seems too large (almost 15m) for just a few primitive array collection.

      1. HIVE-3585.3.patch.txt
        77 kB
        Edward Capriolo
      2. HIVE-3585.2.patch.txt
        77 kB
        Sean Busbey
      3. HIVE-3585.1.patch.txt
        55 kB
        Mark Wagner
      4. futurama_episodes.avro
        3 kB
        Mark Wagner

        Issue Links

          Activity

          Hide
          He Yongqiang added a comment -

          vote for -1.

          I did not see any benefit of adding one that is just a copycat of rcfile.

          Show
          He Yongqiang added a comment - vote for -1. I did not see any benefit of adding one that is just a copycat of rcfile.
          Hide
          Carl Steinbach added a comment -

          @Yongqiang: Trevni aims to offer significant performance improvements over RCFile. Have you read the spec? http://avro.apache.org/docs/current/trevni/spec.html

          Show
          Carl Steinbach added a comment - @Yongqiang: Trevni aims to offer significant performance improvements over RCFile. Have you read the spec? http://avro.apache.org/docs/current/trevni/spec.html
          Hide
          He Yongqiang added a comment -

          Yeah i read some docs of it. But i really did not see a big difference. Some features can be added to RCFile easily. Please point out if you think there is a dramatic difference in some designs.

          Show
          He Yongqiang added a comment - Yeah i read some docs of it. But i really did not see a big difference. Some features can be added to RCFile easily. Please point out if you think there is a dramatic difference in some designs.
          Hide
          Russell Jurney added a comment -

          Trevni is an independent file format. It is not RCFile. It already exists. Why is RCFile relevant to this ticket?

          Show
          Russell Jurney added a comment - Trevni is an independent file format. It is not RCFile. It already exists. Why is RCFile relevant to this ticket?
          Hide
          He Yongqiang added a comment -

          Although it is so similar to RCFIle, i did not see any reference to RCFile in its doc. I assume that will help avoid confusion for its users. But as part of Hive, if we got two formats that are so similar to each other, the confusion will be thrown to all hive users.

          Show
          He Yongqiang added a comment - Although it is so similar to RCFIle, i did not see any reference to RCFile in its doc. I assume that will help avoid confusion for its users. But as part of Hive, if we got two formats that are so similar to each other, the confusion will be thrown to all hive users.
          Hide
          Russell Jurney added a comment -

          Can you please explain how supporting additional formats will confuse anyone? This does not seem to follow.

          Show
          Russell Jurney added a comment - Can you please explain how supporting additional formats will confuse anyone? This does not seem to follow.
          Hide
          Russell Jurney added a comment -

          It is worth noting that Hive already supports Avro via HIVE-895. Trevni is part of Avro. Therefore the possibility for confusion seems minimal.

          Looking at the Serde document https://cwiki.apache.org/confluence/display/Hive/SerDe it seems the Serde mechanism was created specifically for adding other storage formats to Hive. Restricting Serdes based on their similarity or dissimilarity to RCFile is not covered in this document and seems unusual.

          Pig and other tools are starting to generate Trevni data. There is demand to work with this data in Hive. RCFile seems completely unrelated to this feature JIRA, which is simply an extension of the existing Avro support in Hive.

          Show
          Russell Jurney added a comment - It is worth noting that Hive already supports Avro via HIVE-895 . Trevni is part of Avro. Therefore the possibility for confusion seems minimal. Looking at the Serde document https://cwiki.apache.org/confluence/display/Hive/SerDe it seems the Serde mechanism was created specifically for adding other storage formats to Hive. Restricting Serdes based on their similarity or dissimilarity to RCFile is not covered in this document and seems unusual. Pig and other tools are starting to generate Trevni data. There is demand to work with this data in Hive. RCFile seems completely unrelated to this feature JIRA, which is simply an extension of the existing Avro support in Hive.
          Hide
          Mark Wagner added a comment -

          I'm taking this over for Jakob. Please add me as a contributor so that I can assign this ticket to myself.

          Show
          Mark Wagner added a comment - I'm taking this over for Jakob. Please add me as a contributor so that I can assign this ticket to myself.
          Hide
          Jakob Homan added a comment -

          Also, He, I'm assuming your -1 is not intended to be a veto? I don't believe it would hold up technically. Trevni is essentially a variation on Avro. Not letting people read their Trevni-encoded data in Hive just because there's already another columnar format doesn't seem like a good way forward.

          Show
          Jakob Homan added a comment - Also, He, I'm assuming your -1 is not intended to be a veto? I don't believe it would hold up technically. Trevni is essentially a variation on Avro. Not letting people read their Trevni-encoded data in Hive just because there's already another columnar format doesn't seem like a good way forward.
          Hide
          He Yongqiang added a comment -

          @jakob, you can always implement reader of customized data in a 3rd party lib and let hive load it from there.

          Show
          He Yongqiang added a comment - @jakob, you can always implement reader of customized data in a 3rd party lib and let hive load it from there.
          Hide
          Carl Steinbach added a comment -

          I don't think there's any harm in adding this to Hive. There is already overlap among the existing SerDes that are included with Hive (e.g. Avro and Thrift) and no one seems to care about that.

          Show
          Carl Steinbach added a comment - I don't think there's any harm in adding this to Hive. There is already overlap among the existing SerDes that are included with Hive (e.g. Avro and Thrift) and no one seems to care about that.
          Hide
          He Yongqiang added a comment -

          @Carl, adding code that is not much used is always no harm except a lot of maintenance and document pain. You can first go with a contrib folder or a 3rd party lib and merge to core hive later if it proves success.

          Show
          He Yongqiang added a comment - @Carl, adding code that is not much used is always no harm except a lot of maintenance and document pain. You can first go with a contrib folder or a 3rd party lib and merge to core hive later if it proves success.
          Hide
          Jakob Homan added a comment -

          And we have active users and contributors to this code (myself, Mark, Sean, etc.). There's essentially no chance this will be orphaned on arrival.

          Show
          Jakob Homan added a comment - And we have active users and contributors to this code (myself, Mark, Sean, etc.). There's essentially no chance this will be orphaned on arrival.
          Hide
          Sean Busbey added a comment -

          +1 to Jakob Homan's sentiments, for what it's worth.

          It's been my experience that the support cost of some code and documentation is proportional to feature use. People rarely notice inaccuracies and bugs in stuff that doesn't get used. Unused SerDes can always get backed out on next breaking change.

          Show
          Sean Busbey added a comment - +1 to Jakob Homan 's sentiments, for what it's worth. It's been my experience that the support cost of some code and documentation is proportional to feature use. People rarely notice inaccuracies and bugs in stuff that doesn't get used. Unused SerDes can always get backed out on next breaking change.
          Hide
          He Yongqiang added a comment -

          @jakob, awesome to hear you are planning to own its maintenance. No particular intention to complicate your use case here, but i think a 3rd party lib or contrib folder would be good start and won't affect your usage. If i remember correctly, we used to do similar things for Pig's Zebra.

          Show
          He Yongqiang added a comment - @jakob, awesome to hear you are planning to own its maintenance. No particular intention to complicate your use case here, but i think a 3rd party lib or contrib folder would be good start and won't affect your usage. If i remember correctly, we used to do similar things for Pig's Zebra.
          Hide
          Jakob Homan added a comment -

          But the Avro Serde wasn't added to a contrib package (org.apache.hadoop.hive.serde2.avro.AvroSerDe) - so why should its Trevni variant be? They share a lot of code. If it weren't for some subtle problems with updating partition schemas, I'd probably have just gone ahead and made read/write from trevni a table property of tables using the AvroSerde rather than have a separate TrevniSerde....

          Show
          Jakob Homan added a comment - But the Avro Serde wasn't added to a contrib package (org.apache.hadoop.hive.serde2.avro.AvroSerDe) - so why should its Trevni variant be? They share a lot of code. If it weren't for some subtle problems with updating partition schemas, I'd probably have just gone ahead and made read/write from trevni a table property of tables using the AvroSerde rather than have a separate TrevniSerde....
          Hide
          He Yongqiang added a comment -

          Thanks for just reminding me that there is already a Avro serde. Have you tried to make the required changes to be part of the existing Avro serde instead of creating a new one?

          Show
          He Yongqiang added a comment - Thanks for just reminding me that there is already a Avro serde. Have you tried to make the required changes to be part of the existing Avro serde instead of creating a new one?
          Hide
          Jakob Homan added a comment -

          yeah.

          If it weren't for some subtle problems with updating partition schemas, I'd probably have just gone ahead and made read/write from trevni a table property of tables using the AvroSerde rather than have a separate TrevniSerde....

          Show
          Jakob Homan added a comment - yeah. If it weren't for some subtle problems with updating partition schemas, I'd probably have just gone ahead and made read/write from trevni a table property of tables using the AvroSerde rather than have a separate TrevniSerde....
          Hide
          He Yongqiang added a comment -

          I did not get why it does not work with partition schema update.

          Show
          He Yongqiang added a comment - I did not get why it does not work with partition schema update.
          Hide
          Russell Jurney added a comment -

          With PIG-3015, Apache Pig will soon be generating many petabytes of Apache Trevni data, as a builtin. There is every reason for Hive to read this data, as a builtin. With Apache HCatalog, a Hive Trevni builtin can be used to load Apache Trevni data in Apache Hadoop MapReduce, Apache Pig, Apache Hive, Shark and eventually Apache Hadoop streaming. Integration between these tools will create much value for many users.

          Please note that Apache Trevni is part of Apache Avro, and was developed completely in the open, under Apache governance: http://avro.apache.org/docs/current/trevni/spec.html

          Show
          Russell Jurney added a comment - With PIG-3015 , Apache Pig will soon be generating many petabytes of Apache Trevni data, as a builtin. There is every reason for Hive to read this data, as a builtin. With Apache HCatalog, a Hive Trevni builtin can be used to load Apache Trevni data in Apache Hadoop MapReduce, Apache Pig, Apache Hive, Shark and eventually Apache Hadoop streaming. Integration between these tools will create much value for many users. Please note that Apache Trevni is part of Apache Avro, and was developed completely in the open, under Apache governance: http://avro.apache.org/docs/current/trevni/spec.html
          Hide
          Russell Jurney added a comment -

          Also note this ticket currently has 3 votes and 15 watchers.

          Show
          Russell Jurney added a comment - Also note this ticket currently has 3 votes and 15 watchers.
          Hide
          Edward Capriolo added a comment -

          Question. Will Trevni but just another Serde/inputFormat combination of will it involve large scale changes to hive?

          Show
          Edward Capriolo added a comment - Question. Will Trevni but just another Serde/inputFormat combination of will it involve large scale changes to hive?
          Hide
          Edward Capriolo added a comment -

          In our wiki hive states:

          https://cwiki.apache.org/confluence/display/Hive/Home

          Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details.

          So if the new format does not require large scale changes to accommodate I see no issues, and would be +1.

          Show
          Edward Capriolo added a comment - In our wiki hive states: https://cwiki.apache.org/confluence/display/Hive/Home Hive does not mandate read or written data be in the "Hive format"---there is no such thing. Hive works equally well on Thrift, control delimited, or your specialized data formats. Please see File Format and SerDe in the Developer Guide for details. So if the new format does not require large scale changes to accommodate I see no issues, and would be +1.
          Hide
          Jakob Homan added a comment -

          Will Trevni but just another Serde/inputFormat combination of will it involve large scale changes to hive?

          Just another serde/inputformat combination. In fact, for Avro-wrapped Trevni, it'll just be AvroSerde + Trevni

          {I|O}Format. For raw Trevni data it'll just be TrevniSerde + (probably) (?Raw)Trevni{I|O}

          Format. This is not a big patch and doesn't require any more changes to Hive than AvroSerde did - none at all. As I mentioned above, a veto on this can't be sustained on technical grounds, so I'm happy to re-assure He as to his concerns, but I don't see any reason not to proceed.

          I did not get why it does not work with partition schema update.

          I didn't want to try to mix Avro-style schemas and Trevni-style schemas, but Mark has a way around that.

          Show
          Jakob Homan added a comment - Will Trevni but just another Serde/inputFormat combination of will it involve large scale changes to hive? Just another serde/inputformat combination. In fact, for Avro-wrapped Trevni, it'll just be AvroSerde + Trevni {I|O}Format. For raw Trevni data it'll just be TrevniSerde + (probably) (?Raw)Trevni{I|O} Format. This is not a big patch and doesn't require any more changes to Hive than AvroSerde did - none at all. As I mentioned above, a veto on this can't be sustained on technical grounds, so I'm happy to re-assure He as to his concerns, but I don't see any reason not to proceed. I did not get why it does not work with partition schema update. I didn't want to try to mix Avro-style schemas and Trevni-style schemas, but Mark has a way around that.
          Hide
          Edward Capriolo added a comment -

          On one end when I look at hadoop core there is ALOT of stuff not really core to anything. A good example is DBInputFormat.

          In Hive's case we do have quite a big code base, and at some point we may have so many UDF's and InputFormat's that we should draw a line on critical/vs extra. But we are not there yet. If it is just another serde+inputformat I am 100% +1.

          Having hive avro gave us a big boost in project exposure and will help keep the ball rolling. Not everyone uses avro I/WE don't. But I really hate when I read a wiki with something to the effect of "Hive doesn't support X" when it actually does and X is just something easy waiting in the patch queue. Bring it on!

          Show
          Edward Capriolo added a comment - On one end when I look at hadoop core there is ALOT of stuff not really core to anything. A good example is DBInputFormat. In Hive's case we do have quite a big code base, and at some point we may have so many UDF's and InputFormat's that we should draw a line on critical/vs extra. But we are not there yet. If it is just another serde+inputformat I am 100% +1. Having hive avro gave us a big boost in project exposure and will help keep the ball rolling. Not everyone uses avro I/WE don't. But I really hate when I read a wiki with something to the effect of "Hive doesn't support X" when it actually does and X is just something easy waiting in the patch queue. Bring it on!
          Hide
          He Yongqiang added a comment -

          So far i am still not convinced to have it as another builtin serde in Hive's core codebase. We initially did put some new serdes in contrib or 3rd party libs, examples include HBaseSerde and Zebra serde.

          If you can make it work with existing Avro serde, it will also be great.

          Show
          He Yongqiang added a comment - So far i am still not convinced to have it as another builtin serde in Hive's core codebase. We initially did put some new serdes in contrib or 3rd party libs, examples include HBaseSerde and Zebra serde. If you can make it work with existing Avro serde, it will also be great.
          Hide
          Russell Jurney added a comment -

          Pig is adding TrevniStorage as a builtin, and interoperability is desired.

          Show
          Russell Jurney added a comment - Pig is adding TrevniStorage as a builtin, and interoperability is desired.
          Hide
          Edward Capriolo added a comment -

          I do not follow your point . HbaseSerde is not in trunk/contrib, it is it's own subfolder and I do not any zerbra support in hive. Maybe you are thinking of some other contrib besides trunk/config?

          Show
          Edward Capriolo added a comment - I do not follow your point . HbaseSerde is not in trunk/contrib, it is it's own subfolder and I do not any zerbra support in hive. Maybe you are thinking of some other contrib besides trunk/config?
          Hide
          Namit Jain added a comment -

          Will Trevni work without needing Avro ? Is it a stand-alone file format which some users of hive
          (not using avro) can benefit from ?

          Show
          Namit Jain added a comment - Will Trevni work without needing Avro ? Is it a stand-alone file format which some users of hive (not using avro) can benefit from ?
          Hide
          He Yongqiang added a comment -

          HBaseSerde is first added to contrib and then moved to core later.

          Pig is adding TrevniStorage as a builtin, and interoperability is desired.

          I think interoperability is not a problem no matter where the code residents.

          Show
          He Yongqiang added a comment - HBaseSerde is first added to contrib and then moved to core later. Pig is adding TrevniStorage as a builtin, and interoperability is desired. I think interoperability is not a problem no matter where the code residents.
          Hide
          Sean Busbey added a comment -

          namita sharma Trevni defines a columnar format that can be used with different serialization systems. I believe initial efforts across different components are planning to use Avro for serialization.

          Eventually, Trevni support should also work for Thrift and Protobufs.

          Show
          Sean Busbey added a comment - namita sharma Trevni defines a columnar format that can be used with different serialization systems. I believe initial efforts across different components are planning to use Avro for serialization. Eventually, Trevni support should also work for Thrift and Protobufs.
          Hide
          Carl Steinbach added a comment -

          HBaseSerde is first added to contrib and then moved to core later.

          And what did this accomplish? Wouldn't it have been better to put it in core to begin with? In fact, can anyone tell me why we shouldn't abolish contrib altogether?

          Show
          Carl Steinbach added a comment - HBaseSerde is first added to contrib and then moved to core later. And what did this accomplish? Wouldn't it have been better to put it in core to begin with? In fact, can anyone tell me why we shouldn't abolish contrib altogether?
          Hide
          He Yongqiang added a comment - - edited

          contrib is a good place for any projects that is not mature. There are so many custom data formats out there, it does not make sense to support all of them in core hive code base. contrib is a good place for them to grow.

          From http://incubator.apache.org/hcatalog/docs/r0.4.0/, another good place i can think of is the hcatalog project. But i don't know if hcatalog itself includes custom data format support or not.

          Show
          He Yongqiang added a comment - - edited contrib is a good place for any projects that is not mature. There are so many custom data formats out there, it does not make sense to support all of them in core hive code base. contrib is a good place for them to grow. From http://incubator.apache.org/hcatalog/docs/r0.4.0/ , another good place i can think of is the hcatalog project. But i don't know if hcatalog itself includes custom data format support or not.
          Hide
          Carl Steinbach added a comment -

          The only concrete difference between core and contrib that I'm aware of is that the latter doesn't appear on Hive's classpath by default. As such I can only see two advantages to putting code in contrib: 1) it makes it harder for folks to use, and 2) it makes it harder for us to test. Did I miss anything?

          Show
          Carl Steinbach added a comment - The only concrete difference between core and contrib that I'm aware of is that the latter doesn't appear on Hive's classpath by default. As such I can only see two advantages to putting code in contrib: 1) it makes it harder for folks to use, and 2) it makes it harder for us to test. Did I miss anything?
          Hide
          Russell Jurney added a comment -

          He, HCatalog uses Hive Serde. By adding the Trevni builtin for Apache Hive, Apache Hive, Shark, Apache HCatalog and Apache Pig will all get Trevni support. Synergy, baby!

          Apache Trevni is part of an actual Apache top-level project, Apache Avro, so it is nothing like Zebra, which I notice you reported yourself for addition in HIVE-781. Avro and Trevni are specifically designed for Hadoop workloads, and other tools like Pig are including Trevni immediately.

          Show
          Russell Jurney added a comment - He, HCatalog uses Hive Serde. By adding the Trevni builtin for Apache Hive, Apache Hive, Shark, Apache HCatalog and Apache Pig will all get Trevni support. Synergy, baby! Apache Trevni is part of an actual Apache top-level project, Apache Avro, so it is nothing like Zebra, which I notice you reported yourself for addition in HIVE-781 . Avro and Trevni are specifically designed for Hadoop workloads, and other tools like Pig are including Trevni immediately.
          Hide
          Russell Jurney added a comment -

          This ticket now has 5 votes, and 22 watchers. Support for a Trevni builtin is overwhelming.

          Show
          Russell Jurney added a comment - This ticket now has 5 votes, and 22 watchers. Support for a Trevni builtin is overwhelming.
          Hide
          Russell Jurney added a comment -

          Relates to AVRO-806, see goals/differences from RCFile.

          Show
          Russell Jurney added a comment - Relates to AVRO-806 , see goals/differences from RCFile.
          Hide
          Namit Jain added a comment -

          The main reason that contrib exists is to add new features/projects which are being tested, may take some time to
          mature, and are reasonably stand-alone, so that they dont need many changes in existing code. New serdes/fileformats/udfs
          are good usecases for them.

          I dont see why is testing/development in contrib so difficult or different as compared to development in any other component.
          This is the reason why contrib was added, so new stand-alone components can bake. We can definitely move it from contrib, once
          it is mature/safe.

          Why is development in contrib such a bad idea ?

          Show
          Namit Jain added a comment - The main reason that contrib exists is to add new features/projects which are being tested, may take some time to mature, and are reasonably stand-alone, so that they dont need many changes in existing code. New serdes/fileformats/udfs are good usecases for them. I dont see why is testing/development in contrib so difficult or different as compared to development in any other component. This is the reason why contrib was added, so new stand-alone components can bake. We can definitely move it from contrib, once it is mature/safe. Why is development in contrib such a bad idea ?
          Hide
          Namit Jain added a comment -

          This would be really useful to the community. Also, no-one answered my earlier question - can this be used without Avro in hive ?

          Show
          Namit Jain added a comment - This would be really useful to the community. Also, no-one answered my earlier question - can this be used without Avro in hive ?
          Hide
          Sean Busbey added a comment -

          @Namit My earlier comment was meant to answer this

          Trevni defines a columnar format that can be used with different serialization systems. I believe initial efforts across different components are planning to use Avro for serialization.

          Eventually, Trevni support should also work for Thrift and Protobufs.

          In short, Trevni always requires a serialization format. Current efforts across components are focused on Avro as the first implementation. With that, users will need access to the Avro libraries.

          Show
          Sean Busbey added a comment - @Namit My earlier comment was meant to answer this Trevni defines a columnar format that can be used with different serialization systems. I believe initial efforts across different components are planning to use Avro for serialization. Eventually, Trevni support should also work for Thrift and Protobufs. In short, Trevni always requires a serialization format. Current efforts across components are focused on Avro as the first implementation. With that, users will need access to the Avro libraries.
          Hide
          Jakob Homan added a comment -

          Expanding on what Sean said. Yes, there will be a TrevniSerde that doesn't write Avro but happens to require the Avro libraries. And there will be an extension to the AvroSerde that writes Trevni-encoded Avro.

          This patch is going to share 90% of its small code with the existing AvroSerde that was never shunted into contrib. Why should this variation be? There are active users and developers of this code. Again, I'm not seeing any technical reasons to block progress. Is anyone planning on exercising a -1?

          Show
          Jakob Homan added a comment - Expanding on what Sean said. Yes, there will be a TrevniSerde that doesn't write Avro but happens to require the Avro libraries. And there will be an extension to the AvroSerde that writes Trevni-encoded Avro. This patch is going to share 90% of its small code with the existing AvroSerde that was never shunted into contrib. Why should this variation be? There are active users and developers of this code. Again, I'm not seeing any technical reasons to block progress. Is anyone planning on exercising a -1?
          Hide
          He Yongqiang added a comment -

          This patch is going to share 90% of its small code with the existing AvroSerde that was never shunted into contrib.

          Then why it is so hard to make it part of existing AvroSerde?

          I'm not seeing any technical reasons to block progress.

          Technically, there is no issue. Technically I am pretty sure this can be well done.

          Is anyone planning on exercising a -1?

          I have listed two options that i insist on. one is to develop it as part of existing avroserde, the other is to put it in contrib or a 3rd party lib (maybe github?).

          Show
          He Yongqiang added a comment - This patch is going to share 90% of its small code with the existing AvroSerde that was never shunted into contrib. Then why it is so hard to make it part of existing AvroSerde? I'm not seeing any technical reasons to block progress. Technically, there is no issue. Technically I am pretty sure this can be well done. Is anyone planning on exercising a -1? I have listed two options that i insist on. one is to develop it as part of existing avroserde, the other is to put it in contrib or a 3rd party lib (maybe github?).
          Hide
          Mark Wagner added a comment -

          Patch attached. Some notes:

          • This patch implements a SerDe for Avro wrapped Trevni, which allows it to be implemented as a separate file format for the existing AvroSerDe
          • This patch contains contributions from Srinivas Vemuri derived from uncommitted patches to Haivvreo
          • Avro version has been bumped to 1.7.3 to bring in trevni
          • trevni-avro and trevni-core modules of Avro are now included
          • The test avro_partition_format.q has no corresponding *.q.out file due to HIVE-3953. What's the best way to handle this test? Include it knowing it will fail, wait until HIVE-3953 is resolved to commit, or another option?
          Show
          Mark Wagner added a comment - Patch attached. Some notes: This patch implements a SerDe for Avro wrapped Trevni, which allows it to be implemented as a separate file format for the existing AvroSerDe This patch contains contributions from Srinivas Vemuri derived from uncommitted patches to Haivvreo Avro version has been bumped to 1.7.3 to bring in trevni trevni-avro and trevni-core modules of Avro are now included The test avro_partition_format.q has no corresponding *.q.out file due to HIVE-3953 . What's the best way to handle this test? Include it knowing it will fail, wait until HIVE-3953 is resolved to commit, or another option?
          Hide
          Francois Saint-Jacques added a comment -

          The Avro team suggest the use of 1.7.4 instead of 1.7.3 see https://issues.apache.org/jira/browse/AVRO-1259 .

          Show
          Francois Saint-Jacques added a comment - The Avro team suggest the use of 1.7.4 instead of 1.7.3 see https://issues.apache.org/jira/browse/AVRO-1259 .
          Hide
          Edward Capriolo added a comment -

          +1 with one minor comment. It seems like the SELECT in the q tests are all 'SELECT *' those do not really test the map/reduce path. Please do a column pruning operation like select col1, or select distinct(column1).

          Show
          Edward Capriolo added a comment - +1 with one minor comment. It seems like the SELECT in the q tests are all 'SELECT *' those do not really test the map/reduce path. Please do a column pruning operation like select col1, or select distinct(column1).
          Hide
          Yin Huai added a comment -

          I am concerned about the performance of Trevni. Please correct me if my explanation is not accurate.

          First, seems we store a column by multiple compression blocks and the size of a block is around 64 KiB. When the reader reads data from a stored table, we read a compression block at a time. Because the reader is row-oriented, in the worst case, the reader needs to seek to another column after reading every single compression block. As pointed out in AVRO-1208, the performance degradation is significant.

          Second, since we read a compression unit at a time, when the size of a compression unit is smaller than the size of buffer used in the BufferedInputStream inside BlockReader, we may not be able to efficiently read data. We will just use a portion of data fetched by BufferedInputStream and the rest of data cannot be used. I have experience that the performance of Trevni on HDFS without short circuit is not good. I think this issue may be the reason. But I have not dug very deep.

          Show
          Yin Huai added a comment - I am concerned about the performance of Trevni. Please correct me if my explanation is not accurate. First, seems we store a column by multiple compression blocks and the size of a block is around 64 KiB. When the reader reads data from a stored table, we read a compression block at a time. Because the reader is row-oriented, in the worst case, the reader needs to seek to another column after reading every single compression block. As pointed out in AVRO-1208 , the performance degradation is significant. Second, since we read a compression unit at a time, when the size of a compression unit is smaller than the size of buffer used in the BufferedInputStream inside BlockReader, we may not be able to efficiently read data. We will just use a portion of data fetched by BufferedInputStream and the rest of data cannot be used. I have experience that the performance of Trevni on HDFS without short circuit is not good. I think this issue may be the reason. But I have not dug very deep.
          Hide
          Edward Capriolo added a comment -

          Would anyone like to rebase this patch. It no longer applies on trunk due to other avro changes.

          Show
          Edward Capriolo added a comment - Would anyone like to rebase this patch. It no longer applies on trunk due to other avro changes.
          Hide
          Sean Busbey added a comment -

          I've got the patch rebased to work on trunk, and I've got additional tests to force the MR path. Unfortunately, even through HIVE-3953 and HIVE-4789 are no longer a problem, I still get test failures. I think it's from the part of the HIVE-4789 patch that was left out due to not having an applicable test case.

          Care for the updated patch w/o applicable .out files now? That way once I get an issue filed and fixed for just that last part of local-only avro+partitioned support someone just has to get test outputs?

          Show
          Sean Busbey added a comment - I've got the patch rebased to work on trunk, and I've got additional tests to force the MR path. Unfortunately, even through HIVE-3953 and HIVE-4789 are no longer a problem, I still get test failures. I think it's from the part of the HIVE-4789 patch that was left out due to not having an applicable test case. Care for the updated patch w/o applicable .out files now? That way once I get an issue filed and fixed for just that last part of local-only avro+partitioned support someone just has to get test outputs?
          Hide
          Yin Huai added a comment -

          My impression is that Parquet is the replacement of Trevni. So, my question is if there is any serious user of it.

          Show
          Yin Huai added a comment - My impression is that Parquet is the replacement of Trevni . So, my question is if there is any serious user of it.
          Hide
          Edward Capriolo added a comment -

          Yin Huai We have no way of judging who is using what. Maybe more people would be using Trevni if we (committers) had focused on getting the hive support committed. There are 34 watchers, and avro is here to stay. There seems to be a bit of "politics" around what is part of hive and what is not. Hive has support for other columnar inputs formats so who are we do say what should or should not be in hive?

          Show
          Edward Capriolo added a comment - Yin Huai We have no way of judging who is using what. Maybe more people would be using Trevni if we (committers) had focused on getting the hive support committed. There are 34 watchers, and avro is here to stay. There seems to be a bit of "politics" around what is part of hive and what is not. Hive has support for other columnar inputs formats so who are we do say what should or should not be in hive?
          Hide
          Mark Wagner added a comment -

          As Yin said, Parquet has mostly taken the mindshare from Trevni. Looking at the Avro jira, it does seem that there are a few users of Trevni. Edward Capriolo, if you'd like to include this for them, then that's fine. The only change to the Avro Serde was a bit of refactoring, so it shouldn't be any burden on the "main" Avro Serde. That said, I think it'd be good for HIVE-4732 and HIVE-4734 to go in first. Both of those should be ready to commit shortly and will require a bit more rebasing of this patch. Also, the Avro version should get bumped to 1.7.5.

          Show
          Mark Wagner added a comment - As Yin said, Parquet has mostly taken the mindshare from Trevni. Looking at the Avro jira, it does seem that there are a few users of Trevni. Edward Capriolo , if you'd like to include this for them, then that's fine. The only change to the Avro Serde was a bit of refactoring, so it shouldn't be any burden on the "main" Avro Serde. That said, I think it'd be good for HIVE-4732 and HIVE-4734 to go in first. Both of those should be ready to commit shortly and will require a bit more rebasing of this patch. Also, the Avro version should get bumped to 1.7.5.
          Hide
          Edward Capriolo added a comment -

          Care for the updated patch w/o applicable .out files now? That way once I get an issue filed and fixed for just that last part of local-only avro+partitioned support someone just has to get test outputs?

          I would rather just fix it as part of this issue. We do not have to create issues just for the sake of creating issues. Unless you want to.

          Show
          Edward Capriolo added a comment - Care for the updated patch w/o applicable .out files now? That way once I get an issue filed and fixed for just that last part of local-only avro+partitioned support someone just has to get test outputs? I would rather just fix it as part of this issue. We do not have to create issues just for the sake of creating issues. Unless you want to.
          Hide
          Carl Steinbach added a comment -

          I agree with what Ed said earlier and want to add that as a project we shouldn't put ourselves in the position of picking winners and losers when it comes battles between competing data serialization formats. As long as a patch like this meets the same code quality standards that we apply to every other patch I think it should get committed.

          Show
          Carl Steinbach added a comment - I agree with what Ed said earlier and want to add that as a project we shouldn't put ourselves in the position of picking winners and losers when it comes battles between competing data serialization formats. As long as a patch like this meets the same code quality standards that we apply to every other patch I think it should get committed.
          Hide
          Sean Busbey added a comment -

          Filed HIVE-5302 for the last of the "Avro needs table information when a partition doesn't have it" issues, which is the failure I mentioned here. It touches a bunch of things unrelated to this, so I broke it out.

          Show
          Sean Busbey added a comment - Filed HIVE-5302 for the last of the "Avro needs table information when a partition doesn't have it" issues, which is the failure I mentioned here. It touches a bunch of things unrelated to this, so I broke it out.
          Hide
          Sean Busbey added a comment -

          Updated patch rebased onto trunk. review board

          Depends on HIVE-5302 being applied to allow testing

          • Updated Avro to 1.7.5
          • update AvroRecordReaderBase for changes that have since happened in AvroGenericRecordReader
          • added client positive queries that hit MR, per feedback from Edward Capriolo
          • added .out for avro_parition_format.q

          Personally, I think the performance issues can wait for a follow up ticket. esp once we can use a version of Trevni that has a solution for AVRO-1208 in place.

          Show
          Sean Busbey added a comment - Updated patch rebased onto trunk. review board Depends on HIVE-5302 being applied to allow testing Updated Avro to 1.7.5 update AvroRecordReaderBase for changes that have since happened in AvroGenericRecordReader added client positive queries that hit MR, per feedback from Edward Capriolo added .out for avro_parition_format.q Personally, I think the performance issues can wait for a follow up ticket. esp once we can use a version of Trevni that has a solution for AVRO-1208 in place.
          Hide
          Sean Busbey added a comment -

          and since I forgot to tell the precommit build bot not ot try to test it, that's going to fail since it won't have the futurama_episodes.avro file.

          Show
          Sean Busbey added a comment - and since I forgot to tell the precommit build bot not ot try to test it, that's going to fail since it won't have the futurama_episodes.avro file.
          Hide
          Edward Capriolo added a comment -

          Sean Busbey I will commit the avro data (there is no harm in that) where is it supposed to go?

          Show
          Edward Capriolo added a comment - Sean Busbey I will commit the avro data (there is no harm in that) where is it supposed to go?
          Hide
          Sean Busbey added a comment -

          It goes in data/files/

          Show
          Sean Busbey added a comment - It goes in data/files/
          Hide
          Edward Capriolo added a comment -

          Re-uplloaded the patch and hit SUBMIT_PATCH testing should begin soon.

          Show
          Edward Capriolo added a comment - Re-uplloaded the patch and hit SUBMIT_PATCH testing should begin soon.
          Hide
          Hive QA added a comment -

          Overall: -1 at least one tests failed

          Here are the results of testing the latest attachment:
          https://issues.apache.org/jira/secure/attachment/12603845/HIVE-3585.3.patch.txt

          ERROR: -1 due to 1 failed/errored test(s), 3129 tests executed
          Failed tests:

          org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_avro_partition_format
          

          Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/803/testReport
          Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/803/console

          Messages:

          Executing org.apache.hive.ptest.execution.PrepPhase
          Executing org.apache.hive.ptest.execution.ExecutionPhase
          Executing org.apache.hive.ptest.execution.ReportingPhase
          Tests failed with: TestsFailedException: 1 tests failed
          

          This message is automatically generated.

          Show
          Hive QA added a comment - Overall : -1 at least one tests failed Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12603845/HIVE-3585.3.patch.txt ERROR: -1 due to 1 failed/errored test(s), 3129 tests executed Failed tests: org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_avro_partition_format Test results: https://builds.apache.org/job/PreCommit-HIVE-Build/803/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-Build/803/console Messages: Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests failed with: TestsFailedException: 1 tests failed This message is automatically generated.
          Show
          Brock Noland added a comment - Error message is here http://ec2-174-129-184-35.compute-1.amazonaws.com/logs/PreCommit-HIVE-Build-803/failed/TestCliDriver-multigroupby_singlemr.q-nestedvirtual.q-join_vc.q-and-12-more/hadoop.log

            People

            • Assignee:
              Mark Wagner
              Reporter:
              alex gemini
            • Votes:
              15 Vote for this issue
              Watchers:
              33 Start watching this issue

              Dates

              • Created:
                Updated:

                Development