Details

    • Type: Wish Wish
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.19.0
    • Component/s: None
    • Labels:
      None
    • Environment:

      N/A

    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Introduced Hive Data Warehouse built on top of Hadoop that enables structuring Hadoop files as tables and partitions and allows users to query this data through a SQL like language using a command line interface.
      Show
      Introduced Hive Data Warehouse built on top of Hadoop that enables structuring Hadoop files as tables and partitions and allows users to query this data through a SQL like language using a command line interface.

      Description

      Hive is a data warehouse built on top of flat files (stored primarily in HDFS). It includes:

      • Data Organization into Tables with logical and hash partitioning
      • A Metastore to store metadata about Tables/Partitions etc
      • A SQL like query language over object data stored in Tables
      • DDL commands to define and load external data into tables

      Hive's query language is executed using Hadoop map-reduce as the execution engine. Queries can use either single stage or multi-stage map-reduce. Hive has a native format for tables - but can handle any data set (for example json/thrift/xml) using an IO library framework.

      Hive uses Antlr for query parsing, Apache JEXL for expression evaluation and may use Apache Derby as an embedded database for MetaStore. Antlr has a BSD license and should be compatible with Apache license.

      We are currently thinking of contributing to the 0.17 branch as a contrib project (since that is the version under which it will get tested internally) - but looking for advice on the best release path.

      1. ant.log
        60 kB
        Ashish Thusoo
      2. hive.tgz
        7.79 MB
        Ashish Thusoo
      3. hive.tgz
        7.84 MB
        Ashish Thusoo
      4. hive.tgz
        7.84 MB
        Ashish Thusoo
      5. HiveTutorial.pdf
        67 kB
        Ashish Thusoo

        Activity

        Joydeep Sen Sarma created issue -
        Hide
        Joydeep Sen Sarma added a comment -

        Forgot to mention:

        • we will shortly post the query language spec here as well as any relevant design documents.
        • as soon as possible - will make a patch possible that provides basic functionality so people can start playing around.
        Show
        Joydeep Sen Sarma added a comment - Forgot to mention: we will shortly post the query language spec here as well as any relevant design documents. as soon as possible - will make a patch possible that provides basic functionality so people can start playing around.
        Hide
        Enis Soztutar added a comment -

        Contrib will be a good place for hive. Once it gains attention, we will move it to incubation or adopt it as sub project. However since 0.17 is out and 0.17.1 will be out soon, I do not think we can maintain it in 0.17 branch. I guess it will initially be committed to trunk, to be introduced in 0.18.

        Show
        Enis Soztutar added a comment - Contrib will be a good place for hive. Once it gains attention, we will move it to incubation or adopt it as sub project. However since 0.17 is out and 0.17.1 will be out soon, I do not think we can maintain it in 0.17 branch. I guess it will initially be committed to trunk, to be introduced in 0.18.
        Hide
        Owen O'Malley added a comment -

        We commit new features into trunk and branch the release from the trunk on the feature freeze day. Only bug fixes are down ported to the release branches, not new features. The soonest that Hive could be in a release would be 0.19.

        It does not have to be complete to be checked in, but it does need to compile and pass any unit tests that it defines. It would be fine to make this jira the initial check in and have follow up ones that make it usable.

        Show
        Owen O'Malley added a comment - We commit new features into trunk and branch the release from the trunk on the feature freeze day. Only bug fixes are down ported to the release branches, not new features. The soonest that Hive could be in a release would be 0.19. It does not have to be complete to be checked in, but it does need to compile and pass any unit tests that it defines. It would be fine to make this jira the initial check in and have follow up ones that make it usable.
        Hide
        Ashish Thusoo added a comment -

        Ok. Apart from the unit tests, there is no other stipulation, right? In all probability we would have actually tested it out in our production environment on an earlier release of hadoop, even though the code could out with 0.19. When is 0.19 scheduled to be branched?

        Show
        Ashish Thusoo added a comment - Ok. Apart from the unit tests, there is no other stipulation, right? In all probability we would have actually tested it out in our production environment on an earlier release of hadoop, even though the code could out with 0.19. When is 0.19 scheduled to be branched?
        Hide
        Owen O'Malley added a comment -

        0.19 feature freeze is the first Friday of September. HBase was checked into Hadoop's contrib directory for a couple of releases before it was really usable.

        Show
        Owen O'Malley added a comment - 0.19 feature freeze is the first Friday of September. HBase was checked into Hadoop's contrib directory for a couple of releases before it was really usable.
        Hide
        Ashish Thusoo added a comment -

        Tutorial on the capabilities of Hive. This is a pdf of internal documentation and contains query, dml and ddl examples as well as the overview of the system. A formal language spec, architecture documents and roadmaps will follow. This document gives the initial preview of the system and hopefully will seed a lot of interesting discussion/questions etc. around this system.

        Show
        Ashish Thusoo added a comment - Tutorial on the capabilities of Hive. This is a pdf of internal documentation and contains query, dml and ddl examples as well as the overview of the system. A formal language spec, architecture documents and roadmaps will follow. This document gives the initial preview of the system and hopefully will seed a lot of interesting discussion/questions etc. around this system.
        Ashish Thusoo made changes -
        Field Original Value New Value
        Attachment HiveTutorial.pdf [ 12385545 ]
        Hide
        eric baldeschwieler added a comment -

        Hi Folks,

        What follows are some thoughts I have on the general situation in Hadoop of adding big projects like Hive to core/contrib. I don't think this is a scalable way forward and would like to use this submission as an opportunity to discuss the general challenges involved in welcoming new projects into the Hadoop family.

        We've now seen 3 Hadoop projects take different courses with different results:

        1) HBASE - This went into contrib. It sat there for a number of months in active development before becoming a subproject. ADVANTAGES: Good publicity for project. DISADVANTAGES: Since it was very active, it frequently broke the hadoop core build and became a significant fraction of hadoop-dev message traffic. This was somewhat disruptive to core development. IMO this does not scale. If we had several such projects running at once in core/contrib they would drown out the main dev community.

        2) Pig - Pig went directly into the apache incubator and has ambitions to graduate to a Hadoop sub-project. ADVANTAGES: Low overhead to the hadoop community, lots of training for the committers on the Apache way. DISADVANTAGES: Less visible than HBASE, high upfront investment in project setup, review, committer training, approval, ...

        3) ZooKeeper - It was shared by its developers outside of apache under the BSD & then apache licenses, first as a posting on the Yahoo Research website and then as a source forge project. ADVANTAGES: Super low cost to start, fewer restrictions to share code than incubation, ... DISADVANTAGES: Less visible than HBASE.


        From these experiences, I think checking in major projects that build on Hadoop into core contrib is not the most productive way to host them. If they are active, they can be very disruptive during their formation. A project should have its own email lists, tests, branches, etc, independent of Hadoop mainline.

        The main advantage of putting projects in core seems to be to increase their visibility to the Hadoop community. I'd suggest we discuss other mechanisms. Long term I hope to see something like cpan.org emerge for hadoop. But short term we have not IDed an entity to host such a site.

        Absent that, I'd suggest a project like Hive take either the path ZooKeeper or Pig took. As a community we could take some simple steps to address the short comings of these approaches. An obvious step would be to invest in a well linked Wiki section that provides a directory of such projects.

        What do folks this of this? Other thoughts? Suggestions?

        E14

        Show
        eric baldeschwieler added a comment - Hi Folks, What follows are some thoughts I have on the general situation in Hadoop of adding big projects like Hive to core/contrib. I don't think this is a scalable way forward and would like to use this submission as an opportunity to discuss the general challenges involved in welcoming new projects into the Hadoop family. We've now seen 3 Hadoop projects take different courses with different results: 1) HBASE - This went into contrib. It sat there for a number of months in active development before becoming a subproject. ADVANTAGES: Good publicity for project. DISADVANTAGES: Since it was very active, it frequently broke the hadoop core build and became a significant fraction of hadoop-dev message traffic. This was somewhat disruptive to core development. IMO this does not scale. If we had several such projects running at once in core/contrib they would drown out the main dev community. 2) Pig - Pig went directly into the apache incubator and has ambitions to graduate to a Hadoop sub-project. ADVANTAGES: Low overhead to the hadoop community, lots of training for the committers on the Apache way. DISADVANTAGES: Less visible than HBASE, high upfront investment in project setup, review, committer training, approval, ... 3) ZooKeeper - It was shared by its developers outside of apache under the BSD & then apache licenses, first as a posting on the Yahoo Research website and then as a source forge project. ADVANTAGES: Super low cost to start, fewer restrictions to share code than incubation, ... DISADVANTAGES: Less visible than HBASE. From these experiences, I think checking in major projects that build on Hadoop into core contrib is not the most productive way to host them. If they are active, they can be very disruptive during their formation. A project should have its own email lists, tests, branches, etc, independent of Hadoop mainline. The main advantage of putting projects in core seems to be to increase their visibility to the Hadoop community. I'd suggest we discuss other mechanisms. Long term I hope to see something like cpan.org emerge for hadoop. But short term we have not IDed an entity to host such a site. Absent that, I'd suggest a project like Hive take either the path ZooKeeper or Pig took. As a community we could take some simple steps to address the short comings of these approaches. An obvious step would be to invest in a well linked Wiki section that provides a directory of such projects. What do folks this of this? Other thoughts? Suggestions? E14
        Hide
        Owen O'Malley added a comment -

        The straight to subproject path is only available if the code base is from a single organization. Non-Apache projects that want to become Apache projects need to go through the incubator. Getting out of incubator takes a fair amount of effort.

        Another serious advantage for the hbase approach was that the hbase contributors got trained in the way that the Hadoop process and community works. That didn't happen for pig and the training took longer. Hbase had its first release after 2 months and pig hasn't released yet. Also the process and infrastructure overhead was much much lower for creating hbase than pig or zookeeper. It would take an hour to create Hive as a contrib module and a month to create it as a subproject. I agree with the disadvantages though that if the project gets busy, it can start to swamp the hadoop jiras and mailing lists. Certainly, we would have pushed HBase to a subproject much sooner if Hadoop hadn't been a subproject of Lucene at the time.

        If we are going to take Hive in contrib, I think we probably should disengage our process a bit from the current model. In particular, I don't think we should run the contrib unit tests for our patches. The only downside to that is that we should probably promote streaming and data_join into map/reduce, which will take some cleanup.

        Show
        Owen O'Malley added a comment - The straight to subproject path is only available if the code base is from a single organization. Non-Apache projects that want to become Apache projects need to go through the incubator. Getting out of incubator takes a fair amount of effort. Another serious advantage for the hbase approach was that the hbase contributors got trained in the way that the Hadoop process and community works. That didn't happen for pig and the training took longer. Hbase had its first release after 2 months and pig hasn't released yet. Also the process and infrastructure overhead was much much lower for creating hbase than pig or zookeeper. It would take an hour to create Hive as a contrib module and a month to create it as a subproject. I agree with the disadvantages though that if the project gets busy, it can start to swamp the hadoop jiras and mailing lists. Certainly, we would have pushed HBase to a subproject much sooner if Hadoop hadn't been a subproject of Lucene at the time. If we are going to take Hive in contrib, I think we probably should disengage our process a bit from the current model. In particular, I don't think we should run the contrib unit tests for our patches. The only downside to that is that we should probably promote streaming and data_join into map/reduce, which will take some cleanup.
        Hide
        Doug Cutting added a comment -

        > Owen: In particular, I don't think we should run the contrib unit tests for our patches.

        Hmm. We might still run them, but not fail a core patch if a contrib test fails. Or perhaps run them as a separate job in Hudson. We still want contrib to build and pass tests, and regular Hudson tests are a good way to achieve this.

        > Eric I'd suggest a project like Hive take either the path ZooKeeper or Pig took.

        As Owen pointed out, the Pig path (incubator) isn't required here, unless Hive wants to be a TLP (as Pig did at the time). The Zookeeper path (new Hadoop subproject) is available. I don't have a strong preference. If Hive is incorporated as a contrib module and it generates too much mailing list traffic on core lists, that's a success disaster that we can remedy by promoting it to a subproject. Or if folks feel confident from the start that it will sustain a subproject and are willing to create the infrastructure for that, that's fine too. As Owen mentioned, a subproject takes more time, to create a JIRA instance, mailing lists, web site, etc, especially if the folks involved are not already familiar with how these things are done at Apache. But it's not that hard.

        Show
        Doug Cutting added a comment - > Owen: In particular, I don't think we should run the contrib unit tests for our patches. Hmm. We might still run them, but not fail a core patch if a contrib test fails. Or perhaps run them as a separate job in Hudson. We still want contrib to build and pass tests, and regular Hudson tests are a good way to achieve this. > Eric I'd suggest a project like Hive take either the path ZooKeeper or Pig took. As Owen pointed out, the Pig path (incubator) isn't required here, unless Hive wants to be a TLP (as Pig did at the time). The Zookeeper path (new Hadoop subproject) is available. I don't have a strong preference. If Hive is incorporated as a contrib module and it generates too much mailing list traffic on core lists, that's a success disaster that we can remedy by promoting it to a subproject. Or if folks feel confident from the start that it will sustain a subproject and are willing to create the infrastructure for that, that's fine too. As Owen mentioned, a subproject takes more time, to create a JIRA instance, mailing lists, web site, etc, especially if the folks involved are not already familiar with how these things are done at Apache. But it's not that hard.
        Hide
        Joydeep Sen Sarma added a comment -

        Ideally we would like it to be in contrib for the same reasons that Owen outlined:

        • easy (low setup)
        • hadoop api's are not frozen yet - so being part of the tree and having regression tests run regularly against hadoop trunk makes it easy for us to respond to api changes. For the same reason - we like Doug's idea of running contrib tests via Hudson as a separate (nightly job)
        • we are not set on being a TLP - just want to get it out there.

        The point about swamping the core-dev list with contrib jiras is well taken. Would it be possible to have a separate email list for contrib projects (at least the high volume ones)? It would benefit the contrib authors as well in not having to parse tons of core hadoop jiras.

        At this point we have also invested a lot of effort in fitting into the contrib source tree model - so the sourceforge model sounds a little daunting (I imagine Zookeeper is more or less independent of Hadoop? - but Hive is totally intertwined with map-reduce/dfs).

        Show
        Joydeep Sen Sarma added a comment - Ideally we would like it to be in contrib for the same reasons that Owen outlined: easy (low setup) hadoop api's are not frozen yet - so being part of the tree and having regression tests run regularly against hadoop trunk makes it easy for us to respond to api changes. For the same reason - we like Doug's idea of running contrib tests via Hudson as a separate (nightly job) we are not set on being a TLP - just want to get it out there. The point about swamping the core-dev list with contrib jiras is well taken. Would it be possible to have a separate email list for contrib projects (at least the high volume ones)? It would benefit the contrib authors as well in not having to parse tons of core hadoop jiras. At this point we have also invested a lot of effort in fitting into the contrib source tree model - so the sourceforge model sounds a little daunting (I imagine Zookeeper is more or less independent of Hadoop? - but Hive is totally intertwined with map-reduce/dfs).
        Hide
        Doug Cutting added a comment -

        > Would it be possible to have a separate email list for contrib projects [ ... ]

        I'd rather keep the rule that separate lists are reserved for separate subprojects.

        Show
        Doug Cutting added a comment - > Would it be possible to have a separate email list for contrib projects [ ... ] I'd rather keep the rule that separate lists are reserved for separate subprojects.
        Hide
        Sameer Paranjpye added a comment -

        Another option would be to create a sandbox sub-project. This would serve as an incubator of sorts for entities that wanted to be Hadoop sub-projects with it's own mailing lists and builds. A sandbox project that meets some bar could become a sub-project. Don't know if this is possible or has a precedent in Apache. It would have the advantage of enabling us to mostly decouple administration from Hadoop Core.

        It should be possible to set up Hudson so that sandbox regressions run regularly against the Hadoop trunk.

        Show
        Sameer Paranjpye added a comment - Another option would be to create a sandbox sub-project. This would serve as an incubator of sorts for entities that wanted to be Hadoop sub-projects with it's own mailing lists and builds. A sandbox project that meets some bar could become a sub-project. Don't know if this is possible or has a precedent in Apache. It would have the advantage of enabling us to mostly decouple administration from Hadoop Core. It should be possible to set up Hudson so that sandbox regressions run regularly against the Hadoop trunk.
        Hide
        Doug Cutting added a comment -

        > Another option would be to create a sandbox sub-project.
        > A sandbox project that meets some bar could become a sub-project.

        What's the bar? Releases need to be approved by the Hadoop PMC, and, more generally, the PMC must monitor all activity in the project. I don't see how a sandbox designation would help oversight.

        We don't need more options. We have well-understood options that serve us well: core, contrib, sub-projects, TLPs, etc. We need to choose one. The Hive folks seem happy with contrib. I'm willing to try that. If it becomes a problem, we'll switch to something else.

        Show
        Doug Cutting added a comment - > Another option would be to create a sandbox sub-project. > A sandbox project that meets some bar could become a sub-project. What's the bar? Releases need to be approved by the Hadoop PMC, and, more generally, the PMC must monitor all activity in the project. I don't see how a sandbox designation would help oversight. We don't need more options. We have well-understood options that serve us well: core, contrib, sub-projects, TLPs, etc. We need to choose one. The Hive folks seem happy with contrib. I'm willing to try that. If it becomes a problem, we'll switch to something else.
        Hide
        Sameer Paranjpye added a comment -

        > What's the bar? Releases need to be approved by the Hadoop PMC, and, more generally, the PMC must monitor all activity in the project.
        > I don't see how a sandbox designation would help oversight.

        It's for the community to decide what the bar is. It's clearly lower than that required for a TLP. What's the bar for moving something from contrib to a sub-project? It's not well defined, but can be thought about and made concrete. Sandbox projects wouldn't create traffic on the core lists and wouldn't hold up core patches or releases. The patch process can be modified to not consider contrib tests for core patches. But core releases would not be possible without contrib tests passing.

        > We don't need more options. We have well-understood options that serve us well: core, contrib, sub-projects, TLPs, etc.

        We have well understood options, all of whose shortcomings have been discussed in this thread. It may be that no better way exists, but I don't follow the reasoning behind spontaneously rejecting all proposals.

        Show
        Sameer Paranjpye added a comment - > What's the bar? Releases need to be approved by the Hadoop PMC, and, more generally, the PMC must monitor all activity in the project. > I don't see how a sandbox designation would help oversight. It's for the community to decide what the bar is. It's clearly lower than that required for a TLP. What's the bar for moving something from contrib to a sub-project? It's not well defined, but can be thought about and made concrete. Sandbox projects wouldn't create traffic on the core lists and wouldn't hold up core patches or releases. The patch process can be modified to not consider contrib tests for core patches. But core releases would not be possible without contrib tests passing. > We don't need more options. We have well-understood options that serve us well: core, contrib, sub-projects, TLPs, etc. We have well understood options, all of whose shortcomings have been discussed in this thread. It may be that no better way exists, but I don't follow the reasoning behind spontaneously rejecting all proposals.
        Hide
        Doug Cutting added a comment -

        Sameer, I still don't understand what the difference between a sandbox and a sub-project would be. A sandbox would have its own mailing lists, jira, website, and releases, administered by the Hadoop PMC. That's a Hadoop sub-project, no?

        Show
        Doug Cutting added a comment - Sameer, I still don't understand what the difference between a sandbox and a sub-project would be. A sandbox would have its own mailing lists, jira, website, and releases, administered by the Hadoop PMC. That's a Hadoop sub-project, no?
        Hide
        Sameer Paranjpye added a comment -

        Doug,

        Yes, a sandbox is still a Hadoop sub-project. All I'm saying is that it could perhaps be a home for multiple nascent projects. Hive would be one such, it could become a sub-project in it's own right once it gains sufficient critical mass i.e. reaches a community determined bar.

        Show
        Sameer Paranjpye added a comment - Doug, Yes, a sandbox is still a Hadoop sub-project. All I'm saying is that it could perhaps be a home for multiple nascent projects. Hive would be one such, it could become a sub-project in it's own right once it gains sufficient critical mass i.e. reaches a community determined bar.
        Hide
        Ian Holsman added a comment -

        guys.. whatever is the easiest way to get this puppy into the open source.

        If doug thinks it should be in contrib to start off, then stick it there. It isn't a final decision, and there is only a little bit of disruption (judged on the hbase case) if you want to split it out into a sub project later on.

        Is there an actual code drop available?

        Show
        Ian Holsman added a comment - guys.. whatever is the easiest way to get this puppy into the open source. If doug thinks it should be in contrib to start off, then stick it there. It isn't a final decision, and there is only a little bit of disruption (judged on the hbase case) if you want to split it out into a sub project later on. Is there an actual code drop available?
        Hide
        Tim Robertson added a comment -

        I would second that - contrib seems like the easiest way to get it out and tried and tested, and as support grows, split it out to it's own project.
        Is it in any publicly accessible SVN now?

        Show
        Tim Robertson added a comment - I would second that - contrib seems like the easiest way to get it out and tried and tested, and as support grows, split it out to it's own project. Is it in any publicly accessible SVN now?
        Hide
        Ashish Thusoo added a comment -

        We do not have it in a public svn yet

        We are still fixing a few issues, but we are very close to getting this out.

        Reading all the discussions, contrib seems to be a good first step. Are we then good to go with that approach?

        Show
        Ashish Thusoo added a comment - We do not have it in a public svn yet We are still fixing a few issues, but we are very close to getting this out. Reading all the discussions, contrib seems to be a good first step. Are we then good to go with that approach?
        Hide
        eric baldeschwieler added a comment -

        After having dealt with the issues of HBASE in contrib, I really like the sandbox approach. It addresses the many and repeated challenges we experienced with HBASE.

        The idea I think is that all or at least most contrib projects would go there. We would get them off the primary lists and would have a clearer separation on hudson, mail etc. We could call the sub-project contrib or commons or incubator whatever to make it clear that it is a place for nacent sub-projects, that are not part of the Hadoop core code. Its hard for me to understand why we would check in complete systems built ontop of hadoop, like Hive, into core.

        Without some process changes like sandbox I'm against bring hive into contrib, since it will add overhead to core hadoop work. But I really want us to find a way to encourage this and many more projects that build on Hadoop to share their work with the community.

        The argument that we should put Hive in contrib because it is easier than going to source forge or google code really alarms me! Starting a project on those sites is trivial and requires a lot less commitment than signing up to be a good member of the apache hadoop community. Separating the mailing lists and builds of contrib from core would reduce the impact of such projects on the core hadoop community substantially, but not to zero. You are signing up for a lot more by putting your project here than these other sites!

        See:

        Show
        eric baldeschwieler added a comment - After having dealt with the issues of HBASE in contrib, I really like the sandbox approach. It addresses the many and repeated challenges we experienced with HBASE. The idea I think is that all or at least most contrib projects would go there. We would get them off the primary lists and would have a clearer separation on hudson, mail etc. We could call the sub-project contrib or commons or incubator whatever to make it clear that it is a place for nacent sub-projects, that are not part of the Hadoop core code. Its hard for me to understand why we would check in complete systems built ontop of hadoop, like Hive, into core. Without some process changes like sandbox I'm against bring hive into contrib, since it will add overhead to core hadoop work. But I really want us to find a way to encourage this and many more projects that build on Hadoop to share their work with the community. The argument that we should put Hive in contrib because it is easier than going to source forge or google code really alarms me! Starting a project on those sites is trivial and requires a lot less commitment than signing up to be a good member of the apache hadoop community. Separating the mailing lists and builds of contrib from core would reduce the impact of such projects on the core hadoop community substantially, but not to zero. You are signing up for a lot more by putting your project here than these other sites! See: http://incubator.apache.org/learn/theapacheway.html http://wiki.apache.org/hadoop/HowToContribute
        Hide
        eric baldeschwieler added a comment -

        PS If the PMC thinks going straight to a sub-project makes sense, that is fine with me too. But checking big systems under active development into contrib does not work well IMO.

        Show
        eric baldeschwieler added a comment - PS If the PMC thinks going straight to a sub-project makes sense, that is fine with me too. But checking big systems under active development into contrib does not work well IMO.
        Hide
        Edward J. Yoon added a comment -

        Very cool contribution. IMO, +1 for the sandbox approach.

        Show
        Edward J. Yoon added a comment - Very cool contribution. IMO, +1 for the sandbox approach.
        Hide
        Doug Cutting added a comment -

        > Yes, a sandbox is still a Hadoop sub-project [ ... ]

        Contrib is our de-facto sandbox today. I am not personally interested in setting up and managing a separate sandbox, nor have I yet heard other volunteers on the PMC.

        I also think we should separate these two issues: how to manage contrib/sandbox long-term, and how to import Hive short-term.

        > Without some process changes like sandbox I'm against bring hive into contrib, since it will add overhead to core hadoop work.

        If it does, then we could move it to a subproject. As Owen stated, the reason such a move was delayed for HBase was that, as a Lucene subproject, Hadoop could not create sub-sub-projects. But now, as a TLP, we can easily and quickly create sub-projects when they're needed. So I don't see this as a major liability.

        I still think the two viable options are contrib or sub-project. I don't have a strong opinion. It depends on what activity level we expect. If it is to be relatively low-activity, a contrib module is appropriate. If it has enough activity to support separate mailing lists, releases, etc., then a sub-project makes sense. In either case, we'll first make a guess, then adapt if we've made a mistake. A mistake is not fatal or even critical here, but minor.

        Show
        Doug Cutting added a comment - > Yes, a sandbox is still a Hadoop sub-project [ ... ] Contrib is our de-facto sandbox today. I am not personally interested in setting up and managing a separate sandbox, nor have I yet heard other volunteers on the PMC. I also think we should separate these two issues: how to manage contrib/sandbox long-term, and how to import Hive short-term. > Without some process changes like sandbox I'm against bring hive into contrib, since it will add overhead to core hadoop work. If it does, then we could move it to a subproject. As Owen stated, the reason such a move was delayed for HBase was that, as a Lucene subproject, Hadoop could not create sub-sub-projects. But now, as a TLP, we can easily and quickly create sub-projects when they're needed. So I don't see this as a major liability. I still think the two viable options are contrib or sub-project. I don't have a strong opinion. It depends on what activity level we expect. If it is to be relatively low-activity, a contrib module is appropriate. If it has enough activity to support separate mailing lists, releases, etc., then a sub-project makes sense. In either case, we'll first make a guess, then adapt if we've made a mistake. A mistake is not fatal or even critical here, but minor.
        Hide
        dhruba borthakur added a comment -

        My vote is to make Hive a contrib module to start with.

        Show
        dhruba borthakur added a comment - My vote is to make Hive a contrib module to start with.
        Hide
        Joydeep Sen Sarma added a comment -
        • Initially the project will be very active. We expect lots of bug fixes and features for the first few months. So if that is a concern - then a sub-project may be better suited.
        • the concern with sourceforge wasn't so much the act of setting up a project there - but rather the source code organization. we are using hadoop core and other libraries in hadoop/lib liberally and just haven't gone through the exercise of thinking through how we would manage those dependencies if Hive were completely separate.

        can someone post details on how sub-projects are organized in terms of source code organization/checkin rules/branching etc. ?

        also looking for some advice on how we can organize things so that (say medium term) we can do releases independent of hadoop.

        Show
        Joydeep Sen Sarma added a comment - Initially the project will be very active. We expect lots of bug fixes and features for the first few months. So if that is a concern - then a sub-project may be better suited. the concern with sourceforge wasn't so much the act of setting up a project there - but rather the source code organization. we are using hadoop core and other libraries in hadoop/lib liberally and just haven't gone through the exercise of thinking through how we would manage those dependencies if Hive were completely separate. can someone post details on how sub-projects are organized in terms of source code organization/checkin rules/branching etc. ? also looking for some advice on how we can organize things so that (say medium term) we can do releases independent of hadoop.
        Hide
        Tim Robertson added a comment -

        "also looking for some advice on how we can organize things so that (say medium term) we can do releases independent of hadoop."

        Maven! It is so simple to change dependency versions and then when Hadoop increments to say 0.18, you change your pom to reflect this and it is all seamless for us all.
        Really this is what maven is great for.

        Show
        Tim Robertson added a comment - "also looking for some advice on how we can organize things so that (say medium term) we can do releases independent of hadoop." Maven! It is so simple to change dependency versions and then when Hadoop increments to say 0.18, you change your pom to reflect this and it is all seamless for us all. Really this is what maven is great for.
        Hide
        Doug Cutting added a comment -

        > Initially the project will be very active.

        Sounds like a sub-project might be called for.

        > can someone post details on how sub-projects are organized in terms of source code organization/checkin rules/branching etc. ?

        Look at HBase and Zookeeper for examples. A sub-project has it's own trunk, branches and tags in subversion, and releases separately. It has its own mailing lists and jira instance. It has a separate list of committers. All Hadoop subprojects are overseen by the Hadoop PMC. For this to be effective, each subproject should have several PMC members who are active on it, ideally three or more. Creation of a new sub-project requires a vote of the PMC and should be discussed on general@hadoop.apache.org, while creation of a contrib module is generally handled much like any other patch.

        If Hive were a sub-project, it would probably include Hadoop Core jars in it's lib/ directory. Hive releases would lag Hadoop Core releases.

        Show
        Doug Cutting added a comment - > Initially the project will be very active. Sounds like a sub-project might be called for. > can someone post details on how sub-projects are organized in terms of source code organization/checkin rules/branching etc. ? Look at HBase and Zookeeper for examples. A sub-project has it's own trunk, branches and tags in subversion, and releases separately. It has its own mailing lists and jira instance. It has a separate list of committers. All Hadoop subprojects are overseen by the Hadoop PMC. For this to be effective, each subproject should have several PMC members who are active on it, ideally three or more. Creation of a new sub-project requires a vote of the PMC and should be discussed on general@hadoop.apache.org, while creation of a contrib module is generally handled much like any other patch. If Hive were a sub-project, it would probably include Hadoop Core jars in it's lib/ directory. Hive releases would lag Hadoop Core releases.
        Hide
        Doug Cutting added a comment -

        > Maven! It is so simple to change dependency versions and then when Hadoop increments to say 0.18, [...]

        Hadoop Core does not currently publish Maven artifacts.

        Show
        Doug Cutting added a comment - > Maven! It is so simple to change dependency versions and then when Hadoop increments to say 0.18, [...] Hadoop Core does not currently publish Maven artifacts.
        Hide
        Tim Robertson added a comment -

        I know... I wish they did. The first thing I did was get it in the local repository so I could use it in maven build environment - I see others on the mail list do this too.
        Would it break any licensing for someone to put it in a public repository? Hive could for example put it up in their own and reference that in their pom.

        Show
        Tim Robertson added a comment - I know... I wish they did. The first thing I did was get it in the local repository so I could use it in maven build environment - I see others on the mail list do this too. Would it break any licensing for someone to put it in a public repository? Hive could for example put it up in their own and reference that in their pom.
        Hide
        Doug Cutting added a comment -

        > I know... I wish they did.

        Please file a separate issue. If you can, attach a patch there so that hadoop's build generates maven artifacts. Then we'd need documentation added to HowToRelease describing how to publish maven artifacts, and to convince those that produce Hadoop releases that it is worth their while to generate and publish them.

        Show
        Doug Cutting added a comment - > I know... I wish they did. Please file a separate issue. If you can, attach a patch there so that hadoop's build generates maven artifacts. Then we'd need documentation added to HowToRelease describing how to publish maven artifacts, and to convince those that produce Hadoop releases that it is worth their while to generate and publish them.
        Hide
        Owen O'Malley added a comment -

        There is already a jira to publish pom files from Hadoop in HADOOP-3305. We need to clean up our reference to the cli-2 stuff first. I've started a patch for that one.

        Show
        Owen O'Malley added a comment - There is already a jira to publish pom files from Hadoop in HADOOP-3305 . We need to clean up our reference to the cli-2 stuff first. I've started a patch for that one.
        Hide
        Joydeep Sen Sarma added a comment -

        synced up with a few folks working on this internally. in a nutshell - the contributors seem to like the idea of making this a contrib project to begin with.

        the sub-project requirements (in terms of PMC involvement) are fairly rigorous and would probably extend the timeline of releasing hive into the hadoop ecosystem. that is our primary concern at this time. as the project matures - it's possible/likely that a sub-project designation is more appropriate.

        to address the concerns about email traffic on core-dev - we had a suggestion. if we can put the 'component' field in the email header (Pete found this useful link: http://www.atlassian.com/software/jira/docs/latest/emailcontent.html) - then client-side mail filtering should be able to isolate hive jira traffic from that of hadoop (or other contrib projects). there have already been suggestions on this thread with not having contrib test failures stop acceptance of patches - and that would probably alleviate the other major concern around slowing core development down. would these address most of the concerns that are motivating the sandbox/sub-project discussion?

        i dont think we will see a lot of traffic on core-users mailing list (based on the follow up traffic from Ashish's posting of the hive language tutorial) - but we will just have to see how that turns out.

        Show
        Joydeep Sen Sarma added a comment - synced up with a few folks working on this internally. in a nutshell - the contributors seem to like the idea of making this a contrib project to begin with. the sub-project requirements (in terms of PMC involvement) are fairly rigorous and would probably extend the timeline of releasing hive into the hadoop ecosystem. that is our primary concern at this time. as the project matures - it's possible/likely that a sub-project designation is more appropriate. to address the concerns about email traffic on core-dev - we had a suggestion. if we can put the 'component' field in the email header (Pete found this useful link: http://www.atlassian.com/software/jira/docs/latest/emailcontent.html ) - then client-side mail filtering should be able to isolate hive jira traffic from that of hadoop (or other contrib projects). there have already been suggestions on this thread with not having contrib test failures stop acceptance of patches - and that would probably alleviate the other major concern around slowing core development down. would these address most of the concerns that are motivating the sandbox/sub-project discussion? i dont think we will see a lot of traffic on core-users mailing list (based on the follow up traffic from Ashish's posting of the hive language tutorial) - but we will just have to see how that turns out.
        Hide
        Doug Cutting added a comment -

        > the sub-project requirements (in terms of PMC involvement) are fairly rigorous

        Not really, I'd be happy to kibitz on the mailing lists while things get established. Once you've made a release or two then we can perhaps nominate some Hive folks to the PMC. It is best for each subproject to be represented on the PMC by active committers.

        > we can put the 'component' field in the email header

        If the component is specified then it is included in every message body, and folks can filter for it there.

        > there have already been suggestions on this thread with not having contrib test failures stop acceptance of patches

        My preference would not be to treat Hive differently from any other contrib module. If it doesn't fit contrib, then it should be a sub-project. If you think there will be a lot of JIRA traffic that's not of interest to the rest of Hadoop Core then that's a sign that it doesn't belong in Hadoop Core releases and should be a sub-project.

        Show
        Doug Cutting added a comment - > the sub-project requirements (in terms of PMC involvement) are fairly rigorous Not really, I'd be happy to kibitz on the mailing lists while things get established. Once you've made a release or two then we can perhaps nominate some Hive folks to the PMC. It is best for each subproject to be represented on the PMC by active committers. > we can put the 'component' field in the email header If the component is specified then it is included in every message body, and folks can filter for it there. > there have already been suggestions on this thread with not having contrib test failures stop acceptance of patches My preference would not be to treat Hive differently from any other contrib module. If it doesn't fit contrib, then it should be a sub-project. If you think there will be a lot of JIRA traffic that's not of interest to the rest of Hadoop Core then that's a sign that it doesn't belong in Hadoop Core releases and should be a sub-project.
        Hide
        Ashish Thusoo added a comment -

        I am not very sure how much JIRA traffic this would generate initially. In the long run, if this becomes popular, it will of course generate a lot, but at this time, considering that people would just be curious about it and be experimenting with it, it seems to me that creating a sub project is an over optimization. At least I think, hive still needs to prove itself before it can be called a sub project in its own right. There is potential, but a lot will depend on how the community adopts it both the user community and the developer community.

        Putting Hive in contrib, also ensures that we are focussed on working within the Hadoop ecosystem and also focussed on making sure that Hive development doesn't lag Hadoop development and that we actively move forward as Hadoop interfaces involved. It ensures that we do not diverge too much from Hadoop releases.

        Given all that, it seems desirable that we carry on with the contrib model and monitor this closely to see if it earns the right to being a sub project.

        Show
        Ashish Thusoo added a comment - I am not very sure how much JIRA traffic this would generate initially. In the long run, if this becomes popular, it will of course generate a lot, but at this time, considering that people would just be curious about it and be experimenting with it, it seems to me that creating a sub project is an over optimization. At least I think, hive still needs to prove itself before it can be called a sub project in its own right. There is potential, but a lot will depend on how the community adopts it both the user community and the developer community. Putting Hive in contrib, also ensures that we are focussed on working within the Hadoop ecosystem and also focussed on making sure that Hive development doesn't lag Hadoop development and that we actively move forward as Hadoop interfaces involved. It ensures that we do not diverge too much from Hadoop releases. Given all that, it seems desirable that we carry on with the contrib model and monitor this closely to see if it earns the right to being a sub project.
        Hide
        lengwuqing added a comment -

        Could you guys please tell us: what time is the Hive realease-date?
        Could you guys please tell us: what time is the Hive realease-date?
        Could you guys please tell us: what time is the Hive realease-date?

        Show
        lengwuqing added a comment - Could you guys please tell us: what time is the Hive realease-date? Could you guys please tell us: what time is the Hive realease-date? Could you guys please tell us: what time is the Hive realease-date?
        lengwuqing made changes -
        Issue Type New Feature [ 2 ] Wish [ 5 ]
        Environment N/A
        Release Note Could you guys please tell us: what time is the Hive realease-date?
        Could you guys please tell us: what time is the Hive realease-date?
        Hide
        Tim Robertson added a comment -

        Is there an early release code base anywhere publicly available yet (please)?

        Show
        Tim Robertson added a comment - Is there an early release code base anywhere publicly available yet (please)?
        Hide
        Ashish Thusoo added a comment -

        Hi Ian/LengWuqing,

        Apologies for the delay. We are shooting for this Friday (Aug 15th) to upload a patch here. So please bear with us.

        Thanks,
        Ashish

        Show
        Ashish Thusoo added a comment - Hi Ian/LengWuqing, Apologies for the delay. We are shooting for this Friday (Aug 15th) to upload a patch here. So please bear with us. Thanks, Ashish
        Hide
        Ashish Thusoo added a comment -

        We ran into some issues while porting this to trunk. We are actively working to resolve those issues.

        While we solve the compatibility issues with hadoop trunk, interested users can get a source tar ball and a jar distribution which compiles and works with hadoop 0.17 from the following location

        http://mirror.facebook.com/facebook/hive/hadoop-0.17/

        Please follow the instructions in the README file on how to compile the src tar ball and how to use the jar distribution. Not all the features mentioned in the tutorial on this JIRA have made it to this distribution, but a bulk of these are already there. The README in the jar distribution has a summary of what is working and what is not.

        Feel free to try it out and send us feedback.

        Hive@facebook

        Show
        Ashish Thusoo added a comment - We ran into some issues while porting this to trunk. We are actively working to resolve those issues. While we solve the compatibility issues with hadoop trunk, interested users can get a source tar ball and a jar distribution which compiles and works with hadoop 0.17 from the following location http://mirror.facebook.com/facebook/hive/hadoop-0.17/ Please follow the instructions in the README file on how to compile the src tar ball and how to use the jar distribution. Not all the features mentioned in the tutorial on this JIRA have made it to this distribution, but a bulk of these are already there. The README in the jar distribution has a summary of what is working and what is not. Feel free to try it out and send us feedback. Hive@facebook
        Hide
        Tenaali added a comment -

        I read in the readme about number of map reduce jobs hive require to handle group by. Could you throw more light on how many map -reduce jobs hive run in order to run a query having Aggregate functions or query having multiple joins etc.

        Show
        Tenaali added a comment - I read in the readme about number of map reduce jobs hive require to handle group by. Could you throw more light on how many map -reduce jobs hive run in order to run a query having Aggregate functions or query having multiple joins etc.
        Hide
        Ashish Thusoo added a comment -

        Hi Tenaali,

        By default we do the group by in 2 stages. In the first stage we generate partial aggregates and then in the second stage we generate the final aggregates. If there is a query of the form

        select t.c1, count(DISTINCT t.c2) from t group by t.c1

        We would first run a map reduce job with the key as c1, c2 to generate the partial aggregates c1, count(DISTINCT c2).
        We would then follow this up with a second stage map reduce with key as c1 on the output of the previous stage and we would generate the final aggregate as c1, sum(partial aggregates).

        In case distinct is absent e.g.

        select t.c1, count(t.c2) from t group by t.c1

        we would run the first map reduce by randomly distributing the rows to the reducers in order to generate partial aggregates c1, count(c2)
        and then we would generate the final aggregates similar to the case mentioned above.

        We only support DISTINCT only one column right now.

        For join, every join is done in a map-reduce job and the result from one stage is fed into the next join etc. The join keys are used as the map keys and the join is done as a cartesian product on the values that arrive for the different tables in the reducer. We do optimizations so that we can join multiple tables in the same map reduce task (in case the join key for a table is same e.g. a.c1 = b.c1 and b.c1 = c.c1)

        We will be doing many more optimizations for both of these and we will be putting out all this information on the wiki very soon.

        Show
        Ashish Thusoo added a comment - Hi Tenaali, By default we do the group by in 2 stages. In the first stage we generate partial aggregates and then in the second stage we generate the final aggregates. If there is a query of the form select t.c1, count(DISTINCT t.c2) from t group by t.c1 We would first run a map reduce job with the key as c1, c2 to generate the partial aggregates c1, count(DISTINCT c2). We would then follow this up with a second stage map reduce with key as c1 on the output of the previous stage and we would generate the final aggregate as c1, sum(partial aggregates). In case distinct is absent e.g. select t.c1, count(t.c2) from t group by t.c1 we would run the first map reduce by randomly distributing the rows to the reducers in order to generate partial aggregates c1, count(c2) and then we would generate the final aggregates similar to the case mentioned above. We only support DISTINCT only one column right now. For join, every join is done in a map-reduce job and the result from one stage is fed into the next join etc. The join keys are used as the map keys and the join is done as a cartesian product on the values that arrive for the different tables in the reducer. We do optimizations so that we can join multiple tables in the same map reduce task (in case the join key for a table is same e.g. a.c1 = b.c1 and b.c1 = c.c1) We will be doing many more optimizations for both of these and we will be putting out all this information on the wiki very soon.
        Hide
        Ashish Thusoo added a comment -

        tar gzip file that contains the sources for hive. This contains some jar files as well, as a result we could not submit it through the normal patch method. I talked to Dhrubha and he advised that we just upload the tgz file.

        In order to get this into the source tree

        copy hive.tgz to src/contrib

        and then

        tar xvf hive.tgz

        to get all the sources into src/contrib.

        This compiles and tests with svn revision 688101 for hadoop trunk.

        Show
        Ashish Thusoo added a comment - tar gzip file that contains the sources for hive. This contains some jar files as well, as a result we could not submit it through the normal patch method. I talked to Dhrubha and he advised that we just upload the tgz file. In order to get this into the source tree copy hive.tgz to src/contrib and then tar xvf hive.tgz to get all the sources into src/contrib. This compiles and tests with svn revision 688101 for hadoop trunk.
        Ashish Thusoo made changes -
        Attachment hive.tgz [ 12388787 ]
        Hide
        dhruba borthakur added a comment -

        Would somebody care to review this one? We will probably submit this patch tonight for HadoopQA test run.

        Show
        dhruba borthakur added a comment - Would somebody care to review this one? We will probably submit this patch tonight for HadoopQA test run.
        Hide
        Ashish Thusoo added a comment -

        Adding a .tgz built from hadoop root.

        Show
        Ashish Thusoo added a comment - Adding a .tgz built from hadoop root.
        Ashish Thusoo made changes -
        Attachment hive.tgz [ 12388967 ]
        Hide
        Ashish Thusoo added a comment -

        Submitting hive.tgz file for hadoop QA. This .tgz file is built from hadoop root. We are submitting a .tgz file as there are some jar files that we use that need to be included in the sources. In order to build, copy the .tgz file to the hadoop root and then tar -xvzf hive.tgz to get the sources in the correct location.

        Show
        Ashish Thusoo added a comment - Submitting hive.tgz file for hadoop QA. This .tgz file is built from hadoop root. We are submitting a .tgz file as there are some jar files that we use that need to be included in the sources. In order to build, copy the .tgz file to the hadoop root and then tar -xvzf hive.tgz to get the sources in the correct location.
        Ashish Thusoo made changes -
        Affects Version/s 0.19.0 [ 12313211 ]
        Affects Version/s 0.17.0 [ 12312913 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Could you guys please tell us: what time is the Hive realease-date?
        Could you guys please tell us: what time is the Hive realease-date?
         Hive - Data Warehouse built on top of hadoop that enables structuring hadoop files as tables and partitions and allows users to query this data through a SQL like language using a command line interface.
        Hide
        Owen O'Malley added a comment -

        Ok, three things that I've noticed.
        1. You need to handle the "package" target and put your artifacts into build/hadoop-*/contrib/hive so that they are included in the hadoop tarball.
        2. src/contrib/hive/lib needs a README that list where each included component comes from. Each jar should also have a *.jar -> *.LICENSE that is the license that the given jar is distributed under.
        3. Your java doc should be probably be included in the hadoop java doc, so that it is available on the hadoop website.

        Show
        Owen O'Malley added a comment - Ok, three things that I've noticed. 1. You need to handle the "package" target and put your artifacts into build/hadoop-*/contrib/hive so that they are included in the hadoop tarball. 2. src/contrib/hive/lib needs a README that list where each included component comes from. Each jar should also have a *.jar -> *.LICENSE that is the license that the given jar is distributed under. 3. Your java doc should be probably be included in the hadoop java doc, so that it is available on the hadoop website.
        Owen O'Malley made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Ashish Thusoo added a comment -

        sounds good. I will upload another tgz tomorrow with those changes.

        Show
        Ashish Thusoo added a comment - sounds good. I will upload another tgz tomorrow with those changes.
        Hide
        eric baldeschwieler added a comment -

        Do we want hive and all other contribs in the default tarball?

        Why?

        Show
        eric baldeschwieler added a comment - Do we want hive and all other contribs in the default tarball? Why?
        Hide
        lengwuqing added a comment - - edited

        When I tried ths hive system, I found these issues:
        1. The NullHiveObject.getFields always return null, but somewhere(such as NaiiveSerializer) the caller used thisreturn value to call .getSize() method. //I return "return new ArrayList<SerDeField>();" but I can not make sure this is OK.
        2. In the joinOperator.close(), the l4j object is null. //I added some check in this function.
        3. Some times(heavy loading for hadoop), even I used same data and same Hive-QL, but the results are difference. While the logic error happening, the result must be xxxx_r_000022_0 and another file with postfix: xxxx_r_000022_1. //This issue is reproducable, I think this is a critical bug for me. I dont known which cases the results have difference postfix: _0 and _1. But I guess this is a good hint to debug this issue.

        Any facebook guy could you please give me a hand: why the case#3 happened?
        "No one is there", nobody care me. my god, I suggest that FB pay us $, we can enhance the hive better and have it go ahead faster. Hahahahahahahahahahahah

        Show
        lengwuqing added a comment - - edited When I tried ths hive system, I found these issues: 1. The NullHiveObject.getFields always return null, but somewhere(such as NaiiveSerializer) the caller used thisreturn value to call .getSize() method. //I return "return new ArrayList<SerDeField>();" but I can not make sure this is OK. 2. In the joinOperator.close(), the l4j object is null. //I added some check in this function. 3. Some times(heavy loading for hadoop), even I used same data and same Hive-QL, but the results are difference. While the logic error happening, the result must be xxxx_r_000022_0 and another file with postfix: xxxx_r_000022_1. //This issue is reproducable, I think this is a critical bug for me. I dont known which cases the results have difference postfix: _0 and _1. But I guess this is a good hint to debug this issue. Any facebook guy could you please give me a hand: why the case#3 happened? "No one is there", nobody care me. my god, I suggest that FB pay us $, we can enhance the hive better and have it go ahead faster. Hahahahahahahahahahahah
        lengwuqing made changes -
        Affects Version/s 0.19.0 [ 12313211 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.17.2 [ 12313296 ]
        Hide
        YoungWoo Kim added a comment -

        Hi Ashish,

        First, thank you all hive developers for great contribution.

        I'm testing hive with hadoop(0.19 dev).

        This is what I found:

        1. MySQL as a metastore does not work properly. logs below:
        ERROR JPOX.Datastore (Log4JLogger.java:error(117)) - Error thrown executing CREATE TABLE `SD_PARAMS`
        (
        `STORAGE_DESC_ID_OID` BIGINT NOT NULL,
        `PARAM_KEY` VARCHAR(256) BINARY NOT NULL,
        `PARAM_VALUE` VARCHAR(1024) BINARY NULL,
        PRIMARY KEY (`STORAGE_DESC_ID_OID`,`PARAM_KEY`)
        ) ENGINE=INNODB : Specified key was too long; max key length is 767 bytes
        com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes

        It's not hive's bug but MySQL's own limitation (MyQSL 5.0.x, UTF-8 with InnoDB)
        I've changed RDBMS to PostgreSQL. It works fine.

        2. "DESCRIBE TABLE 'table name'" statement does not work but "DESCRIBE 'table name'" statement works.

        3. cli can't handle non-english characters for now.
        hive> select a.* from test a where a.b='김영우';
        Total MapReduce jobs = 1
        Starting Job = job_200808271709_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200808271709_0004
        Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_200808271709_0004
        map = 0%, reduce =0%
        map = 50%, reduce =0%
        map = 100%, reduce =100%
        Ended Job = job_200808271709_0004
        Moving data to: /tmp/hive-hadoop/5847592.10000
        OK
        1 ���
        hive>

        thanks.
        -yw kim

        Show
        YoungWoo Kim added a comment - Hi Ashish, First, thank you all hive developers for great contribution. I'm testing hive with hadoop(0.19 dev). This is what I found: 1. MySQL as a metastore does not work properly. logs below: ERROR JPOX.Datastore (Log4JLogger.java:error(117)) - Error thrown executing CREATE TABLE `SD_PARAMS` ( `STORAGE_DESC_ID_OID` BIGINT NOT NULL, `PARAM_KEY` VARCHAR(256) BINARY NOT NULL, `PARAM_VALUE` VARCHAR(1024) BINARY NULL, PRIMARY KEY (`STORAGE_DESC_ID_OID`,`PARAM_KEY`) ) ENGINE=INNODB : Specified key was too long; max key length is 767 bytes com.mysql.jdbc.exceptions.MySQLSyntaxErrorException: Specified key was too long; max key length is 767 bytes It's not hive's bug but MySQL's own limitation (MyQSL 5.0.x, UTF-8 with InnoDB) I've changed RDBMS to PostgreSQL. It works fine. 2. "DESCRIBE TABLE 'table name'" statement does not work but "DESCRIBE 'table name'" statement works. 3. cli can't handle non-english characters for now. hive> select a.* from test a where a.b='김영우'; Total MapReduce jobs = 1 Starting Job = job_200808271709_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_200808271709_0004 Kill Command = /usr/local/hadoop/bin/hadoop job -Dmapred.job.tracker=localhost:54311 -kill job_200808271709_0004 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =100% Ended Job = job_200808271709_0004 Moving data to: /tmp/hive-hadoop/5847592.10000 OK 1 ê¹�ì��ì�° hive> thanks. -yw kim
        Hide
        Joydeep Sen Sarma added a comment -

        @lengwuqing - we will fix problem #3 in the version of the patch. we didn't quite cover the interaction with speculative execution very well in the first iteration and will fix it soon.

        @kim - for problem #3 - can you tell us whether the issue with how the output data is displayed - or with the actual results themselves (ie. the filter wasn't correctly executed). you can get some idea of how many rows were filtered by clicking on the link to the map-reduce job and looking at the counter named 'FILTERED'.

        we will get back on the other issues as well.

        Show
        Joydeep Sen Sarma added a comment - @lengwuqing - we will fix problem #3 in the version of the patch. we didn't quite cover the interaction with speculative execution very well in the first iteration and will fix it soon. @kim - for problem #3 - can you tell us whether the issue with how the output data is displayed - or with the actual results themselves (ie. the filter wasn't correctly executed). you can get some idea of how many rows were filtered by clicking on the link to the map-reduce job and looking at the counter named 'FILTERED'. we will get back on the other issues as well.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12388967/hive.tgz
        against trunk revision 689733.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3137/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12388967/hive.tgz against trunk revision 689733. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3137/console This message is automatically generated.
        Hide
        Ashish Thusoo added a comment -

        @Kim - Thanks for trying this out and for reporting the problems that you encountered. About #2 - we modeled much of our syntax on mysql and that supports DESCRIBE <table name> and not DESCRIBE TABLE <table name>. The tutorial however calls these as DESCRIBE TABLE which is incorrect. We will fix that in the tutorial.

        @lengwuqing - About 1 - this is probably the same bug that we discovered in house after we uploaded the patch. The new patch that I will upload today should have a fix for that.

        Show
        Ashish Thusoo added a comment - @Kim - Thanks for trying this out and for reporting the problems that you encountered. About #2 - we modeled much of our syntax on mysql and that supports DESCRIBE <table name> and not DESCRIBE TABLE <table name>. The tutorial however calls these as DESCRIBE TABLE which is incorrect. We will fix that in the tutorial. @lengwuqing - About 1 - this is probably the same bug that we discovered in house after we uploaded the patch. The new patch that I will upload today should have a fix for that.
        Hide
        Ashish Thusoo added a comment -

        @hadoop QA - there are tests in the .tgz accept that the .tgz does not include the changes to hadoops build.xml that need to be made to run the tests by executing ant test from the top level. Should I include the build.xml in the .tgz or should I submit that subsequently?

        The patch is a .tgz because it contains jar files, so the patch command is not going to work.

        Show
        Ashish Thusoo added a comment - @hadoop QA - there are tests in the .tgz accept that the .tgz does not include the changes to hadoops build.xml that need to be made to run the tests by executing ant test from the top level. Should I include the build.xml in the .tgz or should I submit that subsequently? The patch is a .tgz because it contains jar files, so the patch command is not going to work.
        Hide
        Owen O'Malley added a comment -

        The libraries haven't been fixed to include the licenses and a README.

        Show
        Owen O'Malley added a comment - The libraries haven't been fixed to include the licenses and a README.
        Owen O'Malley made changes -
        Assignee Ashish Thusoo [ athusoo ]
        Fix Version/s 0.19.0 [ 12313211 ]
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Prasad Chakka added a comment -

        yw kim,

        it works fine with mysql version 5.0.44_2_3. can you check this mysql bug report http://bugs.mysql.com/bug.php?id=28138 and use corresponding version.

        SD_PARAMS CREATE TABLE `SD_PARAMS` (
        `STORAGE_DESC_ID_OID` bigint(20) NOT NULL,
        `PARAM_KEY` varchar(256) character set latin1 collate latin1_bin NOT NULL,
        `PARAM_VALUE` varchar(1024) character set latin1 collate latin1_bin default NULL,
        PRIMARY KEY (`STORAGE_DESC_ID_OID`,`PARAM_KEY`),
        KEY `SD_PARAMS_N49` (`STORAGE_DESC_ID_OID`),
        CONSTRAINT `SD_PARAMS_FK1` FOREIGN KEY (`STORAGE_DESC_ID_OID`) REFERENCES `STORAGE_DESC` (`STORAGE_DESC_ID`)
        ) ENGINE=InnoDB DEFAULT CHARSET=latin1
        Show
        Prasad Chakka added a comment - yw kim, it works fine with mysql version 5.0.44_2_3. can you check this mysql bug report http://bugs.mysql.com/bug.php?id=28138 and use corresponding version. SD_PARAMS CREATE TABLE `SD_PARAMS` ( `STORAGE_DESC_ID_OID` bigint(20) NOT NULL, `PARAM_KEY` varchar(256) character set latin1 collate latin1_bin NOT NULL, `PARAM_VALUE` varchar(1024) character set latin1 collate latin1_bin default NULL, PRIMARY KEY (`STORAGE_DESC_ID_OID`,`PARAM_KEY`), KEY `SD_PARAMS_N49` (`STORAGE_DESC_ID_OID`), CONSTRAINT `SD_PARAMS_FK1` FOREIGN KEY (`STORAGE_DESC_ID_OID`) REFERENCES `STORAGE_DESC` (`STORAGE_DESC_ID`) ) ENGINE=InnoDB DEFAULT CHARSET=latin1
        Hide
        YoungWoo Kim added a comment -

        Hi Parasad,

        Thanks for your answer.

        I'm using MySQL 5.0.51a.

        I found this:

        • 1 byte per a character on 'latin1' charset.
        • 3 bytes per a character on 'utf8' charset in MySQL.

        so, hive works with on 'CHARSET=latin1' in MySQL.

        regards,
        -yw kim.

        Show
        YoungWoo Kim added a comment - Hi Parasad, Thanks for your answer. I'm using MySQL 5.0.51a. I found this: 1 byte per a character on 'latin1' charset. 3 bytes per a character on 'utf8' charset in MySQL. so, hive works with on 'CHARSET=latin1' in MySQL. regards, -yw kim.
        Hide
        Ashish Thusoo added a comment -

        Added license information.
        Added the package target to copy hive distribution to hadoop tar ball location.

        I was not able to complete the javadoc intergration as there is some generated code which we want to exclude and also it needs changes to the hadoop build.xml which is not a part of this .tgz. Will submit those changes as a separate diff file later if that is ok.

        Show
        Ashish Thusoo added a comment - Added license information. Added the package target to copy hive distribution to hadoop tar ball location. I was not able to complete the javadoc intergration as there is some generated code which we want to exclude and also it needs changes to the hadoop build.xml which is not a part of this .tgz. Will submit those changes as a separate diff file later if that is ok.
        Ashish Thusoo made changes -
        Attachment hive.tgz [ 12389137 ]
        Hide
        Ashish Thusoo added a comment -

        Incorporated Owen's comments except I was not able to get the javadoc stuff working. Will submit a separate patch of that later if that is ok.

        Show
        Ashish Thusoo added a comment - Incorporated Owen's comments except I was not able to get the javadoc stuff working. Will submit a separate patch of that later if that is ok.
        Ashish Thusoo made changes -
        Affects Version/s 0.17.2 [ 12313296 ]
        Status Open [ 1 ] Patch Available [ 10002 ]
        Affects Version/s 0.19.0 [ 12313211 ]
        Hide
        YoungWoo Kim added a comment -

        Joydeep,

        The problem #3,
        It was a just simple test. I loaded small set of data that contain 'Korean' characters. and then, I excuted a simple select statement.
        in result, the returned data was correct but was displaied broken characters.

        I try another test with 'INSERT OVERWRITE LOCAL DIRECTORY ...' statement with same select statement.
        ohh, It works fine. result text files are correct with valid characters.

        thanks.
        -yw kim.

        Show
        YoungWoo Kim added a comment - Joydeep, The problem #3, It was a just simple test. I loaded small set of data that contain 'Korean' characters. and then, I excuted a simple select statement. in result, the returned data was correct but was displaied broken characters. I try another test with 'INSERT OVERWRITE LOCAL DIRECTORY ...' statement with same select statement. ohh, It works fine. result text files are correct with valid characters. thanks. -yw kim.
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12389137/hive.tgz
        against trunk revision 690093.

        +1 @author. The patch does not contain any @author tags.

        -1 tests included. The patch doesn't appear to include any new or modified tests.
        Please justify why no tests are needed for this patch.

        -1 patch. The patch command could not apply the patch.

        Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3140/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12389137/hive.tgz against trunk revision 690093. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. -1 patch. The patch command could not apply the patch. Console output: http://hudson.zones.apache.org/hudson/job/Hadoop-Patch/3140/console This message is automatically generated.
        dhruba borthakur made changes -
        Component/s contrib/hive [ 12312455 ]
        Hide
        qing yan added a comment -

        Hi guys,

        I am playing with the Hive binary right now and run into a problem : how to reference an entry in a MAP type?

        According to the example given in the PDF :
        ...
        SELECT pv.userid, pv.properties['page type'];

        but it doesn't work and is conflicting with the source code
        org/apache/hive/ql/parse/SemanticAnalyzer.java
        ...
        if (funcText.equals("[")){
        // "[]" : LSQUARE/INDEX Expression
        assert(children.size() == 2);
        // Only allow constant integer index for now
        if (!(children.get(1) instanceof exprNodeConstantDesc)

        !(((exprNodeConstantDesc)children.get(1)).getValue() instanceof Integer)) { throw new SemanticException(ErrorMsg.INVALID_ARRAYINDEX_CONSTANT.getMsg(expr)); }

        My question is is MAP type supported in the current version and what is the correct syntax for it?

        Thank you!

        Show
        qing yan added a comment - Hi guys, I am playing with the Hive binary right now and run into a problem : how to reference an entry in a MAP type? According to the example given in the PDF : ... SELECT pv.userid, pv.properties ['page type'] ; but it doesn't work and is conflicting with the source code org/apache/hive/ql/parse/SemanticAnalyzer.java ... if (funcText.equals("[")){ // "[]" : LSQUARE/INDEX Expression assert(children.size() == 2); // Only allow constant integer index for now if (!(children.get(1) instanceof exprNodeConstantDesc) !(((exprNodeConstantDesc)children.get(1)).getValue() instanceof Integer)) { throw new SemanticException(ErrorMsg.INVALID_ARRAYINDEX_CONSTANT.getMsg(expr)); } My question is is MAP type supported in the current version and what is the correct syntax for it? Thank you!
        Hide
        Ashish Thusoo added a comment -

        Hi Ging yan,

        Yes, we do not support MAPs right now in the query layer even though the support to create those tables are there. We are working on that. The code that you refer to is actually meant for lists and therefore, the restriction on integer values.

        Ashish

        Show
        Ashish Thusoo added a comment - Hi Ging yan, Yes, we do not support MAPs right now in the query layer even though the support to create those tables are there. We are working on that. The code that you refer to is actually meant for lists and therefore, the restriction on integer values. Ashish
        Hide
        Ashish Thusoo added a comment -

        Sorry that was a bit incorrect. What I meant is that we have some untested support for MAPs in the query layer and what you are likely hitting are the results of that. You can try lifting the restriction for the integers and see what happens. We will try that internally as well.

        Are you creating this table through a DDL. If that is the case then that would not work. We do not yet have support for the serde that generically serializes and deserializes maps and lists. We rely on thrift to do that. So you should try it with a thrift table. And for now such tables can only be created programmatically...

        Ashish

        Show
        Ashish Thusoo added a comment - Sorry that was a bit incorrect. What I meant is that we have some untested support for MAPs in the query layer and what you are likely hitting are the results of that. You can try lifting the restriction for the integers and see what happens. We will try that internally as well. Are you creating this table through a DDL. If that is the case then that would not work. We do not yet have support for the serde that generically serializes and deserializes maps and lists. We rely on thrift to do that. So you should try it with a thrift table. And for now such tables can only be created programmatically... Ashish
        Hide
        dhruba borthakur added a comment -

        Hi ashish, would it be possible for you to post the output if a "ant clean package" and "ant javadocs" for your workspace?

        Show
        dhruba borthakur added a comment - Hi ashish, would it be possible for you to post the output if a "ant clean package" and "ant javadocs" for your workspace?
        Hide
        Owen O'Malley added a comment -

        please run findbugs too and post the results.

        Show
        Owen O'Malley added a comment - please run findbugs too and post the results.
        Hide
        dhruba borthakur added a comment -

        Hi Owen,

        findbugs is not (yet) switched on for the Hive patch. So, if you run "ant findbugs" from the top level, it won't run findbugs on src/contrib/hive yet. In short, "ant findbugs" on the top level won't show any new findbugs warnings!

        We plan to switch on findbugs (and also fix findbugs warnings) in a separate follow-on JIRA. Hope that it ok with you.

        Show
        dhruba borthakur added a comment - Hi Owen, findbugs is not (yet) switched on for the Hive patch. So, if you run "ant findbugs" from the top level, it won't run findbugs on src/contrib/hive yet. In short, "ant findbugs" on the top level won't show any new findbugs warnings! We plan to switch on findbugs (and also fix findbugs warnings) in a separate follow-on JIRA. Hope that it ok with you.
        Hide
        Ashish Thusoo added a comment -

        The output of running

        ant clean package findbugs -l ant.log

        is attached.

        As Dhrubha mentioned we have not included the top level build.xml in this patch and we have not enabled findbugs on hive yet, so all that runs cleanly and should not add additional warnings to the hadoop build.

        Show
        Ashish Thusoo added a comment - The output of running ant clean package findbugs -l ant.log is attached. As Dhrubha mentioned we have not included the top level build.xml in this patch and we have not enabled findbugs on hive yet, so all that runs cleanly and should not add additional warnings to the hadoop build.
        Ashish Thusoo made changes -
        Attachment ant.log [ 12389382 ]
        Hide
        Owen O'Malley added a comment -

        I just committed this. Thanks, guys!

        Show
        Owen O'Malley added a comment - I just committed this. Thanks, guys!
        Owen O'Malley made changes -
        Hadoop Flags [Reviewed]
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Joydeep Sen Sarma added a comment -

        thanks Owen!

        Show
        Joydeep Sen Sarma added a comment - thanks Owen!
        Hide
        shenzhuxi added a comment - - edited

        I download dist.tgz. When I start HIVE, I got

        "WARN fs.FileSystem: "localhost:9000" is a deprecated filesystem name. Use "hdfs://localhost:9000/" instead.
        java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/session/SessionState
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
        at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
        at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220)
        Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.session.SessionState
        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
        at java.security.AccessController.doPrivileged(Native Method)
        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
        ... 7 more

        I use java-6-sun-1.6.0.06

        Show
        shenzhuxi added a comment - - edited I download dist.tgz. When I start HIVE, I got "WARN fs.FileSystem: "localhost:9000" is a deprecated filesystem name. Use "hdfs://localhost:9000/" instead. java.lang.NoClassDefFoundError: org/apache/hadoop/hive/ql/session/SessionState at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.ql.session.SessionState at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319) ... 7 more I use java-6-sun-1.6.0.06
        Hide
        Ashish Thusoo added a comment -

        Hi gleader,

        Did you build it from hadoop sources or did you just download the dist.tgz? It seems that the classpath is not properly set or the hive jars are not in your classpath. Can you provide more details on how you set this up?

        Ashish

        Show
        Ashish Thusoo added a comment - Hi gleader, Did you build it from hadoop sources or did you just download the dist.tgz? It seems that the classpath is not properly set or the hive jars are not in your classpath. Can you provide more details on how you set this up? Ashish
        Hide
        shenzhuxi added a comment - - edited

        I download dist.tgz and set export HADOOP=<hadoop-install-dir>/bin/hadoop as README.
        When I run bin/hive, I got the error.

        However it works in another computer.

        When I
        hive> CREATE TABLE pokes (foo INT, bar STRING);

        I got this
        FAILED: Error in semantic analysis: javax.jdo.JDOFatalInternalException: Unexpected exception caught.

        NestedThrowables:

        java.lang.reflect.InvocationTargetException

        Show
        shenzhuxi added a comment - - edited I download dist.tgz and set export HADOOP=<hadoop-install-dir>/bin/hadoop as README. When I run bin/hive, I got the error. However it works in another computer. When I hive> CREATE TABLE pokes (foo INT, bar STRING); I got this FAILED: Error in semantic analysis: javax.jdo.JDOFatalInternalException: Unexpected exception caught. NestedThrowables: java.lang.reflect.InvocationTargetException
        Hide
        Prasad Chakka added a comment -

        Can check the log at /tmp/<user-name>/hive.log?

        Show
        Prasad Chakka added a comment - Can check the log at /tmp/<user-name>/hive.log?
        Hide
        Doug Cutting added a comment -

        > We plan to switch on findbugs (and also fix findbugs warnings) in a separate follow-on JIRA.

        Why not address these first? And, if they're to be addressed later, where is the JIRA?

        Show
        Doug Cutting added a comment - > We plan to switch on findbugs (and also fix findbugs warnings) in a separate follow-on JIRA. Why not address these first? And, if they're to be addressed later, where is the JIRA?
        Hide
        Ashish Thusoo added a comment -

        I just filed a JIRA for this

        https://issues.apache.org/jira/browse/HADOOP-4072

        This needs changes to the root build.xml which we had not submitted as part of the patch.

        Also there were a bunch of findbugs warnings that we got from antlr generated code when we last ran it. We want to be able to suppress those and cleanup some others before we turn this on.

        Show
        Ashish Thusoo added a comment - I just filed a JIRA for this https://issues.apache.org/jira/browse/HADOOP-4072 This needs changes to the root build.xml which we had not submitted as part of the patch. Also there were a bunch of findbugs warnings that we got from antlr generated code when we last ran it. We want to be able to suppress those and cleanup some others before we turn this on.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-trunk #595 (See http://hudson.zones.apache.org/hudson/job/Hadoop-trunk/595/ )
        Hide
        lengwuqing added a comment - - edited

        — How to improve Hive —

        1. compiling:
        1. download and unzip hadoop-0.17.2.1.tar.gz
        2. download and unzip facebook-hive.tar.gz
        3. copy hive to ./hadoop-0.17.2.1/src/contrib/hive
        4. export CLASSPATH=.:../../../../hadoop-0.17.2.1/hadoop-0.17.2.1-core.jar:$CLASSPATH
        5. ant -Ddist.dir=hive_dist -Dtarget.dir=hive_target package
        6. cp -rf hive_target ../../../../hadoop-0.17.2.1/contrib/hive
        cp -rf hive_target ../../../../hive

        2. developing & debug
        1. create an Eclipse project
        2. collect all hive-related..java into src directory under project.
        3. collect all necessary third-part .jar into lib and set the library setting in project.
        4. modify and run this commnad:
        java -classpath .;./lib/antlr-3.0.1.jar;./lib/stringtemplate-3.1b1.jar;./lib/antlr-2.7.7.jar;./lib/antlr-runtime-3.0.1.jar org.antlr.Tool -fo src/org/apache/hadoop/hive/ql/parse/ src/org/apache/hadoop/hive/ql/parse/Hive.g
        5. refresh project, then you can find that this is a complted Hive development enviroment.

        3. execution
        export HADOOP_HOME=/home/hadoop/setup/hadoop-release
        ./bin/hive -hiveconf hive.root.logger=INFO,console

        4. hivefly
        1. crate some data and format like these two table.
        the resume, the number of records is 1024*1024*100.
        the course table, the number of records is 1024*1024*100*3.
        2. run these scripts and you may find that: the Hive system can not compute out correct result.
        you can debug the hive system on above we built development enviroment.

        CREATE TABLE resume(id INT, name STRING, gender STRING, years INT, intro STRING);
        CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes STRING);

        LOAD DATA LOCAL INPATH '/home/hadoop/john/hive/resume.txt' OVERWRITE INTO TABLE resume;
        LOAD DATA LOCAL INPATH '/home/hadoop/john/hive/course.txt' OVERWRITE INTO TABLE course;

        CREATE TABLE test00(name STRING, count INT);
        INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM course t1 GROUP BY t1.name;

        I suggest that Facebook pay us $, and we can co-work with FB's engineers, then improve this project better.

        Show
        lengwuqing added a comment - - edited — How to improve Hive — 1. compiling: 1. download and unzip hadoop-0.17.2.1.tar.gz 2. download and unzip facebook-hive.tar.gz 3. copy hive to ./hadoop-0.17.2.1/src/contrib/hive 4. export CLASSPATH=.:../../../../hadoop-0.17.2.1/hadoop-0.17.2.1-core.jar:$CLASSPATH 5. ant -Ddist.dir=hive_dist -Dtarget.dir=hive_target package 6. cp -rf hive_target ../../../../hadoop-0.17.2.1/contrib/hive cp -rf hive_target ../../../../hive 2. developing & debug 1. create an Eclipse project 2. collect all hive-related..java into src directory under project. 3. collect all necessary third-part .jar into lib and set the library setting in project. 4. modify and run this commnad: java -classpath .;./lib/antlr-3.0.1.jar;./lib/stringtemplate-3.1b1.jar;./lib/antlr-2.7.7.jar;./lib/antlr-runtime-3.0.1.jar org.antlr.Tool -fo src/org/apache/hadoop/hive/ql/parse/ src/org/apache/hadoop/hive/ql/parse/Hive.g 5. refresh project, then you can find that this is a complted Hive development enviroment. 3. execution export HADOOP_HOME=/home/hadoop/setup/hadoop-release ./bin/hive -hiveconf hive.root.logger=INFO,console 4. hivefly 1. crate some data and format like these two table. the resume, the number of records is 1024*1024*100. the course table, the number of records is 1024*1024*100*3. 2. run these scripts and you may find that: the Hive system can not compute out correct result. you can debug the hive system on above we built development enviroment. CREATE TABLE resume(id INT, name STRING, gender STRING, years INT, intro STRING); CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes STRING); LOAD DATA LOCAL INPATH '/home/hadoop/john/hive/resume.txt' OVERWRITE INTO TABLE resume; LOAD DATA LOCAL INPATH '/home/hadoop/john/hive/course.txt' OVERWRITE INTO TABLE course; CREATE TABLE test00(name STRING, count INT); INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM course t1 GROUP BY t1.name; I suggest that Facebook pay us $, and we can co-work with FB's engineers, then improve this project better.
        Hide
        Prasad Chakka added a comment -

        Hi lengwuqing

        Here are some answers that might make the above steps little easier...

        1: Compiling
        I think just copying the facebook-hive.tar.gz to src/contrib/hive and then doing 'ant package' from hadoop-0.17.2.1 directory should create the necessary packages for Hive in build/contrib/hive/dist and also where Hadoop expects. (Steps 4 & 5 can be done by just going 'ant package')

        2: For step 4) just do 'ant build-grammar' from src/contrib/hive/ql directory that would generate a antrl grammar code. You might also need to do 'ant gen-test' if you want to see the test code.

        4: Did you mean that Insert Overwrite is not working correctly? Which query is computing incorrect results?

        Thanks,
        Prasad

        Show
        Prasad Chakka added a comment - Hi lengwuqing Here are some answers that might make the above steps little easier... 1: Compiling I think just copying the facebook-hive.tar.gz to src/contrib/hive and then doing 'ant package' from hadoop-0.17.2.1 directory should create the necessary packages for Hive in build/contrib/hive/dist and also where Hadoop expects. (Steps 4 & 5 can be done by just going 'ant package') 2: For step 4) just do 'ant build-grammar' from src/contrib/hive/ql directory that would generate a antrl grammar code. You might also need to do 'ant gen-test' if you want to see the test code. 4: Did you mean that Insert Overwrite is not working correctly? Which query is computing incorrect results? Thanks, Prasad
        Hide
        lengwuqing added a comment -

        Hi Prasad Chakka:
        The 'Insert' can not work correctly. Based on my test dataset, it would be some output.

        And could you please give me some help, I want to design and implement the 'SORT' in hive.

        And later, I want to SPEED-UP the hive performance, the basic idea is improving the 'Sample' and 'Partition' in SematicAnalyzer.java.

        Show
        lengwuqing added a comment - Hi Prasad Chakka: The 'Insert' can not work correctly. Based on my test dataset, it would be some output. And could you please give me some help, I want to design and implement the 'SORT' in hive. And later, I want to SPEED-UP the hive performance, the basic idea is improving the 'Sample' and 'Partition' in SematicAnalyzer.java.
        Hide
        lengwuqing added a comment -

        TITLE: I think the approach of group-by has some issue:

        Hi, Guys:
        I create a table like this:
        CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes STRING);
        And I wanted to try the Group-By like this:
        INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM course t1 GROUP BY t1.name;

        I found that Hive NOT ONLY can not compute out a correct result , BUT ALSO the time cost is very hign. I insert some diagnostic code into ExecRecude.java, I found that: all of records have been process ONLY in one or two reducer. I noticed that there are some commnets in Hive: the Group-By is based on hash, I can make sure that the 'Name' column are difference.

        I developed a system named TING (www.sadbit.com), which used quite different methods to implemnt a parallel/distributed database, even it is not mature currently, but I can estimate that: in my hardware enviroment,to process that dataset, It need be faster than 2 minutes. but Hive use more than 9 minutes.
        I biggest issue is: all data are processed on 1-2 nodes while reducing, even the reduce number is 24.

        Any guy give me some commnets, why?

        Show
        lengwuqing added a comment - TITLE: I think the approach of group-by has some issue: Hi, Guys: I create a table like this: CREATE TABLE course(id INT, name STRING, course STRING, score INT, notes STRING); And I wanted to try the Group-By like this: INSERT OVERWRITE TABLE test00 SELECT t1.name,count(DISTINCT t1.name) FROM course t1 GROUP BY t1.name; I found that Hive NOT ONLY can not compute out a correct result , BUT ALSO the time cost is very hign. I insert some diagnostic code into ExecRecude.java, I found that: all of records have been process ONLY in one or two reducer. I noticed that there are some commnets in Hive: the Group-By is based on hash, I can make sure that the 'Name' column are difference. I developed a system named TING (www.sadbit.com), which used quite different methods to implemnt a parallel/distributed database, even it is not mature currently, but I can estimate that: in my hardware enviroment,to process that dataset, It need be faster than 2 minutes. but Hive use more than 9 minutes. I biggest issue is: all data are processed on 1-2 nodes while reducing, even the reduce number is 24. Any guy give me some commnets, why?
        Hide
        Tom White added a comment -

        Hi lengwuqing

        For new features/bug fixes/improvements to Hive please open new Jiras or start discussions on the dev list, rather than reusing this issue (which is now resolved).

        Thanks!

        Show
        Tom White added a comment - Hi lengwuqing For new features/bug fixes/improvements to Hive please open new Jiras or start discussions on the dev list, rather than reusing this issue (which is now resolved). Thanks!
        Hide
        Tenaali added a comment -

        Hive mailing list ??
        Is there any mailing list where we can discuss hive tips, issues etc ?

        Show
        Tenaali added a comment - Hive mailing list ?? Is there any mailing list where we can discuss hive tips, issues etc ?
        Hide
        YoungWoo Kim added a comment -

        Hi Tenaali,

        hive-users mailing list, http://publists.facebook.com/mailman/listinfo/hive-users

        • yw kim
        Show
        YoungWoo Kim added a comment - Hi Tenaali, hive-users mailing list, http://publists.facebook.com/mailman/listinfo/hive-users yw kim
        Robert Chansler made changes -
        Release Note  Hive - Data Warehouse built on top of hadoop that enables structuring hadoop files as tables and partitions and allows users to query this data through a SQL like language using a command line interface.
        Introduced Hive Data Warehouse built on top of Hadoop that enables structuring Hadoop files as tables and partitions and allows users to query this data through a SQL like language using a command line interface.
        Nigel Daley made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Owen O'Malley made changes -
        Component/s contrib/hive [ 12312455 ]
        Hide
        Zheng Shao added a comment -

        The publists.facebook.com link for hive mailing list is deprecated.

        Please see http://hadoop.apache.org/hive/mailing_lists.html for details on the new mailing lists.

        Show
        Zheng Shao added a comment - The publists.facebook.com link for hive mailing list is deprecated. Please see http://hadoop.apache.org/hive/mailing_lists.html for details on the new mailing lists.
        Hide
        Bhavesh Shah added a comment -

        Hi guys,
        When I enter the query in Hive CLI I get following errors:
        $ bin/hive -e ""insert overwrite local directory '/tmp/local_out' select a.* from invites a where a.ds='2008-08-15';"

        Hive history file=/tmp/Bhavesh.Shah/hive_job_log_Bhavesh.Shah_201112021007_2120318983.txt
        Total MapReduce jobs = 2
        Launching Job 1 out of 2
        Number of reduce tasks is set to 0 since there's no reduce operator
        Starting Job = job_201112011620_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112011620_0004
        Kill Command = C:\cygwin\home\Bhavesh.Shah\hadoop-0.20.2\/bin/hadoop job -Dmapred.job.tracker=localhost:9101 -kill job_201112011620_0004
        2011-12-02 10:07:30,777 Stage-1 map = 0%, reduce = 0%
        2011-12-02 10:07:57,796 Stage-1 map = 100%, reduce = 100%
        Ended Job = job_201112011620_0004 with errors
        FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask

        So what is the problem and how to solve it?
        Pls suggest me.
        Thanks.

        Show
        Bhavesh Shah added a comment - Hi guys, When I enter the query in Hive CLI I get following errors: $ bin/hive -e ""insert overwrite local directory '/tmp/local_out' select a.* from invites a where a.ds='2008-08-15';" Hive history file=/tmp/Bhavesh.Shah/hive_job_log_Bhavesh.Shah_201112021007_2120318983.txt Total MapReduce jobs = 2 Launching Job 1 out of 2 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_201112011620_0004, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201112011620_0004 Kill Command = C:\cygwin\home\Bhavesh.Shah\hadoop-0.20.2\/bin/hadoop job -Dmapred.job.tracker=localhost:9101 -kill job_201112011620_0004 2011-12-02 10:07:30,777 Stage-1 map = 0%, reduce = 0% 2011-12-02 10:07:57,796 Stage-1 map = 100%, reduce = 100% Ended Job = job_201112011620_0004 with errors FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.MapRedTask So what is the problem and how to solve it? Pls suggest me. Thanks.

          People

          • Assignee:
            Ashish Thusoo
            Reporter:
            Joydeep Sen Sarma
          • Votes:
            4 Vote for this issue
            Watchers:
            40 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 1,080h
              1,080h
              Remaining:
              Remaining Estimate - 1,080h
              1,080h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development