Hive
  1. Hive
  2. HIVE-3752

Add a non-sql API in hive to access data.

    Details

    • Type: Improvement Improvement
    • Status: Reopened
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We would like to add an input/output format for accessing Hive data in Hadoop directly without having to use e.g. a transform. Using a transform
      means having to do a whole map-reduce step with its own disk accesses and its imposed structure. It also means needing to have Hive be the base infrastructure for the entire system being developed which is not the right fit as we only need a small part of it (access to the data).

      So we propose adding an API level InputFormat and OutputFormat to Hive that will make it trivially easy to select a table with partition spec and read from / write to it. We chose this design to make it compatible with Hadoop so that existing systems that work with Hadoop's IO API will just work out of the box.

      We need this system for the Giraph graph processing system (http://giraph.apache.org/) as running graph jobs which read/write from Hive is a common use case.

      Namit Jain Avery Ching Kevin Wilfong Alessandro Presta

      Input-side (HiveApiInputFormat) review: https://reviews.facebook.net/D7401

        Issue Links

          Activity

          Hide
          Edward Capriolo added a comment -

          Have you looked at HCatalog's input formats? It sounds very close to what you need.

          Show
          Edward Capriolo added a comment - Have you looked at HCatalog's input formats? It sounds very close to what you need.
          Hide
          Namit Jain added a comment -

          Edward Capriolo, yes we did.

          While HCatalog is a neat project, there are a several reasons why an Hive input/format packaged with Hive is better for Apache Giraph

          • HCatalog (trunk) unfortunately is not compatible with Hadoop-0.20
          • Hcatalog is much more complex than simply being an API to use Hive. We only require a small part of Hcatalog's functionality, so having only a portion of this functionality will be easier to fix/update/maintain going forward
          • Having an input/output format that is part of Hive will guarantee its compatibility with Hive going forward

          As an aside, Hcatalog could also use this new input/output format to interface with Hive, potentially enabling a portion of its code to be simpler.

          In nutshell, HCatalog is a overkill for our simple usecase, and we want to avoid dependency on as many systems as possible.
          For a simple usecase like ours, enhancing hive seems like a much simpler option and easier to maintain in the longer term.

          ccing Alan Gates, Carl Steinbach

          Show
          Namit Jain added a comment - Edward Capriolo , yes we did. While HCatalog is a neat project, there are a several reasons why an Hive input/format packaged with Hive is better for Apache Giraph HCatalog (trunk) unfortunately is not compatible with Hadoop-0.20 Hcatalog is much more complex than simply being an API to use Hive. We only require a small part of Hcatalog's functionality, so having only a portion of this functionality will be easier to fix/update/maintain going forward Having an input/output format that is part of Hive will guarantee its compatibility with Hive going forward As an aside, Hcatalog could also use this new input/output format to interface with Hive, potentially enabling a portion of its code to be simpler. In nutshell, HCatalog is a overkill for our simple usecase, and we want to avoid dependency on as many systems as possible. For a simple usecase like ours, enhancing hive seems like a much simpler option and easier to maintain in the longer term. ccing Alan Gates , Carl Steinbach
          Hide
          Namit Jain added a comment -

          Since we have not heard from anyone, I am assuming this is acceptable.

          Show
          Namit Jain added a comment - Since we have not heard from anyone, I am assuming this is acceptable.
          Hide
          Edward Capriolo added a comment -

          I am ok with this. There possibly is some hcatalog overlap but the lean and mean rational is good and I love the 0.20 support. Let's do it. Let me know if I can help.

          Show
          Edward Capriolo added a comment - I am ok with this. There possibly is some hcatalog overlap but the lean and mean rational is good and I love the 0.20 support. Let's do it. Let me know if I can help.
          Hide
          Alan Gates added a comment -

          As a first note, this seems like a good use case of why moving HCat into Hive is a good idea.

          But in general could you explain more why HCat isn't a good solution? If it doesn't work with Hadoop 0.20 we'd certainly be open to patches to fix that. And if it proves useful to push more functionality down into the HCatInput/HCatOutput format that you need that also seems fine. But is there baggage in the current HCatInput and HCatOutput formats that you don't need or want? Why not improve those rather than duplicate functionality?

          Show
          Alan Gates added a comment - As a first note, this seems like a good use case of why moving HCat into Hive is a good idea. But in general could you explain more why HCat isn't a good solution? If it doesn't work with Hadoop 0.20 we'd certainly be open to patches to fix that. And if it proves useful to push more functionality down into the HCatInput/HCatOutput format that you need that also seems fine. But is there baggage in the current HCatInput and HCatOutput formats that you don't need or want? Why not improve those rather than duplicate functionality?
          Hide
          Namit Jain added a comment -

          For a simple usecase like ours, we dont want to depend on additional layers.
          It is a much simpler change to support in hive, rather than fixing a lot more in HCatalog.

          Eventually, if HCat moves into Hive, these 2 APIs should be merged. But, that may take a long time,
          and it may be much easier for us to have a much more light-weight solution in hive, rather than wait.

          Show
          Namit Jain added a comment - For a simple usecase like ours, we dont want to depend on additional layers. It is a much simpler change to support in hive, rather than fixing a lot more in HCatalog. Eventually, if HCat moves into Hive, these 2 APIs should be merged. But, that may take a long time, and it may be much easier for us to have a much more light-weight solution in hive, rather than wait.
          Hide
          Edward Capriolo added a comment -

          OK now I am going to flip-flop.

          If the description is very close to what catalog does or HCatalog can easily be modified to solve the requirements we should use hcatalog.

          We do not need to bloat hive with "not invented here" type code. If something else our there already implements the features we should integrate it.

          What we need is a more technical description for the purposed features of the suggested non-sql API.

          Show
          Edward Capriolo added a comment - OK now I am going to flip-flop. If the description is very close to what catalog does or HCatalog can easily be modified to solve the requirements we should use hcatalog. We do not need to bloat hive with "not invented here" type code. If something else our there already implements the features we should integrate it. What we need is a more technical description for the purposed features of the suggested non-sql API.
          Hide
          Namit Jain added a comment -

          Nitay, can you add the patch with the API ?

          Show
          Namit Jain added a comment - Nitay, can you add the patch with the API ?
          Hide
          Namit Jain added a comment -

          Edward, it is not about "not invented here", it is totally about the technical aspects, and ease of use/maintainability.
          If something very similar to what we need exists already in hive, why should we have additional dependencies ?

          Show
          Namit Jain added a comment - Edward, it is not about "not invented here", it is totally about the technical aspects, and ease of use/maintainability. If something very similar to what we need exists already in hive, why should we have additional dependencies ?
          Hide
          Nitay Joffe added a comment -

          Okay I've posted my first patch. This is just the input side of things (HiveApiInputFormat). Review is here: https://reviews.facebook.net/D7401
          It is missing some basic things (like not being in ql/ folder, comments, unit tests), which I am working on. Just wanted to get something up so folks can start looking at it and make suggestions before I go to deep. Let me know your thoughts.

          Show
          Nitay Joffe added a comment - Okay I've posted my first patch. This is just the input side of things (HiveApiInputFormat). Review is here: https://reviews.facebook.net/D7401 It is missing some basic things (like not being in ql/ folder, comments, unit tests), which I am working on. Just wanted to get something up so folks can start looking at it and make suggestions before I go to deep. Let me know your thoughts.
          Hide
          Namit Jain added a comment -

          Nitay, along with patch, can you create a document on apache hive cwiki, with the proposed API.
          If you dont have wiki permissions, please create an account, and send me your id. -
          I will give you the required permissions.

          Show
          Namit Jain added a comment - Nitay, along with patch, can you create a document on apache hive cwiki, with the proposed API. If you dont have wiki permissions, please create an account, and send me your id. - I will give you the required permissions.
          Hide
          Namit Jain added a comment -

          waiting for the documentation.

          Show
          Namit Jain added a comment - waiting for the documentation.
          Hide
          Nitay Joffe added a comment -

          Namit Jain I created an account (name: nitay). Can you give me edit permissions?

          Show
          Nitay Joffe added a comment - Namit Jain I created an account (name: nitay). Can you give me edit permissions?
          Hide
          Namit Jain added a comment -

          Nitay Joffe, can you try now ?

          Show
          Namit Jain added a comment - Nitay Joffe , can you try now ?
          Hide
          Nitay Joffe added a comment -

          Yeah I see the edit button now, thanks. I'll write up proposed API and send it your way.

          Show
          Nitay Joffe added a comment - Yeah I see the edit button now, thanks. I'll write up proposed API and send it your way.
          Hide
          Nitay Joffe added a comment -
          Show
          Nitay Joffe added a comment - Namit Jain here's the initial API proposal: https://cwiki.apache.org/confluence/display/Hive/Hadoop-compatible+Input-Output+Format+for+Hive . Let me know your thoughts.
          Hide
          Nitay Joffe added a comment -

          I've done this work in a separate library.

          Show
          Nitay Joffe added a comment - I've done this work in a separate library.
          Hide
          Nitay Joffe added a comment -

          Didn't mean to close. The separate library is here: https://github.com/facebook/hive-io-experimental
          If there is still interest to fold this into Hive I'd be happy to support it going in.

          Show
          Nitay Joffe added a comment - Didn't mean to close. The separate library is here: https://github.com/facebook/hive-io-experimental If there is still interest to fold this into Hive I'd be happy to support it going in.

            People

            • Assignee:
              Nitay Joffe
              Reporter:
              Nitay Joffe
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Development