HBase
  1. HBase
  2. HBASE-6800

Build a Document Store on HBase for Better Query Processing

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.95.2
    • Fix Version/s: None
    • Component/s: Coprocessors, Performance
    • Labels:
      None

      Description

      In the last couple of years, increasingly more people begin to stream data into HBase in near time, and
      use high level queries (e.g., Hive) to analyze the data in HBase directly. While HBase already has very effective MapReduce integration with its good scanning performance, query processing using MapReduce on HBase still has significant gaps compared to HDFS: ~3x space overheads and 3~5x performance overheads according to our measurement.

      We propose to implement a document store on HBase, which can greatly improve query processing on HBase (by leveraging the relational model and read-mostly access patterns). According to our prototype, it can reduce space usage by up-to ~3x and speedup query processing by up-to ~1.8x.

      1. dot-deisgn.pdf
        353 kB
        Jason Dai

        Issue Links

          Activity

          Hide
          Jason Dai added a comment -

          Considering that the relative size of individual field(s) in the document can be small, the cost of update would be comparatively higher than a fully de-normalized schema.

          One option is to take the similar approach as HBaseHUT (https://github.com/sematext/HBaseHUT), which converts RMW to updates, and constructs the up-to-date value on the fly.

          Show
          Jason Dai added a comment - Considering that the relative size of individual field(s) in the document can be small, the cost of update would be comparatively higher than a fully de-normalized schema. One option is to take the similar approach as HBaseHUT ( https://github.com/sematext/HBaseHUT ), which converts RMW to updates, and constructs the up-to-date value on the fly.
          Hide
          Andrew Purtell added a comment -

           If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator.  You could do that while being consistent with andrew's suggested methodology (not forking HBase, mavenized integration...).

          This is a good suggestion. Panthera isn't so much an enhancement to HBase but rather a full application on top, and with wider scope than just HBase – also Hive, and additional new components. In the scope of the HBase project alone, API changes, core changes, and (incorporating my earlier comment) utility coprocessors of sufficient generality make a lot of sense, as well as addressing the meta issues raised (I.e. should HBase have Eclipse plugin like tooling for getting and installing CPs). HBase should be a good platform for your work, let us know what you need.

          Show
          Andrew Purtell added a comment -  If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator.  You could do that while being consistent with andrew's suggested methodology (not forking HBase, mavenized integration...). This is a good suggestion. Panthera isn't so much an enhancement to HBase but rather a full application on top, and with wider scope than just HBase – also Hive, and additional new components. In the scope of the HBase project alone, API changes, core changes, and (incorporating my earlier comment) utility coprocessors of sufficient generality make a lot of sense, as well as addressing the meta issues raised (I.e. should HBase have Eclipse plugin like tooling for getting and installing CPs). HBase should be a good platform for your work, let us know what you need.
          Ted Yu made changes -
          Link This issue relates to HBASE-6805 [ HBASE-6805 ]
          Hide
          Ted Yu added a comment -

          @Jason:
          In the description of this JIRA, you mentioned that DOT serves 'read-mostly access patterns'.

          it can also provide support for update of individual fields

          I want to get your opinion on how the above can be achieved. Considering that the relative size of individual field(s) in the document can be small, the cost of update would be comparatively higher than a fully de-normalized schema.

          Show
          Ted Yu added a comment - @Jason: In the description of this JIRA, you mentioned that DOT serves 'read-mostly access patterns'. it can also provide support for update of individual fields I want to get your opinion on how the above can be achieved. Considering that the relative size of individual field(s) in the document can be small, the cost of update would be comparatively higher than a fully de-normalized schema.
          Hide
          Jason Dai added a comment -

          If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator.

          This sounds like an interesting idea. A potential objective of the project is to provide a full-fledged document store on HBase - in addition to the analytic improvements demonstrated by DOT, it can also provide support for update of individual fields, nested documents, flexible document schema, columnar document storage, etc.

          Show
          Jason Dai added a comment - If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator. This sounds like an interesting idea. A potential objective of the project is to provide a full-fledged document store on HBase - in addition to the analytic improvements demonstrated by DOT, it can also provide support for update of individual fields, nested documents, flexible document schema, columnar document storage, etc.
          Hide
          eric baldeschwieler added a comment -

          If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator. You could do that while being consistent with andrew's suggested methodology (not forking HBase, mavenized integration...).

          In terms of dealing with the synchronization issue, you can work with orgs / projects that bundle distributions for business users. Apache BigTop being an example of such a group.

          Show
          eric baldeschwieler added a comment - If moving your code into Apache is a goal, you could also start the co-processor project in the apache incubator. You could do that while being consistent with andrew's suggested methodology (not forking HBase, mavenized integration...). In terms of dealing with the synchronization issue, you can work with orgs / projects that bundle distributions for business users. Apache BigTop being an example of such a group.
          Hide
          Andrew Purtell added a comment -

          (1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment?

          This is exactly the motivation for starting all coprocessor based applications/contributions as external projects. We will have no registry of "approved" or "stable" coprocessor applications. I'd imagine users would expect all such apps in the HBase distribution proper to be in such a state. Beyond that, I don't think the project can have the bandwidth to track a number of ideas in development. We can't know in advance what support, interest, or stability any given contribution would have, so starting as an external project establishes this on its own merit. A popular and well cared for contribution would eventually be candidate for inclusion into the HBase source distribution proper. This is my characterization of what has been discussed and the consensus reached by the PMC. If others feel this in error, or if we should do something differently here, please speak up. 

           (2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other?

          We don't. The onus is on the contributor. If at some point the consensus of the project is to bring in a particular contribution into the ASF HBase source distribution, then at that point we must insure these things... But only with what is in the source distribution. 

           (3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out 

          HBase is a mavenized project and your DOT system is a coprocessor application. There is no technical reason, barring issues with the CP framework itself, I can see why you have to include and maintain a full fork of HBase. Simply depend on HBase project artifacts and the complete DOT application can be compiled as a jar to drop on the classpath of a HBase installation. Where the CP framework may be insufficient, we can address that. Or, if there is some other technical reason (like a patch to core HBase), please list those so we can look at addressing it. 

          Like Ted says also, the modularization of HBase means we could accept a mavenized project that depends on HBase core artifacts pretty easily. 

          Show
          Andrew Purtell added a comment - (1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment? This is exactly the motivation for starting all coprocessor based applications/contributions as external projects. We will have no registry of "approved" or "stable" coprocessor applications. I'd imagine users would expect all such apps in the HBase distribution proper to be in such a state. Beyond that, I don't think the project can have the bandwidth to track a number of ideas in development. We can't know in advance what support, interest, or stability any given contribution would have, so starting as an external project establishes this on its own merit. A popular and well cared for contribution would eventually be candidate for inclusion into the HBase source distribution proper. This is my characterization of what has been discussed and the consensus reached by the PMC. If others feel this in error, or if we should do something differently here, please speak up.   (2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other? We don't. The onus is on the contributor. If at some point the consensus of the project is to bring in a particular contribution into the ASF HBase source distribution, then at that point we must insure these things... But only with what is in the source distribution.   (3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out  HBase is a mavenized project and your DOT system is a coprocessor application. There is no technical reason, barring issues with the CP framework itself, I can see why you have to include and maintain a full fork of HBase. Simply depend on HBase project artifacts and the complete DOT application can be compiled as a jar to drop on the classpath of a HBase installation. Where the CP framework may be insufficient, we can address that. Or, if there is some other technical reason (like a patch to core HBase), please list those so we can look at addressing it.  Like Ted says also, the modularization of HBase means we could accept a mavenized project that depends on HBase core artifacts pretty easily. 
          Hide
          Ted Yu added a comment -

          @Jason:
          You raised some interesting questions.

          I think you may be aware of the modularization effort in trunk. Matt Corgan is submitting his contribution as a separate module.
          This model may be the answer to some of your questions.

          Show
          Ted Yu added a comment - @Jason: You raised some interesting questions. I think you may be aware of the modularization effort in trunk. Matt Corgan is submitting his contribution as a separate module. This model may be the answer to some of your questions.
          Hide
          Jason Dai added a comment -

          coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository

          It would be helpful if only the changes on top of stock HBase code appear here.

          This could work, though I think we need to figure out how to address several implications brought by the proposal, such as:
          (1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment?
          (2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other?
          (3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out (several of our users and customers want to get a prototype of DOT to try it out).

          We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this.

          Yes, we do plan to submit the proposal for observers for the filter operations as a separate JIRA (the original plan was to make it a sub task of this JIRA).

          Show
          Jason Dai added a comment - coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository It would be helpful if only the changes on top of stock HBase code appear here. This could work, though I think we need to figure out how to address several implications brought by the proposal, such as: (1) How do the users figure out what co-processor applications are stable, so that they can use in their production deployment? (2) How do we ensure the co-processor applications continue to be compatible with the changes in the HBase project, and compatible with each other? (3) How do the users get the co-processor applications? They can no longer get these from the Apache HBase release, and may need to perform manual integrations - not something average business users will do, and the main reason that we put the full HBase source tree out (several of our users and customers want to get a prototype of DOT to try it out). We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this. Yes, we do plan to submit the proposal for observers for the filter operations as a separate JIRA (the original plan was to make it a sub task of this JIRA).
          Hide
          Andrew Purtell added a comment -

          Thank you for your interest in contributing to the HBase project. I have two initial comments/suggestions:

          1) From the attached document, it appears that the existing coprocessor framework was sufficient for the implementation of the DOT system on top, which is great to see. There has been some discussion in the HBase PMC, documented in the archives of the dev@hbase.apache.org mailing list, that coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository. In your announcement on general@ I see you have sort-of done this already at: https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a full fork of the HBase source tree with all history of individual changes lost (a single commit of a source drop). It would be helpful if only the changes on top of stock HBase code appear here. Otherwise, what you have done is in effect forked the HBase project, which is not conducive to contribution.

          2) From the design document: "The co-processor framework needs to be extended to provide observers for the filter operations, similar to the observers of the data access operations." We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this. Let's discuss what Coprocessor API extensions or additions are necessary. Do you have a proposal?

          Show
          Andrew Purtell added a comment - Thank you for your interest in contributing to the HBase project. I have two initial comments/suggestions: 1) From the attached document, it appears that the existing coprocessor framework was sufficient for the implementation of the DOT system on top, which is great to see. There has been some discussion in the HBase PMC, documented in the archives of the dev@hbase.apache.org mailing list, that coprocessor based applications should begin as independent code contributions, perhaps hosted in a GitHub repository. In your announcement on general@ I see you have sort-of done this already at: https://github.com/intel-hadoop/hbase-0.94-panthera , except this is a full fork of the HBase source tree with all history of individual changes lost (a single commit of a source drop). It would be helpful if only the changes on top of stock HBase code appear here. Otherwise, what you have done is in effect forked the HBase project, which is not conducive to contribution. 2) From the design document: "The co-processor framework needs to be extended to provide observers for the filter operations, similar to the observers of the data access operations." We would be delighted to work with you on the necessary coprocessor framework extensions. I'd recommend a separate JIRA specifically for this. Let's discuss what Coprocessor API extensions or additions are necessary. Do you have a proposal?
          Jason Dai made changes -
          Description In the last couple of years, increasingly more people begin to stream data into HBase in near time, and
          use high level queries (e.g., Hive) to analyze the data in HBase directly. While HBase already has very effective MapReduce integration with its good scanning performance, query processing using MapReduce on HBase still has significant gaps compared to HDFS: ~3x space overheads and 3~5x performance overheads according to our measurement.

          We propose to implement a document store on HBase, which can greatly improve query processing on HBase (by leveraging the relational model and read-mostly access patterns). According to our prototype, it can reduce space usage by up-to ~3x and speedup query processing by up-to ~2x.
          In the last couple of years, increasingly more people begin to stream data into HBase in near time, and
          use high level queries (e.g., Hive) to analyze the data in HBase directly. While HBase already has very effective MapReduce integration with its good scanning performance, query processing using MapReduce on HBase still has significant gaps compared to HDFS: ~3x space overheads and 3~5x performance overheads according to our measurement.

          We propose to implement a document store on HBase, which can greatly improve query processing on HBase (by leveraging the relational model and read-mostly access patterns). According to our prototype, it can reduce space usage by up-to ~3x and speedup query processing by up-to ~1.8x.
          Jason Dai made changes -
          Field Original Value New Value
          Attachment dot-deisgn.pdf [ 12545380 ]
          Jason Dai created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Dai
            • Votes:
              1 Vote for this issue
              Watchers:
              30 Start watching this issue

              Dates

              • Created:
                Updated:

                Development