Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2955

Create a Cloud Bigtable HBase connector

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: sdk-java-gcp
    • Labels:
      None

      Description

      The Cloud Bigtable (CBT) team has had a Dataflow connector maintained in a different repo for awhile. Recently, we did some reworking of the Cloud Bigtable client that would allow it to better coexist in the Beam ecosystem, and we also released a Beam connector in our repository that exposes HBase idioms rather than the Protobuf idioms of BigtableIO. More information about the customer experience of the HBase connector can be found here: https://cloud.google.com/bigtable/docs/dataflow-hbase.

      The Beam repo is a much better place to house a Cloud Bigtable HBase connector. There are a couple of ways we can implement this new connector:

      1. The CBT connector depends on artifacts in the io/hbase maven project. We can create a new extend HBaseIO for the purposes of CBT. We would have to add some features to HBaseIO to make that work (dynamic rebalancing, and a way for HBase and CBT's size estimation models to coexist)
      2. The BigtableIO connector works well, and we can add an adapter layer on top of it. I have a proof of concept of it here: https://github.com/sduskis/cloud-bigtable-client/tree/add_beam/bigtable-dataflow-parent/bigtable-hbase-beam.
      3. We can build a separate CBT HBase connector.

      I'm happy to do the work. I would appreciate some guidance and discussion about the right approach.

        Activity

        Hide
        iemejia Ismaël Mejía added a comment -

        This is a great idea Solomon Duskis. I will be pro (1) but I can understand if the current BigtableIO maintainers prefer (2). If we go the io/hbase route I would really like that we share as much code as possible, and of course we can refactor it freely to adapt the other client.
        About the missing parts, I added Dynamic Work Rebalancing to it recently (not sure if this is what you refer to). And I agree that size estimation should be separated. I am OOO until beginning of october, but you can count on me for anything related to this.

        Show
        iemejia Ismaël Mejía added a comment - This is a great idea Solomon Duskis . I will be pro (1) but I can understand if the current BigtableIO maintainers prefer (2). If we go the io/hbase route I would really like that we share as much code as possible, and of course we can refactor it freely to adapt the other client. About the missing parts, I added Dynamic Work Rebalancing to it recently (not sure if this is what you refer to). And I agree that size estimation should be separated. I am OOO until beginning of october, but you can count on me for anything related to this.
        Hide
        chamikara Chamikara Jayalath added a comment -

        I like approach (1) as well since seems like it'll minimize code duplication. Are there any drawbacks (features, performance) of approach (1) compared to approach (2) ?

        Assigning this JIRA to you.

        Also ccing some folks interested in I/O.
        Reuven Lax Eugene Kirpichov Jean-Baptiste Onofré

        Show
        chamikara Chamikara Jayalath added a comment - I like approach (1) as well since seems like it'll minimize code duplication. Are there any drawbacks (features, performance) of approach (1) compared to approach (2) ? Assigning this JIRA to you. Also ccing some folks interested in I/O. Reuven Lax Eugene Kirpichov Jean-Baptiste Onofré
        Hide
        sduskis Solomon Duskis added a comment -

        It's awesome that you added the Dynamic rebalancing! I'm ok with extending HBaseIO, as long as there aren't any other overriding concerns. I'd like to explore the possibility of templates (ValueProviders) as the configuration of HBaseIO.

        Show
        sduskis Solomon Duskis added a comment - It's awesome that you added the Dynamic rebalancing! I'm ok with extending HBaseIO, as long as there aren't any other overriding concerns. I'd like to explore the possibility of templates (ValueProviders) as the configuration of HBaseIO.
        Hide
        sduskis Solomon Duskis added a comment -

        Chamikra: HBaseIO will have to be extended or wrapped. Cloud Bigtable needs slightly different configuration options, has a different way to calculate estimated sizes, and needs templating. The interface would essentially be the same whether we leverage HBaseIO or BigtableIO. The BigtableIO wrapper that I wrote was 271 lines of code.

        I'll create a PR for the BigtableIO wrapper in the Beam github project, since the code is already written.
        I'll also create a PR for an extension of HBaseIO.

        That way, we can compare the two options.

        Show
        sduskis Solomon Duskis added a comment - Chamikra: HBaseIO will have to be extended or wrapped. Cloud Bigtable needs slightly different configuration options, has a different way to calculate estimated sizes, and needs templating. The interface would essentially be the same whether we leverage HBaseIO or BigtableIO. The BigtableIO wrapper that I wrote was 271 lines of code. I'll create a PR for the BigtableIO wrapper in the Beam github project, since the code is already written. I'll also create a PR for an extension of HBaseIO. That way, we can compare the two options.
        Hide
        chamikara Chamikara Jayalath added a comment -

        I'm fine with option (2) as well. Extra ~300 lines of code shouldn't be much of an issue. I think we should take the option that makes more sense performance and usability wise.

        Show
        chamikara Chamikara Jayalath added a comment - I'm fine with option (2) as well. Extra ~300 lines of code shouldn't be much of an issue. I think we should take the option that makes more sense performance and usability wise.

          People

          • Assignee:
            sduskis Solomon Duskis
            Reporter:
            sduskis Solomon Duskis
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development