Details

    • Type: Sub-task Sub-task
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: bsp core
    • Labels:
      None

      Description

      Create InputFormats/OutputFormats for HBase

        Issue Links

          Activity

          Show
          praveen sripati added a comment - TableInputFormat from HBase can be ported. http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapred/TableInputFormat.html http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/mapreduce/TableInputFormat.html
          Hide
          debarshi basak added a comment -

          Hi praveen, i was working on integrating exposing hama apis for hbase...i was wondering what is the procedure for creating a patch?Is there any documents that i can follow?

          Show
          debarshi basak added a comment - Hi praveen, i was working on integrating exposing hama apis for hbase...i was wondering what is the procedure for creating a patch?Is there any documents that i can follow?
          Hide
          Thomas Jungblut added a comment -
          Show
          Thomas Jungblut added a comment - Have a look here: http://wiki.apache.org/hama/HowToContribute
          Hide
          Edward J. Yoon added a comment -

          By the way, shouldn't we contribute this to HBase?

          Show
          Edward J. Yoon added a comment - By the way, shouldn't we contribute this to HBase?
          Hide
          praveen sripati added a comment -

          I am OK with either of the option, but till Hama API becomes stable we might consider keeping it in Hama only. Either way we have to add the Hama dependency to HBase or HBase dependency to Hama.

          Ed, I think your feedback will apply for other source/sinks (GraphDB etc) also.

          Show
          praveen sripati added a comment - I am OK with either of the option, but till Hama API becomes stable we might consider keeping it in Hama only. Either way we have to add the Hama dependency to HBase or HBase dependency to Hama. Ed, I think your feedback will apply for other source/sinks (GraphDB etc) also.
          Hide
          Edward J. Yoon added a comment -

          What concerns me is about compatibility issue of our all releases (Like HDFS[1]). So, I'm planning to add this to HBase project[2].

          1. http://wiki.apache.org/hama/CompatibilityTable
          2. http://markmail.org/thread/odckj37jpoqke6sy

          Show
          Edward J. Yoon added a comment - What concerns me is about compatibility issue of our all releases (Like HDFS [1] ). So, I'm planning to add this to HBase project [2] . 1. http://wiki.apache.org/hama/CompatibilityTable 2. http://markmail.org/thread/odckj37jpoqke6sy
          Hide
          Thomas Jungblut added a comment -

          I think we should add this to our contrib package.

          Show
          Thomas Jungblut added a comment - I think we should add this to our contrib package.
          Hide
          Edward J. Yoon added a comment -

          Think about various input formats e.g., cassandra, accumulo, hbase, ..., etc.

          Show
          Edward J. Yoon added a comment - Think about various input formats e.g., cassandra, accumulo, hbase, ..., etc.
          Hide
          Edward J. Yoon added a comment -

          Most of NoSQL solutions supports MR. Once proved value of BSP computing power, they will have to support Hama BSP in the future.

          Show
          Edward J. Yoon added a comment - Most of NoSQL solutions supports MR. Once proved value of BSP computing power, they will have to support Hama BSP in the future.
          Hide
          Edward J. Yoon added a comment -

          Praveen,

          Are you going to implement HBase table input output formats?

          Show
          Edward J. Yoon added a comment - Praveen, Are you going to implement HBase table input output formats?
          Hide
          Thomas Jungblut added a comment -

          You are right, but we must keep track who has which implementations to notify them when something changed. We should add a Wikipage.

          Show
          Thomas Jungblut added a comment - You are right, but we must keep track who has which implementations to notify them when something changed. We should add a Wikipage.
          Hide
          Edward J. Yoon added a comment -

          We should add a Wikipage.

          +1

          Show
          Edward J. Yoon added a comment - We should add a Wikipage. +1
          Hide
          praveen sripati added a comment -

          Ed - Debarshi Basak mentioned he was working on it. So, I haven't spent time on it. Not sure how much progress he has done. If you want to implement it, please go ahead.

          Show
          praveen sripati added a comment - Ed - Debarshi Basak mentioned he was working on it. So, I haven't spent time on it. Not sure how much progress he has done. If you want to implement it, please go ahead.
          Hide
          Keith Turner added a comment -

          I am an Accumulo developer and I was looking at ACCUMULO-532. I was thinking about the problem of where this code should be put. It seems like there are three options.

          1. Put hama input/output formats in Hama
          2. Put hama input/output formats in cassandra, accumulo, hbase, ..., etc
          3. Put hama input/output formats on something like github

          I am not sure what the best option is, we are trying to figure that out. Below are some thoughts about this issue.

          Option 3 may not be a blessed apache way of doing business.

          One thing nice about option 3, is that if accumulo-hama has a serious bug that it can release immediately w/o waiting. For options 1 and 2, either hama or accumulo must release to fix a serious hama-accumulo bug. Options 3 may also make it easier to use a newer version of hama w/ an older version of accumulo. If accumulo ships with hama, and you have an older version of accumulo it probably depends on an older version of hama. This may not be an issue if the hama API is really stable.

          With option 3, Accumulo could include a hama jar in contrib w/ a link to github.

          From the perspective of Accumulo, we have to work this same issue out for lots of projects. For example accumulo could ship with connectors for pig, hive, gora, hama, etc. This increases the # of dependencies that accumulo has that users may not need. Currently the accumulo pig adapter is on github.

          Apache Gora seems to be doing option 1. They have gora-core, gora-accumulo, gora-hbase, etc. Each one of these are maven sub projects of Gora. One nice thing about this for the gora case is that all of the gora stores can share test code. For example gora-accumlo extends a test class from gora-core for testing.

          I suppose option 1 is bad because hama is a subsystem like map reduce. For example gora and pig depend on map reduce, and its probably ok to make them depend on hbase or accumulo. However, you would not want map reduce or hama to depend on a certain version of accumulo or hbase. If accumulo-1.3 jars were in the map reduce system lib dir, I do not think user accumulo-1.4 jars can override those. I suspect the same is true for hama, which is why option 1 is bad?

          I wrote the gora-accumulo backend and then I wrote goraci (https://github.com/keith-turner/goraci). To make goraci easy to run I wrote a slightly complex script and pom. Maybe I was being a bit OCD, but when I ran goraci against accumulo I only wanted the jars that were needed on the classpath. For example, I did not want hbase jars when running goraci against accumulo and visa versa. So what steps would the user have to go through to have hama read from accumulo w/ options 2 vs 3? Seems like the main diff is add one extra jar to the classpath? Is there any other burden? Its nice to make things as easy as possible for the user.

          Show
          Keith Turner added a comment - I am an Accumulo developer and I was looking at ACCUMULO-532 . I was thinking about the problem of where this code should be put. It seems like there are three options. Put hama input/output formats in Hama Put hama input/output formats in cassandra, accumulo, hbase, ..., etc Put hama input/output formats on something like github I am not sure what the best option is, we are trying to figure that out. Below are some thoughts about this issue. Option 3 may not be a blessed apache way of doing business. One thing nice about option 3, is that if accumulo-hama has a serious bug that it can release immediately w/o waiting. For options 1 and 2, either hama or accumulo must release to fix a serious hama-accumulo bug. Options 3 may also make it easier to use a newer version of hama w/ an older version of accumulo. If accumulo ships with hama, and you have an older version of accumulo it probably depends on an older version of hama. This may not be an issue if the hama API is really stable. With option 3, Accumulo could include a hama jar in contrib w/ a link to github. From the perspective of Accumulo, we have to work this same issue out for lots of projects. For example accumulo could ship with connectors for pig, hive, gora, hama, etc. This increases the # of dependencies that accumulo has that users may not need. Currently the accumulo pig adapter is on github. Apache Gora seems to be doing option 1. They have gora-core, gora-accumulo, gora-hbase, etc. Each one of these are maven sub projects of Gora. One nice thing about this for the gora case is that all of the gora stores can share test code. For example gora-accumlo extends a test class from gora-core for testing. I suppose option 1 is bad because hama is a subsystem like map reduce. For example gora and pig depend on map reduce, and its probably ok to make them depend on hbase or accumulo. However, you would not want map reduce or hama to depend on a certain version of accumulo or hbase. If accumulo-1.3 jars were in the map reduce system lib dir, I do not think user accumulo-1.4 jars can override those. I suspect the same is true for hama, which is why option 1 is bad? I wrote the gora-accumulo backend and then I wrote goraci ( https://github.com/keith-turner/goraci ). To make goraci easy to run I wrote a slightly complex script and pom. Maybe I was being a bit OCD, but when I ran goraci against accumulo I only wanted the jars that were needed on the classpath. For example, I did not want hbase jars when running goraci against accumulo and visa versa. So what steps would the user have to go through to have hama read from accumulo w/ options 2 vs 3? Seems like the main diff is add one extra jar to the classpath? Is there any other burden? Its nice to make things as easy as possible for the user.
          Hide
          Edward J. Yoon added a comment -

          Option 2 (add BSP input output formats and its example)'s advantages is to show a best practice for BSP computing model based new applications on Accumulo.

          Option 3 also looks good to me but not great, considering user accessibility.

          I know your concerns about option 2, but I'll bet that our basic BSP message-passing and input/output interfaces don't change.

          Show
          Edward J. Yoon added a comment - Option 2 (add BSP input output formats and its example)'s advantages is to show a best practice for BSP computing model based new applications on Accumulo. Option 3 also looks good to me but not great, considering user accessibility. I know your concerns about option 2, but I'll bet that our basic BSP message-passing and input/output interfaces don't change.

            People

            • Assignee:
              praveen sripati
              Reporter:
              praveen sripati
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development