I am an Accumulo developer and I was looking at
ACCUMULO-532. I was thinking about the problem of where this code should be put. It seems like there are three options.
- Put hama input/output formats in Hama
- Put hama input/output formats in cassandra, accumulo, hbase, ..., etc
- Put hama input/output formats on something like github
I am not sure what the best option is, we are trying to figure that out. Below are some thoughts about this issue.
Option 3 may not be a blessed apache way of doing business.
One thing nice about option 3, is that if accumulo-hama has a serious bug that it can release immediately w/o waiting. For options 1 and 2, either hama or accumulo must release to fix a serious hama-accumulo bug. Options 3 may also make it easier to use a newer version of hama w/ an older version of accumulo. If accumulo ships with hama, and you have an older version of accumulo it probably depends on an older version of hama. This may not be an issue if the hama API is really stable.
With option 3, Accumulo could include a hama jar in contrib w/ a link to github.
From the perspective of Accumulo, we have to work this same issue out for lots of projects. For example accumulo could ship with connectors for pig, hive, gora, hama, etc. This increases the # of dependencies that accumulo has that users may not need. Currently the accumulo pig adapter is on github.
Apache Gora seems to be doing option 1. They have gora-core, gora-accumulo, gora-hbase, etc. Each one of these are maven sub projects of Gora. One nice thing about this for the gora case is that all of the gora stores can share test code. For example gora-accumlo extends a test class from gora-core for testing.
I suppose option 1 is bad because hama is a subsystem like map reduce. For example gora and pig depend on map reduce, and its probably ok to make them depend on hbase or accumulo. However, you would not want map reduce or hama to depend on a certain version of accumulo or hbase. If accumulo-1.3 jars were in the map reduce system lib dir, I do not think user accumulo-1.4 jars can override those. I suspect the same is true for hama, which is why option 1 is bad?
I wrote the gora-accumulo backend and then I wrote goraci (https://github.com/keith-turner/goraci). To make goraci easy to run I wrote a slightly complex script and pom. Maybe I was being a bit OCD, but when I ran goraci against accumulo I only wanted the jars that were needed on the classpath. For example, I did not want hbase jars when running goraci against accumulo and visa versa. So what steps would the user have to go through to have hama read from accumulo w/ options 2 vs 3? Seems like the main diff is add one extra jar to the classpath? Is there any other burden? Its nice to make things as easy as possible for the user.