Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:
- HBase is highly scalable and distributed
- HBase is realtime
- Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
- Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
- It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
- Scaling realtime search will be as simple as scaling HBase.
Phase 1 - Indexing:
- Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
- Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
- Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
- Mirror region splits with indexes (use Lucene's IndexSplitter?)
- When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
- A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
- Write unit tests for the above
Phase 2 - Queries:
- Enable distributed Lucene queries
- Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
- Integrate search with HBase's RPC mechanis