Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Later
-
0.90.0
-
None
-
None
-
None
Description
Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:
- HBase is highly scalable and distributed
- HBase is realtime
- Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
- Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
- It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
- Scaling realtime search will be as simple as scaling HBase.
Phase 1 - Indexing:
- Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
- Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
- Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
- Mirror region splits with indexes (use Lucene's IndexSplitter?)
- When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
- A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
- Write unit tests for the above
Phase 2 - Queries:
- Enable distributed Lucene queries
- Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
- Integrate search with HBase's RPC mechanis
Attachments
Attachments
Issue Links
- is blocked by
-
LUCENE-2919 IndexSplitter that divides by primary key term
- Closed
-
LUCENE-3296 Enable passing a config into PKIndexSplitter
- Closed
- relates to
-
HBASE-3257 Coprocessors: Extend server side integration API to include HLog operations
- Closed
-
SOLR-2563 Allow generic pluggable file system implementations
- Open
-
HADOOP-6311 Add support for unix domain sockets to JNI libs
- Resolved
-
HBASE-883 Secondary Indexes
- Closed
-
HBASE-3786 Enhance MasterCoprocessorHost to include notification of balancing of each region
- Closed
-
HBASE-4048 [Coprocessors] Support configuration of coprocessor at load time
- Closed
-
HDFS-347 DFS read performance suboptimal when client co-located on nodes with data
- Closed
-
HDFS-941 Datanode xceiver protocol should allow reuse of a connection
- Closed
-
SOLR-1431 CommComponent abstracted
- Closed
-
SOLR-2565 Prevent IW#close and cut over to IW#commit
- Closed
-
HBASE-3810 Registering a Coprocessor at HTableDescriptor should be less strict
- Closed