[HBASE-3529] Add search to HBase - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Later
Affects Version/s: 0.90.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

HBase is highly scalable and distributed
HBase is realtime
Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
Scaling realtime search will be as simple as scaling HBase.

Phase 1 - Indexing:

Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
Mirror region splits with indexes (use Lucene's IndexSplitter?)
When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
Write unit tests for the above

Phase 2 - Queries:

Enable distributed Lucene queries
Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
Integrate search with HBase's RPC mechanis

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HDFS-APPEND-0.20-LOCAL-FILE.patch
17/Jul/11 03:35
8 kB
Jason Rutherglen
HBASE-3529.patch
15/Mar/11 14:09
41 kB
Jason Rutherglen

Issue Links

is blocked by

LUCENE-2919 IndexSplitter that divides by primary key term

Resolved

LUCENE-3296 Enable passing a config into PKIndexSplitter

Closed

relates to

HBASE-3257 Coprocessors: Extend server side integration API to include HLog operations

Closed

SOLR-2563 Allow generic pluggable file system implementations

Open

HADOOP-6311 Add support for unix domain sockets to JNI libs

Resolved

HBASE-883 Secondary Indexes

Closed

HBASE-3786 Enhance MasterCoprocessorHost to include notification of balancing of each region

Closed

HBASE-4048 [Coprocessors] Support configuration of coprocessor at load time

Closed

HDFS-347 DFS read performance suboptimal when client co-located on nodes with data

Closed

HDFS-941 Datanode xceiver protocol should allow reuse of a connection

Closed

SOLR-1431 CommComponent abstracted

Closed

SOLR-2565 Prevent IW#close and cut over to IW#commit

Closed

HBASE-3810 Registering a Coprocessor at HTableDescriptor should be less strict

Closed

(8 relates to)

Activity

People

Assignee:: Unassigned

Reporter:: Jason Rutherglen

Votes:: 37 Vote for this issue

Watchers:: 87 Start watching this issue

Dates

Created:: 14/Feb/11 17:15

Updated:: 12/Jun/22 17:29

Resolved:: 12/Aug/14 19:16