Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Later
    • Affects Version/s: 0.90.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

      • HBase is highly scalable and distributed
      • HBase is realtime
      • Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
      • Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
      • It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
      • Scaling realtime search will be as simple as scaling HBase.

      Phase 1 - Indexing:

      • Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
      • Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
      • Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
      • Mirror region splits with indexes (use Lucene's IndexSplitter?)
      • When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
      • A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
      • Write unit tests for the above

      Phase 2 - Queries:

      • Enable distributed Lucene queries
      • Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
      • Integrate search with HBase's RPC mechanis
      1. HDFS-APPEND-0.20-LOCAL-FILE.patch
        8 kB
        Jason Rutherglen
      2. HBASE-3529.patch
        41 kB
        Jason Rutherglen

        Issue Links

          Activity

          Andrew Purtell made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Later [ 7 ]
          linwukang made changes -
          Description sing the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanis

          Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanis

          linwukang made changes -
          Description Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanis

          sing the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanis

          liusheding made changes -
          Description Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanism

          Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

          * HBase is highly scalable and distributed
          * HBase is realtime
          * Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
          * Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
          * It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
          * Scaling realtime search will be as simple as scaling HBase.

          Phase 1 - Indexing:

          * Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
          * Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
          * Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
          * Mirror region splits with indexes (use Lucene's IndexSplitter?)
          * When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
          * A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
          * Write unit tests for the above

          Phase 2 - Queries:

          * Enable distributed Lucene queries
          * Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
          * Integrate search with HBase's RPC mechanis

          Eugene Koontz made changes -
          Link This issue relates to HBASE-3810 [ HBASE-3810 ]
          Eugene Koontz made changes -
          Link This issue relates to HBASE-4048 [ HBASE-4048 ]
          Jason Rutherglen made changes -
          Link This issue is related to CASSANDRA-2915 [ CASSANDRA-2915 ]
          Jason Rutherglen made changes -
          Link This issue is related to CASSANDRA-2915 [ CASSANDRA-2915 ]
          Jason Rutherglen made changes -
          Attachment HDFS-APPEND-0.20-LOCAL-FILE.patch [ 12486753 ]
          Eugene Koontz made changes -
          Link This issue relates to HBASE-883 [ HBASE-883 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by LUCENE-3296 [ LUCENE-3296 ]
          Jason Rutherglen made changes -
          Link This issue relates to SOLR-2565 [ SOLR-2565 ]
          Jason Rutherglen made changes -
          Link This issue relates to SOLR-1431 [ SOLR-1431 ]
          Jason Rutherglen made changes -
          Link This issue incorporates LUCENE-2312 [ LUCENE-2312 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by LUCENE-3191 [ LUCENE-3191 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-2004 [ HDFS-2004 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by LUCENE-3191 [ LUCENE-3191 ]
          Jason Rutherglen made changes -
          Link This issue relates to SOLR-2563 [ SOLR-2563 ]
          Jason Rutherglen made changes -
          Link This issue relates to HBASE-3786 [ HBASE-3786 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HBASE-3786 [ HBASE-3786 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-2004 [ HDFS-2004 ]
          Jason Rutherglen made changes -
          Attachment lucene-misc-4.0-SNAPSHOT.jar [ 12473681 ]
          Jason Rutherglen made changes -
          Attachment lucene-core-4.0-SNAPSHOT.jar [ 12473679 ]
          Jason Rutherglen made changes -
          Attachment lucene-analyzers-common-4.0-SNAPSHOT.jar [ 12473680 ]
          Jason Rutherglen made changes -
          Link This issue relates to HDFS-347 [ HDFS-347 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-347 [ HDFS-347 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by SOLR-1431 [ SOLR-1431 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HBASE-3786 [ HBASE-3786 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-941 [ HDFS-941 ]
          Jason Rutherglen made changes -
          Link This issue relates to HDFS-941 [ HDFS-941 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HADOOP-6311 [ HADOOP-6311 ]
          Jason Rutherglen made changes -
          Link This issue relates to HADOOP-6311 [ HADOOP-6311 ]
          Jason Rutherglen made changes -
          Attachment HBASE-3529.patch [ 12473678 ]
          Attachment lucene-core-4.0-SNAPSHOT.jar [ 12473679 ]
          Attachment lucene-analyzers-common-4.0-SNAPSHOT.jar [ 12473680 ]
          Attachment lucene-misc-4.0-SNAPSHOT.jar [ 12473681 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by LUCENE-2919 [ LUCENE-2919 ]
          Jason Rutherglen made changes -
          Link This issue incorporates LUCENE-2312 [ LUCENE-2312 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by SOLR-1431 [ SOLR-1431 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-941 [ HDFS-941 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HDFS-347 [ HDFS-347 ]
          Jason Rutherglen made changes -
          Link This issue is blocked by HADOOP-6311 [ HADOOP-6311 ]
          Jason Rutherglen made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Eugene Koontz made changes -
          Field Original Value New Value
          Link This issue relates to HBASE-3257 [ HBASE-3257 ]
          Jason Rutherglen created issue -

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              37 Vote for this issue
              Watchers:
              87 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development