HBase
  1. HBase
  2. HBASE-270

[HBase] Build a Lucene index on an HBase table

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This patch provides a Reducer class and other related classes which help to build a Lucene index on an HBase table. The index build part is similar to that of Nutch.

      • Each row is modeled as a Lucene document: row key is indexed in its untokenized form, column name-value pairs are Lucene field name-value pairs.
      • IndexConf is used to configure various Lucene parameters, specify whether to optimize an index and which columns to index and/or store, in tokenized or untokenized form, etc.
      • The number of reduce tasks decides the number of indexes (partitions). The index(es) is stored in the output path of job configuration.
      • The index build process is done in the reduce phase. Users can use the map phase to join rows from different tables or to pre-parse/analyze column content, etc.
      • A junit test is added to test the build of an index on an HBase table with an identity mapper. It also serves as an example on how to use the new classes.
      • BuildTableIndex is provided to help building an index on an HBase table. It should be moved to examples package if HBase decides to have one.

      This patch requires the inclusion of the Lucene library.

      1. build_table_index.take8.patch
        69 kB
        stack
      2. build_table_index.take7.patch
        69 kB
        stack
      3. build_table_index.take6.patch
        39 kB
        stack
      4. build_table_index.take5.patch
        35 kB
        stack
      5. build_table_index.take4.patch
        44 kB
        stack
      6. build_table_index.take3.patch
        39 kB
        Ning Li
      7. build_table_index.take2.again.patch
        37 kB
        Ning Li
      8. build_table_index.take2.patch
        37 kB
        Ning Li
      9. build_table_index.patch
        38 kB
        Ning Li

        Activity

        Hide
        stack added a comment -

        Committed missing files. Resolving for second time.

        Show
        stack added a comment - Committed missing files. Resolving for second time.
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12366722/build_table_index.take8.patch
        against trunk revision r580166.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366722/build_table_index.take8.patch against trunk revision r580166. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/840/console This message is automatically generated.
        Hide
        stack added a comment -

        Trying the missing files against hudson.

        Show
        stack added a comment - Trying the missing files against hudson.
        Hide
        stack added a comment -

        This passes all tests locally.

        Show
        stack added a comment - This passes all tests locally.
        Hide
        stack added a comment -

        Patch that has the files not added.

        Show
        stack added a comment - Patch that has the files not added.
        Hide
        stack added a comment -

        I bungled application of this patch. I didn't add the new classes.

        Show
        stack added a comment - I bungled application of this patch. I didn't add the new classes.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-Nightly #250 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/250/ )
        Hide
        stack added a comment -

        Committed. Resolving. Thanks for the patch Ning!

        Show
        stack added a comment - Committed. Resolving. Thanks for the patch Ning!
        Hide
        Hadoop QA added a comment -

        +1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12366485/build_table_index.take6.patch
        against trunk revision r578879.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests +1. The patch passed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366485/build_table_index.take6.patch against trunk revision r578879. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests +1. The patch passed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/816/console This message is automatically generated.
        Hide
        stack added a comment -

        Add debugging to try and help why failing in test that should have no relation to the code in this patch.

        Show
        stack added a comment - Add debugging to try and help why failing in test that should have no relation to the code in this patch.
        Hide
        stack added a comment -

        Failed in same place, in TestRegionServerAbort.

        Show
        stack added a comment - Failed in same place, in TestRegionServerAbort.
        Hide
        stack added a comment -

        Add logging around setup of assertion scanner in TestRegionServerAbort

        Show
        stack added a comment - Add logging around setup of assertion scanner in TestRegionServerAbort
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12366396/build_table_index.take5.patch
        against trunk revision r578348.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366396/build_table_index.take5.patch against trunk revision r578348. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/808/console This message is automatically generated.
        Hide
        stack added a comment -

        Looking more at the TestRegionServerAbort log, its not same as HADOOP-1924. From logs, all seems healthy just that the scan that asserts region redeploys must not be happening. Could add some logging in here.....(or thread dump if can catch it during build)

        Show
        stack added a comment - Looking more at the TestRegionServerAbort log, its not same as HADOOP-1924 . From logs, all seems healthy just that the scan that asserts region redeploys must not be happening. Could add some logging in here.....(or thread dump if can catch it during build)
        Hide
        stack added a comment -

        Retrying. May be lucky or will learn more about possibly hanging.

        Show
        stack added a comment - Retrying. May be lucky or will learn more about possibly hanging.
        Hide
        stack added a comment -

        Retry. Failure was in org.apache.hadoop.hbase.TestRegionServerAbort. It timed out. The logging profile looks like what we were seeing over in HADOOP-1924. Need to get some thread dumps to confirm.

        Show
        stack added a comment - Retry. Failure was in org.apache.hadoop.hbase.TestRegionServerAbort. It timed out. The logging profile looks like what we were seeing over in HADOOP-1924 . Need to get some thread dumps to confirm.
        Hide
        Hudson added a comment -
        Show
        Hudson added a comment - Integrated in Hadoop-Nightly #247 (See http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Nightly/247/ )
        Hide
        Hadoop QA added a comment -

        -1 overall. Here are the results of testing the latest attachment
        http://issues.apache.org/jira/secure/attachment/12366396/build_table_index.take5.patch
        against trunk revision r578348.

        @author +1. The patch does not contain any @author tags.

        javadoc +1. The javadoc tool did not generate any warning messages.

        javac +1. The applied patch does not generate any new compiler warnings.

        findbugs +1. The patch does not introduce any new Findbugs warnings.

        core tests +1. The patch passed core unit tests.

        contrib tests -1. The patch failed contrib unit tests.

        Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/testReport/
        Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
        Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/artifact/trunk/build/test/checkstyle-errors.html
        Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/console

        This message is automatically generated.

        Show
        Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12366396/build_table_index.take5.patch against trunk revision r578348. @author +1. The patch does not contain any @author tags. javadoc +1. The javadoc tool did not generate any warning messages. javac +1. The applied patch does not generate any new compiler warnings. findbugs +1. The patch does not introduce any new Findbugs warnings. core tests +1. The patch passed core unit tests. contrib tests -1. The patch failed contrib unit tests. Test results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/testReport/ Findbugs warnings: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/artifact/trunk/build/test/checkstyle-errors.html Console output: http://lucene.zones.apache.org:8080/hudson/job/Hadoop-Patch/807/console This message is automatically generated.
        Hide
        stack added a comment -

        Builds locally and passes all tests. Try hudson (Just committed the lucene jar this patch depends on ahead of this submit).

        Show
        stack added a comment - Builds locally and passes all tests. Try hudson (Just committed the lucene jar this patch depends on ahead of this submit).
        Hide
        stack added a comment -

        Fixes so all tests work. Add to AbstractMergeTestBase and TestTableMapReduce tearDown method a try/catch around the final call to fs.close. The filesystem may have been shutdown by the shutdown of dfsCluster – if its non-null (Its now more likely non-null since this patch adds setting of the dfs 'cluster' data member when its passed to one of the MiniHbaseCluster constructors where before it left it null. The 'set' is needed by the TestTableIndex test).

        Move mapreduce files into a new test mapred subpackage.

        Show
        stack added a comment - Fixes so all tests work. Add to AbstractMergeTestBase and TestTableMapReduce tearDown method a try/catch around the final call to fs.close. The filesystem may have been shutdown by the shutdown of dfsCluster – if its non-null (Its now more likely non-null since this patch adds setting of the dfs 'cluster' data member when its passed to one of the MiniHbaseCluster constructors where before it left it null. The 'set' is needed by the TestTableIndex test). Move mapreduce files into a new test mapred subpackage.
        Hide
        stack added a comment -

        I added javadoc and some missing licenses.

        Show
        stack added a comment - I added javadoc and some missing licenses.
        Hide
        Ning Li added a comment -

        > Yes. Sounds good. Do you want to make a new patch Ning that refactors the BuildTableIndex 'example' so rather than a list of column names it instead takes an index file full of configuration that it then parses and inserts into the job Configuration?

        Sure. A new patch with modified BuildTableIndex is attached.

        Show
        Ning Li added a comment - > Yes. Sounds good. Do you want to make a new patch Ning that refactors the BuildTableIndex 'example' so rather than a list of column names it instead takes an index file full of configuration that it then parses and inserts into the job Configuration? Sure. A new patch with modified BuildTableIndex is attached.
        Hide
        stack added a comment -

        Yes. Sounds good. Do you want to make a new patch Ning that refactors the BuildTableIndex 'example' so rather than a list of column names it instead takes an index file full of configuration that it then parses and inserts into the job Configuration?

        Show
        stack added a comment - Yes. Sounds good. Do you want to make a new patch Ning that refactors the BuildTableIndex 'example' so rather than a list of column names it instead takes an index file full of configuration that it then parses and inserts into the job Configuration?
        Hide
        Ning Li added a comment -

        Actually, with the current approach, BuildTableIndex can be modified to read an xml file, validate it, and call conf.set("hbase.index.conf", xml_file_content) to set the indexing configuration in a job.

        Show
        Ning Li added a comment - Actually, with the current approach, BuildTableIndex can be modified to read an xml file, validate it, and call conf.set("hbase.index.conf", xml_file_content) to set the indexing configuration in a job.
        Hide
        stack added a comment -

        That the configuration is per job rather than per instance is an important distinction. Could an xml file be passed to jobs on the command line?

        Show
        stack added a comment - That the configuration is per job rather than per instance is an important distinction. Could an xml file be passed to jobs on the command line?
        Hide
        Ning Li added a comment -

        > Does not conjuring a substantial amount of XML in a StringBuilder in java code become untenable soon after you have two or three columns Ning? I think folks are going to want to do their XML up in a file that they can pass the MR job (and against which they can run xmllint, etc., to verify well-formedness).

        Yes, that'll be nice.

        > Why not have your IndexConf add a resource named 'hbase-index.xml' as HBaseConfiguration adds hbase-site.xml and hbase-default.xml. Your example pasted above would be the content of one such hbase-index.xml file.
        >
        > Otherwise, can we add the config. to hbase-site.xml? One property would list columns and then per column, you'd add indexing properties with column name as qualifier as in:

        I was thinking, the indexing configuration is specified per job, not per hbase. Applications would want to specify different indexing configurations for different tables, which may have same column names. Different applications may even want to index the same table differently.

        An alternative would be to include the configuration in application's jar file since that's what gets distributed out. But it's a bit awkward since a new jar file has to be generated for a new run. Yet another alternative is to store the configuration file in HDFS or HBase...

        Show
        Ning Li added a comment - > Does not conjuring a substantial amount of XML in a StringBuilder in java code become untenable soon after you have two or three columns Ning? I think folks are going to want to do their XML up in a file that they can pass the MR job (and against which they can run xmllint, etc., to verify well-formedness). Yes, that'll be nice. > Why not have your IndexConf add a resource named 'hbase-index.xml' as HBaseConfiguration adds hbase-site.xml and hbase-default.xml. Your example pasted above would be the content of one such hbase-index.xml file. > > Otherwise, can we add the config. to hbase-site.xml? One property would list columns and then per column, you'd add indexing properties with column name as qualifier as in: I was thinking, the indexing configuration is specified per job, not per hbase. Applications would want to specify different indexing configurations for different tables, which may have same column names. Different applications may even want to index the same table differently. An alternative would be to include the configuration in application's jar file since that's what gets distributed out. But it's a bit awkward since a new jar file has to be generated for a new run. Yet another alternative is to store the configuration file in HDFS or HBase...
        Hide
        stack added a comment -

        Does not conjuring a substantial amount of XML in a StringBuilder in java code become untenable soon after you have two or three columns Ning? I think folks are going to want to do their XML up in a file that they can pass the MR job (and against which they can run xmllint, etc., to verify well-formedness).

        Why not have your IndexConf add a resource named 'hbase-index.xml' as HBaseConfiguration adds hbase-site.xml and hbase-default.xml. Your example pasted above would be the content of one such hbase-index.xml file.

        Otherwise, can we add the config. to hbase-site.xml? One property would list columns and then per column, you'd add indexing properties with column name as qualifier as in:

        <configuration>
        ...
        <property><name>hbase.column.names</name><value>column1 column2 column3....</value></property>
        <property><name>hbase.column.column1.store</name><value>true</value></property>
        <property><name>hbase.column.column1.index</name><value>true</value></property>
        ...
        etc.
        
        Show
        stack added a comment - Does not conjuring a substantial amount of XML in a StringBuilder in java code become untenable soon after you have two or three columns Ning? I think folks are going to want to do their XML up in a file that they can pass the MR job (and against which they can run xmllint, etc., to verify well-formedness). Why not have your IndexConf add a resource named 'hbase-index.xml' as HBaseConfiguration adds hbase-site.xml and hbase-default.xml. Your example pasted above would be the content of one such hbase-index.xml file. Otherwise, can we add the config. to hbase-site.xml? One property would list columns and then per column, you'd add indexing properties with column name as qualifier as in: <configuration> ... <property><name>hbase.column.names</name><value>column1 column2 column3....</value></property> <property><name>hbase.column.column1.store</name><value> true </value></property> <property><name>hbase.column.column1.index</name><value> true </value></property> ... etc.
        Hide
        Ning Li added a comment -

        > Is your thinking that folks will build up an XML string in their job config. code rather than edit an hbase-site.xml to add per-column configuration?

        That's right. For example, "conf.set("hbase.index.conf", createIndexConfContent())" sets the index build configuration string in TestTableIndex.java.

        Show
        Ning Li added a comment - > Is your thinking that folks will build up an XML string in their job config. code rather than edit an hbase-site.xml to add per-column configuration? That's right. For example, "conf.set("hbase.index.conf", createIndexConfContent())" sets the index build configuration string in TestTableIndex.java.
        Hide
        stack added a comment -

        Thanks for sample and this version of the patch applies.

        I tried putting the above xml into a hbase-site.xml property value and it looks like the general parse fails:

        07/09/18 12:35:38 FATAL conf.Configuration: error parsing conf file: java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredElementImpl
        

        Is your thinking that folks will build up an XML string in their job config. code rather than edit an hbase-site.xml to add per-column configuration?

        Show
        stack added a comment - Thanks for sample and this version of the patch applies. I tried putting the above xml into a hbase-site.xml property value and it looks like the general parse fails: 07/09/18 12:35:38 FATAL conf.Configuration: error parsing conf file: java.lang.ClassCastException: com.sun.org.apache.xerces.internal.dom.DeferredElementImpl Is your thinking that folks will build up an XML string in their job config. code rather than edit an hbase-site.xml to add per-column configuration?
        Hide
        Ning Li added a comment -

        > Pardon me Ning for being a bit thick but I do not see an example of per column config. in BuildTableIndex. I see parsing of command line and passing of a list of column names to IdentityTableMap but not an example of per-column config. as a property value of an hbase config. Do you mean the XML in TestTableIndex? If so, its not clear how you do config. for columns 2, 3, etc. Perhaps you could provide an example here in the issue

        You are right. I meant the example in TestTableIndex. Here is an example with multiple columns:

        <configuration>
        <column>
        <property><name>hbase.column.name</name><value>column1</value></property>
        <property><name>hbase.column.store</name><value>true</value></property>
        <property><name>hbase.column.index</name><value>true</value></property>
        <property><name>hbase.column.tokenize</name><value>false</value></property>
        <property><name>hbase.column.boost</name><value>3</value></property>
        <property><name>hbase.column.omit.norms</name><value>false</value></property>
        </column>
        <column>
        <property><name>hbase.column.name</name><value>column2</value></property>
        <property><name>hbase.column.store</name><value>false</value></property>
        <property><name>hbase.column.index</name><value>true</value></property>
        <property><name>hbase.column.tokenize</name><value>true</value></property>
        </column>
        <property><name>hbase.index.rowkey.name</name><value>KEY</value></property>
        <property><name>hbase.index.max.buffered.docs</name><value>500</value></property>
        <property><name>hbase.index.max.field.length</name><value>10000</value></property>
        <property><name>hbase.index.merge.factor</name><value>10</value></property>
        <property><name>hbase.index.use.compound.file</name><value>true</value></property>
        <property><name>hbase.index.optimize</name><value>true</value></property>
        </configuration>

        > Take2 seems to be mangled:

        I just tried and it works for me. I rerolled it anyway and here it is.

        Show
        Ning Li added a comment - > Pardon me Ning for being a bit thick but I do not see an example of per column config. in BuildTableIndex. I see parsing of command line and passing of a list of column names to IdentityTableMap but not an example of per-column config. as a property value of an hbase config. Do you mean the XML in TestTableIndex? If so, its not clear how you do config. for columns 2, 3, etc. Perhaps you could provide an example here in the issue You are right. I meant the example in TestTableIndex. Here is an example with multiple columns: <configuration> <column> <property><name>hbase.column.name</name><value>column1</value></property> <property><name>hbase.column.store</name><value>true</value></property> <property><name>hbase.column.index</name><value>true</value></property> <property><name>hbase.column.tokenize</name><value>false</value></property> <property><name>hbase.column.boost</name><value>3</value></property> <property><name>hbase.column.omit.norms</name><value>false</value></property> </column> <column> <property><name>hbase.column.name</name><value>column2</value></property> <property><name>hbase.column.store</name><value>false</value></property> <property><name>hbase.column.index</name><value>true</value></property> <property><name>hbase.column.tokenize</name><value>true</value></property> </column> <property><name>hbase.index.rowkey.name</name><value>KEY</value></property> <property><name>hbase.index.max.buffered.docs</name><value>500</value></property> <property><name>hbase.index.max.field.length</name><value>10000</value></property> <property><name>hbase.index.merge.factor</name><value>10</value></property> <property><name>hbase.index.use.compound.file</name><value>true</value></property> <property><name>hbase.index.optimize</name><value>true</value></property> </configuration> > Take2 seems to be mangled: I just tried and it works for me. I rerolled it anyway and here it is.
        Hide
        stack added a comment -

        .bq The content of an index configuration is actually a property value in an hbase configuration. You can see an example in BuildTableIndex.java

        Pardon me Ning for being a bit thick but I do not see an example of per column config. in BuildTableIndex. I see parsing of command line and passing of a list of column names to IdentityTableMap but not an example of per-column config. as a property value of an hbase config. Do you mean the XML in TestTableIndex? If so, its not clear how you do config. for columns 2, 3, etc. Perhaps you could provide an example here in the issue

        Take2 seems to be mangled:

        durruti:~/Documents/checkouts/hadoop-trunk stack$ patch -p0 < ~/Desktop/build_table_index.take2.patch 
        (Stripping trailing CRs from patch.)
        patching file src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestTableIndex.java
        patch: **** malformed patch at line 311: Index: src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/BuildTableIndex.java
        

        Good on you Ning

        Show
        stack added a comment - .bq The content of an index configuration is actually a property value in an hbase configuration. You can see an example in BuildTableIndex.java Pardon me Ning for being a bit thick but I do not see an example of per column config. in BuildTableIndex. I see parsing of command line and passing of a list of column names to IdentityTableMap but not an example of per-column config. as a property value of an hbase config. Do you mean the XML in TestTableIndex? If so, its not clear how you do config. for columns 2, 3, etc. Perhaps you could provide an example here in the issue Take2 seems to be mangled: durruti:~/Documents/checkouts/hadoop-trunk stack$ patch -p0 < ~/Desktop/build_table_index.take2.patch (Stripping trailing CRs from patch.) patching file src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestTableIndex.java patch: **** malformed patch at line 311: Index: src/contrib/hbase/src/java/org/apache/hadoop/hbase/mapred/BuildTableIndex.java Good on you Ning
        Hide
        Ning Li added a comment -

        Thanks for the comments!

        > Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase settings in the mix (Would IndexConfiguration be a better name than IndexConf).

        The content of an index configuration is actually a property value in an hbase configuration. You can see an example in BuildTableIndex.java

        > You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You should fix. Otherwise it won't apply when hudson tries to apply it.

        Done in take2.

        > You way you add the per-column config. into a hadoop configuration is very cute. I'm unclear how mulitple columns are done..... Should there be a columns element to hold multiple column elements? I'd suggest you add javadoc with example config. ('cos trying to read conjure the xml produced by the code takes a little effort).

        There is an example index configuration in BuildTableIndex.java. Configurations for a column are in a "column" element. I'll add the example to javadoc once we agree on the best way to do index configuration.

        > Ning, have you tried your patch on a distributed cluster? Does your column trick get properly distributed out and your LuceneDocumentWrapper work in the distributed context?
        >
        > Did you use lucene 2.2 or something else?
        > I had a problem compiling:

        Oops. The compiling problem was my mistake (forgot to remove some unused code). All fixed in take2. Yes, I included Lucene 2.2 in hbase/lib. And yes, I have tested on a distributed cluster. Since an index configuration content is a property in an hbase configuration, it does work properly in the distributed environment.

        Show
        Ning Li added a comment - Thanks for the comments! > Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase settings in the mix (Would IndexConfiguration be a better name than IndexConf). The content of an index configuration is actually a property value in an hbase configuration. You can see an example in BuildTableIndex.java > You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You should fix. Otherwise it won't apply when hudson tries to apply it. Done in take2. > You way you add the per-column config. into a hadoop configuration is very cute. I'm unclear how mulitple columns are done..... Should there be a columns element to hold multiple column elements? I'd suggest you add javadoc with example config. ('cos trying to read conjure the xml produced by the code takes a little effort). There is an example index configuration in BuildTableIndex.java. Configurations for a column are in a "column" element. I'll add the example to javadoc once we agree on the best way to do index configuration. > Ning, have you tried your patch on a distributed cluster? Does your column trick get properly distributed out and your LuceneDocumentWrapper work in the distributed context? > > Did you use lucene 2.2 or something else? > I had a problem compiling: Oops. The compiling problem was my mistake (forgot to remove some unused code). All fixed in take2. Yes, I included Lucene 2.2 in hbase/lib. And yes, I have tested on a distributed cluster. Since an index configuration content is a property in an hbase configuration, it does work properly in the distributed environment.
        Hide
        stack added a comment - - edited

        This is a nice looking addition Ning

        Here's a couple of comments:

        Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase settings in the mix (Would IndexConfiguration be a better name than IndexConf).

        You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You should fix. Otherwise it won't apply when hudson tries to apply it.

        You way you add the per-column config. into a hadoop configuration is very cute. I'm unclear how mulitple columns are done..... Should there be a columns element to hold multiple column elements? I'd suggest you add javadoc with example config. ('cos trying to read conjure the xml produced by the code takes a little effort).

        Ning, have you tried your patch on a distributed cluster? Does your column trick get properly distributed out and your LuceneDocumentWrapper work in the distributed context?

        Did you use lucene 2.2 or something else?

        I had a problem compiling:

            [javac] Compiling 14 source files to /Users/stack/Documents/checkouts/hadoop-trunk/build/contrib/hbase/test
            [javac] /Users/stack/Documents/checkouts/hadoop-trunk/src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestTableIndex.java:255: cannot find symbol
            [javac] symbol  : variable DONE_NAME
            [javac] location: class org.apache.hadoop.hbase.mapred.IndexOutputFormat
            [javac]       if (IndexOutputFormat.DONE_NAME.equals(name)) {
        
        Show
        stack added a comment - - edited This is a nice looking addition Ning Here's a couple of comments: Shouldn't IndexConf extend HBaseConfiguration else you'll not have the hbase settings in the mix (Would IndexConfiguration be a better name than IndexConf). You made the patch inside $HBASE_HOME/src rather than at $HADOOP_HOME. You should fix. Otherwise it won't apply when hudson tries to apply it. You way you add the per-column config. into a hadoop configuration is very cute. I'm unclear how mulitple columns are done..... Should there be a columns element to hold multiple column elements? I'd suggest you add javadoc with example config. ('cos trying to read conjure the xml produced by the code takes a little effort). Ning, have you tried your patch on a distributed cluster? Does your column trick get properly distributed out and your LuceneDocumentWrapper work in the distributed context? Did you use lucene 2.2 or something else? I had a problem compiling: [javac] Compiling 14 source files to /Users/stack/Documents/checkouts/hadoop-trunk/build/contrib/hbase/test [javac] /Users/stack/Documents/checkouts/hadoop-trunk/src/contrib/hbase/src/test/org/apache/hadoop/hbase/TestTableIndex.java:255: cannot find symbol [javac] symbol : variable DONE_NAME [javac] location: class org.apache.hadoop.hbase.mapred.IndexOutputFormat [javac] if (IndexOutputFormat.DONE_NAME.equals(name)) {

          People

          • Assignee:
            Unassigned
            Reporter:
            Ning Li
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development