Details

    • Type: Improvement Improvement
    • Status: Patch Available
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.90.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Using the Apache Lucene library we can add freetext search to HBase. The advantages of this are:

      • HBase is highly scalable and distributed
      • HBase is realtime
      • Lucene is a fast inverted index and will soon be realtime (see LUCENE-2312)
      • Lucene offers many types of queries not currently available in HBase (eg, AND, OR, NOT, phrase, etc)
      • It's easier to build scalable realtime systems on top of already architecturally sound, scalable realtime data system, eg, HBase.
      • Scaling realtime search will be as simple as scaling HBase.

      Phase 1 - Indexing:

      • Integrate Lucene into HBase such that an index mirrors a given region. This means cascading add, update, and deletes between a Lucene index and an HBase region (and vice versa).
      • Define meta-data to mark a region as indexed, and use a Solr schema to allow the user to define the fields and analyzers.
      • Integrate with the HLog to ensure that index recovery can occur properly (eg, on region server failure)
      • Mirror region splits with indexes (use Lucene's IndexSplitter?)
      • When a region is written to HDFS, also write the corresponding Lucene index to HDFS.
      • A row key will be the ID of a given Lucene document. The Lucene docstore will explicitly not be used because the document/row data is stored in HBase. We will need to solve what the best data structure for efficiently mapping a docid -> row key is. It could be a docstore, field cache, column stride fields, or some other mechanism.
      • Write unit tests for the above

      Phase 2 - Queries:

      • Enable distributed Lucene queries
      • Regions that have Lucene indexes are inherently available and may be searched on, meaning there's no need for a separate search related system in Zookeeper.
      • Integrate search with HBase's RPC mechanis
      1. HBASE-3529.patch
        41 kB
        Jason Rutherglen
      2. HDFS-APPEND-0.20-LOCAL-FILE.patch
        8 kB
        Jason Rutherglen

        Issue Links

          Activity

          Hide
          stack added a comment -

          @Jason Sounds excellent. Could you do this up in a coprocessor? http://hbaseblog.com/2010/11/30/hbase-coprocessors/

          Show
          stack added a comment - @Jason Sounds excellent. Could you do this up in a coprocessor? http://hbaseblog.com/2010/11/30/hbase-coprocessors/
          Hide
          Jason Rutherglen added a comment -

          Thanks. Right the coprocessor is the key to sync'ing HBase and Lucene. This's where I'll probably start.

          Show
          Jason Rutherglen added a comment - Thanks. Right the coprocessor is the key to sync'ing HBase and Lucene. This's where I'll probably start.
          Hide
          Jason Rutherglen added a comment -

          Another issue brought up is the size of a region vs. the size of the Lucene index. If the region is compressed the resultant Lucene index may in fact be a reasonable size. Typically a maximum Lucene index size of 1 - 2 GB is optimal? If the default region size 256 MB, and the data's been compressed by (what ratio?), then 256 MB could be ideal?

          Show
          Jason Rutherglen added a comment - Another issue brought up is the size of a region vs. the size of the Lucene index. If the region is compressed the resultant Lucene index may in fact be a reasonable size. Typically a maximum Lucene index size of 1 - 2 GB is optimal? If the default region size 256 MB, and the data's been compressed by (what ratio?), then 256 MB could be ideal?
          Hide
          Jason Rutherglen added a comment -

          I opened LUCENE-2919 to split indexes by the primary key, eg, the HBase keys.

          Show
          Jason Rutherglen added a comment - I opened LUCENE-2919 to split indexes by the primary key, eg, the HBase keys.
          Hide
          Dan Harvey added a comment -

          How would you deal with the data types / serialisation, would you assume the cell data is just UTF8 bytes to start with?

          Show
          Dan Harvey added a comment - How would you deal with the data types / serialisation, would you assume the cell data is just UTF8 bytes to start with?
          Hide
          Jason Rutherglen added a comment -

          I added a DocumentTransformer class that looks like this.

          public abstract class DocumentTransformer {
            public abstract Map<Term,Document> transform(Map<byte[], List<KeyValue>> familyMap) throws Exception;
            public abstract Term[] getIDTerms(Map<byte[], List<KeyValue>> familyMap) throws Exception;
          }
          

          The user can then define how they want to transform the underlying data to Lucene documents. I'm trying to find a JSON library to build the unit tests/demo app with.

          Show
          Jason Rutherglen added a comment - I added a DocumentTransformer class that looks like this. public abstract class DocumentTransformer { public abstract Map<Term,Document> transform(Map< byte [], List<KeyValue>> familyMap) throws Exception; public abstract Term[] getIDTerms(Map< byte [], List<KeyValue>> familyMap) throws Exception; } The user can then define how they want to transform the underlying data to Lucene documents. I'm trying to find a JSON library to build the unit tests/demo app with.
          Hide
          stack added a comment -

          @Jason HBase ships with jersey-json (1.4). See here for doc: http://jackson.codehaus.org/Tutorial (Should be easy enough updating jersey-json if needed).

          Show
          stack added a comment - @Jason HBase ships with jersey-json (1.4). See here for doc: http://jackson.codehaus.org/Tutorial (Should be easy enough updating jersey-json if needed).
          Hide
          Jason Rutherglen added a comment -

          Stack, thanks for the pointer.

          Show
          Jason Rutherglen added a comment - Stack, thanks for the pointer.
          Hide
          Jason Rutherglen added a comment -

          There's one possible issue that's come to mind and that is the possible overhead associated with accessing the Lucene index if it's stored in HDFS. Meaning, in Lucene today we have implementations such as NIOFSDirectory which uses NIO's positional read underneath, and it's made highly concurrent search apps much faster (as before we were sync'ing per byte[1024] read call). I'm curious if HDFS has effectively implemented something similar to NIOFSDir underneath? I see pread mentioned in HFile however I think it's referring to the HDFS specific implementation?

          Show
          Jason Rutherglen added a comment - There's one possible issue that's come to mind and that is the possible overhead associated with accessing the Lucene index if it's stored in HDFS. Meaning, in Lucene today we have implementations such as NIOFSDirectory which uses NIO's positional read underneath, and it's made highly concurrent search apps much faster (as before we were sync'ing per byte [1024] read call). I'm curious if HDFS has effectively implemented something similar to NIOFSDir underneath? I see pread mentioned in HFile however I think it's referring to the HDFS specific implementation?
          Hide
          stack added a comment -

          @Jason Yeah, the pread is for hdfs. Its going to be slow though because for EVERY pread invocation, HDFS sets up socket, loads new block, seeks to pread location, then returns bytes and closes sockets. This is to be fixed but thats how it currently works.

          Show
          stack added a comment - @Jason Yeah, the pread is for hdfs. Its going to be slow though because for EVERY pread invocation, HDFS sets up socket, loads new block, seeks to pread location, then returns bytes and closes sockets. This is to be fixed but thats how it currently works.
          Hide
          Jason Rutherglen added a comment -

          1) It may be more expedient for now to store the index in a dedicated directory, and save it to HDFS periodically. However I'm not sure 'when' the loading into HDFS would occur, eg, if HBase is always writing to HDFS then there's no way to sync with that mechanism. Perhaps it'd need to be based on the iterative index size changes? Ie, if the index has grown by 25% since the last save?

          2) I'd like to design the recovery logic now. It's simple to save the timestmap into Lucene, then on recovery get the max timestamp, and iterate from there over the HRegion for the remaining 'lost' rows/documents. What's the most efficient way to scan over > timestamp key values?

          3) We can create indexes for the entire HRegion or for the individual column families. Perhaps this should be optional? I wonder if there are dis/advantages from a user perspective? If interleaving postings was efficient we could even design a system to enable parts of posting lists to be changed per column family, where duplicate docids would be written to intermediate in-memory indexes, and 'interleaved' during posting iteration.

          Show
          Jason Rutherglen added a comment - 1) It may be more expedient for now to store the index in a dedicated directory, and save it to HDFS periodically. However I'm not sure 'when' the loading into HDFS would occur, eg, if HBase is always writing to HDFS then there's no way to sync with that mechanism. Perhaps it'd need to be based on the iterative index size changes? Ie, if the index has grown by 25% since the last save? 2) I'd like to design the recovery logic now. It's simple to save the timestmap into Lucene, then on recovery get the max timestamp, and iterate from there over the HRegion for the remaining 'lost' rows/documents. What's the most efficient way to scan over > timestamp key values? 3) We can create indexes for the entire HRegion or for the individual column families. Perhaps this should be optional? I wonder if there are dis/advantages from a user perspective? If interleaving postings was efficient we could even design a system to enable parts of posting lists to be changed per column family, where duplicate docids would be written to intermediate in-memory indexes, and 'interleaved' during posting iteration.
          Hide
          Ted Yu added a comment -

          From Scan.java:

          • To only retrieve columns within a specific range of version timestamps,
          • execute {@link #setTimeRange(long, long) setTimeRange}

            .

          Show
          Ted Yu added a comment - From Scan.java: To only retrieve columns within a specific range of version timestamps, execute {@link #setTimeRange(long, long) setTimeRange} .
          Hide
          Jason Rutherglen added a comment -

          I opened LUCENE-2930 to store the last/max term of a field in the Lucene terms dictionary. We can use this to more efficiently know the index's last commit point, and start indexing from there. The alternative is to iterate the entire terms dictionary, which for the unique timestamp, would be the length of the number of documents.

          Show
          Jason Rutherglen added a comment - I opened LUCENE-2930 to store the last/max term of a field in the Lucene terms dictionary. We can use this to more efficiently know the index's last commit point, and start indexing from there. The alternative is to iterate the entire terms dictionary, which for the unique timestamp, would be the length of the number of documents.
          Hide
          Jason Rutherglen added a comment -

          Where is a good 'temp' directory to place the Lucene indexes relative to other local HBase files?

          Show
          Jason Rutherglen added a comment - Where is a good 'temp' directory to place the Lucene indexes relative to other local HBase files?
          Hide
          ryan rawson added a comment -

          there are no local hbase files. You'll have to come up with something yourself i guess?

          Show
          ryan rawson added a comment - there are no local hbase files. You'll have to come up with something yourself i guess?
          Hide
          Jason Rutherglen added a comment -

          Maybe something relative to HDFS then?

          Show
          Jason Rutherglen added a comment - Maybe something relative to HDFS then?
          Hide
          Andrew Purtell added a comment -

          I mailed a comment back but it is not showing up fast enough.

          We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.

          Show
          Andrew Purtell added a comment - I mailed a comment back but it is not showing up fast enough. We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.
          Hide
          Andrew Purtell added a comment -

          We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.

          Show
          Andrew Purtell added a comment - We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.
          Hide
          ryan rawson added a comment -

          it's going to be tricky, since with security some people may choose to run hdfs and hbase on different users. Futhermore most hadoop installs have multiple jbod-style disks, and places like /tmp won't have much room (my /tmp has < 2GB). If you can avoid local files as much as possible, I'd try to do that.

          Show
          ryan rawson added a comment - it's going to be tricky, since with security some people may choose to run hdfs and hbase on different users. Futhermore most hadoop installs have multiple jbod-style disks, and places like /tmp won't have much room (my /tmp has < 2GB). If you can avoid local files as much as possible, I'd try to do that.
          Hide
          Jason Rutherglen added a comment -

          We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS.

          This'd be good, however for Lucene we'll need to directly access the local filesystem for performance reasons, eg, HDFS sounds like it's going to be slower than going direct (at the moment). Because the indexes will be local, we'll need to periodically sync the local index to HDFS. This isn't as difficult as it sounds, because we can save off a Lucene commit point and write the checkpoint's index files to HDFS, while letting other Lucene operations proceed. I'd say we can move to writing directly to HDFS when HBase no longer uses a heap based block store (and instead relies on the system IO cache).

          Show
          Jason Rutherglen added a comment - We have internally been discussing the addition of a Coprocessor framework API for reading and writing streams from/to the region data directory in HDFS. This'd be good, however for Lucene we'll need to directly access the local filesystem for performance reasons, eg, HDFS sounds like it's going to be slower than going direct (at the moment). Because the indexes will be local, we'll need to periodically sync the local index to HDFS. This isn't as difficult as it sounds, because we can save off a Lucene commit point and write the checkpoint's index files to HDFS, while letting other Lucene operations proceed. I'd say we can move to writing directly to HDFS when HBase no longer uses a heap based block store (and instead relies on the system IO cache).
          Hide
          Jason Rutherglen added a comment -

          it's going to be tricky, since with security some people may choose to run hdfs and hbase on different users. Futhermore most hadoop installs have multiple jbod-style disks, and places like /tmp won't have much room (my /tmp has < 2GB). If you can avoid local files as much as possible, I'd try to do that.

          Right, /tmp probably isn't the best place. The config and schema per table will be stored in HDFS, which means we can't store per server data there. Maybe there's an HDFS API to introspect as to where it would store files for a local FileSystem?

          Show
          Jason Rutherglen added a comment - it's going to be tricky, since with security some people may choose to run hdfs and hbase on different users. Futhermore most hadoop installs have multiple jbod-style disks, and places like /tmp won't have much room (my /tmp has < 2GB). If you can avoid local files as much as possible, I'd try to do that. Right, /tmp probably isn't the best place. The config and schema per table will be stored in HDFS, which means we can't store per server data there. Maybe there's an HDFS API to introspect as to where it would store files for a local FileSystem?
          Hide
          Andrew Purtell added a comment -

          Writing the indexes to HDFS is possible after LUCENE-2373? We get direct reads from HDFS via HDFS-347 and the OS block cache can help there?

          Show
          Andrew Purtell added a comment - Writing the indexes to HDFS is possible after LUCENE-2373 ? We get direct reads from HDFS via HDFS-347 and the OS block cache can help there?
          Hide
          Gary Helmling added a comment -

          Yeah, as Ryan mentions, with security, writing to HDFS via a coprocessor extension will be easiest to enable.

          I wonder if that plus HDFS-347 (which allows reading directly from the local FS if the block exists on the local DN) would allow for good enough performance? Of course, HDFS-347 itself is tricky from a security perspective.

          If local disk writes are the only solution, then the best option may be to make the user plan for it and explicitly specify a Lucene index path in the coprocessor configuration.

          Show
          Gary Helmling added a comment - Yeah, as Ryan mentions, with security, writing to HDFS via a coprocessor extension will be easiest to enable. I wonder if that plus HDFS-347 (which allows reading directly from the local FS if the block exists on the local DN) would allow for good enough performance? Of course, HDFS-347 itself is tricky from a security perspective. If local disk writes are the only solution, then the best option may be to make the user plan for it and explicitly specify a Lucene index path in the coprocessor configuration.
          Hide
          Jason Rutherglen added a comment -

          Writing the indexes to HDFS is possible after LUCENE-2373?

          Right, that's implemented in trunk as the append codecs. https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html

          We get direct reads from HDFS via HDFS-347 and the OS block cache can help there?

          BlockReaderLocal is sync'd on each method, that's something we've outgrown in Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap second). We'd likely have a couple of options here, write to HDFS and [probably] slow queries to some extent, or write directly to a local directory and have the mechanical overhead of copying index files in/out of HDFS.

          Show
          Jason Rutherglen added a comment - Writing the indexes to HDFS is possible after LUCENE-2373 ? Right, that's implemented in trunk as the append codecs. https://hudson.apache.org/hudson/job/Lucene-trunk/javadoc//contrib-misc/org/apache/lucene/index/codecs/appending/AppendingCodec.html We get direct reads from HDFS via HDFS-347 and the OS block cache can help there? BlockReaderLocal is sync'd on each method, that's something we've outgrown in Lucene a while back (and in it's place NIOFSDirectory is most used, with MMap second). We'd likely have a couple of options here, write to HDFS and [probably] slow queries to some extent, or write directly to a local directory and have the mechanical overhead of copying index files in/out of HDFS.
          Hide
          Jason Rutherglen added a comment -

          Also, I'm curious about how the HLog works, eg, it's archived into HDFS, is there a difference between what's archived and what's live (and would interleaving be necessary?). The reason the HLog needs to be replayed [I think] is deletes need to be executed. If we simply iterate/scan from a given timestamp, we'd get the new rows however we'd miss executing deletes.

          Show
          Jason Rutherglen added a comment - Also, I'm curious about how the HLog works, eg, it's archived into HDFS, is there a difference between what's archived and what's live (and would interleaving be necessary?). The reason the HLog needs to be replayed [I think] is deletes need to be executed. If we simply iterate/scan from a given timestamp, we'd get the new rows however we'd miss executing deletes.
          Hide
          Andrew Purtell added a comment -
          Show
          Andrew Purtell added a comment - I'm curious about how the HLog works See http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
          Hide
          Jason Rutherglen added a comment -

          In the RegionObserver/Coprocessor I don't think there are methods to access the log replay (on server restart), is that something that's planned?

          Show
          Jason Rutherglen added a comment - In the RegionObserver/Coprocessor I don't think there are methods to access the log replay (on server restart), is that something that's planned?
          Hide
          Jason Rutherglen added a comment -

          To answer the previous question there's this issue: HBASE-3257

          And on memstore flush, we'll do a Lucene index commit to ensure that when we replay the HLog, we won't need to access [potentially] out of date HLog entries. We can store the checkpoint meta-data into the Lucene commit, which obviates the need to implement terms dict last term access.

          Show
          Jason Rutherglen added a comment - To answer the previous question there's this issue: HBASE-3257 And on memstore flush, we'll do a Lucene index commit to ensure that when we replay the HLog, we won't need to access [potentially] out of date HLog entries. We can store the checkpoint meta-data into the Lucene commit, which obviates the need to implement terms dict last term access.
          Hide
          Eugene Koontz added a comment -

          As Jason pointed out on hbase-dev, HBASE-3257 provides HLog-related functionality that could be used by coprocessors to add indexing information.

          Show
          Eugene Koontz added a comment - As Jason pointed out on hbase-dev, HBASE-3257 provides HLog-related functionality that could be used by coprocessors to add indexing information.
          Hide
          Jason Rutherglen added a comment -

          Is https://issues.apache.org/jira/secure/attachment/12470743/HDFS-347-branch-20-append.txt the patch applied to CDH? If so, the readChunk method isn't implemented. Is there a plan to implement that, perhaps with NIO positional read? Implementing readChunk would make storing the indexes in HDFS entirely tenable.

          Show
          Jason Rutherglen added a comment - Is https://issues.apache.org/jira/secure/attachment/12470743/HDFS-347-branch-20-append.txt the patch applied to CDH? If so, the readChunk method isn't implemented. Is there a plan to implement that, perhaps with NIO positional read? Implementing readChunk would make storing the indexes in HDFS entirely tenable.
          Hide
          ryan rawson added a comment -

          HDFS-347 is not in CDH nor in branch-20-append.

          As for a plan to implement it, perhaps you should?

          Show
          ryan rawson added a comment - HDFS-347 is not in CDH nor in branch-20-append. As for a plan to implement it, perhaps you should?
          Hide
          Jason Rutherglen added a comment -

          HDFS-347 is not in CDH nor in branch-20-append.

          As for a plan to implement it, perhaps you should?

          Really? Ah, I guess I misread this: https://issues.apache.org/jira/browse/HBASE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997267#comment-12997267

          Sure, I can give a go at an NIO positional read version, it'll be a good learning experience. Are there any caveats to be aware of?

          Show
          Jason Rutherglen added a comment - HDFS-347 is not in CDH nor in branch-20-append. As for a plan to implement it, perhaps you should? Really? Ah, I guess I misread this: https://issues.apache.org/jira/browse/HBASE-3186?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12997267#comment-12997267 Sure, I can give a go at an NIO positional read version, it'll be a good learning experience. Are there any caveats to be aware of?
          Hide
          ryan rawson added a comment -

          I do not know, the whole thing is pretty green field. There are a few different implementations of HDFS-347, and I haven't actually seen a credible attempt at really getting it into a shipping hadoop yet. The test patches are pretty great, but they are POC and won't actually be shipping (due to hadoop security).

          You can give it a shot, but be warned you might not get much for your troubles in terms of committed code.

          Show
          ryan rawson added a comment - I do not know, the whole thing is pretty green field. There are a few different implementations of HDFS-347 , and I haven't actually seen a credible attempt at really getting it into a shipping hadoop yet. The test patches are pretty great, but they are POC and won't actually be shipping (due to hadoop security). You can give it a shot, but be warned you might not get much for your troubles in terms of committed code.
          Hide
          stack added a comment -

          @Jason We could get hdfs-347 applied to branch-0.20-append. Us HBasers are going to talk it up, that folks should apply it to their hadoop since the benefit is so great. CDH will have something like an hdfs-347 but probably not till CDH4 (Todd talks of a version of hdfs-347 but one that will work w/ security – see his patch up in hdfs-237 as opposed to the amended Dhruba patch posted by Ryan). A hdfs-347 probably won't show in apache hadoop till 0.23/0.24 would be my guess.

          Show
          stack added a comment - @Jason We could get hdfs-347 applied to branch-0.20-append. Us HBasers are going to talk it up, that folks should apply it to their hadoop since the benefit is so great. CDH will have something like an hdfs-347 but probably not till CDH4 (Todd talks of a version of hdfs-347 but one that will work w/ security – see his patch up in hdfs-237 as opposed to the amended Dhruba patch posted by Ryan). A hdfs-347 probably won't show in apache hadoop till 0.23/0.24 would be my guess.
          Hide
          Jason Rutherglen added a comment -

          @Stack I didn't see any patches at HDFS-237. I'd be curious to learn what the security issues are, I guess they're articulated in HDFS-347 as solvable by transferring file descriptors, though I'm not sure why the user running the Hadoop Java process should not be accessing certain local files? Also, maybe there are higher level synchronization issues to be aware of (eg, HDFS-1605)? I'm sure much of this can be changed, though it may require a separate call 'path' and classes to avoid any extraneous synchronization. I do like this approach of making core changes to HDFS which'll benefit HBase and this issue, then also streamlines the Lucene integration (ie, there'll be no need for replicating the index back into HDFS from local disk), which'll reduce the aggregate complexity and testing.

          Show
          Jason Rutherglen added a comment - @Stack I didn't see any patches at HDFS-237 . I'd be curious to learn what the security issues are, I guess they're articulated in HDFS-347 as solvable by transferring file descriptors, though I'm not sure why the user running the Hadoop Java process should not be accessing certain local files? Also, maybe there are higher level synchronization issues to be aware of (eg, HDFS-1605 )? I'm sure much of this can be changed, though it may require a separate call 'path' and classes to avoid any extraneous synchronization. I do like this approach of making core changes to HDFS which'll benefit HBase and this issue, then also streamlines the Lucene integration (ie, there'll be no need for replicating the index back into HDFS from local disk), which'll reduce the aggregate complexity and testing.
          Hide
          stack added a comment -

          @Jason Pardon me. The HDFS-237 above is a mistype on my part. I meant HDFS-347 (I was about to make jokes about your dyslexia but on review the affliction blew up in my face). The hbase process can access local files as long as it gets the clearance via hdfs.

          do like this approach of making core changes to HDFS which'll benefit HBase....

          +1

          Show
          stack added a comment - @Jason Pardon me. The HDFS-237 above is a mistype on my part. I meant HDFS-347 (I was about to make jokes about your dyslexia but on review the affliction blew up in my face). The hbase process can access local files as long as it gets the clearance via hdfs. do like this approach of making core changes to HDFS which'll benefit HBase.... +1
          Hide
          Jason Rutherglen added a comment -

          It looks simple to change HDFS-347 (the HDFS-347-branch-20-append.txt patch) to read using positional reads, I'm sure it's necessary as a block reader is instantiated per DFSInputStream? read(long position, byte[] buffer, int offset, int length) calls getBlockRange which is sync'd. Then the read method calls fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader is per thread and isn't reused? So the contention would be in getBlockRange? Perhaps there's not an issue, or not much of one, if the HDFS-347-branch-20-append.txt patch (or something like it) is applied (using HADOOP-6311)?

          I guess the go ahead is to write a Lucene Directory that uses HDFS underneath, that gains concurrency by using DFSInputStream.read(long position, ...)? Oh, the other issue would be all the overhead from simply loading a byte[1024] (eg, all the new object creation etc). Hmm... That'll be a problem.

          Show
          Jason Rutherglen added a comment - It looks simple to change HDFS-347 (the HDFS-347 -branch-20-append.txt patch) to read using positional reads, I'm sure it's necessary as a block reader is instantiated per DFSInputStream? read(long position, byte[] buffer, int offset, int length) calls getBlockRange which is sync'd. Then the read method calls fetchBlockByteRange which calls BlockReader.newBlockReader, eg, the blockreader is per thread and isn't reused? So the contention would be in getBlockRange? Perhaps there's not an issue, or not much of one, if the HDFS-347 -branch-20-append.txt patch (or something like it) is applied (using HADOOP-6311 )? I guess the go ahead is to write a Lucene Directory that uses HDFS underneath, that gains concurrency by using DFSInputStream.read(long position, ...)? Oh, the other issue would be all the overhead from simply loading a byte [1024] (eg, all the new object creation etc). Hmm... That'll be a problem.
          Hide
          Jason Rutherglen added a comment -

          I started on the search part, which is nice as it can utilize HBase's Coprocessor RPC mechanism. The design issue is if we need to store a unique [family, column, row, timestamp] per column/field into Lucene? Or perhaps this only needs to be stored per column family? This'll be used on iteration of the results from Lucene, which yields docids, we'll then lookup the values in the doc store, call Get for each doc, and add the Result to the search response. I think this is how it should work?

          Show
          Jason Rutherglen added a comment - I started on the search part, which is nice as it can utilize HBase's Coprocessor RPC mechanism. The design issue is if we need to store a unique [family, column, row, timestamp] per column/field into Lucene? Or perhaps this only needs to be stored per column family? This'll be used on iteration of the results from Lucene, which yields docids, we'll then lookup the values in the doc store, call Get for each doc, and add the Result to the search response. I think this is how it should work?
          Hide
          stack added a comment -

          You'll have to include row, column family, and qualifier at least if you are to get from lucene back to the the latest version of the cell, won't you? If you want to index more than just the current version of a cell, you'll have to include the hbase timestamp in the lucene index.

          If your lucene indices are per column family, you could leave the column family out of the lucene document and it can be picked up from context; that would leave row, qualifier and timestamp in the lucene document.

          Show
          stack added a comment - You'll have to include row, column family, and qualifier at least if you are to get from lucene back to the the latest version of the cell, won't you? If you want to index more than just the current version of a cell, you'll have to include the hbase timestamp in the lucene index. If your lucene indices are per column family, you could leave the column family out of the lucene document and it can be picked up from context; that would leave row, qualifier and timestamp in the lucene document.
          Hide
          Jason Rutherglen added a comment -

          In regards to HDFS-347 and the issues around fast local file access. I started reimplementing HDFS-347, however I realized it'll be fruitless without an efficient [cached] way of finding the local file a given offset corresponds to. Is there a way for the DFSClient to "listen" for changes to the DataNode and then keep a memory resident 'cache' for the purpose of quickly accessing which local file(s) a given positional read + length corresponds to?

          Show
          Jason Rutherglen added a comment - In regards to HDFS-347 and the issues around fast local file access. I started reimplementing HDFS-347 , however I realized it'll be fruitless without an efficient [cached] way of finding the local file a given offset corresponds to. Is there a way for the DFSClient to "listen" for changes to the DataNode and then keep a memory resident 'cache' for the purpose of quickly accessing which local file(s) a given positional read + length corresponds to?
          Hide
          stack added a comment -

          @Jason Which offset are you talking off? The storefile in hbase keeps offsets in a file index. When we ask to read from a position in the hfile, dfsclient does a quick calc to figure which block and then relatively, the offset into the target block. Are you talking of something more fine grained or something else?

          Show
          stack added a comment - @Jason Which offset are you talking off? The storefile in hbase keeps offsets in a file index. When we ask to read from a position in the hfile, dfsclient does a quick calc to figure which block and then relatively, the offset into the target block. Are you talking of something more fine grained or something else?
          Hide
          Jason Rutherglen added a comment -

          Sorry, I thought through the file access a little more. I think we can use the block local reader as is, because Lucene reads the postings sequentially, we don't really need random file access (eg, the offset issue more or less goes away), we simply need to allow seek'ing forward, and most postings will live inside of a single (64 - 128MB block). The issue with this system is we may need to maintain an FSInputStream per thread per file because we probably don't want to open a new FSInputStream per query given the overhead or creating and destroying them? Will this cause issues with the max file descriptors?

          Show
          Jason Rutherglen added a comment - Sorry, I thought through the file access a little more. I think we can use the block local reader as is, because Lucene reads the postings sequentially, we don't really need random file access (eg, the offset issue more or less goes away), we simply need to allow seek'ing forward, and most postings will live inside of a single (64 - 128MB block). The issue with this system is we may need to maintain an FSInputStream per thread per file because we probably don't want to open a new FSInputStream per query given the overhead or creating and destroying them? Will this cause issues with the max file descriptors?
          Hide
          stack added a comment -

          @Jason Currently HBase keeps all files open all the time (Yeah, users have to up their ulimit if they have more than a smidgeon of data in hbase--requirement #4 or #5).

          Show
          stack added a comment - @Jason Currently HBase keeps all files open all the time (Yeah, users have to up their ulimit if they have more than a smidgeon of data in hbase--requirement #4 or #5).
          Hide
          Jason Rutherglen added a comment -

          Ah, going back to storing the "row, qualifier and timestamp" in a Lucene document/docstore, is that does require totally random reads. I wonder if there's some efficient way to store row pointers in RAM (compression?) or a Hadooop data structure that can be used? I think that storing this information in the Lucene field cache is going to cause OOMs. It'd be great if we could simply store a long that points to the exact row and column family we'd like to reference, as that could easily be stored in RAM, and would possibly enable faster lookup?

          Show
          Jason Rutherglen added a comment - Ah, going back to storing the "row, qualifier and timestamp" in a Lucene document/docstore, is that does require totally random reads. I wonder if there's some efficient way to store row pointers in RAM (compression?) or a Hadooop data structure that can be used? I think that storing this information in the Lucene field cache is going to cause OOMs. It'd be great if we could simply store a long that points to the exact row and column family we'd like to reference, as that could easily be stored in RAM, and would possibly enable faster lookup?
          Hide
          stack added a comment -

          Are you thinking you could exploit hbase scan somehow? If so, how you think it would work?

          Whats a lucene docid? A long? Or a double? You could toBytes that and that'd be the hbase row (HBase rows are byte arrays). The column family could be one byte – that'd give you 256 maximum column family names. Qualifier probably has to be lucene document field name. You could try and keep these short. Timestamp is a long. So thats two longs (docid + ts), one byte for cf, and say, 8 characters for field name.. thats about 25 bytes or so per lucene doc. Will that cause you to run out of mem?

          Show
          stack added a comment - Are you thinking you could exploit hbase scan somehow? If so, how you think it would work? Whats a lucene docid? A long? Or a double? You could toBytes that and that'd be the hbase row (HBase rows are byte arrays). The column family could be one byte – that'd give you 256 maximum column family names. Qualifier probably has to be lucene document field name. You could try and keep these short. Timestamp is a long. So thats two longs (docid + ts), one byte for cf, and say, 8 characters for field name.. thats about 25 bytes or so per lucene doc. Will that cause you to run out of mem?
          Hide
          Jason Rutherglen added a comment -

          @Stack Thanks for the analysis. I forgot to mention that each subquery would also require it's own FSInputStream, which would be too many file descriptors. The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much?

          I think we can go ahead with the positional read which'd only require an FSInputStream per file, to be shared by all readers of that file (using FileChannel.read(ByteBuffer dst, long position) underneath. Given the number of blocks per Lucene file will be < 10 and the blocks are of a fixed size, we can divide the (offset / blocksize) to efficiently obtain the block index? I think it'll be efficient to translate a file offset into a local block file, eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm not familiar enough with HDFS. Then we'd just need to cache the LocatedBlock(s), instead of looking them up from the DataNode on each small read byte[1024] call.

          In summary:

          • DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per second
          • locatedBlocks.findBlock uses a binary search for some reason, that'll be a bottleneck, why can't we divide the number the offset by the number of blocks. Oh ok, that's because block sizes are variable. I guess if the number of blocks is small the binary search will always be fast? Or we can detect if the blocks are of the same size and divide to get the correct block?
          • DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls chooseDataNode, whose return value [DNAddrPair] can be cached inside of LocatedBlock?
          • Later in fetchBlockByteRange we call DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call, getBlockPathInfo. I think the results of this [BlockPathInfo] can be cached into LocatedBlock as well?
          • Then instead of instantiating a new BlockReader object, we can call FileChannel.read(ByteBuffer b, long pos) directly?
          • With this solution in place we can safely store documents in the docstore without any worries, and in addition use the system that most efficient in Lucene today, all the while using the fewest file descriptors possible.
          Show
          Jason Rutherglen added a comment - @Stack Thanks for the analysis. I forgot to mention that each subquery would also require it's own FSInputStream, which would be too many file descriptors. The heap required for 25 bytes * 2 mil docs is 50MB, eg, that's too much? I think we can go ahead with the positional read which'd only require an FSInputStream per file, to be shared by all readers of that file (using FileChannel.read(ByteBuffer dst, long position) underneath. Given the number of blocks per Lucene file will be < 10 and the blocks are of a fixed size, we can divide the (offset / blocksize) to efficiently obtain the block index? I think it'll be efficient to translate a file offset into a local block file, eg, I'm not sure why LocatedBlocks.findBlock uses a binary search because I'm not familiar enough with HDFS. Then we'd just need to cache the LocatedBlock(s), instead of looking them up from the DataNode on each small read byte [1024] call. In summary: DFSClient.DFSInputStream.getBlockRange looks fast enough for many calls per second locatedBlocks.findBlock uses a binary search for some reason, that'll be a bottleneck, why can't we divide the number the offset by the number of blocks. Oh ok, that's because block sizes are variable. I guess if the number of blocks is small the binary search will always be fast? Or we can detect if the blocks are of the same size and divide to get the correct block? DFSClient.DFSInputStream.fetchBlockByteRange is a hotspot because it calls chooseDataNode, whose return value [DNAddrPair] can be cached inside of LocatedBlock? Later in fetchBlockByteRange we call DFSClient.createClientDatanodeProtocolProxy() and make a local RPC call, getBlockPathInfo. I think the results of this [BlockPathInfo] can be cached into LocatedBlock as well? Then instead of instantiating a new BlockReader object, we can call FileChannel.read(ByteBuffer b, long pos) directly? With this solution in place we can safely store documents in the docstore without any worries, and in addition use the system that most efficient in Lucene today, all the while using the fewest file descriptors possible.
          Hide
          Jason Rutherglen added a comment -

          We'll want to keep a single ConcurrentMergeScheduler per HRegionServer (rather than per HRegion) even though there'll be an IndexWriter per HRegion (eg, the default is to have a CMS per IW, which could potentially generate too many threads). I'm wondering if there's a global attribute space to put the CMS so that it can be reused across HRegions?

          Show
          Jason Rutherglen added a comment - We'll want to keep a single ConcurrentMergeScheduler per HRegionServer (rather than per HRegion) even though there'll be an IndexWriter per HRegion (eg, the default is to have a CMS per IW, which could potentially generate too many threads). I'm wondering if there's a global attribute space to put the CMS so that it can be reused across HRegions?
          Hide
          Jason Rutherglen added a comment -

          Here's a really simple first cut at converting HDFS-347 to use NIO positional read. I'm implementing this conservatively as I'm honestly not entirely sure how HDFS works. The TestFileLocalRead passes. I don't know why we're closing the file descriptor after each read, I'm going to start trying to remove that, and cache the FD (and other values) somewhere.

          Show
          Jason Rutherglen added a comment - Here's a really simple first cut at converting HDFS-347 to use NIO positional read. I'm implementing this conservatively as I'm honestly not entirely sure how HDFS works. The TestFileLocalRead passes. I don't know why we're closing the file descriptor after each read, I'm going to start trying to remove that, and cache the FD (and other values) somewhere.
          Hide
          ryan rawson added a comment -

          can you submit this to the proper jira? This isn't hdfs

          Show
          ryan rawson added a comment - can you submit this to the proper jira? This isn't hdfs
          Hide
          Jason Rutherglen added a comment -

          @Ryan Sure, I just wanted to iterate here a little bit, and then test it out with the HDFSDirectory implementation, before submitting it to HDFS-347.

          Show
          Jason Rutherglen added a comment - @Ryan Sure, I just wanted to iterate here a little bit, and then test it out with the HDFSDirectory implementation, before submitting it to HDFS-347 .
          Hide
          stack added a comment -

          @Jason Do you need to hack on hdfs first? Its critical to making the search work on hbase?

          Show
          stack added a comment - @Jason Do you need to hack on hdfs first? Its critical to making the search work on hbase?
          Hide
          Jason Rutherglen added a comment -

          Do you need to hack on hdfs first? Its critical to making the search work on hbase?

          Yes, HDFS as it is would make queries execute extremely slowly (because of random small reads), also I don't know how to implement the HDFSDirectory (the Lucene interface to the filesystem) without knowing how HDFS works. In this case, we need to use NIO positional read underneath. I think the patch shows NIO pos is doable and hopefully it'll be completed shortly, enough to implement HDFSDirectory and then run a performance comparison of HDFSDirectory vs. NIOFSDirectory. Eg, we'll build identical indexes in both dirs, run the same queries and examine the difference in query speed.

          Show
          Jason Rutherglen added a comment - Do you need to hack on hdfs first? Its critical to making the search work on hbase? Yes, HDFS as it is would make queries execute extremely slowly (because of random small reads), also I don't know how to implement the HDFSDirectory (the Lucene interface to the filesystem) without knowing how HDFS works. In this case, we need to use NIO positional read underneath. I think the patch shows NIO pos is doable and hopefully it'll be completed shortly, enough to implement HDFSDirectory and then run a performance comparison of HDFSDirectory vs. NIOFSDirectory. Eg, we'll build identical indexes in both dirs, run the same queries and examine the difference in query speed.
          Hide
          stack added a comment -

          OK.

          Why niopositional read? How is that different than the pread that is already in the dfsclient api? You don't like going via the Block API? Above you say in parens '...(using FileChannel.read(ByteBuffer dst, long position)...' What if the data is not local, usually it is (> 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance. You going to get the FileChannel off the socket (thats the nio bit)?

          You do get the bit that hdfs-347 is a naughty hack as is. A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully.

          You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs.

          Show
          stack added a comment - OK. Why niopositional read? How is that different than the pread that is already in the dfsclient api? You don't like going via the Block API? Above you say in parens '...(using FileChannel.read(ByteBuffer dst, long position)...' What if the data is not local, usually it is (> 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance. You going to get the FileChannel off the socket (thats the nio bit)? You do get the bit that hdfs-347 is a naughty hack as is. A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully. You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs.
          Hide
          Ted Yu added a comment -

          In certain deployment, data node and region server are not on the same machine.
          The above would pose performance issue.

          Show
          Ted Yu added a comment - In certain deployment, data node and region server are not on the same machine. The above would pose performance issue.
          Hide
          Jason Rutherglen added a comment -

          Why niopositional read? How is that different than the pread that is already in the dfsclient api

          I think the goal of HDFS-347 is it'll automatically switch between reading over the network and reading locally? So the pread'll do one or the other?

          You going to get the FileChannel off the socket (thats the nio bit)?

          That's just for the local file.

          What if the data is not local, usually it is (> 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance.

          If we read off a socket I think there's going to be be a serious degradation in performance. I think that's an invariant of search?

          A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully.

          That'll be good! I think this initial version (of HDFS modifications) is simply to get things going, as these other [HDFS] improvements are added we can use them and the DFSInputStream methods used by HDFSDirectory'll be the same?

          You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs.

          If we need to pass the FD using Unix domain sockets then the HDFS work won't be useful. If the UDS's enable positional read, then the [Lucene] HDFSDirectory will work well.

          Show
          Jason Rutherglen added a comment - Why niopositional read? How is that different than the pread that is already in the dfsclient api I think the goal of HDFS-347 is it'll automatically switch between reading over the network and reading locally? So the pread'll do one or the other? You going to get the FileChannel off the socket (thats the nio bit)? That's just for the local file. What if the data is not local, usually it is (> 99% of the time), but is not always; e.g. in time of failure or perhaps after a rebalance. If we read off a socket I think there's going to be be a serious degradation in performance. I think that's an invariant of search? A version that respects 'security', where the 'cleared' fd is passed via unix domain sockets, for the dfsclient to use going direct is probably what'll go in sometime soon hopefully. That'll be good! I think this initial version (of HDFS modifications) is simply to get things going, as these other [HDFS] improvements are added we can use them and the DFSInputStream methods used by HDFSDirectory'll be the same? You are messing down deep below hbase in dfs. I'm a little worried that you'll do a bunch of custom work that may work for your lucene directory implementation but that it will be so particular, it won't be accepted back into hdfs. If we need to pass the FD using Unix domain sockets then the HDFS work won't be useful. If the UDS's enable positional read, then the [Lucene] HDFSDirectory will work well.
          Hide
          Jason Rutherglen added a comment -

          To get Solr distributed queries working across the searchable HBase cluster, we'll need SOLR-1431 completed. Then in this issue, we'll implement the underlying data transfer protocol using HBase RPC (instead of HTTP).

          Show
          Jason Rutherglen added a comment - To get Solr distributed queries working across the searchable HBase cluster, we'll need SOLR-1431 completed. Then in this issue, we'll implement the underlying data transfer protocol using HBase RPC (instead of HTTP).
          Hide
          Jason Rutherglen added a comment -

          Here's a first cut of this:

          • The default directory for the Lucene index is based on the region encoded name
          • We're ignoring column family names for now, this should be configurable, however given we may integrate the Solr config system, there are no configuration settings as of yet.
          • Concurrent queries need to be tested, however they probably will not work because we need the underlying positional read enabled.
          • We're using the append codec because of HDFS
          • The analyzer is hard coded, we should look at integrating the Solr schema system, however that is [currently] hardwired into the rest of the Solr config system.
          • HBaseIndexSearcher uses the UID in the doc store to load the actual data from HBase
          • There are 2 unit tests, TestLuceneCoprocessor and TestHDFSDirectory
          • If HBase replication is turned on, we need to ensure each region's Lucene index has a unique path
          Show
          Jason Rutherglen added a comment - Here's a first cut of this: The default directory for the Lucene index is based on the region encoded name We're ignoring column family names for now, this should be configurable, however given we may integrate the Solr config system, there are no configuration settings as of yet. Concurrent queries need to be tested, however they probably will not work because we need the underlying positional read enabled. We're using the append codec because of HDFS The analyzer is hard coded, we should look at integrating the Solr schema system, however that is [currently] hardwired into the rest of the Solr config system. HBaseIndexSearcher uses the UID in the doc store to load the actual data from HBase There are 2 unit tests, TestLuceneCoprocessor and TestHDFSDirectory If HBase replication is turned on, we need to ensure each region's Lucene index has a unique path
          Hide
          stack added a comment -

          Patch looks great Jason. Is it working?

          On license, its 2011, not 2010.

          What you need here?

          +    // sleep here is an ugly hack to allow region transitions to finish
          +    Thread.sleep(5000);
          

          We should add an API for you that confirms region transitions for you rather than have you wait on a timer that may or may not work (On hudson, the apache build server, it is sure to fail though it may pass on all other platforms known to man).

          I love the very notion of an HBaseIndexSearcher.

          FYI, there is Bytes.equals in place of

          +        if (!Arrays.equals(r.getTableDesc().getName(), tableName)) {
          

          .. your choice. Just pointing it out....

          So, you think the package should be o.a.h.h.search? Do you think this all should ship with hbase Jason? By all means push back into hbase changes you need for your implementation but its looking big enough to be its own project? What you reckon?

          Class comment missing from documenttransformer to explain what it does. Its abstract. Should it be an Interface? (Has no functionality).

          Copyright missing from HDFSLockFactory.

          You are making HDFS locks. Would it make more sense doing ephemeral locks in zk since zk is part of your toolkit when up on hbase?

          Whats going on here?

          +        } else if (!fileSystem.isDirectory(new Path(lockDir))) {//lockDir.) {//isDirectory()) {
          

          DefaultDocumentTransformer.java does non-standard license after the imports. You do this in a few places.

          You probably should use Bytes.toStringBinary instead of + String value = new String(kv.getValue()); The former does UTF-8 and it'll make binaries into printables if any present.

          ditto here: + String rowStr = Bytes.toString(row);

          Class doc missing off HBaseIndexSearcher (or do you want to add package doc to boast about this amazing new utility?)

          What is this 'convert' in HIS doing? Cloning?

          Make the below use Logging instead of System.out?

          + System.out.println("createOutput:"+name);
          + return new HDFSIndexOutput(getPath(name));

          Have you done any perf testing on this stuff. Is it going to be fast enough? You hoping for most searches in in-memory.

          Whats appending codec?

          Show
          stack added a comment - Patch looks great Jason. Is it working? On license, its 2011, not 2010. What you need here? + // sleep here is an ugly hack to allow region transitions to finish + Thread .sleep(5000); We should add an API for you that confirms region transitions for you rather than have you wait on a timer that may or may not work (On hudson, the apache build server, it is sure to fail though it may pass on all other platforms known to man). I love the very notion of an HBaseIndexSearcher. FYI, there is Bytes.equals in place of + if (!Arrays.equals(r.getTableDesc().getName(), tableName)) { .. your choice. Just pointing it out.... So, you think the package should be o.a.h.h.search? Do you think this all should ship with hbase Jason? By all means push back into hbase changes you need for your implementation but its looking big enough to be its own project? What you reckon? Class comment missing from documenttransformer to explain what it does. Its abstract. Should it be an Interface? (Has no functionality). Copyright missing from HDFSLockFactory. You are making HDFS locks. Would it make more sense doing ephemeral locks in zk since zk is part of your toolkit when up on hbase? Whats going on here? + } else if (!fileSystem.isDirectory( new Path(lockDir))) { //lockDir.) {//isDirectory()) { DefaultDocumentTransformer.java does non-standard license after the imports. You do this in a few places. You probably should use Bytes.toStringBinary instead of + String value = new String(kv.getValue()); The former does UTF-8 and it'll make binaries into printables if any present. ditto here: + String rowStr = Bytes.toString(row); Class doc missing off HBaseIndexSearcher (or do you want to add package doc to boast about this amazing new utility?) What is this 'convert' in HIS doing? Cloning? Make the below use Logging instead of System.out? + System.out.println("createOutput:"+name); + return new HDFSIndexOutput(getPath(name)); Have you done any perf testing on this stuff. Is it going to be fast enough? You hoping for most searches in in-memory. Whats appending codec?
          Hide
          Jason Rutherglen added a comment -

          Patch looks great Jason. Is it working?

          Stack, thanks for your comments. The test cases pass. They're not very stressful yet.

          FYI, there is Bytes.equals in place of

          It's copied from TestRegionObserverInterface.

          You are making HDFS locks. Would it make more sense doing ephemeral locks in zk since zk is part of your toolkit when up on hbase?

          That's a good idea, however if HBase is enforcing the lock on a region, meaning the region can only exist on one server at a time, then the Lucene index locks are less important.

          Have you done any perf testing on this stuff. Is it going to be fast enough? You hoping for most searches in in-memory.

          We can get the functionality working, restructuring Lucene or Solr as needed, assuming that positional reads in HDFS will be implemented (I have a separate HDFS patch that can be applied), then I'll start to benchmark. The index doesn't need to be in heap space as the file local positional reads should rely on the system IO cache.

          Whats appending codec?

          Some Lucene segments files after being written seek back to the beginning of the file to write header information, the append codec only writes forward.

          Class comment missing from documenttransformer to explain what it does. Its abstract. Should it be an Interface? (Has no functionality).

          I will change it to an interface.

          So, you think the package should be o.a.h.h.search? Do you think this all should ship with hbase Jason? By all means push back into hbase changes you need for your implementation but its looking big enough to be its own project? What you reckon?

          I think it'll be easier to write the code embedded in HBase, then if it works [well] we can decide?

          What is this 'convert' in HIS doing? Cloning?

          It's loading the actual data from HBase and returning it in a Lucene document. While we can simply return the row + timestamp, loading the doc data is useful if we integrate Solr, because Solr needs a fully fleshed out document to perform for example, highlighting.

          I'll incorporate the rest of the code recommendations. The next patch will [hopefully] have an RPC based search call, implement index splitting (eg, performing the same operation on the index as a region split), and have a test case for WAL based index restoring.

          Show
          Jason Rutherglen added a comment - Patch looks great Jason. Is it working? Stack, thanks for your comments. The test cases pass. They're not very stressful yet. FYI, there is Bytes.equals in place of It's copied from TestRegionObserverInterface. You are making HDFS locks. Would it make more sense doing ephemeral locks in zk since zk is part of your toolkit when up on hbase? That's a good idea, however if HBase is enforcing the lock on a region, meaning the region can only exist on one server at a time, then the Lucene index locks are less important. Have you done any perf testing on this stuff. Is it going to be fast enough? You hoping for most searches in in-memory. We can get the functionality working, restructuring Lucene or Solr as needed, assuming that positional reads in HDFS will be implemented (I have a separate HDFS patch that can be applied), then I'll start to benchmark. The index doesn't need to be in heap space as the file local positional reads should rely on the system IO cache. Whats appending codec? Some Lucene segments files after being written seek back to the beginning of the file to write header information, the append codec only writes forward. Class comment missing from documenttransformer to explain what it does. Its abstract. Should it be an Interface? (Has no functionality). I will change it to an interface. So, you think the package should be o.a.h.h.search? Do you think this all should ship with hbase Jason? By all means push back into hbase changes you need for your implementation but its looking big enough to be its own project? What you reckon? I think it'll be easier to write the code embedded in HBase, then if it works [well] we can decide? What is this 'convert' in HIS doing? Cloning? It's loading the actual data from HBase and returning it in a Lucene document. While we can simply return the row + timestamp, loading the doc data is useful if we integrate Solr, because Solr needs a fully fleshed out document to perform for example, highlighting. I'll incorporate the rest of the code recommendations. The next patch will [hopefully] have an RPC based search call, implement index splitting (eg, performing the same operation on the index as a region split), and have a test case for WAL based index restoring.
          Hide
          Todd Lipcon added a comment -

          No matter how great this works, I don't think we should pull it into HBase proper. For me, "contribs" are an anti-pattern for various reasons we've discussed before on the dev list.

          Now that we are a TLP, we could consider hosting subprojects for things like this - i.e with their own SVN trees and release cycles. But tying this release cycle to HBase core has a number of bad effects associated with it.

          Should we discuss on-list if there's disagreement?

          Show
          Todd Lipcon added a comment - No matter how great this works, I don't think we should pull it into HBase proper. For me, "contribs" are an anti-pattern for various reasons we've discussed before on the dev list. Now that we are a TLP, we could consider hosting subprojects for things like this - i.e with their own SVN trees and release cycles. But tying this release cycle to HBase core has a number of bad effects associated with it. Should we discuss on-list if there's disagreement?
          Hide
          Ted Yu added a comment -

          In DefaultDocumentTransformer, I think we should check whether row has changed:

          +        if (row == null) {
          +          row = kv.getRow();
          

          The following call should be made if row has changed:

          addFields(row, timestamp, doc, lucene);
          
          Show
          Ted Yu added a comment - In DefaultDocumentTransformer, I think we should check whether row has changed: + if (row == null ) { + row = kv.getRow(); The following call should be made if row has changed: addFields(row, timestamp, doc, lucene);
          Hide
          Andrew Purtell added a comment -

          @Todd Hosting subprojects sounds reasonable to me. We want to make a friendly home for cool new work but can also accommodate downstream packagers who don't want any kind of support implied.

          Show
          Andrew Purtell added a comment - @Todd Hosting subprojects sounds reasonable to me. We want to make a friendly home for cool new work but can also accommodate downstream packagers who don't want any kind of support implied.
          Hide
          Jason Rutherglen added a comment -

          In DefaultDocumentTransformer, I think we should check whether row has changed

          It's possible to modify multiple rows per postPut or postWALRestore? Are the KeyValue(s) sorted by row, as we probably want to group row modifications together. Also it seems that it's possible to only update a select few columns of a row? So we may need to reload the entire row and index it again?

          Show
          Jason Rutherglen added a comment - In DefaultDocumentTransformer, I think we should check whether row has changed It's possible to modify multiple rows per postPut or postWALRestore? Are the KeyValue(s) sorted by row, as we probably want to group row modifications together. Also it seems that it's possible to only update a select few columns of a row? So we may need to reload the entire row and index it again?
          Hide
          stack added a comment -

          @Andrew I'm against taking on src/contribs given past experience where they tended to add friction to major core changes. With hbase up in Apache git, I think its easier for projects that are not in our src tree to follow along (github makes it easy doc'ing, etc., the related external project). Discussion of the add-on up on hbase is grand (and encouraged I'd say since it lets the rest of the hbase space know of the addition) but no src I'd say. Any changes to core an external project requires to work we should take on too (if good justification).

          Show
          stack added a comment - @Andrew I'm against taking on src/contribs given past experience where they tended to add friction to major core changes. With hbase up in Apache git, I think its easier for projects that are not in our src tree to follow along (github makes it easy doc'ing, etc., the related external project). Discussion of the add-on up on hbase is grand (and encouraged I'd say since it lets the rest of the hbase space know of the addition) but no src I'd say. Any changes to core an external project requires to work we should take on too (if good justification).
          Hide
          Andrew Purtell added a comment -

          @Stack I didn't say contrib, I said sub projects.

          Show
          Andrew Purtell added a comment - @Stack I didn't say contrib, I said sub projects.
          Hide
          stack added a comment -

          @Andrew Pardon me for my misread but I'd be agin keeping up subprojects too because of the admin load. We don't need it IMO.

          Show
          stack added a comment - @Andrew Pardon me for my misread but I'd be agin keeping up subprojects too because of the admin load. We don't need it IMO.
          Hide
          Ted Yu added a comment -

          postWALRestore would pass one WALEdit which is for one row.
          postPut is for one row as well.

          Show
          Ted Yu added a comment - postWALRestore would pass one WALEdit which is for one row. postPut is for one row as well.
          Hide
          Otis Gospodnetic added a comment -

          Jason, what is the current state of this work? Does it work with the trunk? Is there a list of issues/problems that need to be fixed before this can be called "working"? Thanks!

          Show
          Otis Gospodnetic added a comment - Jason, what is the current state of this work? Does it work with the trunk? Is there a list of issues/problems that need to be fixed before this can be called "working"? Thanks!
          Hide
          Jason Rutherglen added a comment -

          @Otis The next step is to benchmark the query performance which may be degraded due to the random positional read performance of HDFS. For this maybe we should use: http://code.google.com/a/apache-extras.org/p/luceneutil/ Also, the blocking issues should [ideally] be resolved. You can take a look at the Solr one SOLR-1431, and commit it.

          Show
          Jason Rutherglen added a comment - @Otis The next step is to benchmark the query performance which may be degraded due to the random positional read performance of HDFS. For this maybe we should use: http://code.google.com/a/apache-extras.org/p/luceneutil/ Also, the blocking issues should [ideally] be resolved. You can take a look at the Solr one SOLR-1431 , and commit it.
          Hide
          Otis Gospodnetic added a comment -

          Thanks Jason. What's the Solr dependency about? I thought your idea is to go with pure Lucene-level HBase + indexing integration, not Solr. I do see you mention Solr's schema in the initial comments in this issue, but can't find any mentions of Solr in your patch. Could you please clarify the approach? Oh, and if the ML is a better medium, I can move my questions there. Thanks.

          Show
          Otis Gospodnetic added a comment - Thanks Jason. What's the Solr dependency about? I thought your idea is to go with pure Lucene-level HBase + indexing integration, not Solr. I do see you mention Solr's schema in the initial comments in this issue, but can't find any mentions of Solr in your patch. Could you please clarify the approach? Oh, and if the ML is a better medium, I can move my questions there. Thanks.
          Hide
          Jason Rutherglen added a comment -

          @Otis We can benchmark using Lucene in conjunction with HDFS-347, of which I have a more streamlined version of that'll be available in Github. Implementing Solr for benchmarking would create too much overhead.

          I think we may want to integrate with Solr [in the future] for out of the box distributed queries, facets, and also to make use of the schema. I'll likely open additional Solr related issues when we get there.

          Show
          Jason Rutherglen added a comment - @Otis We can benchmark using Lucene in conjunction with HDFS-347 , of which I have a more streamlined version of that'll be available in Github. Implementing Solr for benchmarking would create too much overhead. I think we may want to integrate with Solr [in the future] for out of the box distributed queries, facets, and also to make use of the schema. I'll likely open additional Solr related issues when we get there.
          Hide
          Jason Rutherglen added a comment -

          I placed the HDFS-347 changes in a Github repository located at: https://github.com/jasonrutherglen/HDFS-347-HBASE

          Show
          Jason Rutherglen added a comment - I placed the HDFS-347 changes in a Github repository located at: https://github.com/jasonrutherglen/HDFS-347-HBASE
          Hide
          Jason Rutherglen added a comment -

          The HBase search related branch (at this point it's a branch, however it can/should be a coprocessor isolated Jar) is located at: https://github.com/jasonrutherglen/HBase-Search The utility of it is to have all of the changes in one place, that will contain all of the integrated Lucene/HDFS changes. This will allow easy benchmark and test execution.

          Show
          Jason Rutherglen added a comment - The HBase search related branch (at this point it's a branch, however it can/should be a coprocessor isolated Jar) is located at: https://github.com/jasonrutherglen/HBase-Search The utility of it is to have all of the changes in one place, that will contain all of the integrated Lucene/HDFS changes. This will allow easy benchmark and test execution.
          Hide
          Jason Rutherglen added a comment -

          I'm working on profiling and optimizing the HDFS random access, so that the Lucene HDFS queries are the same as native file system access using NIOFSDirectory.

          I think one extremely direct approach is to set the max block size to something above all Lucene segments files (at runtime via the DFSClient.create method). This will guarantee that there is only one underlying java.io.File per HDFS file, and so random access will avoid navigating block structures (which require expensive network calls, a binary search, and object creation overhead).

          Show
          Jason Rutherglen added a comment - I'm working on profiling and optimizing the HDFS random access, so that the Lucene HDFS queries are the same as native file system access using NIOFSDirectory. I think one extremely direct approach is to set the max block size to something above all Lucene segments files (at runtime via the DFSClient.create method). This will guarantee that there is only one underlying java.io.File per HDFS file, and so random access will avoid navigating block structures (which require expensive network calls, a binary search, and object creation overhead).
          Hide
          Jason Rutherglen added a comment -

          Here are some basic benchmark numbers. The code is more or less pushed to Github. I need to verify it all works for a clean download of the various parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and HBase with Search.

          The architecture is to write out a single block per Lucene file. In this way we can simply obtain one underlying java.io.File directly from the DFSClient. The file is then MMap'ed using a modified version of the MMapDirectory called HDFSDirectory.

          The benchmark shows that storing Lucene indexes into HDFS and reading directly from HDFS is viable (as opposed to copying the files out of HDFS first to the local filesystem).

          Here are times in milliseconds, on the Wiki-EN corpus:

          lucene indexing duration: 50202
          lucene query time #1: 11780
          lucene query time #2: 6211
          lucene query time #3: 6181

          hbase indexing duration: 70681
          hbase query time #1: 8332
          hbase query time #2: 6785
          hbase query time #3: 6621

          As you can see, the indexing is a little bit slower when writing to HDFS. However with the new changes going into Lucene (ie, LUCENE-2324), a pause when flushing (due to HDFS overhead) will not slow down indexing. So expect indexing parity soon.

          The main query times to look at are the #2 and #3, allowing for warmup of the system IO cache in #1. HBase queries are somewhat slower because each new DFSInputStream created must contact the DataNode. We can optimize this however I think for now we're good.

          Here are the queries being run (50 times per round), they are non-trivial.

          "states"
          "unit*"
          "uni*"
          "u*d"
          "un*d"
          "united~0.75"
          "united~0.6"
          "unit~0.7"
          "unit~0.5", // 2
          "doctitle:/.[Uu]nited./"
          "united OR states"
          "united AND states"
          "nebraska AND states"
          "\"united states\""
          "\"united states\"~3"

          Show
          Jason Rutherglen added a comment - Here are some basic benchmark numbers. The code is more or less pushed to Github. I need to verify it all works for a clean download of the various parts, of which there are 3, Lucene, HDFS-347 Hadoop 0.20 append modified, and HBase with Search. The architecture is to write out a single block per Lucene file. In this way we can simply obtain one underlying java.io.File directly from the DFSClient. The file is then MMap'ed using a modified version of the MMapDirectory called HDFSDirectory. The benchmark shows that storing Lucene indexes into HDFS and reading directly from HDFS is viable (as opposed to copying the files out of HDFS first to the local filesystem). Here are times in milliseconds, on the Wiki-EN corpus: lucene indexing duration: 50202 lucene query time #1: 11780 lucene query time #2: 6211 lucene query time #3: 6181 hbase indexing duration: 70681 hbase query time #1: 8332 hbase query time #2: 6785 hbase query time #3: 6621 As you can see, the indexing is a little bit slower when writing to HDFS. However with the new changes going into Lucene (ie, LUCENE-2324 ), a pause when flushing (due to HDFS overhead) will not slow down indexing. So expect indexing parity soon. The main query times to look at are the #2 and #3, allowing for warmup of the system IO cache in #1. HBase queries are somewhat slower because each new DFSInputStream created must contact the DataNode. We can optimize this however I think for now we're good. Here are the queries being run (50 times per round), they are non-trivial. "states" "unit*" "uni*" "u*d" "un*d" "united~0.75" "united~0.6" "unit~0.7" "unit~0.5", // 2 "doctitle:/. [Uu] nited. /" "united OR states" "united AND states" "nebraska AND states" "\"united states\"" "\"united states\"~3"
          Hide
          Jason Rutherglen added a comment -

          I updated the HBase search branch at Github and created complete instructions for how to execute the benchmark. This should also help with examining the code. The HBASE-SEARCH project contains 10,000 bz2 compressed wiki-en documents which account for 100 MB of the download. The slightly modified Lucene libraries are located in the lib/ directory (so that you do not need to download the entire Lucene branch source).

          https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt

          The Lucene vs. HBase Search indexing and search times will be located in the file:
          target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt

          Benchmark Execution Instructions
          
          Create a directory for the HBase Lucene installation.  Then run the following:
          
          git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE
          cd HDFS-347-HBASE
          ant mvn-install
          cd ..
          
          git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH
          cd HBASE-SEARCH
          cd lib
          ./install-libs.sh
          cd ..
          cd wiki-en
          tar -jxvf 10000.bz2
          cd ..
          mvn test -Dtest=TestSearchBenchmark
          

          Feel free to let me know if there are problems or if you have questions.

          Show
          Jason Rutherglen added a comment - I updated the HBase search branch at Github and created complete instructions for how to execute the benchmark. This should also help with examining the code. The HBASE-SEARCH project contains 10,000 bz2 compressed wiki-en documents which account for 100 MB of the download. The slightly modified Lucene libraries are located in the lib/ directory (so that you do not need to download the entire Lucene branch source). https://github.com/jasonrutherglen/HBASE-SEARCH/blob/trunk/BENCHMARK.txt The Lucene vs. HBase Search indexing and search times will be located in the file: target/surefire-reports/org.apache.hadoop.hbase.search.TestSearchBenchmark-output.txt Benchmark Execution Instructions Create a directory for the HBase Lucene installation. Then run the following: git clone git://github.com/jasonrutherglen/HDFS-347-HBASE.git HDFS-347-HBASE cd HDFS-347-HBASE ant mvn-install cd .. git clone git://github.com/jasonrutherglen/HBASE-SEARCH.git HBASE-SEARCH cd HBASE-SEARCH cd lib ./install-libs.sh cd .. cd wiki-en tar -jxvf 10000.bz2 cd .. mvn test -Dtest=TestSearchBenchmark Feel free to let me know if there are problems or if you have questions.
          Hide
          Jason Rutherglen added a comment -

          I think the next round of benchmarking could involve showing that we need to directly access the underlying block file in order to not lose performance when running Lucene on HDFS. This is somewhat as per the comment on HDFS-347:

          https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13013719&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13013719

          The next thing we wanted to look at was random I/O. There is a lot
          more overhead on the datanode for this particular use case so this
          could be a place where direct access could really excel

          We can test using HDFS-941 vs. direct block file access using MMap (by obtaining the local file path and the unix domain sockets). I think then we'll show that for the Lucene case, we're on the right track by using direct file access.

          Show
          Jason Rutherglen added a comment - I think the next round of benchmarking could involve showing that we need to directly access the underlying block file in order to not lose performance when running Lucene on HDFS. This is somewhat as per the comment on HDFS-347 : https://issues.apache.org/jira/browse/HDFS-347?focusedCommentId=13013719&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13013719 The next thing we wanted to look at was random I/O. There is a lot more overhead on the datanode for this particular use case so this could be a place where direct access could really excel We can test using HDFS-941 vs. direct block file access using MMap (by obtaining the local file path and the unix domain sockets). I think then we'll show that for the Lucene case, we're on the right track by using direct file access.
          Hide
          Jason Rutherglen added a comment -

          HDFS-941 isn't applying to trunk, and we'll need a semi-unique build of the HDFSDirectory and benchmarking code updated to Hadoop trunk (as opposed to Hadoop 0.20-append). Given Unix Domain Sockets HADOOP-6311 is for trunk (rather than 0.20-append) we may want to wait for a version of HBase that runs on Hadoop trunk, (eg, the current direct file access works fine, Unix Domain Sockets is only for security, not speed). Then we can put off benchmarking HDFS-941 as well.

          Show
          Jason Rutherglen added a comment - HDFS-941 isn't applying to trunk, and we'll need a semi-unique build of the HDFSDirectory and benchmarking code updated to Hadoop trunk (as opposed to Hadoop 0.20-append). Given Unix Domain Sockets HADOOP-6311 is for trunk (rather than 0.20-append) we may want to wait for a version of HBase that runs on Hadoop trunk, (eg, the current direct file access works fine, Unix Domain Sockets is only for security, not speed). Then we can put off benchmarking HDFS-941 as well.
          Hide
          Jason Rutherglen added a comment -

          I updated the Lucene version to the latest from trunk which includes the new asynchronous flushing of the RAM buffer. As expected, this has put the indexing creation using HDFS in line with Lucene (because the overhead from the DataNode does not delay further indexing). Also it looks like the query times are in fact nearly the same as well.

          Lucene indexing duration: 57858 ms
          Lucene query time #1: 14208 ms
          Lucene query time #2: 7024 ms
          Lucene query time #3: 6902 ms

          HBase indexing duration: 50631 ms
          HBase query time #1: 8625 ms
          HBase query time #2: 7081 ms
          HBase query time #3: 7139 ms

          Show
          Jason Rutherglen added a comment - I updated the Lucene version to the latest from trunk which includes the new asynchronous flushing of the RAM buffer. As expected, this has put the indexing creation using HDFS in line with Lucene (because the overhead from the DataNode does not delay further indexing). Also it looks like the query times are in fact nearly the same as well. Lucene indexing duration: 57858 ms Lucene query time #1: 14208 ms Lucene query time #2: 7024 ms Lucene query time #3: 6902 ms HBase indexing duration: 50631 ms HBase query time #1: 8625 ms HBase query time #2: 7081 ms HBase query time #3: 7139 ms
          Hide
          stack added a comment -

          Nice!

          Show
          stack added a comment - Nice!
          Hide
          Todd Lipcon added a comment -

          Awesome stuff. These query times above are using the hacky (non-secure non-checksummed) implementation of HDFS-347?

          Apologies for my laziness of not looking through the code, but would you provide a one-paragraph description of how a user would interact with this? As I am understanding it, this is:

          • User defines some special property on a column family that they want to be searchable
            • this property would include a solr schema which specifies analyzers and fields
          • User inserts data using normal HBase APIs
          • User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)?

          Is that right?

          Show
          Todd Lipcon added a comment - Awesome stuff. These query times above are using the hacky (non-secure non-checksummed) implementation of HDFS-347 ? Apologies for my laziness of not looking through the code, but would you provide a one-paragraph description of how a user would interact with this? As I am understanding it, this is: User defines some special property on a column family that they want to be searchable this property would include a solr schema which specifies analyzers and fields User inserts data using normal HBase APIs User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)? Is that right?
          Hide
          Jason Rutherglen added a comment -

          Awesome stuff. These query times above are using the hacky (non-secure non-checksummed) implementation of HDFS-347?

          It's hackier than that. It's basically obtaining the java.io.File directly from the FSInputStream. However it's a good baseline to benchmark against things like HADOOP-6311 + HDFS-347. Those need to wait for HBase that works with Hadoop 0.22/trunk anyways?

          User defines some special property on a column family that they want to be searchable, this property would include a solr schema which specifies analyzers and fields

          Currently there's a DocumentTransformer class which needs to be implemented to transform column-family edits into a Lucene document. That could use the Solr schema for example or any other separate system to tokenize the byte[]s into a Document.

          User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)?

          I think for now we need to offer an external commit on the index, as Lucene only has near realtime search (eg, small segments will be written out, which will overwhelm HDFS). LUCENE-2312 will implement realtime search (eg, searching on the RAM buffer as it's being built). The recent LUCENE-3092 could be used in the meantime to build segments in RAM, and only flush to HDFS when it's too RAM consuming, then we would not need to force the user to 'commit' the index.

          To answer the question, yes, though today the indexing performance will not be as good as when LUCENE-2312 is implemented or the user will need to 'commit' the index to search on the latest data.

          Getting all of Solr work work with this system is fairly doable. Each Solr core would map to a region. Things like replication would be disabled. The config files would be stored in HDFS (instead of the local filesystem). For distributed queries, we need SOLR-1431, and then to implement distributed networking using HBase RPC instead of Solr's HTTP RPC. There are other smaller internal things that'd need to change in Solr. I think HBase RPC is aware of where regions live etc so I don't think we need to worry about putting failover logic into the distributed search code?

          I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil documents.

          Show
          Jason Rutherglen added a comment - Awesome stuff. These query times above are using the hacky (non-secure non-checksummed) implementation of HDFS-347 ? It's hackier than that. It's basically obtaining the java.io.File directly from the FSInputStream. However it's a good baseline to benchmark against things like HADOOP-6311 + HDFS-347 . Those need to wait for HBase that works with Hadoop 0.22/trunk anyways? User defines some special property on a column family that they want to be searchable, this property would include a solr schema which specifies analyzers and fields Currently there's a DocumentTransformer class which needs to be implemented to transform column-family edits into a Lucene document. That could use the Solr schema for example or any other separate system to tokenize the byte[]s into a Document. User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)? I think for now we need to offer an external commit on the index, as Lucene only has near realtime search (eg, small segments will be written out, which will overwhelm HDFS). LUCENE-2312 will implement realtime search (eg, searching on the RAM buffer as it's being built). The recent LUCENE-3092 could be used in the meantime to build segments in RAM, and only flush to HDFS when it's too RAM consuming, then we would not need to force the user to 'commit' the index. To answer the question, yes, though today the indexing performance will not be as good as when LUCENE-2312 is implemented or the user will need to 'commit' the index to search on the latest data. Getting all of Solr work work with this system is fairly doable. Each Solr core would map to a region. Things like replication would be disabled. The config files would be stored in HDFS (instead of the local filesystem). For distributed queries, we need SOLR-1431 , and then to implement distributed networking using HBase RPC instead of Solr's HTTP RPC. There are other smaller internal things that'd need to change in Solr. I think HBase RPC is aware of where regions live etc so I don't think we need to worry about putting failover logic into the distributed search code? I'm going to post additional benchmarks shortly, eg, for 100,000 and 1 mil documents.
          Hide
          Jason Rutherglen added a comment -

          In regards to checksums, I think we can verify/checksum the Lucene index
          files only once when HDFSDirectory is created. We cannot checksum per file
          open as the overhead would be too much. I think there'll need to be a hook
          added to run the checksum via the HDFS client?

          The other issue is ensuring data locality as otherwise Lucene queries will
          be unusable due to the inherent random access pattern. I think for this
          we'll need to add something to the NameNode? Perhaps it would be a custom
          placement policy, where if a given file is part of the Lucene index and
          not local, we ask the NameNode to make it local (thereby over replicating
          the file). I think this'll be a separate Jira issue?

          User inserts data using normal HBase APIs

          Yes, even if we [possibly] support Solr, we'd only be implementing a
          subset of the Solr functionality. One of the things that would go unused
          is the ability to update documents using Solr APIs (which we'd turn off),
          instead the data will only be updated via HBase. The Solr query APIs and
          schema would be the main parts of Solr we'd be using. This can be roughly
          defined as making using of the request handlers and search components:
          http://wiki.apache.org/solr/SearchComponent which perhaps should be
          modularized out of Solr anyways.

          User can now perform an arbitrary lucene search over the table,
          resulting in completely up-to-date results? (ie spans both memstore and
          flushed data)?

          Yes, that is correct.

          Show
          Jason Rutherglen added a comment - In regards to checksums, I think we can verify/checksum the Lucene index files only once when HDFSDirectory is created. We cannot checksum per file open as the overhead would be too much. I think there'll need to be a hook added to run the checksum via the HDFS client? The other issue is ensuring data locality as otherwise Lucene queries will be unusable due to the inherent random access pattern. I think for this we'll need to add something to the NameNode? Perhaps it would be a custom placement policy, where if a given file is part of the Lucene index and not local, we ask the NameNode to make it local (thereby over replicating the file). I think this'll be a separate Jira issue? User inserts data using normal HBase APIs Yes, even if we [possibly] support Solr, we'd only be implementing a subset of the Solr functionality. One of the things that would go unused is the ability to update documents using Solr APIs (which we'd turn off), instead the data will only be updated via HBase. The Solr query APIs and schema would be the main parts of Solr we'd be using. This can be roughly defined as making using of the request handlers and search components: http://wiki.apache.org/solr/SearchComponent which perhaps should be modularized out of Solr anyways. User can now perform an arbitrary lucene search over the table, resulting in completely up-to-date results? (ie spans both memstore and flushed data)? Yes, that is correct.
          Hide
          Jason Rutherglen added a comment -

          I opened HDFS-2004 to implement pinning HDFS files (in this case the Lucene index files) to the local DataNode. I think this is necessary functionality for HBase search because all index files need to be local (we're MMap'ing). I think the common use case is a region server goes down, when the new one is brought up, files will likely not be local?

          Show
          Jason Rutherglen added a comment - I opened HDFS-2004 to implement pinning HDFS files (in this case the Lucene index files) to the local DataNode. I think this is necessary functionality for HBase search because all index files need to be local (we're MMap'ing). I think the common use case is a region server goes down, when the new one is brought up, files will likely not be local?
          Hide
          Jason Rutherglen added a comment -

          In discussing with J-D (thanks!), we can place logic in the Lucene
          Coprocessor preOpen method to find out if any of the blocks of the Lucene
          files in HDFS are not local (by asking the NameNode), then we can:

          1) Rewrite, partially optimize, or fully optimize the index, thereby
          rewriting the index files which causes them to 'go local'.

          2) Extend the default placement policy and balancer to skip 'balancing'
          Lucene files, because we want them to stay local.

          3) Use HDFS-2004 to manually move non-local blocks to the local DataNode.

          Where #3 is more complex and will likely be much more time consuming.

          This functionality is important as it could currently be considered the only
          'blocker' on putting HBase search into a test/production environment.

          Show
          Jason Rutherglen added a comment - In discussing with J-D (thanks!), we can place logic in the Lucene Coprocessor preOpen method to find out if any of the blocks of the Lucene files in HDFS are not local (by asking the NameNode), then we can: 1) Rewrite, partially optimize, or fully optimize the index, thereby rewriting the index files which causes them to 'go local'. 2) Extend the default placement policy and balancer to skip 'balancing' Lucene files, because we want them to stay local. 3) Use HDFS-2004 to manually move non-local blocks to the local DataNode. Where #3 is more complex and will likely be much more time consuming. This functionality is important as it could currently be considered the only 'blocker' on putting HBase search into a test/production environment.
          Hide
          Jason Rutherglen added a comment -

          SOLR-1431 is updated to trunk. I'm tempted to start trying to plug in
          Solr. I think the way to do this is to use the HTable.coprocessorExec
          method (for the distributed search), where the Solr shards are of the form
          'shards=start:hexstartkey,end:hexendkey'. Then HBase will take care of the
          rest from an RPC perspective. Eg, forwarding the request to the individual
          HRegion's running the SolrCoprocessor.

          I think we'll use a single Solr schema per region, though we can add a
          special delimiter in the field name to indicate that the prefix is the
          column family, then the column name. Something like 'headers:subject' may
          work. The main caveat is that the fields marked stored in fact
          will not be stored into Lucene (because they're in HBase).

          Show
          Jason Rutherglen added a comment - SOLR-1431 is updated to trunk. I'm tempted to start trying to plug in Solr. I think the way to do this is to use the HTable.coprocessorExec method (for the distributed search), where the Solr shards are of the form 'shards=start:hexstartkey,end:hexendkey'. Then HBase will take care of the rest from an RPC perspective. Eg, forwarding the request to the individual HRegion's running the SolrCoprocessor. I think we'll use a single Solr schema per region, though we can add a special delimiter in the field name to indicate that the prefix is the column family, then the column name. Something like 'headers:subject' may work. The main caveat is that the fields marked stored in fact will not be stored into Lucene (because they're in HBase).
          Hide
          Alex Baranau added a comment -

          There seem to be a bug in the changed HDFS code in git://github.com/jasonrutherglen/HDFS-347-HBASE.

          hdfs-347-hbase/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java
          getFile():1508

          if ((targetAddr.equals(localHost) ||
          targetAddr.getHostName().startsWith("localhost"))) {

          instead, it should be:

          if ((targetAddr.getAddress().equals(localHost) ||
          targetAddr.getHostName().startsWith("localhost"))) {

          This causes TestLuceneCoprocessor to fail in case the machine's host resolves to smth other than localhost ("alexpc" in my case).

          P.S. this was found during HBase hackathon in Berlin, hi from there!

          Show
          Alex Baranau added a comment - There seem to be a bug in the changed HDFS code in git://github.com/jasonrutherglen/HDFS-347-HBASE. hdfs-347-hbase/src/hdfs/org/apache/hadoop/hdfs/DFSClient.java getFile():1508 if ((targetAddr.equals(localHost) || targetAddr.getHostName().startsWith("localhost"))) { instead, it should be: if ((targetAddr.getAddress().equals(localHost) || targetAddr.getHostName().startsWith("localhost"))) { This causes TestLuceneCoprocessor to fail in case the machine's host resolves to smth other than localhost ("alexpc" in my case). P.S. this was found during HBase hackathon in Berlin, hi from there!
          Hide
          Alex Baranau added a comment -

          Another problem we faced: looks like there's an issue in TestLuceneCoprocessor tests life-cycle or smth else:

          • the testSearchRPC test fails if we run "mvn clean -Dtest=TestLuceneCoprocessor test", other 2 pass (it fails on first assert: expected 20, but found 10)
          • if I add @Ignore to other two tests, i.e. the maven command runs only testSearchRPC, it works well
          Show
          Alex Baranau added a comment - Another problem we faced: looks like there's an issue in TestLuceneCoprocessor tests life-cycle or smth else: the testSearchRPC test fails if we run "mvn clean -Dtest=TestLuceneCoprocessor test", other 2 pass (it fails on first assert: expected 20, but found 10) if I add @Ignore to other two tests, i.e. the maven command runs only testSearchRPC, it works well
          Hide
          Jason Rutherglen added a comment -

          Hi Alex, I have new code I will commit to Github.

          Show
          Jason Rutherglen added a comment - Hi Alex, I have new code I will commit to Github.
          Hide
          Alex Baranau added a comment -

          Thank you! Berlin is waiting! (kidding, we are going to leave very soon)

          Show
          Alex Baranau added a comment - Thank you! Berlin is waiting! (kidding, we are going to leave very soon)
          Hide
          Otis Gospodnetic added a comment -

          A few more comments/questions for Jason:

          • I see PKIndexSplitter usage for splitting the index when a region splits. I see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then either you are not adding documents to them, or I'm not seeing it?
          • Are there issues around distributed search? It looks like it wasn't in your github branch.
          • What will happen when a region changes its location/regionserver for whatever reason? I see HDFS-2004 got -1ed and you said without that search will be slow. Do you have an alternative plan?
          • What is the reason for storing those 2 extra row fields? (the UID one at the other one... I think it's called rowStr or something like that)
          • What about storing the index in HBase itself? (a la Solandra, I suppose) Would this be doable? Would it make things simpler in the sense that any splitting or moving around, etc. may be handled by HBase and we wouldn't have to make sure the Lucene index always mirrors what's in a region and make sure it follows the region wherever it goes? Lars' idea/question, and I hope I didn't misunderstand or misrepresent his ideas.
          Show
          Otis Gospodnetic added a comment - A few more comments/questions for Jason: I see PKIndexSplitter usage for splitting the index when a region splits. I see you split the index, open 2 IndexWriters for 2 new Lucene indices, but then either you are not adding documents to them, or I'm not seeing it? Are there issues around distributed search? It looks like it wasn't in your github branch. What will happen when a region changes its location/regionserver for whatever reason? I see HDFS-2004 got -1ed and you said without that search will be slow. Do you have an alternative plan? What is the reason for storing those 2 extra row fields? (the UID one at the other one... I think it's called rowStr or something like that) What about storing the index in HBase itself? (a la Solandra, I suppose) Would this be doable? Would it make things simpler in the sense that any splitting or moving around, etc. may be handled by HBase and we wouldn't have to make sure the Lucene index always mirrors what's in a region and make sure it follows the region wherever it goes? Lars' idea/question, and I hope I didn't misunderstand or misrepresent his ideas.
          Hide
          Jason Rutherglen added a comment -

          Otis, I think many of your questions have been addressed in this issue, though indeed the comment trail is long at this point.

          Do you have an alternative plan?

          https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13040465&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13040465

          Are there issues around distributed search? It looks like it wasn't in your github branch

          https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

          What about storing the index in HBase itself?

          I think that's a great idea to test, though in a different Jira issue.

          PKIndexSplitter

          That's LUCENE-2919. Given it's not been committed I may need to bring it over into the HBase search source tree.

          Show
          Jason Rutherglen added a comment - Otis, I think many of your questions have been addressed in this issue, though indeed the comment trail is long at this point. Do you have an alternative plan? https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13040465&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13040465 Are there issues around distributed search? It looks like it wasn't in your github branch https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913 What about storing the index in HBase itself? I think that's a great idea to test, though in a different Jira issue. PKIndexSplitter That's LUCENE-2919 . Given it's not been committed I may need to bring it over into the HBase search source tree.
          Hide
          Otis Gospodnetic added a comment -

          Re https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913

          Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene, so that you don't have to do Lucene-level distributed search? If so, what about NRTness that will be lost until Solr gets NRT search?

          Show
          Otis Gospodnetic added a comment - Re https://issues.apache.org/jira/browse/HBASE-3529?focusedCommentId=13042913&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13042913 Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene, so that you don't have to do Lucene-level distributed search? If so, what about NRTness that will be lost until Solr gets NRT search?
          Hide
          Jason Rutherglen added a comment -

          Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene

          I think the distributed search capability has been removed from Lucene (I just sent an email to Lucene dev)? We should add it back? Hence the possible Solr integration.

          If so, what about NRTness that will be lost until Solr gets NRT search?

          There's a Solr issue to add this though one wouldn't want to implement NRT without LUCENE-3092 + SOLR-2565.

          Show
          Jason Rutherglen added a comment - Does that mean that in order to implement distributed search you'll immediately convert this to HBase+Solr instead of HBase+Lucene I think the distributed search capability has been removed from Lucene (I just sent an email to Lucene dev)? We should add it back? Hence the possible Solr integration. If so, what about NRTness that will be lost until Solr gets NRT search? There's a Solr issue to add this though one wouldn't want to implement NRT without LUCENE-3092 + SOLR-2565 .
          Hide
          Jason Rutherglen added a comment -

          To implement distributed search with sort, we'll need to serialize the field values across the RPC channel. This can be implemented by assuming the sort is by ord which yields BytesRef values, which are easy to sort.

          Show
          Jason Rutherglen added a comment - To implement distributed search with sort, we'll need to serialize the field values across the RPC channel. This can be implemented by assuming the sort is by ord which yields BytesRef values, which are easy to sort.
          Hide
          Jason Rutherglen added a comment -

          With some recent patches committed to Lucene, I can post a patch to HBase trunk that should work fine, that will only require the special HDFS-347 modification/build. Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded.

          Show
          Jason Rutherglen added a comment - With some recent patches committed to Lucene, I can post a patch to HBase trunk that should work fine, that will only require the special HDFS-347 modification/build. Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded.
          Hide
          Andrew Purtell added a comment -

          Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded.

          Post 0.92 we plan to modularize the Maven build already for pluggable RPC and security-variant code. We can also conditionally build coprocessors set in their own packages. In this case, something like -D HDFS-347 enables build of it, and pulls down a suitably patched Hadoop core jar?

          Show
          Andrew Purtell added a comment - Perhaps it's possible to Maven in the custom HDFS-347 so that no external libraries need to manually downloaded. Post 0.92 we plan to modularize the Maven build already for pluggable RPC and security-variant code. We can also conditionally build coprocessors set in their own packages. In this case, something like -D HDFS-347 enables build of it, and pulls down a suitably patched Hadoop core jar?
          Hide
          Jason Rutherglen added a comment -

          We can also conditionally build coprocessors set in their own packages

          Ok, that sounds interesting. Currently I'm pretending like search will be a part of HBase core. If there is another directory to place it in, eg, a coprocessor or contrib directory, I will place it there.

          In this case, something like -D HDFS-347 enables build of it, and pulls down a suitably patched Hadoop core jar?

          Yeah I have no idea how to post the HDFS-347-LUCENE version to a Maven repo and get that working. I can however probably figure it out.

          I like the idea of posting a patch, putting things on Github seems quite remote, even to me, and I admit to preferring the simplicity of SVN on this currently one man project.

          Show
          Jason Rutherglen added a comment - We can also conditionally build coprocessors set in their own packages Ok, that sounds interesting. Currently I'm pretending like search will be a part of HBase core. If there is another directory to place it in, eg, a coprocessor or contrib directory, I will place it there. In this case, something like -D HDFS-347 enables build of it, and pulls down a suitably patched Hadoop core jar? Yeah I have no idea how to post the HDFS-347 -LUCENE version to a Maven repo and get that working. I can however probably figure it out. I like the idea of posting a patch, putting things on Github seems quite remote, even to me, and I admit to preferring the simplicity of SVN on this currently one man project.
          Hide
          Andrew Purtell added a comment -

          Currently I'm pretending like search will be a part of HBase core.

          Like security, I think there will be enough interest for this that "core but conditional" makes a lot of sense.

          Show
          Andrew Purtell added a comment - Currently I'm pretending like search will be a part of HBase core. Like security, I think there will be enough interest for this that "core but conditional" makes a lot of sense.
          Hide
          Jason Rutherglen added a comment -

          What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS.

          Show
          Jason Rutherglen added a comment - What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS.
          Hide
          Andrew Purtell added a comment -

          What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS.

          See HBASE-4048 and HBase-3810. 3810 is still pending.

          Show
          Andrew Purtell added a comment - What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS. See HBASE-4048 and HBase-3810. 3810 is still pending.
          Hide
          Jason Rutherglen added a comment -

          I opened a trivial issue LUCENE-3296 so that the custom IW config can be passed in.

          Show
          Jason Rutherglen added a comment - I opened a trivial issue LUCENE-3296 so that the custom IW config can be passed in.
          Hide
          Jason Rutherglen added a comment -

          I'm signing up to [1] for the HDFS-347 Maven hosting.

          1. http://nexus.sonatype.org/oss-repository-hosting.html

          Show
          Jason Rutherglen added a comment - I'm signing up to [1] for the HDFS-347 Maven hosting. 1. http://nexus.sonatype.org/oss-repository-hosting.html
          Hide
          Jason Rutherglen added a comment -

          In reviewing the HDFS-347 modification I'd made, the only part of HDFS-347 that's needed is obtaining the local file path from the DataNode via RPC. I will generate a patch that implements this, and submit it to HDFS against the append-0.20 branch. It would be nice to have a more generic introspective metadata'ish API for HDFS that would encompass this (so that it's not so specific to only local data files).

          Show
          Jason Rutherglen added a comment - In reviewing the HDFS-347 modification I'd made, the only part of HDFS-347 that's needed is obtaining the local file path from the DataNode via RPC. I will generate a patch that implements this, and submit it to HDFS against the append-0.20 branch. It would be nice to have a more generic introspective metadata'ish API for HDFS that would encompass this (so that it's not so specific to only local data files).
          Hide
          Jason Rutherglen added a comment -

          Here's a clean update to HDFS that enables obtaining the local file a block corresponds to. I need to place this build in a Maven repository for the actual HBase Search patch.

          Show
          Jason Rutherglen added a comment - Here's a clean update to HDFS that enables obtaining the local file a block corresponds to. I need to place this build in a Maven repository for the actual HBase Search patch.
          Hide
          Eugene Koontz added a comment -

          "What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS."

          "See HBASE-4048 and HBase-3810. 3810 is still pending."

          Show
          Eugene Koontz added a comment - "What's the best way to set custom attributes on the Coprocessor? Eg, I want to tell the Lucene Coprocessor where to look for a configuration file in HDFS." "See HBASE-4048 and HBase-3810. 3810 is still pending."
          Hide
          Martin Alig added a comment -

          @Json: Are you still working on this issue?

          Show
          Martin Alig added a comment - @Json: Are you still working on this issue?
          Hide
          linwukang added a comment -

          I think the most difficut part to integrate Solr into hbase is How to maintain consistency between solr and hbase.

          Show
          linwukang added a comment - I think the most difficut part to integrate Solr into hbase is How to maintain consistency between solr and hbase.

            People

            • Assignee:
              Unassigned
              Reporter:
              Jason Rutherglen
            • Votes:
              36 Vote for this issue
              Watchers:
              99 Start watching this issue

              Dates

              • Created:
                Updated:

                Development