Issue Details (XML | Word | Printable)

Key: LUCENE-1577
Type: Improvement Improvement
Status: Open Open
Priority: Minor Minor
Assignee: Unassigned
Reporter: Jason Rutherglen
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Benchmark of different in RAM realtime techniques

Created: 27/Mar/09 07:52 PM   Updated: 28/Aug/09 05:24 PM
Return to search
Component/s: contrib/*
Affects Version/s: 2.4.1
Fix Version/s: 3.1

Time Tracking:
Original Estimate: 168h
Original Estimate - 168h
Remaining Estimate: 168h
Remaining Estimate - 168h
Time Spent: Not Specified
Remaining Estimate - 168h

File Attachments:
  Size
Text File Licensed for inclusion in ASF works LUCENE-1577.patch 2009-03-27 08:03 PM Jason Rutherglen 16 kB
Issue Links:
Reference
 

Lucene Fields: New


 Description  « Hide
A place to post code that benchmarks the differences in the speed of indexing and searching using different realtime techniques.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Jason Rutherglen added a comment - 27/Mar/09 08:03 PM
This patch performs a benchmark of 3 different techniques for RAM based realtime indexing where after an update, the new document is searchable. It performs multiple rounds of indexing and calculates the percentage difference with fastest of each of the 3 techniques. The document source is the Wikipedia English XML used by contrib/benchmark.
  • RealtimeWriter uses InstantiatedIndex
  • LuceneWriter adds documents to an IndexWriter
  • LuceneRealtimeWriter creates a RAMDirectory, opens an IndexWriter, adds a document, then closes the writer.

I found it odd that RealtimeWriter is faster than LuceneWriter and so perhaps the benchmark is incorrect somehow. Otherwise the results look highly promising in that we can implement realtime search with no impact to existing indexing performance.

Summary of the results:

numRounds:3 docs indexed:50000
lowest of each, percent compared with lowest
RealtimeWriter:7597 dif:0%
LuceneWriter:12940 dif:70%
LuceneRealtimeWriter:25882 dif:241%


Jason Rutherglen made changes - 27/Mar/09 08:03 PM
Field Original Value New Value
Attachment LUCENE-1577.patch [ 12403837 ]
Michael McCandless added a comment - 28/Mar/09 09:05 AM
Are these tests measuring adding a single doc, then searching on it? What are the numbers you measure in the results (eg 25882 for LuceneRealtimeWriter)?

I think we need a more realistic test for near real-time search, but I'm not sure exactly what that is.

In LUCENE-1516 I've added a benchmark task to periodically open a new near real-time reader from the writer, and then tested it while doing bulk indexing. But that's not a typical test, I think (normally bulk indexing is done up front, and only a "trickle" of updates to doc are then done for near real-time search). Maybe we just need an updateDocument task, which randomly picks a doc (identified by a primary-key "docid" field) and replaces it. Then, benchmark already has the ability to rate-limit how frequently docs are updated.


Michael McCandless added a comment - 10/Jun/09 08:12 PM
Moving out.

Michael McCandless made changes - 10/Jun/09 08:12 PM
Fix Version/s 2.9 [ 12312682 ]
Michael McCandless made changes - 11/Jun/09 09:32 AM
Fix Version/s 3.1 [ 12314025 ]
Jason Rutherglen added a comment - 11/Aug/09 09:58 PM
We need a benchmark that simply measures the indexing of
1,5,10,100,1000 docs + (reopen + query). The first benchmark can
use IW.getReader as is (meaning the newly created segments are
written to disk), the other LUCENE-1313 (which stores newly
created segments in RAM). This way we can accurately say which
method works best and in what situation. The use case
LUCENE-1313 is designed for is sub 100 document updates.

I'll update LUCENE-1313, and give this a try.


Mark Miller added a comment - 11/Aug/09 10:29 PM

normally bulk indexing is done up front, and only a "trickle" of updates to doc are then done for near real-time search

Really depends though I think - I would bet that many users that want real time are dealing with a huge amount of updates at given times, and that type of thing seems likely to grow. A lot of times its I think it could be much more than a trickle.

A lot of installations I have seen have certain times when a lot of documents are coming in (certain times, certain days). Social Networking type sites likely see a constant stream of updates at most times. Press releases have hotspots for release - newspaper data all comes in at once in the morning - etc.


Jason Rutherglen made changes - 28/Aug/09 05:24 PM
Link This issue is related to LUCENE-1313 [ LUCENE-1313 ]