Issue Details (XML | Word | Printable)

Key: LUCENE-710
Type: Improvement Improvement
Status: Resolved Resolved
Resolution: Fixed
Priority: Minor Minor
Assignee: Michael McCandless
Reporter: Michael McCandless
Votes: 0
Watchers: 2
Operations

If you were logged in you would be able to see more operations.
Lucene - Java

Implement "point in time" searching without relying on filesystem semantics

Created: 14/Nov/06 07:46 PM   Updated: 15/Mar/07 10:31 PM
Return to search
Component/s: Index
Affects Version/s: 2.1
Fix Version/s: None

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works 710.review.diff 2007-03-15 10:08 AM Doron Cohen 25 kB
File Licensed for inclusion in ASF works 710.review.take2.diff 2007-03-15 12:14 PM Michael McCandless 26 kB
Text File Licensed for inclusion in ASF works LUCENE-710.patch 2007-03-02 03:52 PM Michael McCandless 162 kB
Text File Licensed for inclusion in ASF works LUCENE-710.take2.patch 2007-03-07 09:56 AM Michael McCandless 162 kB
Text File Licensed for inclusion in ASF works LUCENE-710.take3.patch 2007-03-09 09:44 AM Michael McCandless 163 kB

Lucene Fields: New
Resolution Date: 13/Mar/07 09:11 AM


 Description  « Hide
This was touched on in recent discussion on dev list:

http://www.gossamer-threads.com/lists/lucene/java-dev/41700#41700

and then more recently on the user list:

http://www.gossamer-threads.com/lists/lucene/java-user/42088

Lucene's "point in time" searching currently relies on how the
underlying storage handles deletion files that are held open for
reading.

This is highly variable across filesystems. For example, UNIX-like
filesystems usually do "close on last delete", and Windows filesystem
typically refuses to delete a file open for reading (so Lucene retries
later). But NFS just removes the file out from under the reader, and
for that reason "point in time" searching doesn't work on NFS
(see LUCENE-673 ).

With the lockless commits changes (LUCENE-701 ), it's quite simple to
re-implement "point in time searching" so as to not rely on filesystem
semantics: we can just keep more than the last segments_N file (as
well as all files they reference).

This is also in keeping with the design goal of "rely on as little as
possible from the filesystem". EG with lockless we no longer re-use
filenames (don't rely on filesystem cache being coherent) and we no
longer use file renaming (because on Windows it can fails). This
would be another step of not relying on semantics of "deleting open
files". The less we require from filesystem the more portable Lucene
will be!

Where it gets interesting is what "policy" we would then use for
removing segments_N files. The policy now is "remove all but the last
one". I think we would keep this policy as the default. Then you
could imagine other policies:

  • Keep past N day's worth
  • Keep the last N
  • Keep only those in active use by a reader somewhere (note: tricky
    how to reliably figure this out when readers have crashed, etc.)
  • Keep those "marked" as rollback points by some transaction, or
    marked explicitly as a "snaphshot".
  • Or, roll your own: the "policy" would be an interface or abstract
    class and you could make your own implementation.

I think for this issue we could just create the framework
(interface/abstract class for "policy" and invoke it from
IndexFileDeleter) and then implement the current policy (delete all
but most recent segments_N) as the default policy.

In separate issue(s) we could then create the above more interesting
policies.

I think there are some important advantages to doing this:

  • "Point in time" searching would work on NFS (it doesn't now
    because NFS doesn't do "delete on last close"; see LUCENE-673 )
    and any other Directory implementations that don't work
    currently.
  • Transactional semantics become a possibility: you can set a
    snapshot, do a bunch of stuff to your index, and then rollback to
    the snapshot at a later time.
  • If a reader crashes or machine gets rebooted, etc, it could choose
    to re-open the snapshot it had previously been using, whereas now
    the reader must always switch to the last commit point.
  • Searchers could search the same snapshot for follow-on actions.
    Meaning, user does search, then next page, drill down (Solr),
    drill up, etc. These are each separate trips to the server and if
    searcher has been re-opened, user can get inconsistent results (=
    lost trust). But with, one series of search interactions could
    explicitly stay on the snapshot it had started with.


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Michael McCandless made changes - 19/Jan/07 03:09 PM
Field Original Value New Value
Status Open [ 1 ] In Progress [ 3 ]
Michael McCandless made changes - 02/Mar/07 03:52 PM
Attachment LUCENE-710.patch [ 12352438 ]
Michael McCandless made changes - 07/Mar/07 09:56 AM
Attachment LUCENE-710.take2.patch [ 12352818 ]
Michael McCandless made changes - 09/Mar/07 09:44 AM
Attachment LUCENE-710.take3.patch [ 12352965 ]
Michael McCandless made changes - 13/Mar/07 09:11 AM
Status In Progress [ 3 ] Resolved [ 5 ]
Resolution Fixed [ 1 ]
Doron Cohen made changes - 15/Mar/07 10:08 AM
Attachment 710.review.diff [ 12353363 ]
Michael McCandless made changes - 15/Mar/07 12:14 PM
Attachment 710.review.take2.diff [ 12353369 ]