Lucene - Core
  1. Lucene - Core
  2. LUCENE-1458

Further steps towards flexible indexing

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0-ALPHA
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I attached a very rough checkpoint of my current patch, to get early
      feedback. All tests pass, though back compat tests don't pass due to
      changes to package-private APIs plus certain bugs in tests that
      happened to work (eg call TermPostions.nextPosition() too many times,
      which the new API asserts against).

      [Aside: I think, when we commit changes to package-private APIs such
      that back-compat tests don't pass, we could go back, make a branch on
      the back-compat tag, commit changes to the tests to use the new
      package private APIs on that branch, then fix nightly build to use the
      tip of that branch?o]

      There's still plenty to do before this is committable! This is a
      rather large change:

      • Switches to a new more efficient terms dict format. This still
        uses tii/tis files, but the tii only stores term & long offset
        (not a TermInfo). At seek points, tis encodes term & freq/prox
        offsets absolutely instead of with deltas delta. Also, tis/tii
        are structured by field, so we don't have to record field number
        in every term.
        .
        On first 1 M docs of Wikipedia, tii file is 36% smaller (0.99 MB
        -> 0.64 MB) and tis file is 9% smaller (75.5 MB -> 68.5 MB).
        .
        RAM usage when loading terms dict index is significantly less
        since we only load an array of offsets and an array of String (no
        more TermInfo array). It should be faster to init too.
        .
        This part is basically done.
      • Introduces modular reader codec that strongly decouples terms dict
        from docs/positions readers. EG there is no more TermInfo used
        when reading the new format.
        .
        There's nice symmetry now between reading & writing in the codec
        chain – the current docs/prox format is captured in:
        FormatPostingsTermsDictWriter/Reader
        FormatPostingsDocsWriter/Reader (.frq file) and
        FormatPostingsPositionsWriter/Reader (.prx file).
        

        This part is basically done.

      • Introduces a new "flex" API for iterating through the fields,
        terms, docs and positions:
        FieldProducer -> TermsEnum -> DocsEnum -> PostingsEnum
        

        This replaces TermEnum/Docs/Positions. SegmentReader emulates the
        old API on top of the new API to keep back-compat.

      Next steps:

      • Plug in new codecs (pulsing, pfor) to exercise the modularity /
        fix any hidden assumptions.
      • Expose new API out of IndexReader, deprecate old API but emulate
        old API on top of new one, switch all core/contrib users to the
        new API.
      • Maybe switch to AttributeSources as the base class for TermsEnum,
        DocsEnum, PostingsEnum – this would give readers API flexibility
        (not just index-file-format flexibility). EG if someone wanted
        to store payload at the term-doc level instead of
        term-doc-position level, you could just add a new attribute.
      • Test performance & iterate.
      1. LUCENE-1458_rotate.patch
        4 kB
        Robert Muir
      2. LUCENE-1458_sortorder_bwcompat.patch
        3 kB
        Robert Muir
      3. LUCENE-1458_termenum_bwcompat.patch
        1 kB
        Robert Muir
      4. LUCENE-1458.patch
        883 kB
        Mark Miller
      5. LUCENE-1458.patch
        878 kB
        Mark Miller
      6. LUCENE-1458.patch
        909 kB
        Michael McCandless
      7. LUCENE-1458.patch
        895 kB
        Michael McCandless
      8. LUCENE-1458.patch
        886 kB
        Michael McCandless
      9. LUCENE-1458.patch
        1024 kB
        Mark Miller
      10. LUCENE-1458.patch
        1015 kB
        Mark Miller
      11. LUCENE-1458.patch
        360 kB
        Michael Busch
      12. LUCENE-1458.patch
        370 kB
        Michael McCandless
      13. LUCENE-1458.patch
        263 kB
        Michael McCandless
      14. LUCENE-1458.patch
        188 kB
        Michael McCandless
      15. LUCENE-1458.patch
        167 kB
        Michael McCandless
      16. LUCENE-1458.patch
        116 kB
        Michael McCandless
      17. LUCENE-1458.tar.bz2
        1.93 MB
        Michael McCandless
      18. LUCENE-1458.tar.bz2
        1.94 MB
        Michael McCandless
      19. LUCENE-1458.tar.bz2
        1.84 MB
        Michael McCandless
      20. LUCENE-1458.tar.bz2
        1.83 MB
        Michael McCandless
      21. LUCENE-1458.tar.bz2
        1.82 MB
        Michael McCandless
      22. LUCENE-1458.tar.bz2
        1.83 MB
        Michael McCandless
      23. LUCENE-1458.tar.bz2
        1.80 MB
        Michael McCandless
      24. LUCENE-1458-back-compat.patch
        22 kB
        Michael McCandless
      25. LUCENE-1458-back-compat.patch
        22 kB
        Michael McCandless
      26. LUCENE-1458-back-compat.patch
        16 kB
        Michael McCandless
      27. LUCENE-1458-back-compat.patch
        16 kB
        Michael McCandless
      28. LUCENE-1458-back-compat.patch
        15 kB
        Michael McCandless
      29. LUCENE-1458-back-compat.patch
        15 kB
        Michael McCandless
      30. LUCENE-1458-DocIdSetIterator.patch
        22 kB
        Uwe Schindler
      31. LUCENE-1458-DocIdSetIterator.patch
        21 kB
        Uwe Schindler
      32. LUCENE-1458-MTQ-BW.patch
        2 kB
        Uwe Schindler
      33. LUCENE-1458-NRQ.patch
        12 kB
        Uwe Schindler
      34. UnicodeTestCase.patch
        2 kB
        Robert Muir
      35. UnicodeTestCase.patch
        2 kB
        Robert Muir

        Issue Links

          Activity

          Hide
          Mark Miller added a comment -

          Hmmm...I think something is missing - FormatPostingsPositionsReader?

          Show
          Mark Miller added a comment - Hmmm...I think something is missing - FormatPostingsPositionsReader?
          Hide
          Michael McCandless added a comment -

          Woops, sorry... I was missing a bunch of files. Try this one?

          Show
          Michael McCandless added a comment - Woops, sorry... I was missing a bunch of files. Try this one?
          Hide
          Marvin Humphrey added a comment -

          The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?

          Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size.

          For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable.

          Advantages:

          • Multiple forks can all share the same system buffer, reducing per-process memory footprint.
          • The cost to read in the index term dictionary during IndexReader startup drops to zero.
          • The OS caches for the index term dictionaries can either be allowed to warm naturally, or can be nudged into virtual memory via e.g. "cat /path/to/index/*.tis > /dev/null".
          Show
          Marvin Humphrey added a comment - The work on streamlining the term dictionary is excellent, but perhaps we can do better still. Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Say that we break up the index file into fixed-width blocks of 1024 bytes. Most blocks would start with a complete term/pointer pairing, though at the top of each block, we'd need a status byte indicating whether the block contains a continuation from the previous block in order to handle cases where term length exceeds the block size. For Lucy/KinoSearch our plan would be to mmap() on the file, but accessing it as a stream would work, too. Seeking around the index term dictionary would involve seeking the stream to multiples of the block size and performing binary search, rather than performing binary search on an array of cached terms. There would be increased processor overhead; my guess is that since the second stage of a term dictionary seek – scanning through the primary term dictionary – involves comparatively more processor power than this, the increased costs would be acceptable. Advantages: Multiple forks can all share the same system buffer, reducing per-process memory footprint. The cost to read in the index term dictionary during IndexReader startup drops to zero. The OS caches for the index term dictionaries can either be allowed to warm naturally, or can be nudged into virtual memory via e.g. "cat /path/to/index/*.tis > /dev/null".
          Hide
          Michael McCandless added a comment -

          Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether?

          Interesting! I've been wondering what you're up to over on KS, Marvin

          I'm not sure it'll be a win in practice: I'm not sure I'd trust the
          OS's IO cache to "make the right decisions" about what to cache. Plus
          during that binary search the IO system is loading whole pages into
          the IO cache, even though you'll only peak at the first few bytes of
          each.

          We could also explore something in-between, eg it'd be nice to
          genericize MultiLevelSkipListWriter so that it could index arbitrary
          files, then we could use that to index the terms dict. You could
          choose to spend dedicated process RAM on the higher levels of the skip
          tree, and then tentatively trust IO cache for the lower levels.

          I'd like to eventually make the TermsDict index pluggable so one could
          swap in different indexers like this (it's not now).

          Show
          Michael McCandless added a comment - Can we design a format that allows us rely upon the operating system's virtual memory and avoid caching in process memory altogether? Interesting! I've been wondering what you're up to over on KS, Marvin I'm not sure it'll be a win in practice: I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache. Plus during that binary search the IO system is loading whole pages into the IO cache, even though you'll only peak at the first few bytes of each. We could also explore something in-between, eg it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files, then we could use that to index the terms dict. You could choose to spend dedicated process RAM on the higher levels of the skip tree, and then tentatively trust IO cache for the lower levels. I'd like to eventually make the TermsDict index pluggable so one could swap in different indexers like this (it's not now).
          Hide
          Michael McCandless added a comment -

          [Attached patch]

          To test whether the new pluggable codec approach is flexible enough, I
          coded up "pulsing" (described in detail in
          http://citeseer.ist.psu.edu/cutting90optimizations.html), where
          freq/prox info is inlined into the terms dict if the term freq is < N.

          It was wonderfully simple I just had to create a reader & a writer,
          and then switch the places that read (SegmentReader) and write
          (SegmentMerger, FreqProxTermsWriter) to use the new pulsing codec
          instead of the default one.

          The pulsing codec can "wrap" any other codec, ie, when a term is
          written, if the term's freq is < N, then it's inlined into the terms
          dict with the pulsing writer, else it's fed to the other codec for it
          to do whatever it normally would. The two codecs are strongly
          decoupled, so we can mix & match pulsing with other codecs like pfor.

          All tests pass with this pulsing codec.

          As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie
          terms that occur only in one document are inlined into the terms
          dict). 5.4M terms get inlined (only 1 doc) and 2.2M terms are not (>
          1 doc). The final size of the index (after optimizing) was a bit
          smaller with pulsing (1120 MB vs 1131 MB).

          Show
          Michael McCandless added a comment - [Attached patch] To test whether the new pluggable codec approach is flexible enough, I coded up "pulsing" (described in detail in http://citeseer.ist.psu.edu/cutting90optimizations.html ), where freq/prox info is inlined into the terms dict if the term freq is < N. It was wonderfully simple I just had to create a reader & a writer, and then switch the places that read (SegmentReader) and write (SegmentMerger, FreqProxTermsWriter) to use the new pulsing codec instead of the default one. The pulsing codec can "wrap" any other codec, ie, when a term is written, if the term's freq is < N, then it's inlined into the terms dict with the pulsing writer, else it's fed to the other codec for it to do whatever it normally would. The two codecs are strongly decoupled, so we can mix & match pulsing with other codecs like pfor. All tests pass with this pulsing codec. As a quick test I indexed first 1M docs from Wikipedia, with N=2 (ie terms that occur only in one document are inlined into the terms dict). 5.4M terms get inlined (only 1 doc) and 2.2M terms are not (> 1 doc). The final size of the index (after optimizing) was a bit smaller with pulsing (1120 MB vs 1131 MB).
          Hide
          Michael Busch added a comment -

          I'll look into this patch soon.

          Just wanted to say: I'm really excited about the progress here, this is cool stuff!
          Great job...

          Show
          Michael Busch added a comment - I'll look into this patch soon. Just wanted to say: I'm really excited about the progress here, this is cool stuff! Great job...
          Hide
          Marvin Humphrey added a comment -

          > I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache.

          In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard – in which case we won't have to worry about the OS swapping out those pages.

          I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems – if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised.

          In other words...

          • I trust the OS to do a decent enough job on underpowered systems.
          • High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory.

          More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>.

          > Plus during that binary search the IO system is loading whole pages into
          > the IO cache, even though you'll only peak at the first few bytes of each.

          I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer.

          But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that.

          Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.

          > We could also explore something in-between, eg it'd be nice to
          > genericize MultiLevelSkipListWriter so that it could index arbitrary
          > files, then we could use that to index the terms dict. You could
          > choose to spend dedicated process RAM on the higher levels of the skip
          > tree, and then tentatively trust IO cache for the lower levels.

          That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths.

          > I'd like to eventually make the TermsDict index pluggable so one could
          > swap in different indexers like this (it's not now).

          If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.

          Show
          Marvin Humphrey added a comment - > I'm not sure I'd trust the OS's IO cache to "make the right decisions" about what to cache. In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard – in which case we won't have to worry about the OS swapping out those pages. I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems – if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised. In other words... I trust the OS to do a decent enough job on underpowered systems. High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory. More on designing with modern virtual memory in mind at < http://varnish.projects.linpro.no/wiki/ArchitectNotes >. > Plus during that binary search the IO system is loading whole pages into > the IO cache, even though you'll only peak at the first few bytes of each. I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer. But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that. Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. > We could also explore something in-between, eg it'd be nice to > genericize MultiLevelSkipListWriter so that it could index arbitrary > files, then we could use that to index the terms dict. You could > choose to spend dedicated process RAM on the higher levels of the skip > tree, and then tentatively trust IO cache for the lower levels. That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths. > I'd like to eventually make the TermsDict index pluggable so one could > swap in different indexers like this (it's not now). If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.
          Hide
          Michael Busch added a comment -

          We could also explore something in-between, eg it'd be nice to
          genericize MultiLevelSkipListWriter so that it could index arbitrary
          files, then we could use that to index the terms dict.

          Hmm, +1 for generalizing the MultiLevelSkipListWriter/Reader so that we can re-use it for different (custom) posting-list formats easily.
          However, I'm not so sure if it's the right approach for a dictionary. A skip list is optimized for skipping forward (as the name says), so excellent for positing lists, which are always read from "left to right".
          However, in the term dictionary you do a binary search for the lookup term. So something like a B+Tree would probably work better. Then you can decide similar to the MultiLevelSkipListWriter how many of the upper levels you want to keep in memory and control memory consumption.

          Show
          Michael Busch added a comment - We could also explore something in-between, eg it'd be nice to genericize MultiLevelSkipListWriter so that it could index arbitrary files, then we could use that to index the terms dict. Hmm, +1 for generalizing the MultiLevelSkipListWriter/Reader so that we can re-use it for different (custom) posting-list formats easily. However, I'm not so sure if it's the right approach for a dictionary. A skip list is optimized for skipping forward (as the name says), so excellent for positing lists, which are always read from "left to right". However, in the term dictionary you do a binary search for the lookup term. So something like a B+Tree would probably work better. Then you can decide similar to the MultiLevelSkipListWriter how many of the upper levels you want to keep in memory and control memory consumption.
          Hide
          Michael McCandless added a comment -

          So something like a B+Tree would probably work better.

          I agree, btree is a better fit, though we don't need insertion & deletion operations since each segment is write once.

          Show
          Michael McCandless added a comment - So something like a B+Tree would probably work better. I agree, btree is a better fit, though we don't need insertion & deletion operations since each segment is write once.
          Hide
          Michael McCandless added a comment -

          In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard - in which case we won't have to worry about the OS swapping out those pages.

          I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems - if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised.

          In other words...

          • I trust the OS to do a decent enough job on underpowered systems.
          • High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory.

          These are the two extremes, but, I think most common are all the apps
          in between. Take a large Jira instance, where the app itself is also
          consuming alot of RAM, doing alot of its own IO, etc., where perhaps
          searching is done infrequently enough relative to other operations
          that the OS may no longer think the pages you hit for the terms index
          are hot enough to keep around.

          More on designing with modern virtual memory in mind at <http://varnish.projects.linpro.no/wiki/ArchitectNotes>.

          This is a good read, but I find it overly trusting of VM.

          How can the VM system possibly make good decisions about what to swap
          out? It can't know if a page is being used for terms dict index,
          terms dict, norms, stored fields, postings. LRU is not a good policy,
          because some pages (terms index) are far far more costly to miss than
          others.

          From Java we have even more ridiculous problems: sometimes the OS
          swaps out garbage... and then massive swapping takes place when GC
          runs, swapping back in the garbage only to then throw it away. Ugh!

          I think we need to aim for consistency: a given search should not
          suddenly take 10 seconds because the OS decided to swap out a few
          critical structures (like the term index). Unfortunately we can't
          really achieve that today, especially from Java.

          I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make
          stupid VM decisions: if I run something that does a single-pass
          through many GB of on-disk data (eg re-encoding a video), it then
          swaps out the vast majority of my apps even though I have 6 GB RAM. I
          hit tons (many seconds) of swapping just switching back to my mail
          client. It's infuriating. I've seen Linux do the same thing, but at
          least Linux let's you tune this behavior ("swappiness"); I had to
          disable swapping entirely on my desktop.

          Similarly, when a BG merge is burning through data, or say backup
          kicks off and moves many GB, or the simple act of iterating through a
          big postings list, the OS will gleefully evict my terms index or norms
          in order to populate its IO cache with data it will need again for a
          very long time.

          I bet the VM system fails to show graceful degradation: if I don't
          have enough RAM to hold my entire index, then walking through postings
          lists will evict my terms index and norms, making all searches slower.

          In the ideal world, an IndexReader would be told how much RAM to use.
          It would spend that RAM wisely, eg first on the terms index, second on
          norms, third maybe on select column-stride fields, etc. It would pin
          these pages so the OS couldn't swap them out (can't do this from
          java... though as a workaround we could use a silly thread). Or, if
          the OS found itself tight on RAM, it would ask the app to free things
          up instead of blindly picking pages to swap out, which does not happen
          today.

          From Java we could try using WeakReference but I fear the
          communication from the OS -> JRE is too weak. IE I'd want my
          WeakReference cleared only when the OS is threatening to swap out my
          data structure.

          > Plus during that binary search the IO system is loading whole pages into
          > the IO cache, even though you'll only peak at the first few bytes of each.

          I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer.

          But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that.

          I'm not convinced this'll be a win in practice. You are now paying an
          even higher overhead cost for each "check" of your binary search,
          especially with something like pulsing which inlines more stuff into
          the terms dict. I agree it's simpler, but I think that's trumped by
          the performance hit.

          In Lucene java, the concurrency model we are aiming for is a single
          JVM sharing a single instance of IndexReader. I do agree, if fork()
          is the basis of your concurrency model then sharing pages becomes
          critical. However, modern OSs implement copy-on-write sharing of VM
          pages after a fork, so that's another good path to sharing?

          Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case.

          Have you tried any actual tests swapping these approaches in as your
          terms index impl? Tests of fully hot and fully cold ends of the
          spectrum would be interesting, but also tests where a big segment
          merge or a backup is running in the background...

          That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks.

          That's a nice goal. Our biggest cost in Lucene is warming the
          FieldCache, used for sorting, function queries, etc. Column-stride
          fields should go a ways towards improving this.

          It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths.

          I think if we make the pluggable API simple, and capture the
          complexity inside each impl, such that it can be well tested in
          isolation, it's acceptable.

          If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob.

          In my approach here, the blob is opaque to the terms dict reader: it
          simply seeks to the right spot in the tis file, and then asks the
          codec to decode the entry. TermsDictReader is entirely unaware of
          what/how is stored there.

          Show
          Michael McCandless added a comment - In KS and Lucy, at least, we're focused on optimizing for the use case of dedicated search clusters where each box has enough RAM to fit the entire index/shard - in which case we won't have to worry about the OS swapping out those pages. I suspect that in many circumstances the term dictionary would be a hot file even if RAM were running short, but I don't think it's important to worry about maxing out performance on such systems - if the term dictionary isn't hot the posting list files are definitely not hot and search-time responsiveness is already compromised. In other words... I trust the OS to do a decent enough job on underpowered systems. High-powered systems should strive to avoid swapping entirely. To aid in that endeavor, we minimize per-process RAM consumption by maximizing our use of mmap and treating the system IO cache backing buffers as interprocess shared memory. These are the two extremes, but, I think most common are all the apps in between. Take a large Jira instance, where the app itself is also consuming alot of RAM, doing alot of its own IO, etc., where perhaps searching is done infrequently enough relative to other operations that the OS may no longer think the pages you hit for the terms index are hot enough to keep around. More on designing with modern virtual memory in mind at < http://varnish.projects.linpro.no/wiki/ArchitectNotes >. This is a good read, but I find it overly trusting of VM. How can the VM system possibly make good decisions about what to swap out? It can't know if a page is being used for terms dict index, terms dict, norms, stored fields, postings. LRU is not a good policy, because some pages (terms index) are far far more costly to miss than others. From Java we have even more ridiculous problems: sometimes the OS swaps out garbage... and then massive swapping takes place when GC runs, swapping back in the garbage only to then throw it away. Ugh! I think we need to aim for consistency : a given search should not suddenly take 10 seconds because the OS decided to swap out a few critical structures (like the term index). Unfortunately we can't really achieve that today, especially from Java. I've seen my desktop OS (Mac OS X 10.5.5, based on FreeBSD) make stupid VM decisions: if I run something that does a single-pass through many GB of on-disk data (eg re-encoding a video), it then swaps out the vast majority of my apps even though I have 6 GB RAM. I hit tons (many seconds) of swapping just switching back to my mail client. It's infuriating. I've seen Linux do the same thing, but at least Linux let's you tune this behavior ("swappiness"); I had to disable swapping entirely on my desktop. Similarly, when a BG merge is burning through data, or say backup kicks off and moves many GB, or the simple act of iterating through a big postings list, the OS will gleefully evict my terms index or norms in order to populate its IO cache with data it will need again for a very long time. I bet the VM system fails to show graceful degradation: if I don't have enough RAM to hold my entire index, then walking through postings lists will evict my terms index and norms, making all searches slower. In the ideal world, an IndexReader would be told how much RAM to use. It would spend that RAM wisely, eg first on the terms index, second on norms, third maybe on select column-stride fields, etc. It would pin these pages so the OS couldn't swap them out (can't do this from java... though as a workaround we could use a silly thread). Or, if the OS found itself tight on RAM, it would ask the app to free things up instead of blindly picking pages to swap out, which does not happen today. From Java we could try using WeakReference but I fear the communication from the OS -> JRE is too weak. IE I'd want my WeakReference cleared only when the OS is threatening to swap out my data structure. > Plus during that binary search the IO system is loading whole pages into > the IO cache, even though you'll only peak at the first few bytes of each. I'd originally been thinking of mapping only the term dictionary index files. Those are pretty small, and the file itself occupies fewer bytes than the decompressed array of term/pointer pairs. Even better if you have several search app forks and they're all sharing the same memory mapped system IO buffer. But hey, we can simplify even further! How about dispensing with the index file? We can just divide the main dictionary file into blocks and binary search on that. I'm not convinced this'll be a win in practice. You are now paying an even higher overhead cost for each "check" of your binary search, especially with something like pulsing which inlines more stuff into the terms dict. I agree it's simpler, but I think that's trumped by the performance hit. In Lucene java, the concurrency model we are aiming for is a single JVM sharing a single instance of IndexReader. I do agree, if fork() is the basis of your concurrency model then sharing pages becomes critical. However, modern OSs implement copy-on-write sharing of VM pages after a fork, so that's another good path to sharing? Killing off the term dictionary index yields a nice improvement in code and file specification simplicity, and there's no performance penalty for our primary optimization target use case. Have you tried any actual tests swapping these approaches in as your terms index impl? Tests of fully hot and fully cold ends of the spectrum would be interesting, but also tests where a big segment merge or a backup is running in the background... That doesn't meet the design goals of bringing the cost of opening/warming an IndexReader down to near-zero and sharing backing buffers among multiple forks. That's a nice goal. Our biggest cost in Lucene is warming the FieldCache, used for sorting, function queries, etc. Column-stride fields should go a ways towards improving this. It's also very complicated, which of course bothers me more than it bothers you. So I imagine we'll choose different paths. I think if we make the pluggable API simple, and capture the complexity inside each impl, such that it can be well tested in isolation, it's acceptable. If we treat the term dictionary as a black box, it has to accept a term and return... a blob, I guess. Whatever calls the lookup needs to know how to handle that blob. In my approach here, the blob is opaque to the terms dict reader: it simply seeks to the right spot in the tis file, and then asks the codec to decode the entry. TermsDictReader is entirely unaware of what/how is stored there.
          Hide
          Marvin Humphrey added a comment -

          > Take a large Jira instance, where the app itself is also
          > consuming alot of RAM, doing alot of its own IO, etc., where perhaps
          > searching is done infrequently enough relative to other operations
          > that the OS may no longer think the pages you hit for the terms index
          > are hot enough to keep around.

          Search responsiveness is already compromised in such a situation, because we
          can all but guarantee that the posting list files have already been evicted
          from cache. If the box has enough RAM for the large JIRA instance including
          the Lucene index, search responsiveness won't be a problem. As soon as you
          start running a little short on RAM, though, there's no way to stop infrequent
          searches from being sluggish.

          Nevertheless, the terms index isn't that big in comparison to, say, the size
          of a posting list for a common term, so the cost of re-heating it isn't
          astronomical in the grand scheme of things.

          > Similarly, when a BG merge is burning through data, or say backup kicks off
          > and moves many GB, or the simple act of iterating through a big postings
          > list, the OS will gleefully evict my terms index or norms in order to
          > populate its IO cache with data it will need again for a very long time.

          When that background merge finishes, the new files will be hot. So, if we
          open a new IndexReader right away and that IndexReader uses mmap() to get at
          the file data, new segments be responsive right away.

          Even better, any IO caches for old segments used by the previous IndexReader
          may still be warm. All of this without having to decompress a bunch of stream
          data into per-process data structures at IndexReader startup.

          The terms index could indeed get evicted some of the time on busy systems, but
          the point is that the system IO cache usually works in our favor, even under
          load.

          As far as backup daemons blowing up everybody's cache, that's stupid,
          pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>. Such
          apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel
          knows it can recycle the cache pages as soon as they're cleared.

          >> But hey, we can simplify even further! How about dispensing with the index
          >> file? We can just divide the main dictionary file into blocks and binary
          >> search on that.
          >
          > I'm not convinced this'll be a win in practice. You are now paying an
          > even higher overhead cost for each "check" of your binary search,
          > especially with something like pulsing which inlines more stuff into
          > the terms dict. I agree it's simpler, but I think that's trumped by
          > the performance hit.

          I'm persuaded that we shouldn't do away with the terms index. Even if we're
          operating on a dedicated search box with gobs of RAM, loading entire cache
          pages when we only care about the first few bytes of each is poor use of
          memory bandwidth. And, just in case the cache does get blown, we'd like to
          keep the cost of rewarming down.

          Nathan Kurz and I brainstormed this subject in a phone call this morning, and
          we came up with a three-file lexicon index design:

          • A file which is a solid stack of 64-bit file pointers into the lexicon
            index term data. Term data UTF-8 byte length can be determined by
            subtracting the current pointer from the next one (or the file length at
            the end).
          • A file which is contains solid UTF-8 term content. (No string lengths, no
            file pointers, just character data.)
          • A file which is a solid stack of 64-bit file pointers into the primary
            lexicon.

          Since the integers are already expanded and the raw UTF-8 data can be compared
          as-is, those files can be memory-mapped and used as-is for binary search.

          > In Lucene java, the concurrency model we are aiming for is a single JVM
          > sharing a single instance of IndexReader.

          When I mentioned this to Nate, he remarked that we're using the OS kernel like
          you're using the JVM.

          We don't keep a single IndexReader around, but we do keep the bulk of its data
          cached so that we can just slap a cheap wrapper around it.

          > I do agree, if fork() is the basis of your concurrency model then sharing
          > pages becomes critical. However, modern OSs implement copy-on-write sharing
          > of VM pages after a fork, so that's another good path to sharing?

          Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to
          be able to launch a cheap search process.

          > Have you tried any actual tests swapping these approaches in as your
          > terms index impl?

          No – changing something like this requires a lot of coding, so it's better to
          do thought experiments first to winnow down the options.

          > Tests of fully hot and fully cold ends of the
          > spectrum would be interesting, but also tests where a big segment
          > merge or a backup is running in the background...

          >> That doesn't meet the design goals of bringing the cost of opening/warming
          >> an IndexReader down to near-zero and sharing backing buffers among
          >> multiple forks.
          >
          > That's a nice goal. Our biggest cost in Lucene is warming the FieldCache, used
          > for sorting, function queries, etc.

          Exactly. It would be nice to add a plug-in indexing component that writes sort
          caches to files that can be memory mapped at IndexReader startup. There would
          be multiple files: both a solid array of 32-bit integers mapping document
          number to sort order, and the field cache values. Such a component would
          allow us to move the time it takes to read in a sort cache from
          IndexReader-startup-time to index-time.

          Hmm, maybe we can conflate this with a column-stride field writer and require
          that sort fields have a fixed width?

          > In my approach here, the blob is opaque to the terms dict reader: it
          > simply seeks to the right spot in the tis file, and then asks the
          > codec to decode the entry. TermsDictReader is entirely unaware of
          > what/how is stored there.

          Sounds good. Basically, a hash lookup.

          In KS, the relevant IndexReader methods no longer take a Term object. (In
          fact, there IS no Term object any more – KinoSearch::Index::Term has been
          removed.) Instead, they take a string field and a generic "Obj".

          Lexicon*
          SegReader_lexicon(SegReader *self, const CharBuf *field, Obj *term)

          { return (Lexicon*)LexReader_Lexicon(self->lex_reader, field, term); }

          I suppose we genericize this by adding a TermsDictReader/LexReader argument to
          the IndexReader constructor? That way, someone can supply a custom subclass
          that knows how to decode custom dictionary files.

          Show
          Marvin Humphrey added a comment - > Take a large Jira instance, where the app itself is also > consuming alot of RAM, doing alot of its own IO, etc., where perhaps > searching is done infrequently enough relative to other operations > that the OS may no longer think the pages you hit for the terms index > are hot enough to keep around. Search responsiveness is already compromised in such a situation, because we can all but guarantee that the posting list files have already been evicted from cache. If the box has enough RAM for the large JIRA instance including the Lucene index, search responsiveness won't be a problem. As soon as you start running a little short on RAM, though, there's no way to stop infrequent searches from being sluggish. Nevertheless, the terms index isn't that big in comparison to, say, the size of a posting list for a common term, so the cost of re-heating it isn't astronomical in the grand scheme of things. > Similarly, when a BG merge is burning through data, or say backup kicks off > and moves many GB, or the simple act of iterating through a big postings > list, the OS will gleefully evict my terms index or norms in order to > populate its IO cache with data it will need again for a very long time. When that background merge finishes, the new files will be hot. So, if we open a new IndexReader right away and that IndexReader uses mmap() to get at the file data, new segments be responsive right away. Even better, any IO caches for old segments used by the previous IndexReader may still be warm. All of this without having to decompress a bunch of stream data into per-process data structures at IndexReader startup. The terms index could indeed get evicted some of the time on busy systems, but the point is that the system IO cache usually works in our favor, even under load. As far as backup daemons blowing up everybody's cache, that's stupid, pathological behavior: < http://kerneltrap.org/node/3000#comment-8573 >. Such apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel knows it can recycle the cache pages as soon as they're cleared. >> But hey, we can simplify even further! How about dispensing with the index >> file? We can just divide the main dictionary file into blocks and binary >> search on that. > > I'm not convinced this'll be a win in practice. You are now paying an > even higher overhead cost for each "check" of your binary search, > especially with something like pulsing which inlines more stuff into > the terms dict. I agree it's simpler, but I think that's trumped by > the performance hit. I'm persuaded that we shouldn't do away with the terms index. Even if we're operating on a dedicated search box with gobs of RAM, loading entire cache pages when we only care about the first few bytes of each is poor use of memory bandwidth. And, just in case the cache does get blown, we'd like to keep the cost of rewarming down. Nathan Kurz and I brainstormed this subject in a phone call this morning, and we came up with a three-file lexicon index design: A file which is a solid stack of 64-bit file pointers into the lexicon index term data. Term data UTF-8 byte length can be determined by subtracting the current pointer from the next one (or the file length at the end). A file which is contains solid UTF-8 term content. (No string lengths, no file pointers, just character data.) A file which is a solid stack of 64-bit file pointers into the primary lexicon. Since the integers are already expanded and the raw UTF-8 data can be compared as-is, those files can be memory-mapped and used as-is for binary search. > In Lucene java, the concurrency model we are aiming for is a single JVM > sharing a single instance of IndexReader. When I mentioned this to Nate, he remarked that we're using the OS kernel like you're using the JVM. We don't keep a single IndexReader around, but we do keep the bulk of its data cached so that we can just slap a cheap wrapper around it. > I do agree, if fork() is the basis of your concurrency model then sharing > pages becomes critical. However, modern OSs implement copy-on-write sharing > of VM pages after a fork, so that's another good path to sharing? Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to be able to launch a cheap search process. > Have you tried any actual tests swapping these approaches in as your > terms index impl? No – changing something like this requires a lot of coding, so it's better to do thought experiments first to winnow down the options. > Tests of fully hot and fully cold ends of the > spectrum would be interesting, but also tests where a big segment > merge or a backup is running in the background... >> That doesn't meet the design goals of bringing the cost of opening/warming >> an IndexReader down to near-zero and sharing backing buffers among >> multiple forks. > > That's a nice goal. Our biggest cost in Lucene is warming the FieldCache, used > for sorting, function queries, etc. Exactly. It would be nice to add a plug-in indexing component that writes sort caches to files that can be memory mapped at IndexReader startup. There would be multiple files: both a solid array of 32-bit integers mapping document number to sort order, and the field cache values. Such a component would allow us to move the time it takes to read in a sort cache from IndexReader-startup-time to index-time. Hmm, maybe we can conflate this with a column-stride field writer and require that sort fields have a fixed width? > In my approach here, the blob is opaque to the terms dict reader: it > simply seeks to the right spot in the tis file, and then asks the > codec to decode the entry. TermsDictReader is entirely unaware of > what/how is stored there. Sounds good. Basically, a hash lookup. In KS, the relevant IndexReader methods no longer take a Term object. (In fact, there IS no Term object any more – KinoSearch::Index::Term has been removed.) Instead, they take a string field and a generic "Obj". Lexicon* SegReader_lexicon(SegReader *self, const CharBuf *field, Obj *term) { return (Lexicon*)LexReader_Lexicon(self->lex_reader, field, term); } I suppose we genericize this by adding a TermsDictReader/LexReader argument to the IndexReader constructor? That way, someone can supply a custom subclass that knows how to decode custom dictionary files.
          Hide
          Michael McCandless added a comment -

          OK I created another codec, SepCodec (for lack of a better name) that
          stores doc & frq & skip in 3 separate files (vs 1 for Lucene today),
          as well as positions & payloads in 2 separate files (vs 1 for Lucene
          today).

          The code is still messy – lots of nocommits all over the place. I'm
          still iterating.

          Finally, this gets us one step closer to using PFOR! With this codec,
          the .frq, .doc and .prx are now "pure" streams of ints.

          This codec was more interesting because it adds new files to the file
          format, which required fixing the various interesting places where we
          assume which file extensions belong to a segment.

          In this patch I also created a PostingCodec class, with the 3
          subclasses (so far):

          • DefaultCodec: new terms dict format, but same back-compatible
            prx/frq format
          • PulsingCodec: new terms dict format, but inlines rare terms into
            terms dict
          • SepCodec: new terms dict format, and splits doc/frq/skip into
            3 separate files, and prox/payload into 2 separate files

          By editing the PostingCodec.getCodec method you can switch all tests
          to use each codec; all tests pass using each codec.

          I built the 1M Wikipedia index, using SepCodec. Here's the ls -l:

          -rw-rw-rw-  1 mike  admin    4000004 Nov 20 17:16 _0.fdt
          -rw-rw-rw-  1 mike  admin    8000004 Nov 20 17:16 _0.fdx
          -rw-rw-rw-  1 mike  admin  303526787 Nov 20 17:34 _n.doc
          -rw-rw-rw-  1 mike  admin         33 Nov 20 17:30 _n.fnm
          -rw-rw-rw-  1 mike  admin  220470670 Nov 20 17:34 _n.frq
          -rw-rw-rw-  1 mike  admin    3000004 Nov 20 17:34 _n.nrm
          -rw-rw-rw-  1 mike  admin  651670377 Nov 20 17:34 _n.prx
          -rw-rw-rw-  1 mike  admin          0 Nov 20 17:30 _n.pyl
          -rw-rw-rw-  1 mike  admin   84963104 Nov 20 17:34 _n.skp
          -rw-rw-rw-  1 mike  admin     666999 Nov 20 17:34 _n.tii
          -rw-rw-rw-  1 mike  admin   87551274 Nov 20 17:34 _n.tis
          -rw-rw-rw-  1 mike  admin         20 Nov 20 17:34 segments.gen
          -rw-rw-rw-  1 mike  admin         64 Nov 20 17:34 segments_2
          

          Some initial observations for SepCodec:

          • Merging/optimizing was noticeably slower... I think there's some
            pending inefficiency in my changes, but it could also simply be
            that having to step through 3 (.frq, .doc, .prx) files instead of
            2 (.frq, .prx) for each segment is that much more costly. (With
            payloads it'd be 4 files instead of 2).
          • Net index size is quite a bit larger (1300 MB vs 1139 MB), I think
            because we are not efficiently encoding the frq=1 case anymore.
            PFOR should fix that.
          • Skip data is just about as large as the terms dict, which
            surprises me (I had intuitively expected it to be smaller I
            guess).
          Show
          Michael McCandless added a comment - OK I created another codec, SepCodec (for lack of a better name) that stores doc & frq & skip in 3 separate files (vs 1 for Lucene today), as well as positions & payloads in 2 separate files (vs 1 for Lucene today). The code is still messy – lots of nocommits all over the place. I'm still iterating. Finally, this gets us one step closer to using PFOR! With this codec, the .frq, .doc and .prx are now "pure" streams of ints. This codec was more interesting because it adds new files to the file format, which required fixing the various interesting places where we assume which file extensions belong to a segment. In this patch I also created a PostingCodec class, with the 3 subclasses (so far): DefaultCodec: new terms dict format, but same back-compatible prx/frq format PulsingCodec: new terms dict format, but inlines rare terms into terms dict SepCodec: new terms dict format, and splits doc/frq/skip into 3 separate files, and prox/payload into 2 separate files By editing the PostingCodec.getCodec method you can switch all tests to use each codec; all tests pass using each codec. I built the 1M Wikipedia index, using SepCodec. Here's the ls -l: -rw-rw-rw- 1 mike admin 4000004 Nov 20 17:16 _0.fdt -rw-rw-rw- 1 mike admin 8000004 Nov 20 17:16 _0.fdx -rw-rw-rw- 1 mike admin 303526787 Nov 20 17:34 _n.doc -rw-rw-rw- 1 mike admin 33 Nov 20 17:30 _n.fnm -rw-rw-rw- 1 mike admin 220470670 Nov 20 17:34 _n.frq -rw-rw-rw- 1 mike admin 3000004 Nov 20 17:34 _n.nrm -rw-rw-rw- 1 mike admin 651670377 Nov 20 17:34 _n.prx -rw-rw-rw- 1 mike admin 0 Nov 20 17:30 _n.pyl -rw-rw-rw- 1 mike admin 84963104 Nov 20 17:34 _n.skp -rw-rw-rw- 1 mike admin 666999 Nov 20 17:34 _n.tii -rw-rw-rw- 1 mike admin 87551274 Nov 20 17:34 _n.tis -rw-rw-rw- 1 mike admin 20 Nov 20 17:34 segments.gen -rw-rw-rw- 1 mike admin 64 Nov 20 17:34 segments_2 Some initial observations for SepCodec: Merging/optimizing was noticeably slower... I think there's some pending inefficiency in my changes, but it could also simply be that having to step through 3 (.frq, .doc, .prx) files instead of 2 (.frq, .prx) for each segment is that much more costly. (With payloads it'd be 4 files instead of 2). Net index size is quite a bit larger (1300 MB vs 1139 MB), I think because we are not efficiently encoding the frq=1 case anymore. PFOR should fix that. Skip data is just about as large as the terms dict, which surprises me (I had intuitively expected it to be smaller I guess).
          Hide
          Michael McCandless added a comment -

          Nevertheless, the terms index isn't that big in comparison to, say, the size
          of a posting list for a common term, so the cost of re-heating it isn't
          astronomical in the grand scheme of things.

          Be careful: it's the seeking that kills you (until we switch to SSDs
          at which point perhaps most of this discussion is moot!). Even though
          the terms index net size is low, if re-heating the spots you touch
          incurs 20 separate page misses, you lose.

          Potentially worse than the terms index are norms, if the search hits
          alot of docs.

          > Take a large Jira instance...

          Search responsiveness is already compromised in such a situation, because we
          can all but guarantee that the posting list files have already been evicted
          from cache. If the box has enough RAM for the large JIRA instance including
          the Lucene index, search responsiveness won't be a problem. As soon as you
          start running a little short on RAM, though, there's no way to stop infrequent
          searches from being sluggish.

          If the term index and norms are pinned (or happen to still be hot), I
          would expect most searches to be OK with this "in the middle" use case
          because the number of seeks you'll hit should be well contained
          (assuming your posting list isn't unduly fragmented by the
          filesystem). Burning through the posting list is a linear scan.
          Queries that simply hit too many docs will always be slow anyways.

          I think at both extremes (way too litle RAM and tons of RAM) both
          approaches (pinned in RAM vs mmap'd) should perfom the same. It's the
          cases in between where I think letting VM decide whether critical
          things (terms index, norms) get to stay hot is dangerous.

          The terms index could indeed get evicted some of the time on busy systems, but
          the point is that the system IO cache usually works in our favor, even under
          load.

          I think you're just more trusting of the IO/VM system. I think LRU is
          a poor metric.

          As far as backup daemons blowing up everybody's cache, that's stupid,
          pathological behavior: <http://kerneltrap.org/node/3000#comment-8573>. Such
          apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel
          knows it can recycle the cache pages as soon as they're cleared.

          Excellent! If only more people knew about this. And, if only we
          could do this from javaland. EG SegmentMerger should do this for all
          segment data it's reading & writing.

          Nathan Kurz and I brainstormed this subject in a phone call this morning, and
          we came up with a three-file lexicon index design:

          I don't fully understand this approach. Would the index file pointers
          point into the full lexicon's packed utf8 file, or a separate "only
          terms in the index" packed utf8 file?

          We currently materialize individual Strings when we load our index,
          which is bad because of the GC cost, added RAM overhead (& swapping)
          and because for iso8859-1 only terms we are using 2X the space over
          utf8. So I'd love to eventually do something similar (in RAM) for
          Lucene.

          > Have you tried any actual tests swapping these approaches in as your
          > terms index impl?

          No - changing something like this requires a lot of coding, so it's better to
          do thought experiments first to winnow down the options.

          Agreed. But once you've got the mmap-based solution up and running
          it'd be nice to meaure net time doing terms lookup / norms reading,
          for a variety of search use cases, and plot that on a histogram.

          When I mentioned this to Nate, he remarked that we're using the OS kernel like
          you're using the JVM.

          True!

          Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to
          be able to launch a cheap search process.

          It seems like the ability to very quickly launch brand new searchers
          is/has become a strong design goal of Lucy/KS. What's the driver
          here? Is it for near-realtime search? (Which I think may be better
          achieved by having IndexWriter export a reader, rather than using IO
          system as the intermediary).

          If we fix terms index to bulk load arrays (it's not now) then the cost
          of loading norms & terms index on instantiating a reader should be
          fairly well contained, though not as near zero as Lucy/KS will be.

          > That's a nice goal. Our biggest cost in Lucene is warming the
          > FieldCache, used for sorting, function queries, etc.

          Exactly. It would be nice to add a plug-in indexing component that
          writes sort caches to files that can be memory mapped at IndexReader
          startup. There would be multiple files: both a solid array of 32-bit
          integers mapping document number to sort order, and the field cache
          values. Such a component would allow us to move the time it takes to
          read in a sort cache from IndexReader-startup-time to index-time.

          Except I would have IndexReader use its RAM budget to pick & choose
          which of these will be hot, and which would be mmap'd.

          Hmm, maybe we can conflate this with a column-stride field writer
          and require that sort fields have a fixed width?

          Yes I think column-stride fields writer should write the docID -> ord
          part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would
          then use it. With enumerated type of fields (far fewer unique terms
          than docs), bit packing will make them compact.

          In KS, the relevant IndexReader methods no longer take a Term
          object. (In fact, there IS no Term object any more -
          KinoSearch::Index::Term has been removed.) Instead, they take a
          string field and a generic "Obj".

          But you must at least require these Obj's to know how to compareTo one
          another? Does this mean using per-field custom sort ordering
          (collator) is straightforward for KS?

          I suppose we genericize this by adding a TermsDictReader/LexReader
          argument to the IndexReader constructor? That way, someone can
          supply a custom subclass that knows how to decode custom dictionary
          files.

          Right; that's what let me create the PulsingCodec here.

          The biggest problem with the "load important stuff into RAM" approach,
          of course, is we can't actually pin VM pages from java, which means
          the OS will happily swap out my RAM anyway, at which point of course
          we should have used mmap. Though apparently at least Windows has an
          option to "optimize for services" (= "don't swap out my RAM" I think)
          vs "optimize for applications", and Linux lets you tune swappiness.
          But both are global.

          Show
          Michael McCandless added a comment - Nevertheless, the terms index isn't that big in comparison to, say, the size of a posting list for a common term, so the cost of re-heating it isn't astronomical in the grand scheme of things. Be careful: it's the seeking that kills you (until we switch to SSDs at which point perhaps most of this discussion is moot!). Even though the terms index net size is low, if re-heating the spots you touch incurs 20 separate page misses, you lose. Potentially worse than the terms index are norms, if the search hits alot of docs. > Take a large Jira instance... Search responsiveness is already compromised in such a situation, because we can all but guarantee that the posting list files have already been evicted from cache. If the box has enough RAM for the large JIRA instance including the Lucene index, search responsiveness won't be a problem. As soon as you start running a little short on RAM, though, there's no way to stop infrequent searches from being sluggish. If the term index and norms are pinned (or happen to still be hot), I would expect most searches to be OK with this "in the middle" use case because the number of seeks you'll hit should be well contained (assuming your posting list isn't unduly fragmented by the filesystem). Burning through the posting list is a linear scan. Queries that simply hit too many docs will always be slow anyways. I think at both extremes (way too litle RAM and tons of RAM) both approaches (pinned in RAM vs mmap'd) should perfom the same. It's the cases in between where I think letting VM decide whether critical things (terms index, norms) get to stay hot is dangerous. The terms index could indeed get evicted some of the time on busy systems, but the point is that the system IO cache usually works in our favor, even under load. I think you're just more trusting of the IO/VM system. I think LRU is a poor metric. As far as backup daemons blowing up everybody's cache, that's stupid, pathological behavior: < http://kerneltrap.org/node/3000#comment-8573 >. Such apps ought to be calling madvise(ptr, len, MADV_SEQUENTIAL) so that the kernel knows it can recycle the cache pages as soon as they're cleared. Excellent! If only more people knew about this. And, if only we could do this from javaland. EG SegmentMerger should do this for all segment data it's reading & writing. Nathan Kurz and I brainstormed this subject in a phone call this morning, and we came up with a three-file lexicon index design: I don't fully understand this approach. Would the index file pointers point into the full lexicon's packed utf8 file, or a separate "only terms in the index" packed utf8 file? We currently materialize individual Strings when we load our index, which is bad because of the GC cost, added RAM overhead (& swapping) and because for iso8859-1 only terms we are using 2X the space over utf8. So I'd love to eventually do something similar (in RAM) for Lucene. > Have you tried any actual tests swapping these approaches in as your > terms index impl? No - changing something like this requires a lot of coding, so it's better to do thought experiments first to winnow down the options. Agreed. But once you've got the mmap-based solution up and running it'd be nice to meaure net time doing terms lookup / norms reading, for a variety of search use cases, and plot that on a histogram. When I mentioned this to Nate, he remarked that we're using the OS kernel like you're using the JVM. True! Lucy/KS can't enforce that, and we wouldn't want to. It's very convenient to be able to launch a cheap search process. It seems like the ability to very quickly launch brand new searchers is/has become a strong design goal of Lucy/KS. What's the driver here? Is it for near-realtime search? (Which I think may be better achieved by having IndexWriter export a reader, rather than using IO system as the intermediary). If we fix terms index to bulk load arrays (it's not now) then the cost of loading norms & terms index on instantiating a reader should be fairly well contained, though not as near zero as Lucy/KS will be. > That's a nice goal. Our biggest cost in Lucene is warming the > FieldCache, used for sorting, function queries, etc. Exactly. It would be nice to add a plug-in indexing component that writes sort caches to files that can be memory mapped at IndexReader startup. There would be multiple files: both a solid array of 32-bit integers mapping document number to sort order, and the field cache values. Such a component would allow us to move the time it takes to read in a sort cache from IndexReader-startup-time to index-time. Except I would have IndexReader use its RAM budget to pick & choose which of these will be hot, and which would be mmap'd. Hmm, maybe we can conflate this with a column-stride field writer and require that sort fields have a fixed width? Yes I think column-stride fields writer should write the docID -> ord part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would then use it. With enumerated type of fields (far fewer unique terms than docs), bit packing will make them compact. In KS, the relevant IndexReader methods no longer take a Term object. (In fact, there IS no Term object any more - KinoSearch::Index::Term has been removed.) Instead, they take a string field and a generic "Obj". But you must at least require these Obj's to know how to compareTo one another? Does this mean using per-field custom sort ordering (collator) is straightforward for KS? I suppose we genericize this by adding a TermsDictReader/LexReader argument to the IndexReader constructor? That way, someone can supply a custom subclass that knows how to decode custom dictionary files. Right; that's what let me create the PulsingCodec here. The biggest problem with the "load important stuff into RAM" approach, of course, is we can't actually pin VM pages from java, which means the OS will happily swap out my RAM anyway, at which point of course we should have used mmap. Though apparently at least Windows has an option to "optimize for services" (= "don't swap out my RAM" I think) vs "optimize for applications", and Linux lets you tune swappiness. But both are global.
          Hide
          Marvin Humphrey added a comment -

          > Nathan Kurz and I brainstormed this subject in a phone call this morning, and
          > we came up with a three-file lexicon index design:
          >
          > I don't fully understand this approach. Would the index file pointers
          > point into the full lexicon's packed utf8 file, or a separate "only
          > terms in the index" packed utf8 file?

          Just the index terms (i.e. every 128th term). We're trying to fake up an
          array of strings without having to load anything into process memory. The
          comparison would go something like this:

            /* self->text_lengths, self->char_data, and self->lex_file_ptrs are all
             * memory mapped buffers.
             */
            while (hi >= lo) {
              const i32_t mid        = lo + ((hi - lo) / 2);
              const i64_t offset     = self->text_lengths[mid];
              const i64_t mid_len    = self->text_lengths[mid + 1] - offset;
              char *const mid_text   = self->char_data + offset;
              const i32_t comparison = StrHelp_string_diff(target_text, target_len, 
                                                           mid_text, mid_len);
              if      (comparison < 0) { hi = mid - 1; }
              else if (comparison > 0) { lo = mid + 1; }
              else { 
                result = mid; 
                break;
              }
            }
            offset_into_main_lexicon = self->lex_file_ptrs[result]
            ...
          

          However, perhaps some sort of a B-tree with string prefix compression would be
          better, as per recent suggestions.

          Show
          Marvin Humphrey added a comment - > Nathan Kurz and I brainstormed this subject in a phone call this morning, and > we came up with a three-file lexicon index design: > > I don't fully understand this approach. Would the index file pointers > point into the full lexicon's packed utf8 file, or a separate "only > terms in the index" packed utf8 file? Just the index terms (i.e. every 128th term). We're trying to fake up an array of strings without having to load anything into process memory. The comparison would go something like this: /* self->text_lengths, self->char_data, and self->lex_file_ptrs are all * memory mapped buffers. */ while (hi >= lo) { const i32_t mid = lo + ((hi - lo) / 2); const i64_t offset = self->text_lengths[mid]; const i64_t mid_len = self->text_lengths[mid + 1] - offset; char * const mid_text = self->char_data + offset; const i32_t comparison = StrHelp_string_diff(target_text, target_len, mid_text, mid_len); if (comparison < 0) { hi = mid - 1; } else if (comparison > 0) { lo = mid + 1; } else { result = mid; break ; } } offset_into_main_lexicon = self->lex_file_ptrs[result] ... However, perhaps some sort of a B-tree with string prefix compression would be better, as per recent suggestions.
          Hide
          Marvin Humphrey added a comment -

          >> Hmm, maybe we can conflate this with a column-stride field writer
          >> and require that sort fields have a fixed width?
          >
          > Yes I think column-stride fields writer should write the docID -> ord
          > part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would
          > then use it. With enumerated type of fields (far fewer unique terms
          > than docs), bit packing will make them compact.

          How do you plan on dealing with the ord values changing as segments get
          added? The addition of a single document triggers the rewriting of the
          entire mapping.

          I was planning on having SortCacheWriter write the out the docID -> ord
          mapping, but with the understanding that there was a relatively high cost so
          the module couldn't be core. The idea was to take the cost of iterating over
          the field caches during IndexReader startup, move that to index time, and write
          out a file that could be memory mapped and shared among multiple search apps.

          In theory, if we were to have only per-segment docID -> ord maps, we could
          perform inter-segment collation the same way that it's handled at the
          MultiSearcher level – by comparing the original strings. It wouldn't be that
          expensive in the grand scheme of things, because most of the work would be
          done by comparing ord values within large segments.

          Unfortunately, that won't work because segment boundaries are hidden from
          Scorers.

          >> In KS, the relevant IndexReader methods no longer take a Term
          >> object. (In fact, there IS no Term object any more -
          >> KinoSearch::Index::Term has been removed.) Instead, they take a
          >> string field and a generic "Obj".
          >
          > But you must at least require these Obj's to know how to compareTo one
          > another?

          Yes.

          > Does this mean using per-field custom sort ordering (collator) is
          > straightforward for KS?

          That's one objective. The implementation is incomplete.

          Another objective is to allow non-string term types, e.g. TimeStamp,
          Float... Hmm... how about FixedWidthText?

          Show
          Marvin Humphrey added a comment - >> Hmm, maybe we can conflate this with a column-stride field writer >> and require that sort fields have a fixed width? > > Yes I think column-stride fields writer should write the docID -> ord > part of StringIndex to disk, and MultiRangeQuery in LUCENE-1461 would > then use it. With enumerated type of fields (far fewer unique terms > than docs), bit packing will make them compact. How do you plan on dealing with the ord values changing as segments get added? The addition of a single document triggers the rewriting of the entire mapping. I was planning on having SortCacheWriter write the out the docID -> ord mapping, but with the understanding that there was a relatively high cost so the module couldn't be core. The idea was to take the cost of iterating over the field caches during IndexReader startup, move that to index time, and write out a file that could be memory mapped and shared among multiple search apps. In theory, if we were to have only per-segment docID -> ord maps, we could perform inter-segment collation the same way that it's handled at the MultiSearcher level – by comparing the original strings. It wouldn't be that expensive in the grand scheme of things, because most of the work would be done by comparing ord values within large segments. Unfortunately, that won't work because segment boundaries are hidden from Scorers. >> In KS, the relevant IndexReader methods no longer take a Term >> object. (In fact, there IS no Term object any more - >> KinoSearch::Index::Term has been removed.) Instead, they take a >> string field and a generic "Obj". > > But you must at least require these Obj's to know how to compareTo one > another? Yes. > Does this mean using per-field custom sort ordering (collator) is > straightforward for KS? That's one objective. The implementation is incomplete. Another objective is to allow non-string term types, e.g. TimeStamp, Float... Hmm... how about FixedWidthText?
          Hide
          Marvin Humphrey added a comment -

          >> I suppose we genericize this by adding a TermsDictReader/LexReader
          >> argument to the IndexReader constructor? That way, someone can
          >> supply a custom subclass that knows how to decode custom dictionary
          >> files.
          >
          > Right; that's what let me create the PulsingCodec here.

          I'm running into an OO design problem because of the SegmentReader/MultiReader
          bifurcation. If IndexReader were an ordinary class, and we expected all of
          its component parts to perform their own collation of data from multiple
          segments, then the API for overriding individual components would be
          straightforward:

            reader = new IndexReader(termsDictReader, postingsReader, fieldsReader);
          

          We can't do that, though, because there's logic in IndexReader.open() which
          guards against race conditions with regards to file deletion and index
          modification, and the initialization of the auxiliary reader components would
          happen outside those guards – possibly resulting in sub-components within an
          IndexReader object reading from different versions of the index.

          Using setters a la reader.setTermsDictReader(termsDictReader) is problematic
          for the same reason.

          Are factory methods the only way to handle adding or replacing components
          within IndexReader?

          KS forces people to subclass Schema to define their index, but up till now
          there hasn't been anything that would affect the complement of major
          sub-components within IndexReader or InvIndexer (=IndexWriter). I suppose
          Schema is the right place to put stuff like this, but it seems a lot more
          elaborate than the factory method which returns the index's default Analyzer.

          Show
          Marvin Humphrey added a comment - >> I suppose we genericize this by adding a TermsDictReader/LexReader >> argument to the IndexReader constructor? That way, someone can >> supply a custom subclass that knows how to decode custom dictionary >> files. > > Right; that's what let me create the PulsingCodec here. I'm running into an OO design problem because of the SegmentReader/MultiReader bifurcation. If IndexReader were an ordinary class, and we expected all of its component parts to perform their own collation of data from multiple segments, then the API for overriding individual components would be straightforward: reader = new IndexReader(termsDictReader, postingsReader, fieldsReader); We can't do that, though, because there's logic in IndexReader.open() which guards against race conditions with regards to file deletion and index modification, and the initialization of the auxiliary reader components would happen outside those guards – possibly resulting in sub-components within an IndexReader object reading from different versions of the index. Using setters a la reader.setTermsDictReader(termsDictReader) is problematic for the same reason. Are factory methods the only way to handle adding or replacing components within IndexReader? KS forces people to subclass Schema to define their index, but up till now there hasn't been anything that would affect the complement of major sub-components within IndexReader or InvIndexer (=IndexWriter). I suppose Schema is the right place to put stuff like this, but it seems a lot more elaborate than the factory method which returns the index's default Analyzer.
          Hide
          Marvin Humphrey added a comment -

          > Be careful: it's the seeking that kills you (until we switch to SSDs
          > at which point perhaps most of this discussion is moot!). Even though
          > the terms index net size is low, if re-heating the spots you touch
          > incurs 20 separate page misses, you lose.

          Perhaps for such situations, we can make it possible to create custom
          HotLexiconReader or HotIndexReader subclasses that slurp term index files and
          what-have-you into process memory. Implementation would be easy, since we can
          just back the InStreams with malloc'd RAM buffers rather than memory mapped
          system buffers.

          Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers,
          busy systems may experience sluggish search after long lapses in a worst case
          scenario. On the other hand, reading a bunch of stuff into process memory
          makes IndexReader a lot heavier, with large indexes imposing consistently
          sluggish startup and a large RAM footprint on each object.

          > It seems like the ability to very quickly launch brand new searchers
          > is/has become a strong design goal of Lucy/KS. What's the driver
          > here? Is it for near-realtime search?

          Near-realtime search is one of the motivations. But lightweight IndexReaders
          are more convenient in all sorts of ways.

          Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever
          indexes get modified underneath a persistent search service. This is
          certainly a problem when you are trying to keep up with real-time insertions,
          but it is also a problem with batch updates or optimization passes.

          With lightweight IndexReaders, you can check whether the index has been
          modified as requests come in, launch a new Searcher if it has, then deal with
          the request after a negligible delay. You have to warm the system io caches
          when the service starts up ("cat /path/to/index/* > /dev/null"), but after
          that, there's no more need for background warming.

          Lightweight IndexReaders can also be sprinkled liberally around source code in
          a way that heavy IndexReaders cannot. For instance, each thread in a
          multi-threaded server can have its own Searcher.

          Launching cheap search processes is also important when writing tools akin to
          the Unix command line 'locate' app. The first time you invoke locate it's
          slow, but subsequent invocations are nice and quick. You can only mimic that
          with a lightweight IndexReader.

          And so on... The fact that segment data files are never modified once written
          makes the Lucene/Lucy/KS file format design particularly well suited for
          memory mapping and sharing via the system buffers. In addition to the reasons
          cited above, intuition tells me that this is the right design decision and
          that there will be other opportunities not yet anticipated. I don't see how Lucy
          can deny such advantages to most users for the sake of those few for whom
          term dictionary cache eviction proves to be a problem, especially when we can
          offer those users a remedy.

          > The biggest problem with the "load important stuff into RAM" approach,
          > of course, is we can't actually pin VM pages from java, which means
          > the OS will happily swap out my RAM anyway, at which point of course
          > we should have used mmap.

          We can't realistically pin pages from C, either, at least on Unixen. Modern
          Unixen offer the mlock() command, but it has a crucial limitation – you have to
          run it as root.

          Also, there aren't any madvise() flags that hint to the OS that the mapped
          region should stay hot. The closest thing is MADV_WILLNEED, which
          communicates "this will be needed soon" – not "keep this around".

          Show
          Marvin Humphrey added a comment - > Be careful: it's the seeking that kills you (until we switch to SSDs > at which point perhaps most of this discussion is moot!). Even though > the terms index net size is low, if re-heating the spots you touch > incurs 20 separate page misses, you lose. Perhaps for such situations, we can make it possible to create custom HotLexiconReader or HotIndexReader subclasses that slurp term index files and what-have-you into process memory. Implementation would be easy, since we can just back the InStreams with malloc'd RAM buffers rather than memory mapped system buffers. Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers, busy systems may experience sluggish search after long lapses in a worst case scenario. On the other hand, reading a bunch of stuff into process memory makes IndexReader a lot heavier, with large indexes imposing consistently sluggish startup and a large RAM footprint on each object. > It seems like the ability to very quickly launch brand new searchers > is/has become a strong design goal of Lucy/KS. What's the driver > here? Is it for near-realtime search? Near-realtime search is one of the motivations. But lightweight IndexReaders are more convenient in all sorts of ways. Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever indexes get modified underneath a persistent search service. This is certainly a problem when you are trying to keep up with real-time insertions, but it is also a problem with batch updates or optimization passes. With lightweight IndexReaders, you can check whether the index has been modified as requests come in, launch a new Searcher if it has, then deal with the request after a negligible delay. You have to warm the system io caches when the service starts up ("cat /path/to/index/* > /dev/null"), but after that, there's no more need for background warming. Lightweight IndexReaders can also be sprinkled liberally around source code in a way that heavy IndexReaders cannot. For instance, each thread in a multi-threaded server can have its own Searcher. Launching cheap search processes is also important when writing tools akin to the Unix command line 'locate' app. The first time you invoke locate it's slow, but subsequent invocations are nice and quick. You can only mimic that with a lightweight IndexReader. And so on... The fact that segment data files are never modified once written makes the Lucene/Lucy/KS file format design particularly well suited for memory mapping and sharing via the system buffers. In addition to the reasons cited above, intuition tells me that this is the right design decision and that there will be other opportunities not yet anticipated. I don't see how Lucy can deny such advantages to most users for the sake of those few for whom term dictionary cache eviction proves to be a problem, especially when we can offer those users a remedy. > The biggest problem with the "load important stuff into RAM" approach, > of course, is we can't actually pin VM pages from java, which means > the OS will happily swap out my RAM anyway, at which point of course > we should have used mmap. We can't realistically pin pages from C, either, at least on Unixen. Modern Unixen offer the mlock() command, but it has a crucial limitation – you have to run it as root. Also, there aren't any madvise() flags that hint to the OS that the mapped region should stay hot. The closest thing is MADV_WILLNEED, which communicates "this will be needed soon" – not "keep this around".
          Hide
          Michael McCandless added a comment -

          Just the index terms (i.e. every 128th term). We're trying to fake up an
          array of strings without having to load anything into process memory. The
          comparison would go something like this:

          OK this makes sense. We could do something similar in Lucene. Not
          creating String objects is nice. I wonder in practice how much time
          we are "typically" spending loading the terms index...

          However, perhaps some sort of a B-tree with string prefix compression would be
          better, as per recent suggestions.

          B-tree or FST/trie or ... something.

          Actually: I just realized the terms index need not store all suffixes
          of the terms it stores. Only unique prefixes (ie a simple letter
          trie, not FST). Because, its goal is to simply find the spot in the
          main lexicon file to seek to and then scan from. This makes it even
          smaller!

          Though, if we want to do neat things like respelling, wildcard/prefix
          searching, etc., which reduce to graph-intersection problems, we would
          need the suffix and we would need the entire lexicon (not just every
          128th index term) compiled into the FST.

          Show
          Michael McCandless added a comment - Just the index terms (i.e. every 128th term). We're trying to fake up an array of strings without having to load anything into process memory. The comparison would go something like this: OK this makes sense. We could do something similar in Lucene. Not creating String objects is nice. I wonder in practice how much time we are "typically" spending loading the terms index... However, perhaps some sort of a B-tree with string prefix compression would be better, as per recent suggestions. B-tree or FST/trie or ... something. Actually: I just realized the terms index need not store all suffixes of the terms it stores. Only unique prefixes (ie a simple letter trie, not FST). Because, its goal is to simply find the spot in the main lexicon file to seek to and then scan from. This makes it even smaller! Though, if we want to do neat things like respelling, wildcard/prefix searching, etc., which reduce to graph-intersection problems, we would need the suffix and we would need the entire lexicon (not just every 128th index term) compiled into the FST.
          Hide
          Michael McCandless added a comment -

          How do you plan on dealing with the ord values changing as segments get
          added? The addition of a single document triggers the rewriting of the
          entire mapping.

          ...

          Unfortunately, that won't work because segment boundaries are hidden from
          Scorers.

          This is a big challenge – presenting a merged docID->ord map for a
          MultiSegmentReader is very costly.

          I think, just like we are pushing for column-stride / FieldCache to be
          "per segment" instead of one big merged array, we should move in the
          same direction for searching?

          Ie, if one did all searching with MultiSearcher, it should work well.
          Each segment uses its pre-computed (during indexing) docID->ord
          mapping. Merge-sorting the results from each searcher ought to be low
          cost since you only need to lookup the string values for the top N
          docs (though care must be taken to not incur N seeks for this... eg
          perhaps each reader, on hitting a doc that makes it into the pqueue,
          should then seek&load the String value from column-stride store?). An
          optimized index wouldn't need to read any of the actual string values
          since no results merging is needed.

          For the RangeFilter impl in LUCENE-1461 (which'd use the docID->order
          per segment, using MultiSearcher), string values are never needed.

          > Does this mean using per-field custom sort ordering (collator) is
          > straightforward for KS?

          That's one objective. The implementation is incomplete.

          Another objective is to allow non-string term types, e.g. TimeStamp,
          Float... Hmm... how about FixedWidthText?

          Neat!

          Show
          Michael McCandless added a comment - How do you plan on dealing with the ord values changing as segments get added? The addition of a single document triggers the rewriting of the entire mapping. ... Unfortunately, that won't work because segment boundaries are hidden from Scorers. This is a big challenge – presenting a merged docID->ord map for a MultiSegmentReader is very costly. I think, just like we are pushing for column-stride / FieldCache to be "per segment" instead of one big merged array, we should move in the same direction for searching? Ie, if one did all searching with MultiSearcher, it should work well. Each segment uses its pre-computed (during indexing) docID->ord mapping. Merge-sorting the results from each searcher ought to be low cost since you only need to lookup the string values for the top N docs (though care must be taken to not incur N seeks for this... eg perhaps each reader, on hitting a doc that makes it into the pqueue, should then seek&load the String value from column-stride store?). An optimized index wouldn't need to read any of the actual string values since no results merging is needed. For the RangeFilter impl in LUCENE-1461 (which'd use the docID->order per segment, using MultiSearcher), string values are never needed. > Does this mean using per-field custom sort ordering (collator) is > straightforward for KS? That's one objective. The implementation is incomplete. Another objective is to allow non-string term types, e.g. TimeStamp, Float... Hmm... how about FixedWidthText? Neat!
          Hide
          Michael McCandless added a comment -

          If IndexReader were an ordinary class, and we expected all of
          its component parts to perform their own collation of data from multiple
          segments, then the API for overriding individual components would be
          straightforward:

          reader = new IndexReader(termsDictReader, postingsReader, fieldsReader);

          We can't do that, though, because there's logic in IndexReader.open() which
          guards against race conditions with regards to file deletion and index
          modification, and the initialization of the auxiliary reader components would
          happen outside those guards - possibly resulting in sub-components within an
          IndexReader object reading from different versions of the index.

          I think you "just" have to have "index version data" that's
          collectively read/written, atomically, and is then used to init all
          the components. This is what segments_N is in Lucene (and I think
          "Schema" is in KS/Lucy?): it contains all details that all
          sub-components need.

          If init'ing each sub-component is then costly (opening files,
          slurping things in, etc.) its OK because they are all still loading a
          consistent commit point.

          Show
          Michael McCandless added a comment - If IndexReader were an ordinary class, and we expected all of its component parts to perform their own collation of data from multiple segments, then the API for overriding individual components would be straightforward: reader = new IndexReader(termsDictReader, postingsReader, fieldsReader); We can't do that, though, because there's logic in IndexReader.open() which guards against race conditions with regards to file deletion and index modification, and the initialization of the auxiliary reader components would happen outside those guards - possibly resulting in sub-components within an IndexReader object reading from different versions of the index. I think you "just" have to have "index version data" that's collectively read/written, atomically, and is then used to init all the components. This is what segments_N is in Lucene (and I think "Schema" is in KS/Lucy?): it contains all details that all sub-components need. If init'ing each sub-component is then costly (opening files, slurping things in, etc.) its OK because they are all still loading a consistent commit point.
          Hide
          Michael McCandless added a comment -

          > Be careful: it's the seeking that kills you (until we switch to SSDs
          > at which point perhaps most of this discussion is moot!). Even though
          > the terms index net size is low, if re-heating the spots you touch
          > incurs 20 separate page misses, you lose.

          Perhaps for such situations, we can make it possible to create custom
          HotLexiconReader or HotIndexReader subclasses that slurp term index files and
          what-have-you into process memory. Implementation would be easy, since we can
          just back the InStreams with malloc'd RAM buffers rather than memory mapped
          system buffers.

          Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers,
          busy systems may experience sluggish search after long lapses in a worst case
          scenario. On the other hand, reading a bunch of stuff into process memory
          makes IndexReader a lot heavier, with large indexes imposing consistently
          sluggish startup and a large RAM footprint on each object.

          I think this is a fabulous solution. If you make things so pluggable
          that you can choose to swap in "mmap this thing" vs "slurp in this
          thing" and it's the same interface presented to the consumer, then, we
          don't need to resolve this debate now Put both out in the field and
          gather data...

          Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever
          indexes get modified underneath a persistent search service. This is
          certainly a problem when you are trying to keep up with real-time insertions,
          but it is also a problem with batch updates or optimization passes.

          With lightweight IndexReaders, you can check whether the index has been
          modified as requests come in, launch a new Searcher if it has, then deal with
          the request after a negligible delay. You have to warm the system io caches
          when the service starts up ("cat /path/to/index/* > /dev/null"), but after
          that, there's no more need for background warming.

          Well ...that cat command can be deadly for a large index, too? You've
          replaced the elaborate pre-warming ritual (= run certain queries that
          you know will populate variou caches) with a cat command that doesn't
          distinguish what's important (norms, terms index, certain docID->ord
          maps, certain column-stride-fields, etc.) from what's less important.

          Lightweight IndexReaders can also be sprinkled liberally around source code in
          a way that heavy IndexReaders cannot. For instance, each thread in a
          multi-threaded server can have its own Searcher.

          Launching cheap search processes is also important when writing tools akin to
          the Unix command line 'locate' app. The first time you invoke locate it's
          slow, but subsequent invocations are nice and quick. You can only mimic that
          with a lightweight IndexReader.

          This is indeed nice. I think the two approaches boil down to "pay up
          front & reuse" (Lucene, slurping) vs "pay as you go & discard"
          (KS/Lucy, mmap'ing).

          And so on... The fact that segment data files are never modified once written
          makes the Lucene/Lucy/KS file format design particularly well suited for
          memory mapping and sharing via the system buffers. In addition to the reasons
          cited above, intuition tells me that this is the right design decision and
          that there will be other opportunities not yet anticipated. I don't see how Lucy
          can deny such advantages to most users for the sake of those few for whom
          term dictionary cache eviction proves to be a problem, especially when we can
          offer those users a remedy.

          [BTW the ZFS filesystem gets many of its nice properties for the same
          reason – "write once", at the file block level.]

          Lucene java takes advantage of that 'write once' nature during
          IndexReader.reopen(). If we can finally push FieldCache, norms,
          docID->ord to be per-reader then the reopen of a MultiSearcher should
          be alot better than it is today.

          > The biggest problem with the "load important stuff into RAM" approach,
          > of course, is we can't actually pin VM pages from java, which means
          > the OS will happily swap out my RAM anyway, at which point of course
          > we should have used mmap.

          We can't realistically pin pages from C, either, at least on Unixen. Modern
          Unixen offer the mlock() command, but it has a crucial limitation - you have to
          run it as root.

          Also, there aren't any madvise() flags that hint to the OS that the mapped
          region should stay hot. The closest thing is MADV_WILLNEED, which
          communicates "this will be needed soon" - not "keep this around".

          Alas.

          The only fallback is gross system-level tunings ("swappiness" on Linux
          and "Adjust for best performance of: Programs/System Cache" on Windows
          Server 2003, at least).

          Or also a silly "keep warm" thread...

          Show
          Michael McCandless added a comment - > Be careful: it's the seeking that kills you (until we switch to SSDs > at which point perhaps most of this discussion is moot!). Even though > the terms index net size is low, if re-heating the spots you touch > incurs 20 separate page misses, you lose. Perhaps for such situations, we can make it possible to create custom HotLexiconReader or HotIndexReader subclasses that slurp term index files and what-have-you into process memory. Implementation would be easy, since we can just back the InStreams with malloc'd RAM buffers rather than memory mapped system buffers. Consider the tradeoffs. On the one hand, if we rely on memory mapped buffers, busy systems may experience sluggish search after long lapses in a worst case scenario. On the other hand, reading a bunch of stuff into process memory makes IndexReader a lot heavier, with large indexes imposing consistently sluggish startup and a large RAM footprint on each object. I think this is a fabulous solution. If you make things so pluggable that you can choose to swap in "mmap this thing" vs "slurp in this thing" and it's the same interface presented to the consumer, then, we don't need to resolve this debate now Put both out in the field and gather data... Elaborate pre-warming rituals are necessary with heavy IndexReaders whenever indexes get modified underneath a persistent search service. This is certainly a problem when you are trying to keep up with real-time insertions, but it is also a problem with batch updates or optimization passes. With lightweight IndexReaders, you can check whether the index has been modified as requests come in, launch a new Searcher if it has, then deal with the request after a negligible delay. You have to warm the system io caches when the service starts up ("cat /path/to/index/* > /dev/null"), but after that, there's no more need for background warming. Well ...that cat command can be deadly for a large index, too? You've replaced the elaborate pre-warming ritual (= run certain queries that you know will populate variou caches) with a cat command that doesn't distinguish what's important (norms, terms index, certain docID->ord maps, certain column-stride-fields, etc.) from what's less important. Lightweight IndexReaders can also be sprinkled liberally around source code in a way that heavy IndexReaders cannot. For instance, each thread in a multi-threaded server can have its own Searcher. Launching cheap search processes is also important when writing tools akin to the Unix command line 'locate' app. The first time you invoke locate it's slow, but subsequent invocations are nice and quick. You can only mimic that with a lightweight IndexReader. This is indeed nice. I think the two approaches boil down to "pay up front & reuse" (Lucene, slurping) vs "pay as you go & discard" (KS/Lucy, mmap'ing). And so on... The fact that segment data files are never modified once written makes the Lucene/Lucy/KS file format design particularly well suited for memory mapping and sharing via the system buffers. In addition to the reasons cited above, intuition tells me that this is the right design decision and that there will be other opportunities not yet anticipated. I don't see how Lucy can deny such advantages to most users for the sake of those few for whom term dictionary cache eviction proves to be a problem, especially when we can offer those users a remedy. [BTW the ZFS filesystem gets many of its nice properties for the same reason – "write once", at the file block level.] Lucene java takes advantage of that 'write once' nature during IndexReader.reopen(). If we can finally push FieldCache, norms, docID->ord to be per-reader then the reopen of a MultiSearcher should be alot better than it is today. > The biggest problem with the "load important stuff into RAM" approach, > of course, is we can't actually pin VM pages from java, which means > the OS will happily swap out my RAM anyway, at which point of course > we should have used mmap. We can't realistically pin pages from C, either, at least on Unixen. Modern Unixen offer the mlock() command, but it has a crucial limitation - you have to run it as root. Also, there aren't any madvise() flags that hint to the OS that the mapped region should stay hot. The closest thing is MADV_WILLNEED, which communicates "this will be needed soon" - not "keep this around". Alas. The only fallback is gross system-level tunings ("swappiness" on Linux and "Adjust for best performance of: Programs/System Cache" on Windows Server 2003, at least). Or also a silly "keep warm" thread...
          Hide
          Marvin Humphrey added a comment -

          > I think you "just" have to have "index version data" that's
          > collectively read/written, atomically, and is then used to init all
          > the components. This is what segments_N is in Lucene (and I think
          > "Schema" is in KS/Lucy?): it contains all details that all
          > sub-components need.

          The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is
          encoded as JSON. There's a KinoSearch::Index::Snapshot class that's
          responsible for reading/writing it.

          KinoSearch::Schema is for defining your index: global field properties,
          default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as
          an abstract class that users are required to subclass. Translated to Java,
          the subclassing might look something like this:

            class MySchema extends Schema {
              class URLField extends TextField {
                  boolean analyzed() { return false; }
                  boolean indexed() { return false; }
              }
          
              void initFields() {
                addField("title", "text");
                addField("content", "text");
                addField("url", new URLField());
              }
          
              Analyzer analyzer() {
                return new PolyAnalyzer("en");
              }
            }
          

          I anticipate that Lucy will adopt both Schema and Snapshot in some form, but
          after discussion.

          > If init'ing each sub-component is then costly (opening files,
          > slurping things in, etc.) its OK because they are all still loading a
          > consistent commit point.

          So, something like this prospective Lucy code? (Lucy with Java bindings, that is.)

            MySchema schema = new MySchema();
            Snapshot snapshot = new Snapshot((Schema)schema);
            snapShot.readSnapShot("/path/to/index");
            MyTermsDictReader termsDictReader = new MyTermsDictReader(schema, snapshot);
            IndexReader reader = new IndexReader(schema, snapshot, null, null,
                                                 (TermsDictReader)termsDictReader);
          

          What if index files get deleted out from under that code block? The user will
          have to implement retry logic.

          Show
          Marvin Humphrey added a comment - > I think you "just" have to have "index version data" that's > collectively read/written, atomically, and is then used to init all > the components. This is what segments_N is in Lucene (and I think > "Schema" is in KS/Lucy?): it contains all details that all > sub-components need. The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is encoded as JSON. There's a KinoSearch::Index::Snapshot class that's responsible for reading/writing it. KinoSearch::Schema is for defining your index: global field properties, default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as an abstract class that users are required to subclass. Translated to Java, the subclassing might look something like this: class MySchema extends Schema { class URLField extends TextField { boolean analyzed() { return false ; } boolean indexed() { return false ; } } void initFields() { addField( "title" , "text" ); addField( "content" , "text" ); addField( "url" , new URLField()); } Analyzer analyzer() { return new PolyAnalyzer( "en" ); } } I anticipate that Lucy will adopt both Schema and Snapshot in some form, but after discussion. > If init'ing each sub-component is then costly (opening files, > slurping things in, etc.) its OK because they are all still loading a > consistent commit point. So, something like this prospective Lucy code? (Lucy with Java bindings, that is.) MySchema schema = new MySchema(); Snapshot snapshot = new Snapshot((Schema)schema); snapShot.readSnapShot( "/path/to/index" ); MyTermsDictReader termsDictReader = new MyTermsDictReader(schema, snapshot); IndexReader reader = new IndexReader(schema, snapshot, null , null , (TermsDictReader)termsDictReader); What if index files get deleted out from under that code block? The user will have to implement retry logic.
          Hide
          Michael McCandless added a comment -

          The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is
          encoded as JSON. There's a KinoSearch::Index::Snapshot class that's
          responsible for reading/writing it.

          KinoSearch::Schema is for defining your index: global field properties,
          default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as
          an abstract class that users are required to subclass. Translated to Java,
          the subclassing might look something like this:

          OK got it.

          What if index files get deleted out from under that code block? The
          user will have to implement retry logic.

          I would think this "openReader" method would live inside Lucy/KS, and
          would in fact implement its own retry logic (to load the next snapshot
          and try again). I must be missing some part of the question here...

          Show
          Michael McCandless added a comment - The equivalent to segments_N in KinoSearch is snapshot_N.meta, which is encoded as JSON. There's a KinoSearch::Index::Snapshot class that's responsible for reading/writing it. KinoSearch::Schema is for defining your index: global field properties, default Analyzer, etc. It's similar to Solr's schema.xml, but implemented as an abstract class that users are required to subclass. Translated to Java, the subclassing might look something like this: OK got it. What if index files get deleted out from under that code block? The user will have to implement retry logic. I would think this "openReader" method would live inside Lucy/KS, and would in fact implement its own retry logic (to load the next snapshot and try again). I must be missing some part of the question here...
          Hide
          Marvin Humphrey added a comment -

          > I would think this "openReader" method would live inside Lucy/KS, and
          > would in fact implement its own retry logic (to load the next snapshot
          > and try again). I must be missing some part of the question here...

          If the retry code lives inside of IndexReader, then the only way to get the
          IndexReader to use e.g. a subclassed TermsDictReader is to subclass
          IndexReader and override a factory method:

            class MyIndexReader extends IndexReader {
              TermsDictReader makeTermsDictReader() {
                return (TermsDictReader) new MyTermsDictReader(invindex, snapshot);
              }
            }
          
            InvIndex invindex = MySchema.open("/path/to/index");
            IndexReader reader = (IndexReader) new MyIndexReader(invindex);
          

          I was hoping to avoid forcing the user to subclass IndexReader, but I think
          the need for retry logic during open() precludes that possibility.

          Show
          Marvin Humphrey added a comment - > I would think this "openReader" method would live inside Lucy/KS, and > would in fact implement its own retry logic (to load the next snapshot > and try again). I must be missing some part of the question here... If the retry code lives inside of IndexReader, then the only way to get the IndexReader to use e.g. a subclassed TermsDictReader is to subclass IndexReader and override a factory method: class MyIndexReader extends IndexReader { TermsDictReader makeTermsDictReader() { return (TermsDictReader) new MyTermsDictReader(invindex, snapshot); } } InvIndex invindex = MySchema.open( "/path/to/index" ); IndexReader reader = (IndexReader) new MyIndexReader(invindex); I was hoping to avoid forcing the user to subclass IndexReader, but I think the need for retry logic during open() precludes that possibility.
          Hide
          Michael McCandless added a comment -

          I was hoping to avoid forcing the user to subclass IndexReader, but I
          think the need for retry logic during open() precludes that
          possibility.

          How about the caller provides a codec instance which when asked will
          return a TermsDictReader "matching" the codec that had been used to
          write the index?

          Then open() implements the retry logic, asking the codec to load each
          part of the index?

          That's roughly the approach I'm taking here (on next iteriaton of the
          patch, hopefully soon), though I'm only tackling the postings now (not
          yet norms, stored fields, term vectors, fields infos).

          Show
          Michael McCandless added a comment - I was hoping to avoid forcing the user to subclass IndexReader, but I think the need for retry logic during open() precludes that possibility. How about the caller provides a codec instance which when asked will return a TermsDictReader "matching" the codec that had been used to write the index? Then open() implements the retry logic, asking the codec to load each part of the index? That's roughly the approach I'm taking here (on next iteriaton of the patch, hopefully soon), though I'm only tackling the postings now (not yet norms, stored fields, term vectors, fields infos).
          Hide
          Marvin Humphrey added a comment -

          >> We're trying to fake up an array of strings without having to load anything
          >> into process memory.

          > We could do something similar in Lucene. Not creating String objects is
          > nice.

          OK, assume that you slurp all three files. Here's the code from above, ported
          from C to Java.

          while (hi >= lo) {
            int  mid           = lo + ((hi - lo) / 2);
            long midTextOffset = textLengths[mid];
            long midTextLength = textLengths[mid + 1] - midTextOffset;
            int comparison     = StringHelper.compareUTF8Bytes(
                                    targetUTF8Bytes, 0, targetLength, 
                                    termUTF8bytes, midTextOffset, midTextLength);
            if      (comparison < 0) { hi = mid - 1; }
            else if (comparison > 0) { lo = mid + 1; }
            else { 
              result = mid; 
              break;
            }
          }
          long offsetIntoMainTermDict = mainTermDictFilePointers[result];
          ...
          

          Other than the slurping, the only significant difference is the need for the
          comparison routine to take a byte[] array and an offset, rather than a char*
          pointer.

          You can also use FileChannels to memory map this stuff, right? (Have to be
          careful on 32-bit systems, though.)

          > B-tree or FST/trie or ... something.

          Much to my regret, my tree algorithm vocabulary is limited – I haven't spent
          enough time coding such projects that I can intuit sophisticated solutions.
          So I'll be counting on you, Jason Rutherglen, and Eks Dev to suggest
          appropriate algorithms based on your experience.

          Our segment-based inverted index term dictionary has a few defining
          characteristics.

          First, a lot of tree algorithms are optimized to a greater or lesser extent
          for insertion speed, but we hardly care about that at all. We can spend all
          the cycles we need at index-time balancing nodes within a segment, and once
          the tree is written out, it will never be updated.

          Second, when we are writing out the term dictionary at index-time, the raw
          data will be fed into the writer in sorted order as iterated values, one
          term/term-info pair at a time. Ideally, the writer would be able to serialize
          the tree structure during this single pass, but it could also write a
          temporary file during the terms iteration then write a final file afterwards.
          The main limitation is that the writer will never be able to "see" all
          terms at once as an array.

          Third, at read-time we're going to have one of these trees per segment. We'd
          really like to be able to conflate them somehow. KinoSearch actually
          implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue;
          MultiLexicon_Next() advances the queue to the next unique term. However,
          that's slow, unwieldy, and inflexible. Can we do better?

          > Actually: I just realized the terms index need not store all suffixes
          > of the terms it stores. Only unique prefixes (ie a simple letter
          > trie, not FST). Because, its goal is to simply find the spot in the
          > main lexicon file to seek to and then scan from. This makes it even
          > smaller!

          It would be ideal if we could separate the keys from the values and put all
          the keys in a single file.

          > Though, if we want to do neat things like respelling, wildcard/prefix
          > searching, etc., which reduce to graph-intersection problems, we would
          > need the suffix and we would need the entire lexicon (not just every
          > 128th index term) compiled into the FST.

          The main purpose of breaking out a separate index structure is to avoid binary
          searching over the large primary file. There's nothing special about the
          extra file – in fact, it's a drawback that it doesn't include all terms. If
          we can jam all the data we need to binary search against into the front of the
          file, but include the data for all terms in an infrequently-accessed tail, we
          win.

          Show
          Marvin Humphrey added a comment - >> We're trying to fake up an array of strings without having to load anything >> into process memory. > We could do something similar in Lucene. Not creating String objects is > nice. OK, assume that you slurp all three files. Here's the code from above, ported from C to Java. while (hi >= lo) { int mid = lo + ((hi - lo) / 2); long midTextOffset = textLengths[mid]; long midTextLength = textLengths[mid + 1] - midTextOffset; int comparison = StringHelper.compareUTF8Bytes( targetUTF8Bytes, 0, targetLength, termUTF8bytes, midTextOffset, midTextLength); if (comparison < 0) { hi = mid - 1; } else if (comparison > 0) { lo = mid + 1; } else { result = mid; break ; } } long offsetIntoMainTermDict = mainTermDictFilePointers[result]; ... Other than the slurping, the only significant difference is the need for the comparison routine to take a byte[] array and an offset, rather than a char* pointer. You can also use FileChannels to memory map this stuff, right? (Have to be careful on 32-bit systems, though.) > B-tree or FST/trie or ... something. Much to my regret, my tree algorithm vocabulary is limited – I haven't spent enough time coding such projects that I can intuit sophisticated solutions. So I'll be counting on you, Jason Rutherglen, and Eks Dev to suggest appropriate algorithms based on your experience. Our segment-based inverted index term dictionary has a few defining characteristics. First, a lot of tree algorithms are optimized to a greater or lesser extent for insertion speed, but we hardly care about that at all. We can spend all the cycles we need at index-time balancing nodes within a segment, and once the tree is written out, it will never be updated. Second, when we are writing out the term dictionary at index-time, the raw data will be fed into the writer in sorted order as iterated values, one term/term-info pair at a time. Ideally, the writer would be able to serialize the tree structure during this single pass, but it could also write a temporary file during the terms iteration then write a final file afterwards. The main limitation is that the writer will never be able to "see" all terms at once as an array. Third, at read-time we're going to have one of these trees per segment. We'd really like to be able to conflate them somehow. KinoSearch actually implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue; MultiLexicon_Next() advances the queue to the next unique term. However, that's slow, unwieldy, and inflexible. Can we do better? > Actually: I just realized the terms index need not store all suffixes > of the terms it stores. Only unique prefixes (ie a simple letter > trie, not FST). Because, its goal is to simply find the spot in the > main lexicon file to seek to and then scan from. This makes it even > smaller! It would be ideal if we could separate the keys from the values and put all the keys in a single file. > Though, if we want to do neat things like respelling, wildcard/prefix > searching, etc., which reduce to graph-intersection problems, we would > need the suffix and we would need the entire lexicon (not just every > 128th index term) compiled into the FST. The main purpose of breaking out a separate index structure is to avoid binary searching over the large primary file. There's nothing special about the extra file – in fact, it's a drawback that it doesn't include all terms. If we can jam all the data we need to binary search against into the front of the file, but include the data for all terms in an infrequently-accessed tail, we win.
          Hide
          Marvin Humphrey added a comment -

          > How about the caller provides a codec instance which when asked will
          > return a TermsDictReader "matching" the codec that had been used to
          > write the index?

          OK, it makes sense to have the user access these capabilities via a single
          handle at both index-time and search-time. However, for Lucy/KS, the handle
          should definitely be specified via the Schema subclass rather than via
          constructor argument.

          "Codec" isn't really the right name for this, though. "IndexComponent",
          maybe? Lucy would have three main index components by default:
          LexiconComponent, PostingsComponent, StorageComponent.

          // If Lucy's Schema class were implemented in Java instead of C...
          abstract class Schema extends Obj {
            LexiconComponent lexiconComponent() { return new LexiconComponent(); }
            PostingsComponent postingsComponent() { return new PostingsComponent(); }
            StorageComponent storageComponent() { return new StorageComponent(); }
            ...
          }
          

          Auxiliary IndexComponents might include TermVectorsComponent,
          SortCacheComponent, ColumnStrideComponent, RTreeComponent, etc.

          Here's example code for overriding the default LexiconComponent:

          // Implements term dictionary as a hash table with term texts as keys.
          class HashLexiconComponent extends LexiconComponent {
            LexiconReader makeReader(InvIndex invindex, Snapshot snapshot) {
              SegInfos segInfos = Snapshot.getSegInfos();
              if (segInfos.size == 1) { 
                return (LexiconReader) new SegHashLexiconReader(invindex, snapshot);
              }
              else {
                return (LexiconReader) new MultiHashLexiconReader(invindex, snapshot);
              }
            }
          
            LexiconWriter makeWriter(InvIndex invindex, SegInfo segInfo) {
              return (LexiconWriter) new HashLexiconWriter(invindex, segInfo);
            }
          }
          
          // [User code]
          class MySchema extends Schema {
            LexiconComponent lexiconComponent() {
              return (LexiconComponent) new HashLexiconComponent();
            }
            ...
          }
          
          Show
          Marvin Humphrey added a comment - > How about the caller provides a codec instance which when asked will > return a TermsDictReader "matching" the codec that had been used to > write the index? OK, it makes sense to have the user access these capabilities via a single handle at both index-time and search-time. However, for Lucy/KS, the handle should definitely be specified via the Schema subclass rather than via constructor argument. "Codec" isn't really the right name for this, though. "IndexComponent", maybe? Lucy would have three main index components by default: LexiconComponent, PostingsComponent, StorageComponent. // If Lucy's Schema class were implemented in Java instead of C... abstract class Schema extends Obj { LexiconComponent lexiconComponent() { return new LexiconComponent(); } PostingsComponent postingsComponent() { return new PostingsComponent(); } StorageComponent storageComponent() { return new StorageComponent(); } ... } Auxiliary IndexComponents might include TermVectorsComponent, SortCacheComponent, ColumnStrideComponent, RTreeComponent, etc. Here's example code for overriding the default LexiconComponent: // Implements term dictionary as a hash table with term texts as keys. class HashLexiconComponent extends LexiconComponent { LexiconReader makeReader(InvIndex invindex, Snapshot snapshot) { SegInfos segInfos = Snapshot.getSegInfos(); if (segInfos.size == 1) { return (LexiconReader) new SegHashLexiconReader(invindex, snapshot); } else { return (LexiconReader) new MultiHashLexiconReader(invindex, snapshot); } } LexiconWriter makeWriter(InvIndex invindex, SegInfo segInfo) { return (LexiconWriter) new HashLexiconWriter(invindex, segInfo); } } // [User code] class MySchema extends Schema { LexiconComponent lexiconComponent() { return (LexiconComponent) new HashLexiconComponent(); } ... }
          Hide
          Marvin Humphrey added a comment -

          > I think, just like we are pushing for column-stride / FieldCache to be
          > "per segment" instead of one big merged array, we should move in the
          > same direction for searching?

          Algorithmically speaking, it would definitely help this specific task, and
          that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord
          map at index-time, allows us to totally eliminate the current cost of loading
          sort caches at IndexReader startup. The question is, how easy is it to
          refactor our search OO hierarchy to support it?

          If our goal is minimal impact to the current model, we worry only about the
          TopFieldDocs search() method. We can hack in per-segment bookending via doc
          number to the hit collection routine, initializing the TopFieldDocCollector
          each segment (either creating a new one or popping all the collected docs).

          But does it make sense to be more aggressive? Should Searchers run hit
          collection against individual segments? Should Scorers only be compiled
          against single segments?

          Maybe so. I implemented pruning (early termination) in KS, and it had to be
          done per segment. This is because you have to sort the documents within a
          segment according to the primary criteria you want to prune on (typically doc
          boost). I've since ripped out that code because it was adding too much
          complexity, but maybe there would have been less complexity if segments were
          closer to the foreground.

          Show
          Marvin Humphrey added a comment - > I think, just like we are pushing for column-stride / FieldCache to be > "per segment" instead of one big merged array, we should move in the > same direction for searching? Algorithmically speaking, it would definitely help this specific task, and that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord map at index-time, allows us to totally eliminate the current cost of loading sort caches at IndexReader startup. The question is, how easy is it to refactor our search OO hierarchy to support it? If our goal is minimal impact to the current model, we worry only about the TopFieldDocs search() method. We can hack in per-segment bookending via doc number to the hit collection routine, initializing the TopFieldDocCollector each segment (either creating a new one or popping all the collected docs). But does it make sense to be more aggressive? Should Searchers run hit collection against individual segments? Should Scorers only be compiled against single segments? Maybe so. I implemented pruning (early termination) in KS, and it had to be done per segment. This is because you have to sort the documents within a segment according to the primary criteria you want to prune on (typically doc boost). I've since ripped out that code because it was adding too much complexity, but maybe there would have been less complexity if segments were closer to the foreground.
          Hide
          Marvin Humphrey added a comment -

          > Well ...that cat command can be deadly for a large index, too?

          It will be costly for a large index, and it wouldn't be appropriate in all
          cases. The use case I was thinking of was: dedicated server with gobs of RAM.
          The index could either be updated often or not updated at all. Pre-existing
          segments stay warm on such a box, and the writer would leave the latest
          segment hot, so the cat command would only be needed once, at the startup of
          the persistent service.

          Show
          Marvin Humphrey added a comment - > Well ...that cat command can be deadly for a large index, too? It will be costly for a large index, and it wouldn't be appropriate in all cases. The use case I was thinking of was: dedicated server with gobs of RAM. The index could either be updated often or not updated at all. Pre-existing segments stay warm on such a box, and the writer would leave the latest segment hot, so the cat command would only be needed once, at the startup of the persistent service.
          Hide
          Michael McCandless added a comment -

          OK, assume that you slurp all three files. Here's the code from above, ported
          from C to Java.

          Looks good!

          You can also use FileChannels to memory map this stuff, right? (Have to be
          careful on 32-bit systems, though.)

          Yes.

          First, a lot of tree algorithms are optimized to a greater or lesser extent
          for insertion speed, but we hardly care about that at all. We can spend all
          the cycles we need at index-time balancing nodes within a segment, and once
          the tree is written out, it will never be updated.

          Right, neither inserts nor deletes matter to us.

          Second, when we are writing out the term dictionary at index-time, the raw
          data will be fed into the writer in sorted order as iterated values, one
          term/term-info pair at a time. Ideally, the writer would be able to serialize
          the tree structure during this single pass, but it could also write a
          temporary file during the terms iteration then write a final file afterwards.
          The main limitation is that the writer will never be able to "see" all
          terms at once as an array.

          Lucene differs from Lucy/KS in this. For Lucene, when flushing a
          new segment, we can assume you can see all Terms in RAM at once. We
          don't make use of this today (it's a simple iteration that's given to
          the consumer), but we could. In Lucene, when RAM is full, we flush a
          real segment (but KS flushes a "run" which I think is more of a raw
          dump, ie, you don't build lexicon trees during that?).

          However, for both Lucene and Lucy/KS, during merging one cannot assume
          the entire lexicon can be in RAM at once. But then, during merging
          you could in theory merge trees not expanded terms.

          I think for starters at least we should stick with the simple
          shared-prefix-compression we have today.

          Third, at read-time we're going to have one of these trees per segment. We'd
          really like to be able to conflate them somehow. KinoSearch actually
          implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue;
          MultiLexicon_Next() advances the queue to the next unique term. However,
          that's slow, unwieldy, and inflexible. Can we do better?

          Continuing the move towards pushing searching closer to the segments
          (ie, using MultiSearcher instead of MultiReader), I think we should
          not try to conflate the terms dict?

          It would be ideal if we could separate the keys from the values and put all
          the keys in a single file.

          Why not inline the value with the key? The pointer to the value just
          consumes extra space. I think "value" in this context is the long
          offset into the main terms dict file, which then stores the "real
          [opaque] value" for each term.

          > Though, if we want to do neat things like respelling, wildcard/prefix
          > searching, etc., which reduce to graph-intersection problems, we would
          > need the suffix and we would need the entire lexicon (not just every
          > 128th index term) compiled into the FST.

          The main purpose of breaking out a separate index structure is to avoid binary
          searching over the large primary file. There's nothing special about the
          extra file - in fact, it's a drawback that it doesn't include all terms. If
          we can jam all the data we need to binary search against into the front of the
          file, but include the data for all terms in an infrequently-accessed tail, we
          win.

          And... if your terms index is in RAM, to minimize its net size and
          decode cost on loading.

          Show
          Michael McCandless added a comment - OK, assume that you slurp all three files. Here's the code from above, ported from C to Java. Looks good! You can also use FileChannels to memory map this stuff, right? (Have to be careful on 32-bit systems, though.) Yes. First, a lot of tree algorithms are optimized to a greater or lesser extent for insertion speed, but we hardly care about that at all. We can spend all the cycles we need at index-time balancing nodes within a segment, and once the tree is written out, it will never be updated. Right, neither inserts nor deletes matter to us. Second, when we are writing out the term dictionary at index-time, the raw data will be fed into the writer in sorted order as iterated values, one term/term-info pair at a time. Ideally, the writer would be able to serialize the tree structure during this single pass, but it could also write a temporary file during the terms iteration then write a final file afterwards. The main limitation is that the writer will never be able to "see" all terms at once as an array. Lucene differs from Lucy/KS in this. For Lucene, when flushing a new segment, we can assume you can see all Terms in RAM at once. We don't make use of this today (it's a simple iteration that's given to the consumer), but we could. In Lucene, when RAM is full, we flush a real segment (but KS flushes a "run" which I think is more of a raw dump, ie, you don't build lexicon trees during that?). However, for both Lucene and Lucy/KS, during merging one cannot assume the entire lexicon can be in RAM at once. But then, during merging you could in theory merge trees not expanded terms. I think for starters at least we should stick with the simple shared-prefix-compression we have today. Third, at read-time we're going to have one of these trees per segment. We'd really like to be able to conflate them somehow. KinoSearch actually implements a MultiLexicon class which keeps SegLexicons in a PriorityQueue; MultiLexicon_Next() advances the queue to the next unique term. However, that's slow, unwieldy, and inflexible. Can we do better? Continuing the move towards pushing searching closer to the segments (ie, using MultiSearcher instead of MultiReader), I think we should not try to conflate the terms dict? It would be ideal if we could separate the keys from the values and put all the keys in a single file. Why not inline the value with the key? The pointer to the value just consumes extra space. I think "value" in this context is the long offset into the main terms dict file, which then stores the "real [opaque] value" for each term. > Though, if we want to do neat things like respelling, wildcard/prefix > searching, etc., which reduce to graph-intersection problems, we would > need the suffix and we would need the entire lexicon (not just every > 128th index term) compiled into the FST. The main purpose of breaking out a separate index structure is to avoid binary searching over the large primary file. There's nothing special about the extra file - in fact, it's a drawback that it doesn't include all terms. If we can jam all the data we need to binary search against into the front of the file, but include the data for all terms in an infrequently-accessed tail, we win. And... if your terms index is in RAM, to minimize its net size and decode cost on loading.
          Hide
          Michael McCandless added a comment -

          OK, it makes sense to have the user access these capabilities via a single
          handle at both index-time and search-time. However, for Lucy/KS, the handle
          should definitely be specified via the Schema subclass rather than via
          constructor argument.

          "Codec" isn't really the right name for this, though. "IndexComponent",
          maybe? Lucy would have three main index components by default:
          LexiconComponent, PostingsComponent, StorageComponent.

          Well, maybe both? Ie, each of these IndexComponents could have many
          different codecs to write/read the data to/from the index. So when I
          implement PostingsComponent, when writing a segment I could choose my
          own codec; when reading it, I retrieve the matching codec to decode
          it.

          Subclassing Schema seems like the right approach.

          Show
          Michael McCandless added a comment - OK, it makes sense to have the user access these capabilities via a single handle at both index-time and search-time. However, for Lucy/KS, the handle should definitely be specified via the Schema subclass rather than via constructor argument. "Codec" isn't really the right name for this, though. "IndexComponent", maybe? Lucy would have three main index components by default: LexiconComponent, PostingsComponent, StorageComponent. Well, maybe both? Ie, each of these IndexComponents could have many different codecs to write/read the data to/from the index. So when I implement PostingsComponent, when writing a segment I could choose my own codec; when reading it, I retrieve the matching codec to decode it. Subclassing Schema seems like the right approach.
          Hide
          Michael McCandless added a comment -

          > I think, just like we are pushing for column-stride / FieldCache to be
          > "per segment" instead of one big merged array, we should move in the
          > same direction for searching?

          Algorithmically speaking, it would definitely help this specific task, and
          that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord
          map at index-time, allows us to totally eliminate the current cost of loading
          sort caches at IndexReader startup. The question is, how easy is it to
          refactor our search OO hierarchy to support it?

          If our goal is minimal impact to the current model, we worry only about the
          TopFieldDocs search() method. We can hack in per-segment bookending via doc
          number to the hit collection routine, initializing the TopFieldDocCollector
          each segment (either creating a new one or popping all the collected docs).

          But does it make sense to be more aggressive? Should Searchers run hit
          collection against individual segments? Should Scorers only be compiled
          against single segments?

          Maybe so. I implemented pruning (early termination) in KS, and it had to be
          done per segment. This is because you have to sort the documents within a
          segment according to the primary criteria you want to prune on (typically doc
          boost). I've since ripped out that code because it was adding too much
          complexity, but maybe there would have been less complexity if segments were
          closer to the foreground.

          I think the plus's are substantial here. Not having to materialize
          one massive array of norms, of FieldCache/column-stride values, of
          docID->ord values, is very important because these are at least linear
          cost (more for the docID->ord) in # docs in the index. Reopening a
          searcher on a large index is very costly in Lucene now because of
          these materializations.

          We need to think more about the tradeoffs here...

          Show
          Michael McCandless added a comment - > I think, just like we are pushing for column-stride / FieldCache to be > "per segment" instead of one big merged array, we should move in the > same direction for searching? Algorithmically speaking, it would definitely help this specific task, and that's a BIG FAT PLUS. This, plus memory mapping and writing the DocID -> ord map at index-time, allows us to totally eliminate the current cost of loading sort caches at IndexReader startup. The question is, how easy is it to refactor our search OO hierarchy to support it? If our goal is minimal impact to the current model, we worry only about the TopFieldDocs search() method. We can hack in per-segment bookending via doc number to the hit collection routine, initializing the TopFieldDocCollector each segment (either creating a new one or popping all the collected docs). But does it make sense to be more aggressive? Should Searchers run hit collection against individual segments? Should Scorers only be compiled against single segments? Maybe so. I implemented pruning (early termination) in KS, and it had to be done per segment. This is because you have to sort the documents within a segment according to the primary criteria you want to prune on (typically doc boost). I've since ripped out that code because it was adding too much complexity, but maybe there would have been less complexity if segments were closer to the foreground. I think the plus's are substantial here. Not having to materialize one massive array of norms, of FieldCache/column-stride values, of docID->ord values, is very important because these are at least linear cost (more for the docID->ord) in # docs in the index. Reopening a searcher on a large index is very costly in Lucene now because of these materializations. We need to think more about the tradeoffs here...
          Hide
          Michael McCandless added a comment -

          > Well ...that cat command can be deadly for a large index, too?

          It will be costly for a large index, and it wouldn't be appropriate in all
          cases. The use case I was thinking of was: dedicated server with gobs of RAM.
          The index could either be updated often or not updated at all. Pre-existing
          segments stay warm on such a box, and the writer would leave the latest
          segment hot, so the cat command would only be needed once, at the startup of
          the persistent service.

          Ahh OK. But that cat command is basically just a different, more
          global, implemenation of "warming".

          So eg you'd still need to coordinate so that the new searcher isn't
          used until warming finishes, right? In Lucene, since warming is explicit
          and under direct programmatic control, we know when warming is done.
          I guess you could also do a system call to do the cat command,
          blocking cutover to the new searcher until it completes.

          Show
          Michael McCandless added a comment - > Well ...that cat command can be deadly for a large index, too? It will be costly for a large index, and it wouldn't be appropriate in all cases. The use case I was thinking of was: dedicated server with gobs of RAM. The index could either be updated often or not updated at all. Pre-existing segments stay warm on such a box, and the writer would leave the latest segment hot, so the cat command would only be needed once, at the startup of the persistent service. Ahh OK. But that cat command is basically just a different, more global, implemenation of "warming". So eg you'd still need to coordinate so that the new searcher isn't used until warming finishes, right? In Lucene, since warming is explicit and under direct programmatic control, we know when warming is done. I guess you could also do a system call to do the cat command, blocking cutover to the new searcher until it completes.
          Hide
          Marvin Humphrey added a comment -

          > Not having to materialize one massive array of norms, of
          > FieldCache/column-stride values, of docID->ord values, is very important
          > because these are at least linear cost (more for the docID->ord) in # docs
          > in the index. Reopening a searcher on a large index is very costly in Lucene
          > now because of these materializations.
          >
          > We need to think more about the tradeoffs here...

          Let's continue the discussion of segment-centric searching on java-dev, since it it's
          only tangentially related to flexible indexing.

          Show
          Marvin Humphrey added a comment - > Not having to materialize one massive array of norms, of > FieldCache/column-stride values, of docID->ord values, is very important > because these are at least linear cost (more for the docID->ord) in # docs > in the index. Reopening a searcher on a large index is very costly in Lucene > now because of these materializations. > > We need to think more about the tradeoffs here... Let's continue the discussion of segment-centric searching on java-dev, since it it's only tangentially related to flexible indexing.
          Hide
          Marvin Humphrey added a comment -

          > So eg you'd still need to coordinate so that the new searcher isn't
          > used until warming finishes, right?

          ...

          > I guess you could also do a system call to do the cat command,
          > blocking cutover to the new searcher until it completes.

          Warming is only needed once, at search service startup. The idea is to get
          the whole index into the system IO cache.

          Once all segment data is in the IO cache, we assume that it stays there,
          because this is a beefy dedicated search box with more than enough RAM to fit
          the entire index/shard.

          Say that we add a new segment to the index, either by running an
          index-writing process locally, or via rsync. (Assume for the purposes of
          argument that the local indexing process doesn't require much RAM – which
          is true with KS – and so won't have the side effect of nudging existing
          segments out of IO cache.)

          Now, say that our search service checks at the beginning of each request to
          see if the index has been modified. If it has, it opens a new searcher from
          scratch – which takes almost no time, because we're memory mapping rather
          than slurping.

          while (newRequest()) {
            if (indexHasBeenUpdated()) {
              searcher = new IndexSearcher("/path/to/index");
            }
            ...
          }
          

          After an abrupt cutover to the new searcher, we process the search request.
          Is the new search sluggish in any way? No, because all the segments used
          by the new searcher are "hot". Older segments are hot because they were
          in use by the prior searcher, and the new segment is hot because it was
          just written.

          Therefore, we don't need to worry about managing cutover to a new searcher.
          We can just discard the old one and replace it with the new one.

          Show
          Marvin Humphrey added a comment - > So eg you'd still need to coordinate so that the new searcher isn't > used until warming finishes, right? ... > I guess you could also do a system call to do the cat command, > blocking cutover to the new searcher until it completes. Warming is only needed once, at search service startup. The idea is to get the whole index into the system IO cache. Once all segment data is in the IO cache, we assume that it stays there, because this is a beefy dedicated search box with more than enough RAM to fit the entire index/shard. Say that we add a new segment to the index, either by running an index-writing process locally, or via rsync. (Assume for the purposes of argument that the local indexing process doesn't require much RAM – which is true with KS – and so won't have the side effect of nudging existing segments out of IO cache.) Now, say that our search service checks at the beginning of each request to see if the index has been modified. If it has, it opens a new searcher from scratch – which takes almost no time, because we're memory mapping rather than slurping. while (newRequest()) { if (indexHasBeenUpdated()) { searcher = new IndexSearcher( "/path/to/index" ); } ... } After an abrupt cutover to the new searcher, we process the search request. Is the new search sluggish in any way? No, because all the segments used by the new searcher are "hot". Older segments are hot because they were in use by the prior searcher, and the new segment is hot because it was just written. Therefore, we don't need to worry about managing cutover to a new searcher. We can just discard the old one and replace it with the new one.
          Hide
          Marvin Humphrey added a comment -

          > Well, maybe both? Ie, each of these IndexComponents could have many
          > different codecs to write/read the data to/from the index. So when I
          > implement PostingsComponent, when writing a segment I could choose my
          > own codec; when reading it, I retrieve the matching codec to decode
          > it.

          Yes, both – that sounds good. However, I'm not sure whether you're proposing
          the creation of a class named "Codec", which I think we should avoid unless
          all of our "codecs" can descend from it. So: PostingsCodec, TermsDictCodec
          (or LexiconCodec, for Lucy/KS), and so on would be base classes.

          > Subclassing Schema seems like the right approach.

          Groovy. How are you going to handle it in Lucene? I think you just have to
          require the end user to be consistent about supplying the necessary arguments
          to the IndexReader and IndexWriter constructors.

          How do we handle auxiliary IndexComponents? I've long wanted to implement an
          RTreeComponent for geographic searching, so I'll use that as an example.

          At index-time, I think we just create an array of SegDataWriter objects and
          feed each document to each writer in turn. The SegDataWriter abstract base
          class will define all the necessary abstract methods: addDoc(),
          addSegment(SegReader) (for Lucy/KS), various commands related to merging (for
          Lucene), finish()/close(), and so on. RTreeWriter would simply subclass
          SegDataWriter.

          At search-time, things get a little trickier. Say we hand our Searcher object
          an RTreeRadiusQuery. At some point, the RTreeRadiusQuery will need to be
          compiled to an RTreeRadiusScorer, which will involve accessing an RTreeReader
          which presumably resides within an IndexReader. However, right now,
          IndexReader hides all of its inner readers and provides access through
          specific methods, e.g. IndexReader.document(int docNum), which ultimately
          hands off to FieldsReader internally. This model doesn't scale with the
          addition of arbitrary IndexComponents.

          The only thing I can thing of is an IndexReader.getReader(String name) method.

          Show
          Marvin Humphrey added a comment - > Well, maybe both? Ie, each of these IndexComponents could have many > different codecs to write/read the data to/from the index. So when I > implement PostingsComponent, when writing a segment I could choose my > own codec; when reading it, I retrieve the matching codec to decode > it. Yes, both – that sounds good. However, I'm not sure whether you're proposing the creation of a class named "Codec", which I think we should avoid unless all of our "codecs" can descend from it. So: PostingsCodec, TermsDictCodec (or LexiconCodec, for Lucy/KS), and so on would be base classes. > Subclassing Schema seems like the right approach. Groovy. How are you going to handle it in Lucene? I think you just have to require the end user to be consistent about supplying the necessary arguments to the IndexReader and IndexWriter constructors. How do we handle auxiliary IndexComponents? I've long wanted to implement an RTreeComponent for geographic searching, so I'll use that as an example. At index-time, I think we just create an array of SegDataWriter objects and feed each document to each writer in turn. The SegDataWriter abstract base class will define all the necessary abstract methods: addDoc(), addSegment(SegReader) (for Lucy/KS), various commands related to merging (for Lucene), finish()/close(), and so on. RTreeWriter would simply subclass SegDataWriter. At search-time, things get a little trickier. Say we hand our Searcher object an RTreeRadiusQuery. At some point, the RTreeRadiusQuery will need to be compiled to an RTreeRadiusScorer, which will involve accessing an RTreeReader which presumably resides within an IndexReader. However, right now, IndexReader hides all of its inner readers and provides access through specific methods, e.g. IndexReader.document(int docNum), which ultimately hands off to FieldsReader internally. This model doesn't scale with the addition of arbitrary IndexComponents. The only thing I can thing of is an IndexReader.getReader(String name) method.
          Hide
          Michael McCandless added a comment -

          Warming is only needed once, at search service startup.

          Ahh, got it. Lucene must warm for each reopened searcher (though that warming cost will eventually be in proportion to what's changed in the index), but KS/Lucy should be fine doing zero warming except for the very first searcher startup (eg after rebooting the machine).

          Show
          Michael McCandless added a comment - Warming is only needed once, at search service startup. Ahh, got it. Lucene must warm for each reopened searcher (though that warming cost will eventually be in proportion to what's changed in the index), but KS/Lucy should be fine doing zero warming except for the very first searcher startup (eg after rebooting the machine).
          Hide
          Michael McCandless added a comment -

          > So: PostingsCodec, TermsDictCodec (or LexiconCodec, for Lucy/KS), and
          > so on would be base classes.

          Right: separate codec base classes for each component. Back to the
          video analogy: a typical video has a "audio" component and a "video"
          component. AudioCodec would be the base class for all the various
          audio codecs, and likewise for VideoCodec.

          > I think you just have to require the end user to be consistent about
          > supplying the necessary arguments to the IndexReader and IndexWriter
          > constructors.

          Right.

          > How do we handle auxiliary IndexComponents? I've long wanted to implement an
          > RTreeComponent for geographic searching, so I'll use that as an example.

          > At index-time, I think we just create an array of SegDataWriter objects and
          > feed each document to each writer in turn.

          I think that's right. In Lucene we now have an indexing chain
          (package private), so that you can "tap in" at whatever point is
          appropriate – you could handle the whole doc yourself (like
          SegDataWriter); you could be fed one field at a time; you could tap in
          after inversion so you get one token at a time, etc.

          > At search-time, things get a little trickier.
          > ...
          > The only thing I can thing of is an IndexReader.getReader(String
          > name) method.

          I haven't thought enough about how to handle this at search time.
          IR.getReader seems fine, though, you'd need to open each
          IndexComponent up front inside the retry loop, right?

          Show
          Michael McCandless added a comment - > So: PostingsCodec, TermsDictCodec (or LexiconCodec, for Lucy/KS), and > so on would be base classes. Right: separate codec base classes for each component. Back to the video analogy: a typical video has a "audio" component and a "video" component. AudioCodec would be the base class for all the various audio codecs, and likewise for VideoCodec. > I think you just have to require the end user to be consistent about > supplying the necessary arguments to the IndexReader and IndexWriter > constructors. Right. > How do we handle auxiliary IndexComponents? I've long wanted to implement an > RTreeComponent for geographic searching, so I'll use that as an example. > At index-time, I think we just create an array of SegDataWriter objects and > feed each document to each writer in turn. I think that's right. In Lucene we now have an indexing chain (package private), so that you can "tap in" at whatever point is appropriate – you could handle the whole doc yourself (like SegDataWriter); you could be fed one field at a time; you could tap in after inversion so you get one token at a time, etc. > At search-time, things get a little trickier. > ... > The only thing I can thing of is an IndexReader.getReader(String > name) method. I haven't thought enough about how to handle this at search time. IR.getReader seems fine, though, you'd need to open each IndexComponent up front inside the retry loop, right?
          Hide
          Marvin Humphrey added a comment -

          > In Lucene we now have an indexing chain
          > (package private), so that you can "tap in" at whatever point is
          > appropriate - you could handle the whole doc yourself (like
          > SegDataWriter); you could be fed one field at a time; you could tap in
          > after inversion so you get one token at a time, etc.

          That's pretty nice. It occurred to me to try something like that, but I got a
          little lost.

          The fact that the Doc object in KS uses the host language's native hashtable
          and string implementations for field data complicates an already complicated
          matter. It's hard to abstract out access to field data so that the KS/Lucy
          core, which knows nothing about the host language, can see it, yet still
          maintain peak performance in the addDoc() loop.

          In any case, I don't anticipate intractable implementation troubles with
          adding IndexComponents at index-time.

          > IR.getReader seems fine, though, you'd need to open each
          > IndexComponent up front inside the retry loop, right?

          Sure, startup's easy. I think we just add Schema.auxiliaryComponents(),
          which returns an array of IndexComponents. The default would be to return
          null or an empty array, but subclasses could override it.

          Where we have problems, though, is with remote searching or multi-searching.
          You can't ask a Searchable for its inner IndexReader, because it might not
          have one. That means that you can't "see" information pertaining to a custom
          IndexComponent until you're at the level of the individual machine –
          aggregate information, like docFreq across an entire collection spanning
          multiple indexes, wouldn't be available to searches which use custom
          components.

          The only remedy would be to subclass all your Searchables – the local
          IndexSearcher, the RemoteSearchable that wraps it, and the MultiSearcher that
          aggregates results – to drill down into the correct IndexReader and pass data
          back up the chain. Basically, you'd have to duplicate e.g. the call chain
          that fetches documents.

          Show
          Marvin Humphrey added a comment - > In Lucene we now have an indexing chain > (package private), so that you can "tap in" at whatever point is > appropriate - you could handle the whole doc yourself (like > SegDataWriter); you could be fed one field at a time; you could tap in > after inversion so you get one token at a time, etc. That's pretty nice. It occurred to me to try something like that, but I got a little lost. The fact that the Doc object in KS uses the host language's native hashtable and string implementations for field data complicates an already complicated matter. It's hard to abstract out access to field data so that the KS/Lucy core, which knows nothing about the host language, can see it, yet still maintain peak performance in the addDoc() loop. In any case, I don't anticipate intractable implementation troubles with adding IndexComponents at index-time. > IR.getReader seems fine, though, you'd need to open each > IndexComponent up front inside the retry loop, right? Sure, startup's easy. I think we just add Schema.auxiliaryComponents(), which returns an array of IndexComponents. The default would be to return null or an empty array, but subclasses could override it. Where we have problems, though, is with remote searching or multi-searching. You can't ask a Searchable for its inner IndexReader, because it might not have one. That means that you can't "see" information pertaining to a custom IndexComponent until you're at the level of the individual machine – aggregate information, like docFreq across an entire collection spanning multiple indexes, wouldn't be available to searches which use custom components. The only remedy would be to subclass all your Searchables – the local IndexSearcher, the RemoteSearchable that wraps it, and the MultiSearcher that aggregates results – to drill down into the correct IndexReader and pass data back up the chain. Basically, you'd have to duplicate e.g. the call chain that fetches documents.
          Hide
          Michael McCandless added a comment -

          New patch attached (still plenty more to do...):

          • Updated to current trunk (747391).
          • All tests pass, but back-compat tests don't compile...
          • Switched the new "4d iteration API" (Fields -> Terms -> Docs ->
            Positions) to subclass AttributeSource; this way codecs can add in
            their own attrs.
          • Added PostingsCodecs class, that holds all PostingCodec instances
            your index may make use of, and changed segments_N format to
            record which codec was used per segment. So, an index can have
            mixed codecs (though for a single IndexWriter session, the same
            codec is used when writing new segments).
          • I cutover TermScorer to use the new API; I still need to cutover
            other queries, segment merging, etc.
          Show
          Michael McCandless added a comment - New patch attached (still plenty more to do...): Updated to current trunk (747391). All tests pass, but back-compat tests don't compile... Switched the new "4d iteration API" (Fields -> Terms -> Docs -> Positions) to subclass AttributeSource; this way codecs can add in their own attrs. Added PostingsCodecs class, that holds all PostingCodec instances your index may make use of, and changed segments_N format to record which codec was used per segment. So, an index can have mixed codecs (though for a single IndexWriter session, the same codec is used when writing new segments). I cutover TermScorer to use the new API; I still need to cutover other queries, segment merging, etc.
          Hide
          Michael McCandless added a comment -

          Clearing fix version.

          Show
          Michael McCandless added a comment - Clearing fix version.
          Hide
          Michael Busch added a comment -

          I took Mike's latest patch and updated it to current trunk.
          It applies cleanly and compiles fine.

          Some test cases fail. The problem is in SegmentReader in termsIndexIsLoaded() and loadTermsIndex(). I'll take a look tomorrow, I need to understand the latest changes we made in the different IndexReaders better (and now it's getting quite late here...)

          Show
          Michael Busch added a comment - I took Mike's latest patch and updated it to current trunk. It applies cleanly and compiles fine. Some test cases fail. The problem is in SegmentReader in termsIndexIsLoaded() and loadTermsIndex(). I'll take a look tomorrow, I need to understand the latest changes we made in the different IndexReaders better (and now it's getting quite late here...)
          Hide
          Michael McCandless added a comment -

          Thanks for modernizing the patch Michael! I'll get back to this one soon... I'd really love to get PForDelta working as a codec. It's a great test case since it's block-based, ie, very different from the other codecs.

          Show
          Michael McCandless added a comment - Thanks for modernizing the patch Michael! I'll get back to this one soon... I'd really love to get PForDelta working as a codec. It's a great test case since it's block-based, ie, very different from the other codecs.
          Hide
          Michael Busch added a comment -

          Switches to a new more efficient terms dict format.

          This is nice! Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already. We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch). Then we can have different terms index implementations in the future, e.g. one that uses a tree. We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1.

          Show
          Michael Busch added a comment - Switches to a new more efficient terms dict format. This is nice! Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already. We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch). Then we can have different terms index implementations in the future, e.g. one that uses a tree. We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1.
          Hide
          Michael Busch added a comment -

          In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.

          Show
          Michael Busch added a comment - In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.
          Hide
          Michael Busch added a comment -

          Oups, didn't want to steal this from you, Mike. Wanted to hit the "Watch" button instead...

          Show
          Michael Busch added a comment - Oups, didn't want to steal this from you, Mike. Wanted to hit the "Watch" button instead...
          Hide
          Michael McCandless added a comment -

          Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already.

          Yeah the issue is very large now. I'll think about how to break it
          up.

          I agree: the new default terms dict codec is a good step forward.
          Rather than load a separate TermInfo instance for every indexed term
          (costly in object overhead, and, because we store Term[] as well we
          are wasting space storing many duplicate String field pointers in a
          row), we only store the String and the long offset into the index file
          as two arrays. It's a sizable memory savings for indexes with many
          terms.

          This was a nice side-effect of genericizing things, because the
          TermInfo class had to be made private to the codec since it's storing
          things like proxOffset, freqOffset, etc., which is particular to how
          the Lucene's default codec stores postings.

          But, it's somewhat tricky to break out only this change... eg it's
          also coupled with the change to strongly separate field from term
          text, and, to remove TermInfo reliance. Ie, the new terms dict has a
          separate per-field class, and within that per-field class it has the
          String[] termText and long[] index offsets. I guess we could make a
          drop-in class that tries to emulate TermInfosReader/SegmentTermEnum
          even though it separates into per-field, internally.

          We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch).

          Actually the whole terms dict writing/reading is itself pluggable, so
          your codec could provide its own. Ie, Lucene "just" needs a
          FieldsConsumer (for writing) and a FieldsProducer (for reading).

          But it sounds like you're proposing making a strong decoupling of
          terms index from terms dict?

          Then we can have different terms index implementations in the future, e.g. one that uses a tree.

          +1

          Or, an FST. FST is more compelling than tree since it also compresses
          suffixes. FST is simply a tree in the front plus a tree in the back
          (in reverse), where the "output" of a given term's details appears in
          the middle, on an edge that is "unique" to each term, as you traverse
          the graph.

          We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1.

          Right. Somehow we should genericize the "I don't need the terms
          index at all" when opening a SegmentReader. Passing -1 is sort of
          hackish. Though I do prefer passing up front your intentions, rather
          than loading lazily (LUCENE-1609).

          We could eg pass "requirements" when asking the codec for the terms
          dict reader. EG if I don't state that RANDOM_ACCESS is required (and
          only say LINEAR_SCAN) then the codec internally can make itself more
          efficient based on that.

          In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong.

          The Codec is indeed index-wide, however, because the field vs term
          text are strongly separated, it's completely within a Codec's control
          to return a different reader/writer for different fields. So it ought
          to work fine... eg one in theory could make a "PerFieldCodecWrapper".
          But, I haven't yet tried this with any codecs. It would make a good
          test case though... I'll write down to make a test case for this.

          Also, it's fine if an index has used different codecs over time when
          writing, as long as when reading you provide a PostingsCodecs
          instance that's able to [correctly] retrieve those codecs to read those
          segments.

          Show
          Michael McCandless added a comment - Maybe we should break this whole issue into smaller pieces? We could start with the dictionary. The changes you made here are really cool already. Yeah the issue is very large now. I'll think about how to break it up. I agree: the new default terms dict codec is a good step forward. Rather than load a separate TermInfo instance for every indexed term (costly in object overhead, and, because we store Term[] as well we are wasting space storing many duplicate String field pointers in a row), we only store the String and the long offset into the index file as two arrays. It's a sizable memory savings for indexes with many terms. This was a nice side-effect of genericizing things, because the TermInfo class had to be made private to the codec since it's storing things like proxOffset, freqOffset, etc., which is particular to how the Lucene's default codec stores postings. But, it's somewhat tricky to break out only this change... eg it's also coupled with the change to strongly separate field from term text, and, to remove TermInfo reliance. Ie, the new terms dict has a separate per-field class, and within that per-field class it has the String[] termText and long[] index offsets. I guess we could make a drop-in class that tries to emulate TermInfosReader/SegmentTermEnum even though it separates into per-field, internally. We could further separate the actual TermsDictReader from the terms index with a clean API (I think you put actually a TODO comment into your patch). Actually the whole terms dict writing/reading is itself pluggable, so your codec could provide its own. Ie, Lucene "just" needs a FieldsConsumer (for writing) and a FieldsProducer (for reading). But it sounds like you're proposing making a strong decoupling of terms index from terms dict? Then we can have different terms index implementations in the future, e.g. one that uses a tree. +1 Or, an FST. FST is more compelling than tree since it also compresses suffixes. FST is simply a tree in the front plus a tree in the back (in reverse), where the "output" of a given term's details appears in the middle, on an edge that is "unique" to each term, as you traverse the graph. We could also make SegmentReader a bit cleaner: if opened just for merging it would not create a terms index reader at all; only if cloned for an external reader we would instantiate the terms index lazily. Currently this is done by setting the divisor to -1. Right. Somehow we should genericize the "I don't need the terms index at all" when opening a SegmentReader. Passing -1 is sort of hackish. Though I do prefer passing up front your intentions, rather than loading lazily ( LUCENE-1609 ). We could eg pass "requirements" when asking the codec for the terms dict reader. EG if I don't state that RANDOM_ACCESS is required (and only say LINEAR_SCAN) then the codec internally can make itself more efficient based on that. In the current patch the choice of the Codec is index-wide, right? So I can't specify different codecs for different fields. Please correct me if I'm wrong. The Codec is indeed index-wide, however, because the field vs term text are strongly separated, it's completely within a Codec's control to return a different reader/writer for different fields. So it ought to work fine... eg one in theory could make a "PerFieldCodecWrapper". But, I haven't yet tried this with any codecs. It would make a good test case though... I'll write down to make a test case for this. Also, it's fine if an index has used different codecs over time when writing, as long as when reading you provide a PostingsCodecs instance that's able to [correctly] retrieve those codecs to read those segments.
          Hide
          Michael Busch added a comment -

          But it sounds like you're proposing making a strong decoupling of
          terms index from terms dict?

          Right.

          Right. Somehow we should genericize the "I don't need the terms
          index at all" when opening a SegmentReader. Passing -1 is sort of
          hackish. Though I do prefer passing up front your intentions, rather
          than loading lazily (LUCENE-1609).

          I'm a bit confused. Doesn't the IndexWriter open SegmentReaders
          usually with termsIndexDivisor=-1 for merge, and maybe later with
          a termsIndexDivisor>0 when IndexWriter#getReader() is called?
          That's what I meant with loading lazily.

          I thought that's why it'd be good to separate the terms index from
          the terms dict. For merge we'd open the dict reader only, and then
          if getReader() is called we'd open the terms index reader and give
          its reference to the dict reader.

          I admit that I didn't follow the NRT changes as closely as I should
          have, so I might be missing things here.

          Show
          Michael Busch added a comment - But it sounds like you're proposing making a strong decoupling of terms index from terms dict? Right. Right. Somehow we should genericize the "I don't need the terms index at all" when opening a SegmentReader. Passing -1 is sort of hackish. Though I do prefer passing up front your intentions, rather than loading lazily ( LUCENE-1609 ). I'm a bit confused. Doesn't the IndexWriter open SegmentReaders usually with termsIndexDivisor=-1 for merge, and maybe later with a termsIndexDivisor>0 when IndexWriter#getReader() is called? That's what I meant with loading lazily. I thought that's why it'd be good to separate the terms index from the terms dict. For merge we'd open the dict reader only, and then if getReader() is called we'd open the terms index reader and give its reference to the dict reader. I admit that I didn't follow the NRT changes as closely as I should have, so I might be missing things here.
          Hide
          Michael Busch added a comment -

          The Codec is indeed index-wide, however, because the field vs term
          text are strongly separated, it's completely within a Codec's control
          to return a different reader/writer for different fields. So it ought
          to work fine... eg one in theory could make a "PerFieldCodecWrapper".
          But, I haven't yet tried this with any codecs. It would make a good
          test case though... I'll write down to make a test case for this.

          OK I see now. Did you think about possibly extending the field API
          to specify the codec? And then to store the Codec name in the
          fieldinfos (which we'd want to make extensible too, as briefly
          discussed in LUCENE-1597) instead of the dictionary?

          Show
          Michael Busch added a comment - The Codec is indeed index-wide, however, because the field vs term text are strongly separated, it's completely within a Codec's control to return a different reader/writer for different fields. So it ought to work fine... eg one in theory could make a "PerFieldCodecWrapper". But, I haven't yet tried this with any codecs. It would make a good test case though... I'll write down to make a test case for this. OK I see now. Did you think about possibly extending the field API to specify the codec? And then to store the Codec name in the fieldinfos (which we'd want to make extensible too, as briefly discussed in LUCENE-1597 ) instead of the dictionary?
          Hide
          Michael McCandless added a comment -

          I'm a bit confused. Doesn't the IndexWriter open SegmentReaders
          usually with termsIndexDivisor=-1 for merge, and maybe later with
          a termsIndexDivisor>0 when IndexWriter#getReader() is called?
          That's what I meant with loading lazily.

          Right, it does. This is the one case (internal to Lucene, only) where
          loading lazily is still necessary.

          I thought that's why it'd be good to separate the terms index from
          the terms dict. For merge we'd open the dict reader only, and then
          if getReader() is called we'd open the terms index reader and give
          its reference to the dict reader.

          OK got it. I think this makes sense.

          The separation in the current approach is already quite strong, in
          that the terms dict writer/reader maintains its own String[] indexText
          and long[] indexOffset and then "defers" to its child component just
          what is stored in each terms dict entry. So each child can store
          whatever it wants in the terms dict entry (eg the pulsing codec
          inlines low-freq postings).

          If we make pluggable how the indexText/indexOffset is stored/loaded in
          memory/used, then we have a stronger separation/pluggability on the
          index. EG even before FST for the index we should switch to blocks of
          char[] instead of separate Strings, for indexText.

          Show
          Michael McCandless added a comment - I'm a bit confused. Doesn't the IndexWriter open SegmentReaders usually with termsIndexDivisor=-1 for merge, and maybe later with a termsIndexDivisor>0 when IndexWriter#getReader() is called? That's what I meant with loading lazily. Right, it does. This is the one case (internal to Lucene, only) where loading lazily is still necessary. I thought that's why it'd be good to separate the terms index from the terms dict. For merge we'd open the dict reader only, and then if getReader() is called we'd open the terms index reader and give its reference to the dict reader. OK got it. I think this makes sense. The separation in the current approach is already quite strong, in that the terms dict writer/reader maintains its own String[] indexText and long[] indexOffset and then "defers" to its child component just what is stored in each terms dict entry. So each child can store whatever it wants in the terms dict entry (eg the pulsing codec inlines low-freq postings). If we make pluggable how the indexText/indexOffset is stored/loaded in memory/used, then we have a stronger separation/pluggability on the index. EG even before FST for the index we should switch to blocks of char[] instead of separate Strings, for indexText.
          Hide
          Michael Busch added a comment -

          EG even before FST for the index we should switch to blocks of
          char[] instead of separate Strings, for indexText.

          I totally agree. I made a similar change (from String objects to
          char[] blocks) on some other code (not Lucene) and the savings
          in memory and garbage collection were tremendous!

          Show
          Michael Busch added a comment - EG even before FST for the index we should switch to blocks of char[] instead of separate Strings, for indexText. I totally agree. I made a similar change (from String objects to char[] blocks) on some other code (not Lucene) and the savings in memory and garbage collection were tremendous!
          Hide
          Michael McCandless added a comment -

          I attached a .tar.bz2 of src/* with my current state – too hard to
          keep svn in sync / patchable right now. Changes:

          • Factored out the terms dict index, so it's now "pluggable" (though
            I've only created one impl, so far)
          • Cutover SegmentMerger to flex API
          • Changed terms to be stored in RAM as byte[] (not char[]), when
            reading. These are UTF8 bytes, but in theory eventually we could
            allow generic bytes here (there are not that many places that try
            to decode them as UTF8). I think this is a good step towards
            allowing generic terms. It also saves 50% RAM for simple ascii
            terms w/ the terms index.
          • Changed terms index to use shared byte[] blocks
          • Broke sources out into "codecs" subdir of oal.index. Right now I
            have "preflex" (only provides reader, to read old index format),
            "standard" (new terms dict & index, but otherwise same
            freq/prox/skip/payloads encoding), "pulsing" (inlines low-freq
            terms directly into terms dict) and "sep" (seperately stores docs,
            frq, prox, skip, payloads, as a pre-cursor to using pfor to encode
            doc/frq/prox).

          The patch is very rough... core & core-test compile, but most tests
          fail. It's very much still a work in progress...

          Show
          Michael McCandless added a comment - I attached a .tar.bz2 of src/* with my current state – too hard to keep svn in sync / patchable right now. Changes: Factored out the terms dict index, so it's now "pluggable" (though I've only created one impl, so far) Cutover SegmentMerger to flex API Changed terms to be stored in RAM as byte[] (not char[]), when reading. These are UTF8 bytes, but in theory eventually we could allow generic bytes here (there are not that many places that try to decode them as UTF8). I think this is a good step towards allowing generic terms. It also saves 50% RAM for simple ascii terms w/ the terms index. Changed terms index to use shared byte[] blocks Broke sources out into "codecs" subdir of oal.index. Right now I have "preflex" (only provides reader, to read old index format), "standard" (new terms dict & index, but otherwise same freq/prox/skip/payloads encoding), "pulsing" (inlines low-freq terms directly into terms dict) and "sep" (seperately stores docs, frq, prox, skip, payloads, as a pre-cursor to using pfor to encode doc/frq/prox). The patch is very rough... core & core-test compile, but most tests fail. It's very much still a work in progress...
          Hide
          Michael McCandless added a comment -

          New patch & src.tar.bz2 attached. All tests, including back-compat, pass.

          There are still zillions of nocommits to resolve.

          Some of the changes:

          • Got all tests to pass.
          • Separated out a non-enum Fields/Terms API.
          • Improved byte[] block allocation in the new terms index so that
            the blocks are shared across fields (important when there are
            zillions of fields each of which has few index terms)
          • Changed TermsEnum.docs() API to accept a new bit set interface
            (currently called Bits) skipDocs. This is towards eventual
            support for random access filters. I also added Bits
            IndexReader.getDeletedDocs().

          Next step is to get the other codecs (sep, pulsing) to pass all tests,
          then to make a pfor codec! I also need to perf test all of these
          changes...

          Show
          Michael McCandless added a comment - New patch & src.tar.bz2 attached. All tests, including back-compat, pass. There are still zillions of nocommits to resolve. Some of the changes: Got all tests to pass. Separated out a non-enum Fields/Terms API. Improved byte[] block allocation in the new terms index so that the blocks are shared across fields (important when there are zillions of fields each of which has few index terms) Changed TermsEnum.docs() API to accept a new bit set interface (currently called Bits) skipDocs. This is towards eventual support for random access filters. I also added Bits IndexReader.getDeletedDocs(). Next step is to get the other codecs (sep, pulsing) to pass all tests, then to make a pfor codec! I also need to perf test all of these changes...
          Hide
          Yonik Seeley added a comment -

          Changed terms to be stored in RAM as byte[] (not char[]),

          Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications.

          I attached a .tar.bz2 of src/* with my current state - too hard to keep svn in sync / patchable right now.

          Could a git branch make things easier for mega-features like this?

          Show
          Yonik Seeley added a comment - Changed terms to be stored in RAM as byte[] (not char[]), Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications. I attached a .tar.bz2 of src/* with my current state - too hard to keep svn in sync / patchable right now. Could a git branch make things easier for mega-features like this?
          Hide
          Michael McCandless added a comment -

          Changed terms to be stored in RAM as byte[] (not char[]),

          Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications.

          It's actually byte[] both in how the terms dict index stores the terms
          in RAM (using shared byte[] blocks) and also in how terms are
          represented throughout the flex API. EG TermsEnum API returns
          a TermRef from its next() method. TermRef holds byte[]/offset/length.

          Could a git branch make things easier for mega-features like this?

          Maybe – though I don't have much experience w/ git. If people are
          interested in working together on this then I think it'd be worth
          exploring?

          Show
          Michael McCandless added a comment - Changed terms to be stored in RAM as byte[] (not char[]), Yay! This will be important for NumericField too since it uses 7 bits per char and will probably account for the majority of terms in the index in many applications. It's actually byte[] both in how the terms dict index stores the terms in RAM (using shared byte[] blocks) and also in how terms are represented throughout the flex API. EG TermsEnum API returns a TermRef from its next() method. TermRef holds byte[]/offset/length. Could a git branch make things easier for mega-features like this? Maybe – though I don't have much experience w/ git. If people are interested in working together on this then I think it'd be worth exploring?
          Hide
          Michael McCandless added a comment -

          Attached patch.

          All tests pass with all 3 codecs (standard = just like today's index format; pulsing = terms that occur in only 1 doc are inlined into terms dict; sep = separate files for doc, freq, prx, payload, skip data).

          Show
          Michael McCandless added a comment - Attached patch. All tests pass with all 3 codecs (standard = just like today's index format; pulsing = terms that occur in only 1 doc are inlined into terms dict; sep = separate files for doc, freq, prx, payload, skip data).
          Hide
          Jason Rutherglen added a comment -

          Mike,

          Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?

          Show
          Jason Rutherglen added a comment - Mike, Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?
          Hide
          Michael McCandless added a comment -

          Maybe a directed acyclic word graph would work well as an alternative dictionary implementation?

          I think that'd be great. In particular, an FST (DAG that shares prefix & suffix and "outputs" the per-term data in the middle of the graph) should be a good savings in most normal term distributions.

          Flexible indexing makes the terms dict & terms dict index pluggable, so we are free to experiment with alternative impls. I've only taken some baby steps to improve on the current terms dict index (by switching to shared byte[] blocks, instead of a separate TermInfo / String instance per indexed term).

          Show
          Michael McCandless added a comment - Maybe a directed acyclic word graph would work well as an alternative dictionary implementation? I think that'd be great. In particular, an FST (DAG that shares prefix & suffix and "outputs" the per-term data in the middle of the graph) should be a good savings in most normal term distributions. Flexible indexing makes the terms dict & terms dict index pluggable, so we are free to experiment with alternative impls. I've only taken some baby steps to improve on the current terms dict index (by switching to shared byte[] blocks, instead of a separate TermInfo / String instance per indexed term).
          Hide
          Michael McCandless added a comment -

          New patch attached. All tests pass.

          I haven't quite made it to PForDelta yet, but it's very close!

          The sep codec was the first step (uses separate files for doc, frq,
          pos, payload, skip).

          Then, in this patch, the big change was to create new
          IntIndexInput/Output abstract classes, that only expose reading &
          writing ints. I then fixed the sep codec to use this class for doc,
          frq and pos files.

          The trickiest part was abstracting away just what a "file pointer"
          is. In Lucene we assume in many places this is the long file offset,
          but I needed to change this to file-offset plus within-block-offset,
          for int-block based files.

          Once I did that, I created a FixedIntBlockIndexInput/Output, which
          reads & writes the ints in blocks of a specified size. They are
          abstract classes and require a subclass to do the actual encode/decode
          of a given block. To test it I created a simple class that just
          writes multiple vInts. All tests also pass with this newly added
          ("intblock") codec.

          So the next step is to hook up PforDelta...

          Show
          Michael McCandless added a comment - New patch attached. All tests pass. I haven't quite made it to PForDelta yet, but it's very close! The sep codec was the first step (uses separate files for doc, frq, pos, payload, skip). Then, in this patch, the big change was to create new IntIndexInput/Output abstract classes, that only expose reading & writing ints. I then fixed the sep codec to use this class for doc, frq and pos files. The trickiest part was abstracting away just what a "file pointer" is. In Lucene we assume in many places this is the long file offset, but I needed to change this to file-offset plus within-block-offset, for int-block based files. Once I did that, I created a FixedIntBlockIndexInput/Output, which reads & writes the ints in blocks of a specified size. They are abstract classes and require a subclass to do the actual encode/decode of a given block. To test it I created a simple class that just writes multiple vInts. All tests also pass with this newly added ("intblock") codec. So the next step is to hook up PforDelta...
          Hide
          John Wang added a comment -

          This is awesome!
          Feel free to take code from Kamikaze for the p4delta stuff.
          The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form.

          Show
          John Wang added a comment - This is awesome! Feel free to take code from Kamikaze for the p4delta stuff. The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form.
          Hide
          Michael McCandless added a comment -

          Feel free to take code from Kamikaze for the p4delta stuff.
          The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form.

          Thanks John. I've been starting with LUCENE-1410 for now, but we can easily swap in any p4 impl, or any other int compression method. All that's needed in the subclass is to implement encodeBlock (in the writer) and decodeBlock (in the reader). The intblock codec takes care of the rest.

          Kamikaze looks like great stuff!

          What variation on p4 is Kamikaze using?

          Keeping the p4 data compressed is interesting... when you implement AND/OR/NOT on p4, do you have a shortcut that traverses the compressed form while applying the operator? Or do you do the full decode and then 2nd pass to apply the operator?

          Show
          Michael McCandless added a comment - Feel free to take code from Kamikaze for the p4delta stuff. The impl in Kamikaze assumes no decompression at load time, e.g. the Docset can be traversed in compressed form. Thanks John. I've been starting with LUCENE-1410 for now, but we can easily swap in any p4 impl, or any other int compression method. All that's needed in the subclass is to implement encodeBlock (in the writer) and decodeBlock (in the reader). The intblock codec takes care of the rest. Kamikaze looks like great stuff! What variation on p4 is Kamikaze using? Keeping the p4 data compressed is interesting... when you implement AND/OR/NOT on p4, do you have a shortcut that traverses the compressed form while applying the operator? Or do you do the full decode and then 2nd pass to apply the operator?
          Hide
          John Wang added a comment -

          Hi Mike:

          We have been using Kamikaze in our social graph engine in addition to our search system. A person's network can be rather large, decompressing it in memory some network operation is not feasible for us, hence we made the requirement for the DocIdSetIterator to be able to walk to DocIdSet's P4Delta implementation in compressed form.

          We do not decode the p4delta set and make a second pass for boolean set operations, we cannot afford it in both memory cost and latency. The P4Delta set adheres to the DocIdSet/Iterator api, and the And/Or/Not is performed on that level of abstraction using next() and skipTo methods.

          -John

          Show
          John Wang added a comment - Hi Mike: We have been using Kamikaze in our social graph engine in addition to our search system. A person's network can be rather large, decompressing it in memory some network operation is not feasible for us, hence we made the requirement for the DocIdSetIterator to be able to walk to DocIdSet's P4Delta implementation in compressed form. We do not decode the p4delta set and make a second pass for boolean set operations, we cannot afford it in both memory cost and latency. The P4Delta set adheres to the DocIdSet/Iterator api, and the And/Or/Not is performed on that level of abstraction using next() and skipTo methods. -John
          Hide
          John Wang added a comment -

          Just a FYI: Kamikaze was originally started as our sandbox for Lucene contributions until 2.4 is ready. (we needed the DocIdSet/Iterator abstraction that was migrated from Solr)

          It has three components:

          1) P4Delta
          2) Logical boolean operations on DocIdSet/Iterators (I have created a jira ticket and a patch for Lucene awhile ago with performance numbers. It is significantly faster than DisjunctionScorer)
          3) algorithm to determine which DocIdSet implementations to use given some parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the application behavior if not all parameters are given.

          So please feel free to incorporate anything you see if or move it to contrib.

          Show
          John Wang added a comment - Just a FYI: Kamikaze was originally started as our sandbox for Lucene contributions until 2.4 is ready. (we needed the DocIdSet/Iterator abstraction that was migrated from Solr) It has three components: 1) P4Delta 2) Logical boolean operations on DocIdSet/Iterators (I have created a jira ticket and a patch for Lucene awhile ago with performance numbers. It is significantly faster than DisjunctionScorer) 3) algorithm to determine which DocIdSet implementations to use given some parameters, e.g. miniD,maxid,id count etc. It learns and adjust from the application behavior if not all parameters are given. So please feel free to incorporate anything you see if or move it to contrib.
          Hide
          John Wang added a comment -

          Hi Uwe:

          Thanks for the pointer to the isCacheable method. We will defn incorporate it.

          -John

          Show
          John Wang added a comment - Hi Uwe: Thanks for the pointer to the isCacheable method. We will defn incorporate it. -John
          Hide
          Michael McCandless added a comment -

          Attached patch. This includes the pfor impl from LUCENE-1410.

          PforDelta is working! I added another codec (pfordelta). It uses the
          sep codec to separately store freq, doc, pos, and then uses PforDelta
          to encode the ints (as fixed-size blocks).

          However, there are a couple test failures
          (TestIndexWriter.testNegativePositions,
          TestPositionIncrement.testPayloadsPos0) due to PforDelta not properly
          encoding -1 (it's returned as 255). Lucene normally doesn't write
          negative ints, except for the special case of a 0 position increment
          in the initial token(s), in which case due to the bug in LUCENE-1542
          we write a -1 if you've called IndexWriter.setAllowMinus1Position.
          However, that's deprecated and will be removed shortly at which point
          the pfordelta codec will pass all tests.

          Show
          Michael McCandless added a comment - Attached patch. This includes the pfor impl from LUCENE-1410 . PforDelta is working! I added another codec (pfordelta). It uses the sep codec to separately store freq, doc, pos, and then uses PforDelta to encode the ints (as fixed-size blocks). However, there are a couple test failures (TestIndexWriter.testNegativePositions, TestPositionIncrement.testPayloadsPos0) due to PforDelta not properly encoding -1 (it's returned as 255). Lucene normally doesn't write negative ints, except for the special case of a 0 position increment in the initial token(s), in which case due to the bug in LUCENE-1542 we write a -1 if you've called IndexWriter.setAllowMinus1Position. However, that's deprecated and will be removed shortly at which point the pfordelta codec will pass all tests.
          Hide
          Michael McCandless added a comment -

          I wrote up first cut of the toplevel design of this patch, in the wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing.

          Show
          Michael McCandless added a comment - I wrote up first cut of the toplevel design of this patch, in the wiki: http://wiki.apache.org/lucene-java/FlexibleIndexing .
          Hide
          John Wang added a comment -

          Hi Mike:

          Truly awesome work!

          Quick question, are codecs per index or per field? From the wiki, it seems to be per index, if so, is it possible to make it per field?

          Thanks

          -John

          Show
          John Wang added a comment - Hi Mike: Truly awesome work! Quick question, are codecs per index or per field? From the wiki, it seems to be per index, if so, is it possible to make it per field? Thanks -John
          Hide
          Michael McCandless added a comment -

          The codec is per segment. However, we ask the codec for
          Terms/TermsEnum by fields, so it should be simple to make a Codec that
          dispatches to field-specific Codecs.

          Show
          Michael McCandless added a comment - The codec is per segment. However, we ask the codec for Terms/TermsEnum by fields, so it should be simple to make a Codec that dispatches to field-specific Codecs.
          Hide
          Michael McCandless added a comment -

          Attached current patch. All tests pass:

          • Cutover merging to flex API.
          • Cutover FieldCache to flex API. This got tricky, because terms
            are now UTF8 byte[]. First, we have a back-compat issue (I
            changed FieldCache's parsers to take TermRef not String). Second,
            parsing float/double from byte[] is tricky. I just punted and
            made a new String(), and then called parseDouble/parseFloat, which
            is slow (but, NumericFields don't do this – they are easy to
            parse straight from byte[], I think). Net/net this should be
            faster loading the FieldCache now. Also, later we can make a
            String/StringIndex FieldCache variant that keeps things as byte[].
          • Cutover CheckIndex to flex API.
          • Removed the codec-owned extensions from IndexFileNames; added
            methods to quey a Codec for all file extensions it may write. As
            part of this there is a minor (I think) runtime change whereby
            Directory.copy or new RamDirectory(Directory) will now copy all
            files not just index-related files.

          I'm now working towards getting this committable. While PforDelta
          works, I think we should move its codec over to LUCENE-1410 and get it
          working well, separately, after this is committed.

          Still need to cutover more stuff (queries, AllTermDocs, etc.) to flex
          API, get the ThreadLocal cache carried over, fix a bunch of nocommits,
          remove debugging, do perf testing & fix issues, add some more tests,
          etc.

          Show
          Michael McCandless added a comment - Attached current patch. All tests pass: Cutover merging to flex API. Cutover FieldCache to flex API. This got tricky, because terms are now UTF8 byte[]. First, we have a back-compat issue (I changed FieldCache's parsers to take TermRef not String). Second, parsing float/double from byte[] is tricky. I just punted and made a new String(), and then called parseDouble/parseFloat, which is slow (but, NumericFields don't do this – they are easy to parse straight from byte[], I think). Net/net this should be faster loading the FieldCache now. Also, later we can make a String/StringIndex FieldCache variant that keeps things as byte[]. Cutover CheckIndex to flex API. Removed the codec-owned extensions from IndexFileNames; added methods to quey a Codec for all file extensions it may write. As part of this there is a minor (I think) runtime change whereby Directory.copy or new RamDirectory(Directory) will now copy all files not just index-related files. I'm now working towards getting this committable. While PforDelta works, I think we should move its codec over to LUCENE-1410 and get it working well, separately, after this is committed. Still need to cutover more stuff (queries, AllTermDocs, etc.) to flex API, get the ThreadLocal cache carried over, fix a bunch of nocommits, remove debugging, do perf testing & fix issues, add some more tests, etc.
          Hide
          Michael McCandless added a comment -

          New patch attached. All tests pass. The changes are mostly cutting
          many things over to the flex API. Still many nocommits to address,
          but I'm getting closer!

          I haven't "svn up"d to all the recent the deprecations removals /
          generics additions. Kinda dreading doing so I think I'll wait
          until all deprecations are gone and then bite the bullet...

          Cutting over all the MultiTermQuery subclasses was nice because all
          the places where we get a TermEnum & iterate, checking if .field() is
          still our field, are now cleaner because with the flex API the
          TermsEnum you get is already only for your requested field.

          Show
          Michael McCandless added a comment - New patch attached. All tests pass. The changes are mostly cutting many things over to the flex API. Still many nocommits to address, but I'm getting closer! I haven't "svn up"d to all the recent the deprecations removals / generics additions. Kinda dreading doing so I think I'll wait until all deprecations are gone and then bite the bullet... Cutting over all the MultiTermQuery subclasses was nice because all the places where we get a TermEnum & iterate, checking if .field() is still our field, are now cleaner because with the flex API the TermsEnum you get is already only for your requested field.
          Hide
          Yonik Seeley added a comment -

          Sounding cool! I haven't had time to look at the code too much... but I j ust wanted to mention two features I've had in the back of my mind for a while that seem to have multiple use cases.

          1) How many terms in a field?

          • If the tii/TermInfos were exposed, this could be estimated.
          • Perhaps this could just be stored in FieldInfos... should be easy to track during indexing?
          • MultiTermQuery could also use this to switch impls

          2) Convert back and forth between a term number and a term.
          Solr has code to do this... stores every 128th term in memory as an index, and uses that to convert back and forth. This is very much like the internals of TermInfos... would be nice to expose some of that.

          Show
          Yonik Seeley added a comment - Sounding cool! I haven't had time to look at the code too much... but I j ust wanted to mention two features I've had in the back of my mind for a while that seem to have multiple use cases. 1) How many terms in a field? If the tii/TermInfos were exposed, this could be estimated. Perhaps this could just be stored in FieldInfos... should be easy to track during indexing? MultiTermQuery could also use this to switch impls 2) Convert back and forth between a term number and a term. Solr has code to do this... stores every 128th term in memory as an index, and uses that to convert back and forth. This is very much like the internals of TermInfos... would be nice to expose some of that.
          Hide
          John Wang added a comment -

          Hi Yonik:

          These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we can add 2) to the same issue to track?

          Thanks

          -John

          Show
          John Wang added a comment - Hi Yonik: These are indeed useful features. LUCENE-1922 addresses 1), perhaps, we can add 2) to the same issue to track? Thanks -John
          Hide
          Michael McCandless added a comment -

          1) How many terms in a field?

          Actually I've already added this one (Terms.getUniqueTermCount), but I
          didn't punch it through to IndexReader. I'll do that. The standard
          codec (new "default" codec when writing segments) already records this
          per field, so it's trivial to expose.

          However, some impls may throw UOE (eg a composite IndexReader).

          2) Convert back and forth between a term number and a term.

          I agree this would be useful. I did have ord() in early iterations of
          the TermsEnum API, but it wasn't fully implemented and I stripped it
          when I switched to "just finish it already" mode We could think
          about adding it back, though you'd also presumably need seek(int ord)
          as well? (And docFreq(String field, int ord) sugar exposed in
          IndexReader?).

          Show
          Michael McCandless added a comment - 1) How many terms in a field? Actually I've already added this one (Terms.getUniqueTermCount), but I didn't punch it through to IndexReader. I'll do that. The standard codec (new "default" codec when writing segments) already records this per field, so it's trivial to expose. However, some impls may throw UOE (eg a composite IndexReader). 2) Convert back and forth between a term number and a term. I agree this would be useful. I did have ord() in early iterations of the TermsEnum API, but it wasn't fully implemented and I stripped it when I switched to "just finish it already" mode We could think about adding it back, though you'd also presumably need seek(int ord) as well? (And docFreq(String field, int ord) sugar exposed in IndexReader?).
          Hide
          Yonik Seeley added a comment -

          I agree this would be useful. I did have ord() in early iterations of the TermsEnum API, but it wasn't fully implemented and I stripped it when I switched to "just finish it already" mode

          A "complete" implementation seems hard (i.e. across multiple segments also)... but it still seems useful even if it's only at the segment level. So perhaps just on SegmentTermEnum, and uses would have to cast to access?

          Exposing the term index array (i.e. every 128th term) as an expert-subject-to-change warning would let people implement variants themselves at least.

          you'd also presumably need seek(int ord)

          Yep.

          Show
          Yonik Seeley added a comment - I agree this would be useful. I did have ord() in early iterations of the TermsEnum API, but it wasn't fully implemented and I stripped it when I switched to "just finish it already" mode A "complete" implementation seems hard (i.e. across multiple segments also)... but it still seems useful even if it's only at the segment level. So perhaps just on SegmentTermEnum, and uses would have to cast to access? Exposing the term index array (i.e. every 128th term) as an expert-subject-to-change warning would let people implement variants themselves at least. you'd also presumably need seek(int ord) Yep.
          Hide
          Grant Ingersoll added a comment -

          I haven't followed too closely (even though it is one of my favorite issues) but I figured while Yonik was throwing out ideas, I'd add that one of the obvious use cases for flexible indexing is altering scoring. One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?

          Show
          Grant Ingersoll added a comment - I haven't followed too closely (even though it is one of my favorite issues) but I figured while Yonik was throwing out ideas, I'd add that one of the obvious use cases for flexible indexing is altering scoring. One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?
          Hide
          Mark Miller added a comment -

          I haven't "svn up"d to all the recent the deprecations removals / generics additions. Kinda dreading doing so '

          Come on old man, stop clinging to emacs I've got a meditation technique for that

          Sounds like some annoyance, and I think I made a comment there - and I'm a man of my word... or child of my word - take your pick.

          To trunk. Since you likely have moved on, don't worry - this was good practice - I'll do it again sometime if you'd like. I may have mis merged something little or something. I went fairly quick (I think it took like 30 or 40 min - was hoping to do it faster, but eh - sometimes I like to grind).

          I didn't really look at the code, but some stuff I noticed:

          java 6 in pfor Arrays.copy

          skiplist stuff in codecs still have package of index - not sure what is going on there - changed them

          in IndexWriter:
          + // Mark: read twice?
          segmentInfos.read(directory);
          + segmentInfos.read(directory, codecs);

          Core tests pass, but I didn't wait for contrib or back compat.

          Show
          Mark Miller added a comment - I haven't "svn up"d to all the recent the deprecations removals / generics additions. Kinda dreading doing so ' Come on old man, stop clinging to emacs I've got a meditation technique for that Sounds like some annoyance, and I think I made a comment there - and I'm a man of my word... or child of my word - take your pick. To trunk. Since you likely have moved on, don't worry - this was good practice - I'll do it again sometime if you'd like. I may have mis merged something little or something. I went fairly quick (I think it took like 30 or 40 min - was hoping to do it faster, but eh - sometimes I like to grind). I didn't really look at the code, but some stuff I noticed: java 6 in pfor Arrays.copy skiplist stuff in codecs still have package of index - not sure what is going on there - changed them in IndexWriter: + // Mark: read twice? segmentInfos.read(directory); + segmentInfos.read(directory, codecs); Core tests pass, but I didn't wait for contrib or back compat.
          Hide
          Mark Miller added a comment -

          eh - even if you have moved on, if I'm going to put up a patch, might as well do it right - here is another:

          • removed a boatload of unused imports
          • removed DefaultSkipListWriter/Reader - I accidently put them back in
          • removed an unused field or two (not all)
          • paramaterized LegacySegmentMergeQueue.java
          • Fixed the double read I mentioned in previous comment in IndexWriter
          • TermRef defines an equals (that throws UOE) and not hashCode - early stuff I guess but odd since no class extends it. Added a hashCode that throws UOE anyway.
          • fixed bug in TermRangeTermsEnum: lowerTermRef = new TermRef(lowerTermText); to lowerTermRef = new TermRef(this.lowerTermText);
          • Fixed Remote contrib test to work with TermRef for fieldcache parser (since you don't include contrib in the tar)
          • Missed a StringBuffer to StringBuilder in MultiTermQuery.toString
          • had missed removing deprecated IndexReader.open(final Directory directory) and deprecated IndexReader.open(final IndexCommit commit)
          • Paramertized some stuff in ParrallelReader that made sense - what the heck
          • added a nocommit or two on unread fields with a comment that made it look like they were/will be used
          • Looks like SegmentTermPositions.java may have been screwy in last patch - ensure its now a deleted file - same with TermInfosWriter.java
          • You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'.
          • Missed removing listAll from FileSwitchDirectory - gone
          • cleaned up some white space nothings in the patch
          • I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.
          • looks like i missed merging in a change to TestIndexWriter.java#assertNoUnreferencedFiles - done
          • doubled checked my merge work

          core and contrib tests pass

          Show
          Mark Miller added a comment - eh - even if you have moved on, if I'm going to put up a patch, might as well do it right - here is another: removed a boatload of unused imports removed DefaultSkipListWriter/Reader - I accidently put them back in removed an unused field or two (not all) paramaterized LegacySegmentMergeQueue.java Fixed the double read I mentioned in previous comment in IndexWriter TermRef defines an equals (that throws UOE) and not hashCode - early stuff I guess but odd since no class extends it. Added a hashCode that throws UOE anyway. fixed bug in TermRangeTermsEnum: lowerTermRef = new TermRef(lowerTermText); to lowerTermRef = new TermRef(this.lowerTermText); Fixed Remote contrib test to work with TermRef for fieldcache parser (since you don't include contrib in the tar) Missed a StringBuffer to StringBuilder in MultiTermQuery.toString had missed removing deprecated IndexReader.open(final Directory directory) and deprecated IndexReader.open(final IndexCommit commit) Paramertized some stuff in ParrallelReader that made sense - what the heck added a nocommit or two on unread fields with a comment that made it look like they were/will be used Looks like SegmentTermPositions.java may have been screwy in last patch - ensure its now a deleted file - same with TermInfosWriter.java You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'. Missed removing listAll from FileSwitchDirectory - gone cleaned up some white space nothings in the patch I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. looks like i missed merging in a change to TestIndexWriter.java#assertNoUnreferencedFiles - done doubled checked my merge work core and contrib tests pass
          Hide
          Michael McCandless added a comment -

          Whoa thanks for the sudden sprint Mark!

          Come on old man, stop clinging to emacs

          Hey! I'm not so old But yeah I still cling to emacs. Hey, I know
          people who still cling to vi!

          I didn't really look at the code, but some stuff I noticed:

          java 6 in pfor Arrays.copy

          skiplist stuff in codecs still have package of index - not sure what is going on there - changed them

          in IndexWriter:
          + // Mark: read twice?
          segmentInfos.read(directory);
          + segmentInfos.read(directory, codecs);

          Excellent catches! All of these are not right.

          (since you don't include contrib in the tar)

          Gak, sorry. I have a bunch of mods there, cutting over to flex API.

          You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'.

          Woops, for back compat I think we need to leave it in (it's a
          protected method), deprecated. I'll put it back if you haven't.

          I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.

          Eek, it shouldn't be – indeed it is. When did that happen? We
          should fix this (separately from this issue!).

          Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib & back-compat branch, and post new patch! Thanks

          Show
          Michael McCandless added a comment - Whoa thanks for the sudden sprint Mark! Come on old man, stop clinging to emacs Hey! I'm not so old But yeah I still cling to emacs. Hey, I know people who still cling to vi! I didn't really look at the code, but some stuff I noticed: java 6 in pfor Arrays.copy skiplist stuff in codecs still have package of index - not sure what is going on there - changed them in IndexWriter: + // Mark: read twice? segmentInfos.read(directory); + segmentInfos.read(directory, codecs); Excellent catches! All of these are not right. (since you don't include contrib in the tar) Gak, sorry. I have a bunch of mods there, cutting over to flex API. You left getEnum(IndexReader reader) in the MultiTerm queries, but no in PrefixQuery - just checkin'. Woops, for back compat I think we need to leave it in (it's a protected method), deprecated. I'll put it back if you haven't. I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be – indeed it is. When did that happen? We should fix this (separately from this issue!). Do you have more fixes coming? If so, I'll let you sprint some more; else, I'll merge in, add contrib & back-compat branch, and post new patch! Thanks
          Hide
          Michael McCandless added a comment -

          One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this?

          Well, thinking through how you'd do this... likely you'd want to store
          the avg length (in tokens), eg as a single float per field per
          segment, right? The natural place to store this would be in the
          FieldInfos, I think?. Unfortunately, this patch doesn't yet add
          extensibility to FieldInfos.

          And you'd need a small customization to the indexing chain to
          compute this when indexing new docs, which is already doable today
          (though, package private).

          But then on merging segments, you'd need an extensions point, which we
          don't have today, to recompute the avg. Hmm: how would you handle
          deleted docs? Would you want to go back to the field length for every
          doc & recompute the average? (Which'd mean you need to per doc per
          field length, not just the averages).

          Unfortunately, this patch doesn't yet address things like customizing
          what's stored in FieldInfo or SegmentInfo, nor customizing what
          happens during merging (though it takes us a big step closer to this).
          I think we need both of these to "finish" flexible indexing, but I'm
          thinking at this point that these should really be tackled in followon
          issue(s). This issue is already ridiculously massive.

          Show
          Michael McCandless added a comment - One of the common statistics one needs to implement some more advanced scoring approaches is the average document length. Is this patch far enough along that I could take a look at it and think about how one might do this? Well, thinking through how you'd do this... likely you'd want to store the avg length (in tokens), eg as a single float per field per segment, right? The natural place to store this would be in the FieldInfos, I think?. Unfortunately, this patch doesn't yet add extensibility to FieldInfos. And you'd need a small customization to the indexing chain to compute this when indexing new docs, which is already doable today (though, package private). But then on merging segments, you'd need an extensions point, which we don't have today, to recompute the avg. Hmm: how would you handle deleted docs? Would you want to go back to the field length for every doc & recompute the average? (Which'd mean you need to per doc per field length, not just the averages). Unfortunately, this patch doesn't yet address things like customizing what's stored in FieldInfo or SegmentInfo, nor customizing what happens during merging (though it takes us a big step closer to this). I think we need both of these to "finish" flexible indexing, but I'm thinking at this point that these should really be tackled in followon issue(s). This issue is already ridiculously massive.
          Hide
          Uwe Schindler added a comment -

          I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now.

          Eek, it shouldn't be - indeed it is. When did that happen? We
          should fix this (separately from this issue!).

          My fault, I removed it during the remove backwards tests on Saturday. If we do not remove DateTools/DateField for 3.0 (we may need to leave it in for index compatibility), I will restore, these tests, too. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop).

          I have this on my list when going forward with removing the old TokenStream API.

          Show
          Uwe Schindler added a comment - I guess TestBackwardsCompatibility.java has been removed from trunk or something? kept it here for now. Eek, it shouldn't be - indeed it is. When did that happen? We should fix this (separately from this issue!). My fault, I removed it during the remove backwards tests on Saturday. If we do not remove DateTools/DateField for 3.0 (we may need to leave it in for index compatibility), I will restore, these tests, too. It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop). I have this on my list when going forward with removing the old TokenStream API.
          Hide
          Michael McCandless added a comment -

          It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop).

          Ahh – can you do this for TestBackwardsCompatibility? I restored it, but, lost all history. Thanks.

          Show
          Michael McCandless added a comment - It's easy with TortoiseSVN and you can also preserve the history (using svn:mergeinfo prop). Ahh – can you do this for TestBackwardsCompatibility? I restored it, but, lost all history. Thanks.
          Hide
          Uwe Schindler added a comment -

          Done. I also did it for the BW branch, but didn't create a tag yet. The next tag creation for the next bigger patch is enough (no need to do it now).

          What I have done: svn copy from the older revision to the same path

          Show
          Uwe Schindler added a comment - Done. I also did it for the BW branch, but didn't create a tag yet. The next tag creation for the next bigger patch is enough (no need to do it now). What I have done: svn copy from the older revision to the same path
          Hide
          Michael McCandless added a comment -

          What I have done: svn copy from the older revision to the same path

          Excellent, thanks! It had a few problems (was still trying to deprecated APIs, some of which were gone) – I just committed fixes.

          Show
          Michael McCandless added a comment - What I have done: svn copy from the older revision to the same path Excellent, thanks! It had a few problems (was still trying to deprecated APIs, some of which were gone) – I just committed fixes.
          Hide
          Yonik Seeley added a comment -

          likely you'd want to store the avg length (in tokens), eg as a single float per field per segment, right?

          I think we might want to store fundamentals instead:

          • total number of tokens indexed for that field in the entire segment
          • total number of documents that contain the field in the entire segment

          Both of these seem really easy to keep track of?
          I also think we'd just ignore deleted docs (i.e. don't change the stats) just as idf does today.

          The natural place to store this would be in the FieldInfos, I think?

          yep.

          Show
          Yonik Seeley added a comment - likely you'd want to store the avg length (in tokens), eg as a single float per field per segment, right? I think we might want to store fundamentals instead: total number of tokens indexed for that field in the entire segment total number of documents that contain the field in the entire segment Both of these seem really easy to keep track of? I also think we'd just ignore deleted docs (i.e. don't change the stats) just as idf does today. The natural place to store this would be in the FieldInfos, I think? yep.
          Hide
          Michael McCandless added a comment -

          Uber-patch attached: started from Mark's patch (thanks!), added my contrib & back-compat branch changes. All tests pass.

          Also, I removed pfor from this issue. I'll attach the pfor codec to LUCENE-1410.

          Note that I didn't use "svn move" in generating the patch, so that the patch can be applied cleanly. When it [finally] comes time to commit for real, I'll svn move so we preserve history.

          Show
          Michael McCandless added a comment - Uber-patch attached: started from Mark's patch (thanks!), added my contrib & back-compat branch changes. All tests pass. Also, I removed pfor from this issue. I'll attach the pfor codec to LUCENE-1410 . Note that I didn't use "svn move" in generating the patch, so that the patch can be applied cleanly. When it [finally] comes time to commit for real, I'll svn move so we preserve history.
          Hide
          Mark Miller added a comment -

          Hey! I'm not so old But yeah I still cling to emacs.

          can you say both of those things in the same breath? Just how long did it take to get that phd...

          I'd look it up and guestimate your age, but I think MIT still has my ip blocked from back when I was applying to colleges. So I'm going with the "uses emacs" guestimate.

          Hey, I know people who still cling to vi!

          vi is the only one I can half way use - I know 3 commands - edit mode, leave edit mode, and save. And every now and then I accidently delete a whole line. When I make a change that I don't want to save, I have to kill the power.

          The patch is in a bit of an unpatchable state I think I know what editor to blame...Pico!

          Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix that...

          But also, NumericUtils is unpatched, Codec is missing, along with most of the classes from the codecs packages! This looks like my work

          My only conclusion is that your one of those guys that can write the whole program once without even running it - and then it works perfectly on the first go. Thats the only way I can explain those classes in the wrong package previously as well No bug hunting tonight

          Show
          Mark Miller added a comment - Hey! I'm not so old But yeah I still cling to emacs. can you say both of those things in the same breath? Just how long did it take to get that phd... I'd look it up and guestimate your age, but I think MIT still has my ip blocked from back when I was applying to colleges. So I'm going with the "uses emacs" guestimate. Hey, I know people who still cling to vi! vi is the only one I can half way use - I know 3 commands - edit mode, leave edit mode, and save. And every now and then I accidently delete a whole line. When I make a change that I don't want to save, I have to kill the power. The patch is in a bit of an unpatchable state I think I know what editor to blame...Pico! Our old friend, the $id is messing up WildcardTermEnum - no problem, I can fix that... But also, NumericUtils is unpatched, Codec is missing, along with most of the classes from the codecs packages! This looks like my work My only conclusion is that your one of those guys that can write the whole program once without even running it - and then it works perfectly on the first go. Thats the only way I can explain those classes in the wrong package previously as well No bug hunting tonight
          Hide
          Mark Miller added a comment - - edited

          nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better.

          A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet.

          edit

          Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around.

          edit

          Yes - I found it - BitVector was supposed to implement Bits - which was in the patch ... this patch just did not want to apply. I guess it was right, but Eclipse just did not want it to take ...

          Show
          Mark Miller added a comment - - edited nope - something else - looking through the patch I see the files I want - a second attempt at patching has gone over better. A couple errors still, but stuff I think I can fix so that I can at least look over. False alarm. My patcher wonked out or something. I can resolve the few errors that popped up this time. Sweet. edit Just for reference - not sure what happened the first time - my patch preview looked the same both times (was only complaining about the $id), but completely failed on attempt one and worked on attempt two - the only issue now appears to be you have half switch deletedDocs to Bits from BitVector - but only have way, so its broken in a dozen places. Not sure what you are doing about size() and what not, so I'm just gonna read around. edit Yes - I found it - BitVector was supposed to implement Bits - which was in the patch ... this patch just did not want to apply. I guess it was right, but Eclipse just did not want it to take ...
          Hide
          Mark Miller added a comment -

          Bah - all this huffing an puffing over the patch and I'm too sick to stay up late anyway.

          Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock.

          Show
          Mark Miller added a comment - Bah - all this huffing an puffing over the patch and I'm too sick to stay up late anyway. Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock.
          Hide
          Michael McCandless added a comment -

          Mark is there anything wrong w/ the patch? Did you get it working?

          Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock.

          I haven't but it sounds like you have! I'll get to it soon... but one thing I know is missing is the equivalent of the "terminfo cache" so that when a query 1) looks up docFreq of the term (to compute its weight), and 2) looks up the freq/prox offsets, that 2nd lookup is cached.

          IntBlock is expected to be slow – it naively encodes one int at a time using vInt. Ie, it's just a "test" codec, meant to be the base for real block-based codecs like pfor.

          Show
          Michael McCandless added a comment - Mark is there anything wrong w/ the patch? Did you get it working? Have you started benching at all? I'm seeing like a 40-50% drop in same reader search benches with standard, sep, and pulsing. Like 80% with intblock. I haven't but it sounds like you have! I'll get to it soon... but one thing I know is missing is the equivalent of the "terminfo cache" so that when a query 1) looks up docFreq of the term (to compute its weight), and 2) looks up the freq/prox offsets, that 2nd lookup is cached. IntBlock is expected to be slow – it naively encodes one int at a time using vInt. Ie, it's just a "test" codec, meant to be the base for real block-based codecs like pfor.
          Hide
          Mark Miller added a comment -

          Mark is there anything wrong w/ the patch? Did you get it working?

          I got it working - it didn't apply cleanly, but perhaps that was just me. It was a weird situation - I get a preview of whats going to happen with complaints, and it only complained about the $id issue in wildcardtermenum - the half the patch failed. A second attempt and it only complained about that again - but then it missed making BitVector implement Bits - could just be ghosts in my machine. I wouldn't worry about it till someone else complains. In any case, I got it working in my case by just fixing the $id issue and adding implements Bits to BitVector.

          Show
          Mark Miller added a comment - Mark is there anything wrong w/ the patch? Did you get it working? I got it working - it didn't apply cleanly, but perhaps that was just me. It was a weird situation - I get a preview of whats going to happen with complaints, and it only complained about the $id issue in wildcardtermenum - the half the patch failed. A second attempt and it only complained about that again - but then it missed making BitVector implement Bits - could just be ghosts in my machine. I wouldn't worry about it till someone else complains. In any case, I got it working in my case by just fixing the $id issue and adding implements Bits to BitVector.
          Hide
          Mark Miller added a comment -

          I haven't but it sounds like you have!

          Nothing serious Just began trying to understand the code a bit more, so started with playing around with the different Codecs. Which lead to just quickly trying out the micro bench with each of em.

          Show
          Mark Miller added a comment - I haven't but it sounds like you have! Nothing serious Just began trying to understand the code a bit more, so started with playing around with the different Codecs. Which lead to just quickly trying out the micro bench with each of em.
          Hide
          Michael McCandless added a comment -

          New patch attached. All tests pass.

          I simplified the TermsEnum.seek API, and added ord to the API. The
          ord is a long, but the standard codec (and, I think, Lucene today)
          internally use an int...

          Show
          Michael McCandless added a comment - New patch attached. All tests pass. I simplified the TermsEnum.seek API, and added ord to the API. The ord is a long, but the standard codec (and, I think, Lucene today) internally use an int...
          Hide
          Yonik Seeley added a comment - - edited

          Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc.

          I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse)

          Show
          Yonik Seeley added a comment - - edited Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc. I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse)
          Hide
          Mark Miller added a comment - - edited

          Okay, I just tried a toy cache with standard - its not perfect because the tests have a bunch that end up finding one doc short, and I don't turn off the cache for any reason (the old one terns it off when returning the segmenttermenum, but I didn't even try to understand that with the new stuff). But that appears to get the majority of the perf back. Went from about 3500 r/s to 7500 - the old is 8400.

          This stuff is so cool by the way.

          edit

          whew - emphasis on toy - its hard to do this right with docsreader

          Show
          Mark Miller added a comment - - edited Okay, I just tried a toy cache with standard - its not perfect because the tests have a bunch that end up finding one doc short, and I don't turn off the cache for any reason (the old one terns it off when returning the segmenttermenum, but I didn't even try to understand that with the new stuff). But that appears to get the majority of the perf back. Went from about 3500 r/s to 7500 - the old is 8400. This stuff is so cool by the way. edit whew - emphasis on toy - its hard to do this right with docsreader
          Hide
          Michael McCandless added a comment -

          its hard to do this right with docsreader

          I was thinking something along the lines of adding a "captureState" to DocsProducer.Reader, that returns an opaque object, and then adding a corresponding seek that accepts that object. It would chain to the positions reader.

          Then StandardTermsDictReader would hold the thread private cache, using this API.

          Show
          Michael McCandless added a comment - its hard to do this right with docsreader I was thinking something along the lines of adding a "captureState" to DocsProducer.Reader, that returns an opaque object, and then adding a corresponding seek that accepts that object. It would chain to the positions reader. Then StandardTermsDictReader would hold the thread private cache, using this API.
          Hide
          Mark Miller added a comment -

          Well thats reassuring - I think I was on the right path then. I've got the thread private cache, and I was initially just capturing in's position so I could set it before calling readTerm after pulling from the cache - so I knew I had an issue with the positions reader in there too (the position of it in readTerm) - but didn't see the cleanest path to set and capture that without modifying the reader like you said - but I wasn't even sure I was on the right path, so thats about where I gave up

          Your comment makes me feel a little less dumb about it all though.

          Show
          Mark Miller added a comment - Well thats reassuring - I think I was on the right path then. I've got the thread private cache, and I was initially just capturing in's position so I could set it before calling readTerm after pulling from the cache - so I knew I had an issue with the positions reader in there too (the position of it in readTerm) - but didn't see the cleanest path to set and capture that without modifying the reader like you said - but I wasn't even sure I was on the right path, so thats about where I gave up Your comment makes me feel a little less dumb about it all though.
          Hide
          Michael McCandless added a comment -

          No problem Please post the patch once you have it working! We'll need to implement captureState/seek for the other codes too. The pulsing case will be interesting since it's state will hold the actual postings for the low freq case.

          BTW I think an interesting codec would be one that pre-loads postings into RAM, storing them uncompressed (eg docs/positions as simple int[]) or slightly compressed (stored as packed bits). This should be a massive performance win at the expense of sizable RAM consumption, ie it makes the same tradeoff as contrib/memory and contrib/instantiated.

          Show
          Michael McCandless added a comment - No problem Please post the patch once you have it working! We'll need to implement captureState/seek for the other codes too. The pulsing case will be interesting since it's state will hold the actual postings for the low freq case. BTW I think an interesting codec would be one that pre-loads postings into RAM, storing them uncompressed (eg docs/positions as simple int[]) or slightly compressed (stored as packed bits). This should be a massive performance win at the expense of sizable RAM consumption, ie it makes the same tradeoff as contrib/memory and contrib/instantiated.
          Hide
          Michael McCandless added a comment -

          Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc.
          I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse)

          Right now seek(TermRef seekTerm) stops at the earliest term that's >=
          seekTerm.

          It sounds like you're asking for a variant of seek that'd stop at the
          latest term that's <= seekTerm?

          How would you use this to seek to the last term in a field? With the
          flex API, the TermsEnum only works with a single field's terms. So I
          guess we'd need TermRef constants, eg TermRef.FIRST and TermRef.LAST,
          that "act like" -infinity / +infinity.

          Show
          Michael McCandless added a comment - Another for theTermsEnum wishlist: the ability to seek to the term before the given term... useful for finding the largest value in a field, etc. I imagine "at or before" semantics would also work (like the current semantics of TermEnum in reverse) Right now seek(TermRef seekTerm) stops at the earliest term that's >= seekTerm. It sounds like you're asking for a variant of seek that'd stop at the latest term that's <= seekTerm? How would you use this to seek to the last term in a field? With the flex API, the TermsEnum only works with a single field's terms. So I guess we'd need TermRef constants, eg TermRef.FIRST and TermRef.LAST, that "act like" -infinity / +infinity.
          Hide
          Michael McCandless added a comment -

          Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Though that'd only work for TermsEnum impls that support ords.

          Show
          Michael McCandless added a comment - Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Though that'd only work for TermsEnum impls that support ords.
          Hide
          Yonik Seeley added a comment -

          How would you use this to seek to the last term in a field?

          It's not just last in a field, since one may be looking for last out of any given term range (the highest value of a trie int is not the last value encoded in that field).
          So if you had a trie based field, one would find the highest value via seekAtOrBefore(triecoded(MAXINT))

          Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()).

          Ahhh... right, prev could be implemented like so:

          int ord = seek(triecoded(MAXINT))).ord
          seek(ord-1)

          Though that'd only work for TermsEnum impls that support ords.

          As long as ord is supported at the segment level, it's doable.

          Show
          Yonik Seeley added a comment - How would you use this to seek to the last term in a field? It's not just last in a field, since one may be looking for last out of any given term range (the highest value of a trie int is not the last value encoded in that field). So if you had a trie based field, one would find the highest value via seekAtOrBefore(triecoded(MAXINT)) Actually, FIRST/LAST could be achieved with seek-by-ord (plus getUniqueTermCount()). Ahhh... right, prev could be implemented like so: int ord = seek(triecoded(MAXINT))).ord seek(ord-1) Though that'd only work for TermsEnum impls that support ords. As long as ord is supported at the segment level, it's doable.
          Hide
          Mark Miller added a comment - - edited

          hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though.

          My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow.

          Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed.

          edit

          Hmm - still no equivalent of the cached enum for one I guess.
          And at the least, since you only cache when the scan is great than one, you can at least skip one capture there...

          Show
          Mark Miller added a comment - - edited hmm - I think I'm close. Everything passes except for omitTermsTest, LazyProxTest, and for some odd reason the multi term tests. Getting close though. My main concern at the moment is the state capturing. It seems I have to capture the state before readTerm in next() - but I might not use that state if there are multiple next calls before the hit. So thats a lot of wasted capturing. Have to deal with that somehow. Doing things more correctly like this, the gain is much less significant. What really worries me is that my hack test was still slower than the old - and that skipped a bunch of necessary work, so its almost a better than best case here - I think you might need more gains elsewhere to get back up to speed. edit Hmm - still no equivalent of the cached enum for one I guess. And at the least, since you only cache when the scan is great than one, you can at least skip one capture there...
          Hide
          Michael McCandless added a comment -

          It seems I have to capture the state before readTerm in next()

          Wait, how come? It seems like we should only cache if we find exactly the requested term (ie, where we return SeekStatus.FOUND)? So you should only have to capture the state once, there?

          Hmm I wonder whether we should also cache the seek(ord) calls?

          Show
          Michael McCandless added a comment - It seems I have to capture the state before readTerm in next() Wait, how come? It seems like we should only cache if we find exactly the requested term (ie, where we return SeekStatus.FOUND)? So you should only have to capture the state once, there? Hmm I wonder whether we should also cache the seek(ord) calls?
          Hide
          Mark Miller added a comment -

          Hmm - I must have something off then. I've never been into this stuff much before.

          on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) - I'm just caching the freq, isIndex, and the positions with a CurrentState object. The captureCurrentState now telescopes down capturing the state of each object

          Perhaps I'm off there - because if I do that, it seems I have to capture the state right before the call to readTerm in next() - otherwise readTerm will move everything forward before I can grab it when I actually put the state into the cache - when its FOUND.

          I may be all wet though - no worries - I'm really just playing around trying to learn some of this - only way I learn to is to code.

          Hmm I wonder whether we should also cache the seek(ord) calls?

          I was wondering about that, but hand't even got to thinking about it

          Show
          Mark Miller added a comment - Hmm - I must have something off then. I've never been into this stuff much before. on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) - I'm just caching the freq, isIndex, and the positions with a CurrentState object. The captureCurrentState now telescopes down capturing the state of each object Perhaps I'm off there - because if I do that, it seems I have to capture the state right before the call to readTerm in next() - otherwise readTerm will move everything forward before I can grab it when I actually put the state into the cache - when its FOUND. I may be all wet though - no worries - I'm really just playing around trying to learn some of this - only way I learn to is to code. Hmm I wonder whether we should also cache the seek(ord) calls? I was wondering about that, but hand't even got to thinking about it
          Hide
          Michael Busch added a comment -

          I added this cache originally because it seemed the easiest to improve the term lookup performance.

          Now we're adding the burden of implementing such a cache to every codec, right? Maybe instead we should improve the search runtime to not call idf() twice for every term?

          Show
          Michael Busch added a comment - I added this cache originally because it seemed the easiest to improve the term lookup performance. Now we're adding the burden of implementing such a cache to every codec, right? Maybe instead we should improve the search runtime to not call idf() twice for every term?
          Hide
          Michael McCandless added a comment -

          on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex)

          Hmm... I think your cache might be one level too low? I think we want the cache to live in StandardTermsDictReader. Only the seek(TermRef) method interacts with the cache for now (until we maybe add ord as well).

          So, seek first checks if that term is in cache, and if so pulls the opaque state and asks the docsReader to restore to that state. Else, it does the normal seek, but then if the exact term is found, it calls docsReader.captureState and stores it in the cache.

          Make sure the cache lives high enough to be shared by different TermsEnum instances. I think it should probably live in StandardTermsDictReader.FieldReader. There is one instance of that per field.

          Show
          Michael McCandless added a comment - on a cache hit, I'm still calling docs.readTerm(entry.freq, entry.isIndex) Hmm... I think your cache might be one level too low? I think we want the cache to live in StandardTermsDictReader. Only the seek(TermRef) method interacts with the cache for now (until we maybe add ord as well). So, seek first checks if that term is in cache, and if so pulls the opaque state and asks the docsReader to restore to that state. Else, it does the normal seek, but then if the exact term is found, it calls docsReader.captureState and stores it in the cache. Make sure the cache lives high enough to be shared by different TermsEnum instances. I think it should probably live in StandardTermsDictReader.FieldReader. There is one instance of that per field.
          Hide
          Michael McCandless added a comment -

          Now we're adding the burden of implementing such a cache to every codec, right?

          I suspect most codecs will reuse the StandardTermsDictReader, ie, they will usually only change the docs/positions/payloads format. So each codec will only have to implement capture/restoreState.

          Maybe instead we should improve the search runtime to not call idf() twice for every term?

          Oh I didn't realize we call idf() twice per term – we should separately just fix that. Where are we doing that?

          (I thought the two calls were first for idf() and then 2nd when it's time to get the actual TermDocs/Positions to step through).

          Show
          Michael McCandless added a comment - Now we're adding the burden of implementing such a cache to every codec, right? I suspect most codecs will reuse the StandardTermsDictReader, ie, they will usually only change the docs/positions/payloads format. So each codec will only have to implement capture/restoreState. Maybe instead we should improve the search runtime to not call idf() twice for every term? Oh I didn't realize we call idf() twice per term – we should separately just fix that. Where are we doing that? (I thought the two calls were first for idf() and then 2nd when it's time to get the actual TermDocs/Positions to step through).
          Hide
          Michael Busch added a comment -

          Oh I didn't realize we call idf() twice per term

          Hmm I take that back. I looked in LUCENE-1195 again:

          Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done
          twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called.
          The second time when the posting list is opened (TermDocs or TermPositions).

          Hmm something's wrong with my memory this morning! Maybe the lack of caffeine

          Show
          Michael Busch added a comment - Oh I didn't realize we call idf() twice per term Hmm I take that back. I looked in LUCENE-1195 again: Currently we have a bottleneck for multi-term queries: the dictionary lookup is being done twice for each term. The first time in Similarity.idf(), where searcher.docFreq() is called. The second time when the posting list is opened (TermDocs or TermPositions). Hmm something's wrong with my memory this morning! Maybe the lack of caffeine
          Hide
          Mark Miller added a comment -

          Ah - okay - that helps. I think the cache itself is currently around the right level (StandardTermsDictReader, and it gets hit pretty hard), but I thought it was funky I still had to make that read call - I think I see how it should work without that now, but just queuing up the docsReader to where it should be correctly. We will see. Vacation till Tuesday - don't let me stop you from doing it correctly if its on your timeline. Just playing over here - and I don't have a lot of time to play really.

          Show
          Mark Miller added a comment - Ah - okay - that helps. I think the cache itself is currently around the right level (StandardTermsDictReader, and it gets hit pretty hard), but I thought it was funky I still had to make that read call - I think I see how it should work without that now, but just queuing up the docsReader to where it should be correctly. We will see. Vacation till Tuesday - don't let me stop you from doing it correctly if its on your timeline. Just playing over here - and I don't have a lot of time to play really.
          Hide
          Michael McCandless added a comment -

          New patch attached. All tests pass.

          A few small changes (eg sync'd to trunk) but the biggest change is a
          new test case (TestExternalCodecs) that contains two new codecs:

          • RAMOnlyCodec – like instantiated, it writes and reads all
            postings into RAM in dedicated classes
          • PerFieldCodecWrapper – dispatches by field name to different
            codecs (this was asked about a couple times)

          The test indexes one field using the standard codec, and the other
          using the RAMOnlyCodec. It also verifies one can in fact make a
          custom codec external to oal.index.

          Show
          Michael McCandless added a comment - New patch attached. All tests pass. A few small changes (eg sync'd to trunk) but the biggest change is a new test case (TestExternalCodecs) that contains two new codecs: RAMOnlyCodec – like instantiated, it writes and reads all postings into RAM in dedicated classes PerFieldCodecWrapper – dispatches by field name to different codecs (this was asked about a couple times) The test indexes one field using the standard codec, and the other using the RAMOnlyCodec. It also verifies one can in fact make a custom codec external to oal.index.
          Hide
          Mark Miller added a comment -

          Okay, after all that poking around in the dark, tonight I decided to actually try turning on the DEBUG stuff you have and figuring out how things actually work Always too lazy to open that instruction manual till I've wasted plenty of time spinning in circles.

          So I've got it working -

          When it was working like 99% I benched the speed at 6300-6500 r/s with the samerdr bench as compared to 9500-11000 with the trunk version I had checked out.

          But that last 1% meant adding two TermRef clones, and that dropped things to about 5800 or so.

          I'm sure I might have a few wasteful instructions and/or there can be a little more eeked out, but I think it will still come up short.

          I dont see seek(ord) being called using eclipse (other than in tests), but it may be missing it? So I'm not really sure if it needs to be cached or not - no code to test it with at the moment.

          Show
          Mark Miller added a comment - Okay, after all that poking around in the dark, tonight I decided to actually try turning on the DEBUG stuff you have and figuring out how things actually work Always too lazy to open that instruction manual till I've wasted plenty of time spinning in circles. So I've got it working - When it was working like 99% I benched the speed at 6300-6500 r/s with the samerdr bench as compared to 9500-11000 with the trunk version I had checked out. But that last 1% meant adding two TermRef clones, and that dropped things to about 5800 or so. I'm sure I might have a few wasteful instructions and/or there can be a little more eeked out, but I think it will still come up short. I dont see seek(ord) being called using eclipse (other than in tests), but it may be missing it? So I'm not really sure if it needs to be cached or not - no code to test it with at the moment.
          Hide
          Michael Busch added a comment -

          Shall we create a flexible-indexing branch and commit this?

          The downside of course is that we'd have to commit patches to trunk and this branch until 3.0 is out. Or we could use svn's new branch merging capabilities, which I haven't tried out yet.

          Show
          Michael Busch added a comment - Shall we create a flexible-indexing branch and commit this? The downside of course is that we'd have to commit patches to trunk and this branch until 3.0 is out. Or we could use svn's new branch merging capabilities, which I haven't tried out yet.
          Hide
          Michael McCandless added a comment -

          Shall we create a flexible-indexing branch and commit this?

          I think this is a good idea.

          But I haven't played heavily w/ svn & branching. EG if we branch now, and trunk moves fast (which it still is w/ deprecation removals), are we going to have conflicts? Or... is svn good about merging branches?

          Show
          Michael McCandless added a comment - Shall we create a flexible-indexing branch and commit this? I think this is a good idea. But I haven't played heavily w/ svn & branching. EG if we branch now, and trunk moves fast (which it still is w/ deprecation removals), are we going to have conflicts? Or... is svn good about merging branches?
          Hide
          Michael McCandless added a comment -

          I dont see seek(ord) being called using eclipse (other than in tests), but it may be missing it?

          Yeah this won't be used yet – we only just added it (and only to the flex API). I guess wait on caching it for now?

          I'm sure I might have a few wasteful instructions and/or there can be a little more eeked out, but I think it will still come up short.

          OK we've got some work to do Which queries in particular are slower?

          Show
          Michael McCandless added a comment - I dont see seek(ord) being called using eclipse (other than in tests), but it may be missing it? Yeah this won't be used yet – we only just added it (and only to the flex API). I guess wait on caching it for now? I'm sure I might have a few wasteful instructions and/or there can be a little more eeked out, but I think it will still come up short. OK we've got some work to do Which queries in particular are slower?
          Hide
          Mark Miller added a comment -

          Havn't gotten that far yet Still just doing quick standard micro benches of each. I think I've got it around 6500 now - perhaps a little higher.

          I'll post the patch fairly soon - still struggling merging with your latest and trunk.

          I think I've got it all except an issue with one of the contribs - must have gotten a little mis merge. Also your new external codecs test through a monkey wrench in - pulsing isn't setup to work with the cache yet - I'm punting on that for now.

          Show
          Mark Miller added a comment - Havn't gotten that far yet Still just doing quick standard micro benches of each. I think I've got it around 6500 now - perhaps a little higher. I'll post the patch fairly soon - still struggling merging with your latest and trunk. I think I've got it all except an issue with one of the contribs - must have gotten a little mis merge. Also your new external codecs test through a monkey wrench in - pulsing isn't setup to work with the cache yet - I'm punting on that for now.
          Hide
          Michael McCandless added a comment -

          pulsing isn't setup to work with the cache yet - I'm punting on that for now.

          OK that's fine for now. The cache should gracefully handle codecs that don't implement "captureState" by simply not caching them.

          Show
          Michael McCandless added a comment - pulsing isn't setup to work with the cache yet - I'm punting on that for now. OK that's fine for now. The cache should gracefully handle codecs that don't implement "captureState" by simply not caching them.
          Hide
          Mark Miller added a comment -

          Here is my patch. I won't say its 100% polished and done, but I believe its in initial working order. This is a good check point time for me for various reasons.

          Simple LRU cache for Standard Codec - meant to replace TermInfo cache.

          Merged with latest patch from Mike + to trunk

          Some other little random stuff that I remember:

          PrefixTermsEnum is deprecated - sees itself - fixed

          WildcardQuery should have @see WildcardTermsEnums - fixed

          some stuff in preflex is already deprecated but not all?

          StandardDocsReader - freqStart is always 0 - left it in, but doesn't do anything at the moment

          backcompattests missing termref - fixed

          note: currently, with the testThreadSafety test in TestIndexReaderReopen appears to have some Garbage Collection issues with Java6 - not really seeing them with Java5 though - will investigate more.

          I've got the latest tag updated too - but there appear to be some odditties with it (unrelated to this patch), so leaving out for now.

          Show
          Mark Miller added a comment - Here is my patch. I won't say its 100% polished and done, but I believe its in initial working order. This is a good check point time for me for various reasons. Simple LRU cache for Standard Codec - meant to replace TermInfo cache. Merged with latest patch from Mike + to trunk Some other little random stuff that I remember: PrefixTermsEnum is deprecated - sees itself - fixed WildcardQuery should have @see WildcardTermsEnums - fixed some stuff in preflex is already deprecated but not all? StandardDocsReader - freqStart is always 0 - left it in, but doesn't do anything at the moment backcompattests missing termref - fixed note: currently, with the testThreadSafety test in TestIndexReaderReopen appears to have some Garbage Collection issues with Java6 - not really seeing them with Java5 though - will investigate more. I've got the latest tag updated too - but there appear to be some odditties with it (unrelated to this patch), so leaving out for now.
          Hide
          Mark Miller added a comment -

          Latest to trunk - still issues with GC and the reopen thread safety test (unless the test is run in isolation).

          Must be a tweak needed, but I'm not sure what. I'm closing the thread locals when the StandardTermsDictReader is closed - I don't see a way to improve on that yet.

          Show
          Mark Miller added a comment - Latest to trunk - still issues with GC and the reopen thread safety test (unless the test is run in isolation). Must be a tweak needed, but I'm not sure what. I'm closing the thread locals when the StandardTermsDictReader is closed - I don't see a way to improve on that yet.
          Hide
          Mark Miller added a comment - - edited

          Whoops - double check the wrong index splitter test - the multi pass one is throwing a null pointer exception for me - don't think its related to this patch, but I havn't checked.

          edit

          Okay, just checked - it is this patch. Looks like perhaps something to do with LegacyFieldsEnum? Something that isnt being hit by core tests at the moment (I didnt run through all the backcompat tests with this yet, since that failed)

          Show
          Mark Miller added a comment - - edited Whoops - double check the wrong index splitter test - the multi pass one is throwing a null pointer exception for me - don't think its related to this patch, but I havn't checked. edit Okay, just checked - it is this patch. Looks like perhaps something to do with LegacyFieldsEnum? Something that isnt being hit by core tests at the moment (I didnt run through all the backcompat tests with this yet, since that failed)
          Hide
          Mark Miller added a comment -

          Looks pretty simple - the field is not getting set with LegacyFieldsEnum.

          Show
          Mark Miller added a comment - Looks pretty simple - the field is not getting set with LegacyFieldsEnum.
          Hide
          Michael McCandless added a comment -

          OK I think I've committed Mark's last patch onto this branch:

          https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458

          and I also branched the 2.9 back-compat branch and committed the last back compat patch:

          https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458_2_9_back_compat_tests

          Mark can you check it out & see if I missed anything?

          Show
          Michael McCandless added a comment - OK I think I've committed Mark's last patch onto this branch: https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458 and I also branched the 2.9 back-compat branch and committed the last back compat patch: https://svn.apache.org/repos/asf/lucene/java/branches/flex_1458_2_9_back_compat_tests Mark can you check it out & see if I missed anything?
          Hide
          Uwe Schindler added a comment -

          By the way, a lot of these PriorityQueues can be generified like in trunk to remove the unneeded casts in lessThan, pop, insert,... everywhere.

          Show
          Uwe Schindler added a comment - By the way, a lot of these PriorityQueues can be generified like in trunk to remove the unneeded casts in lessThan, pop, insert,... everywhere.
          Hide
          Michael McCandless added a comment -

          I just committed some small improvements to the ThreadLocal cache; all
          tests pass at 512M heap limit again.

          I think the reason why TestIndexReaderReopen was hitting the limit is
          because its testThreadSafety test opens many (344) IndexReaders at
          once, without closing them until the very end, and the standard codec
          is now using more starting RAM per reader because 1) the terms index
          uses a fixed minimal block size for the byte[], and 2) the new terms
          info cache is less RAM efficient.

          I've made some progress to "scale down" better:

          • Don't create a 1024 sized cache when total # terms is less than
            that
          • Cache a single thread-private TermsEnum, to re-use for docFreq
            lookups
          • Reduced what's stored in each cache entry
          • Made StandardDocsReader subclass CacheEntry to store its own
            stuff; saves one extra object per entry.
          Show
          Michael McCandless added a comment - I just committed some small improvements to the ThreadLocal cache; all tests pass at 512M heap limit again. I think the reason why TestIndexReaderReopen was hitting the limit is because its testThreadSafety test opens many (344) IndexReaders at once, without closing them until the very end, and the standard codec is now using more starting RAM per reader because 1) the terms index uses a fixed minimal block size for the byte[], and 2) the new terms info cache is less RAM efficient. I've made some progress to "scale down" better: Don't create a 1024 sized cache when total # terms is less than that Cache a single thread-private TermsEnum, to re-use for docFreq lookups Reduced what's stored in each cache entry Made StandardDocsReader subclass CacheEntry to store its own stuff; saves one extra object per entry.
          Hide
          Mark Miller added a comment - - edited

          // nocommit – why scanCnt > 1?
          //if (docs.canCaptureState() && scanCnt > 1) {

          My mistake - an early mess up when I was copying from preflix caching code - I saw it doing this - but its doing it with the cached enum - I should have been looking below where it doesn't do that. Just a left over from early on when I was kind of shooting in the dark.

          edit

          I also had messed with it a bit - tried 0 and 2 - neither appeared to affect the micro bench samerdrsearch results. Seemed odd. Adding the cache did help those results, so I'd expect that changing that would affect things more.

          Show
          Mark Miller added a comment - - edited // nocommit – why scanCnt > 1? //if (docs.canCaptureState() && scanCnt > 1) { My mistake - an early mess up when I was copying from preflix caching code - I saw it doing this - but its doing it with the cached enum - I should have been looking below where it doesn't do that. Just a left over from early on when I was kind of shooting in the dark. edit I also had messed with it a bit - tried 0 and 2 - neither appeared to affect the micro bench samerdrsearch results. Seemed odd. Adding the cache did help those results, so I'd expect that changing that would affect things more.
          Hide
          Mark Miller added a comment - - edited
              // nocommit -- not needed?  we don't need to sync since
              // only one thread works with this?
          
              /*
              @Override
              public synchronized Object put(Object key, Object value) {
                // TODO Auto-generated method stub
                return super.put(key, value);
              }
              
              @Override
              public synchronized Object get(Object key) {
                // TODO Auto-generated method stub
                return super.get(key);
              }
              */
          

          Whoops! I'm sorry! I wondered why I didn't have to replace all to get rid of that when I updated - I didn't mean to commit that! That was just part of my experimenting with the RAM blowout issue - was just making sure everything still worked without each thread having its own cache. That means the ThreadResources was out of whack too - I did have it as a member of the SegmentTermsEnum - I'm sorry - totally didn't mean to commit that!

          edit Also the stuff with the threadResourceSet and setting to null - just trying to figure out the mem issue - I did a bunch of debugging things and they all got caught up in a merge. Yuck.

          Show
          Mark Miller added a comment - - edited // nocommit -- not needed? we don't need to sync since // only one thread works with this ? /* @Override public synchronized Object put( Object key, Object value) { // TODO Auto-generated method stub return super .put(key, value); } @Override public synchronized Object get( Object key) { // TODO Auto-generated method stub return super .get(key); } */ Whoops! I'm sorry! I wondered why I didn't have to replace all to get rid of that when I updated - I didn't mean to commit that! That was just part of my experimenting with the RAM blowout issue - was just making sure everything still worked without each thread having its own cache. That means the ThreadResources was out of whack too - I did have it as a member of the SegmentTermsEnum - I'm sorry - totally didn't mean to commit that! edit Also the stuff with the threadResourceSet and setting to null - just trying to figure out the mem issue - I did a bunch of debugging things and they all got caught up in a merge. Yuck.
          Hide
          Mark Miller added a comment -

          // nocommit – wonder if simple double-barrel LRU cache
          // would be better

          Yeah - haven't considered anything about the cache being used - really just took the same cache that was being used to cache terminfos. The only reason I changed to my own impl over SimpleLRUCache was that I wanted to reuse the removed entry.

          // nocommit – we should not init cache w/ full
          // capacity? init it at 0, and only start evicting
          // once #entries is over our max

          Same here - I took the same thing the old cache was doing.
          Do we want to start it at 0 though? Perhaps a little higher? Doesn't it keep rehashing to roughly double the size? That could be a lot of resizing ...

          Show
          Mark Miller added a comment - // nocommit – wonder if simple double-barrel LRU cache // would be better Yeah - haven't considered anything about the cache being used - really just took the same cache that was being used to cache terminfos. The only reason I changed to my own impl over SimpleLRUCache was that I wanted to reuse the removed entry. // nocommit – we should not init cache w/ full // capacity? init it at 0, and only start evicting // once #entries is over our max Same here - I took the same thing the old cache was doing. Do we want to start it at 0 though? Perhaps a little higher? Doesn't it keep rehashing to roughly double the size? That could be a lot of resizing ...
          Hide
          Mark Miller added a comment -

          Hmm - I'm still getting the heap space issue I think - its always been somewhat intermittent - sometimes it doesn't happen - usually it happens when you run all the tests - sometimes not though. Same when you run the test class individually - usually to sometimes it doesn't happen - and then usually to sometimes it does.

          Show
          Mark Miller added a comment - Hmm - I'm still getting the heap space issue I think - its always been somewhat intermittent - sometimes it doesn't happen - usually it happens when you run all the tests - sometimes not though. Same when you run the test class individually - usually to sometimes it doesn't happen - and then usually to sometimes it does.
          Hide
          Michael McCandless added a comment -

          OK thank for addressing the new nocommits – you wanna remove them & commit as you find/comment on them? Can be our means of communicating through the branch

          For now, I don't think we need to explore improvements to the TermInfo cache (starting @ smaller size, simplistic double barrel LRU cache) – we can simply mimic trunk for now; such improvements are orthogonal here. Maybe switch those nocommits to TODOs instead?

          Hmm - I'm still getting the heap space issue I think

          Sigh. I think we have more work to do to "scale down" RAM used by IndexReader for a smallish index.

          Show
          Michael McCandless added a comment - OK thank for addressing the new nocommits – you wanna remove them & commit as you find/comment on them? Can be our means of communicating through the branch For now, I don't think we need to explore improvements to the TermInfo cache (starting @ smaller size, simplistic double barrel LRU cache) – we can simply mimic trunk for now; such improvements are orthogonal here. Maybe switch those nocommits to TODOs instead? Hmm - I'm still getting the heap space issue I think Sigh. I think we have more work to do to "scale down" RAM used by IndexReader for a smallish index.
          Hide
          Michael McCandless added a comment -

          you wanna remove them & commit as you find/comment on them?

          Woops, I see you already did! Thanks.

          Show
          Michael McCandless added a comment - you wanna remove them & commit as you find/comment on them? Woops, I see you already did! Thanks.
          Hide
          Mark Miller added a comment -

          just committed an initial stab at pulsing cache support - could prob use your love again

          Oddly, the reopen test passed no problem and this adds more to the cache - perhaps I was seeing a ghost last night ...

          I'll know before too long.

          Show
          Mark Miller added a comment - just committed an initial stab at pulsing cache support - could prob use your love again Oddly, the reopen test passed no problem and this adds more to the cache - perhaps I was seeing a ghost last night ... I'll know before too long.
          Hide
          Mark Miller added a comment -

          Almost got an initial rough stab at the sep codec cache done - just have to get two more tests to pass involving the payload's state.

          Show
          Mark Miller added a comment - Almost got an initial rough stab at the sep codec cache done - just have to get two more tests to pass involving the payload's state.
          Hide
          Mark Miller added a comment -

          Hey Mike: you tweaked a couple little things with the standard cache capture state (showing that I'm a cheater and getting stuff to work that I haven't yet fully understood My specialty ) - what worries me is that they look like important little pieces if they are correct, but all tests passed without them. Hopefully we can get some tests in that catch these little off bys.

          Show
          Mark Miller added a comment - Hey Mike: you tweaked a couple little things with the standard cache capture state (showing that I'm a cheater and getting stuff to work that I haven't yet fully understood My specialty ) - what worries me is that they look like important little pieces if they are correct, but all tests passed without them. Hopefully we can get some tests in that catch these little off bys.
          Hide
          Mark Miller added a comment -

          Okay, first pass for sep cache support is in - def needs to be trimmed down - heap issue with reopen everytime - I'm using a state object with the Index objects though, and I'm sure that can be done away with - though I guess a clone is not really much better and there is no access to their guts at the moment. Works for a first pass though.

          Show
          Mark Miller added a comment - Okay, first pass for sep cache support is in - def needs to be trimmed down - heap issue with reopen everytime - I'm using a state object with the Index objects though, and I'm sure that can be done away with - though I guess a clone is not really much better and there is no access to their guts at the moment. Works for a first pass though.
          Hide
          Michael McCandless added a comment -

          you tweaked a couple little things with the standard cache capture state

          Actually I think I just moved things around? EG I made it the StandardTermsDictReader's job to seek the termsIn file, I moved docCount "up", and I made a single cache entry. I think I also removed a few attrs that we didn't need to store... and downgraded skipOffset from long -> int (it's int on trunk).

          Show
          Michael McCandless added a comment - you tweaked a couple little things with the standard cache capture state Actually I think I just moved things around? EG I made it the StandardTermsDictReader's job to seek the termsIn file, I moved docCount "up", and I made a single cache entry. I think I also removed a few attrs that we didn't need to store... and downgraded skipOffset from long -> int (it's int on trunk).
          Hide
          Mark Miller added a comment -

          Actually I think I just moved things around? EG I made it the StandardTermsDictReader's job to seek the termsIn file, I moved docCount "up", and I made a single cache entry. I think I also removed a few attrs that we didn't need to store... and downgraded skipOffset from long -> int (it's int on trunk).

          Okay - that makes me feel a little better - I knew there was some unneccessary stuff, just hadn't gone through and figured out what could be stripped yet (there is likely the same thing with the new caches, but I don't think as much).

          They main thing I saw that made me worry that I didn't think I had was:

                    posReader.positions.seekPending = true;
                    posReader.positions.skipOffset = posReader.proxOffset;
          

          But perhaps I was just accomplishing the same thing in a different manner? I'd have to go back and look - I just don't think I knew enough to set either of those correctly - but seeing it helped me figure out what the heck was wrong with the final payloads piece in Sep

          Show
          Mark Miller added a comment - Actually I think I just moved things around? EG I made it the StandardTermsDictReader's job to seek the termsIn file, I moved docCount "up", and I made a single cache entry. I think I also removed a few attrs that we didn't need to store... and downgraded skipOffset from long -> int (it's int on trunk). Okay - that makes me feel a little better - I knew there was some unneccessary stuff, just hadn't gone through and figured out what could be stripped yet (there is likely the same thing with the new caches, but I don't think as much). They main thing I saw that made me worry that I didn't think I had was: posReader.positions.seekPending = true ; posReader.positions.skipOffset = posReader.proxOffset; But perhaps I was just accomplishing the same thing in a different manner? I'd have to go back and look - I just don't think I knew enough to set either of those correctly - but seeing it helped me figure out what the heck was wrong with the final payloads piece in Sep
          Hide
          Michael McCandless added a comment -

          Ahh, I just changed your seek to be a lazy seek, in case the caller won't use the positions; though I think setting skipPosCount=0 (which I also added) should have been necessary even with the non-lazy seek. Probably we could get the TestCodecs test to tickle that bug, if we get a DocsEnum, get PositionsEnum, read a few docs but NOT the positions, then seek to a term we had already seeked to (so it uses the cache) then try to read positions. The positions should be wrong because skipPosCount will carry over a non-zero value.

          Show
          Michael McCandless added a comment - Ahh, I just changed your seek to be a lazy seek, in case the caller won't use the positions; though I think setting skipPosCount=0 (which I also added) should have been necessary even with the non-lazy seek. Probably we could get the TestCodecs test to tickle that bug, if we get a DocsEnum, get PositionsEnum, read a few docs but NOT the positions, then seek to a term we had already seeked to (so it uses the cache) then try to read positions. The positions should be wrong because skipPosCount will carry over a non-zero value.
          Hide
          Michael McCandless added a comment -

          I just committed fix for a major memory cost during TestIndexReaderReopen.

          The new terms dict index uses fixed byte[] blocks to hold the UTF8 bytes, of size 32 KB currently. But for a tiny segment this is very wasteful. So I fixed it to trim down the last byte[] block to free up the unused space. I think TestIndexReaderReopen should no longer hit OOMs.

          Show
          Michael McCandless added a comment - I just committed fix for a major memory cost during TestIndexReaderReopen. The new terms dict index uses fixed byte[] blocks to hold the UTF8 bytes, of size 32 KB currently. But for a tiny segment this is very wasteful. So I fixed it to trim down the last byte[] block to free up the unused space. I think TestIndexReaderReopen should no longer hit OOMs.
          Hide
          Mark Miller added a comment -

          Nice! Sep and Pulsing still need to be trimmed down though - or we consider their bloat acceptable (they still don't pass). Sep especially should be pretty trimable I think. Pulsing is more of an issue because of the Document caching...

          Show
          Mark Miller added a comment - Nice! Sep and Pulsing still need to be trimmed down though - or we consider their bloat acceptable (they still don't pass). Sep especially should be pretty trimable I think. Pulsing is more of an issue because of the Document caching...
          Hide
          Michael McCandless added a comment -

          Pulsing is more of an issue because of the Document caching...

          Yeah, we probably need to measure cache size by RAM usage not shear count. And, make it settable when you instantiate the codec.

          Sep and Pulsing still need to be trimmed down though

          Are they causing OOMs with TestIndexReaderReopen? (I haven't tried yet).

          Show
          Michael McCandless added a comment - Pulsing is more of an issue because of the Document caching... Yeah, we probably need to measure cache size by RAM usage not shear count. And, make it settable when you instantiate the codec. Sep and Pulsing still need to be trimmed down though Are they causing OOMs with TestIndexReaderReopen? (I haven't tried yet).
          Hide
          Mark Miller added a comment -

          Are they causing OOMs with TestIndexReaderReopen? (I haven't tried yet).

          Yes - they both def need polish too - I just got them working (passing all the tests), but havn't really finished them.

          Show
          Mark Miller added a comment - Are they causing OOMs with TestIndexReaderReopen? (I haven't tried yet). Yes - they both def need polish too - I just got them working (passing all the tests), but havn't really finished them.
          Hide
          Michael McCandless added a comment -

          I just committed contrib/benchmark/sortBench.py on the branch, to run
          perf tests comparing trunk to flex.

          You have to apply patches from LUCENE-2042 and LUCENE-2043 (until we
          resync branch).

          First edit the TRUNK_DIR and FLEX_DIR up top, and WIKI_FILE (it
          requires wiki export – all tests run against it), then run with "-run
          XXX" to test performance.

          It first creates the 5M doc index, for trunk and for flex, with
          multiple commit points holding higher pctg of deletions (0, 0.1%, 1%,
          10%), and then tests speed of various queries against it.

          I also fixed a bug in the standard codec's terms index reader.

          Show
          Michael McCandless added a comment - I just committed contrib/benchmark/sortBench.py on the branch, to run perf tests comparing trunk to flex. You have to apply patches from LUCENE-2042 and LUCENE-2043 (until we resync branch). First edit the TRUNK_DIR and FLEX_DIR up top, and WIKI_FILE (it requires wiki export – all tests run against it), then run with "-run XXX" to test performance. It first creates the 5M doc index, for trunk and for flex, with multiple commit points holding higher pctg of deletions (0, 0.1%, 1%, 10%), and then tests speed of various queries against it. I also fixed a bug in the standard codec's terms index reader.
          Hide
          Michael McCandless added a comment -

          Initial results. Performance is quite catastrophically bad for the MultiTermQueries! Something silly must be up....

          JAVA:
          java version "1.5.0_19"
          Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02)
          Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode)

          OS:
          SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris

          Query Deletes % Tot hits QPS old QPS new Pct change
          body:[tec TO tet] 0.0 body:[tec TO tet] 3.06 0.23 -92.5%
          body:[tec TO tet] 0.1 body:[tec TO tet] 2.87 0.22 -92.3%
          body:[tec TO tet] 1.0 body:[tec TO tet] 2.85 0.22 -92.3%
          body:[tec TO tet] 10 body:[tec TO tet] 2.83 0.23 -91.9%
          1 0.0 1 22.15 23.87 7.8%
          1 0.1 1 19.89 21.72 9.2%
          1 1.0 1 19.47 21.55 10.7%
          1 10 1 19.82 21.13 6.6%
          2 0.0 2 23.54 25.97 10.3%
          2 0.1 2 21.12 23.56 11.6%
          2 1.0 2 21.37 23.27 8.9%
          2 10 2 21.55 23.10 7.2%
          +1 +2 0.0 +1 +2 7.13 6.97 -2.2%
          +1 +2 0.1 +1 +2 6.40 6.77 5.8%
          +1 +2 1.0 +1 +2 6.41 6.64 3.6%
          +1 +2 10 +1 +2 6.65 6.98 5.0%
          +1 -2 0.0 +1 -2 7.78 7.95 2.2%
          +1 -2 0.1 +1 -2 7.11 7.31 2.8%
          +1 -2 1.0 +1 -2 7.18 7.27 1.3%
          +1 -2 10 +1 -2 7.11 7.70 8.3%
          1 2 3 -4 0.0 1 2 3 -4 5.03 4.91 -2.4%
          1 2 3 -4 0.1 1 2 3 -4 4.62 4.39 -5.0%
          1 2 3 -4 1.0 1 2 3 -4 4.72 4.67 -1.1%
          1 2 3 -4 10 1 2 3 -4 4.78 4.74 -0.8%
          real* 0.0 real* 28.40 0.19 -99.3%
          real* 0.1 real* 26.23 0.20 -99.2%
          real* 1.0 real* 26.04 0.20 -99.2%
          real* 10 real* 26.83 0.20 -99.3%
          "world economy" 0.0 "world economy" 18.82 17.83 -5.3%
          "world economy" 0.1 "world economy" 18.64 17.99 -3.5%
          "world economy" 1.0 "world economy" 18.97 18.35 -3.3%
          "world economy" 10 "world economy" 19.59 18.12 -7.5%
          Show
          Michael McCandless added a comment - Initial results. Performance is quite catastrophically bad for the MultiTermQueries! Something silly must be up.... JAVA: java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02) Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode) OS: SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris Query Deletes % Tot hits QPS old QPS new Pct change body: [tec TO tet] 0.0 body: [tec TO tet] 3.06 0.23 -92.5% body: [tec TO tet] 0.1 body: [tec TO tet] 2.87 0.22 -92.3% body: [tec TO tet] 1.0 body: [tec TO tet] 2.85 0.22 -92.3% body: [tec TO tet] 10 body: [tec TO tet] 2.83 0.23 -91.9% 1 0.0 1 22.15 23.87 7.8% 1 0.1 1 19.89 21.72 9.2% 1 1.0 1 19.47 21.55 10.7% 1 10 1 19.82 21.13 6.6% 2 0.0 2 23.54 25.97 10.3% 2 0.1 2 21.12 23.56 11.6% 2 1.0 2 21.37 23.27 8.9% 2 10 2 21.55 23.10 7.2% +1 +2 0.0 +1 +2 7.13 6.97 -2.2% +1 +2 0.1 +1 +2 6.40 6.77 5.8% +1 +2 1.0 +1 +2 6.41 6.64 3.6% +1 +2 10 +1 +2 6.65 6.98 5.0% +1 -2 0.0 +1 -2 7.78 7.95 2.2% +1 -2 0.1 +1 -2 7.11 7.31 2.8% +1 -2 1.0 +1 -2 7.18 7.27 1.3% +1 -2 10 +1 -2 7.11 7.70 8.3% 1 2 3 -4 0.0 1 2 3 -4 5.03 4.91 -2.4% 1 2 3 -4 0.1 1 2 3 -4 4.62 4.39 -5.0% 1 2 3 -4 1.0 1 2 3 -4 4.72 4.67 -1.1% 1 2 3 -4 10 1 2 3 -4 4.78 4.74 -0.8% real* 0.0 real* 28.40 0.19 -99.3% real* 0.1 real* 26.23 0.20 -99.2% real* 1.0 real* 26.04 0.20 -99.2% real* 10 real* 26.83 0.20 -99.3% "world economy" 0.0 "world economy" 18.82 17.83 -5.3% "world economy" 0.1 "world economy" 18.64 17.99 -3.5% "world economy" 1.0 "world economy" 18.97 18.35 -3.3% "world economy" 10 "world economy" 19.59 18.12 -7.5%
          Hide
          Michael McCandless added a comment -

          Committed fixes addressing silly slowness. You also need LUCENE-2044 patch, until we sync up with trunk again, to run sortBench.py.

          Part of the slowness was from MTQ queries incorrectly running the TermsEnum to exhaustion, instead of stopping when they hit their upperTerm. But, another part of the slowness was because sortBench.py was actually incorrectly testing flex branch against a trunk index. This is definitely something we have to test (it's what people will see when they use flex to search existing indexes – flex API emulated on the current index format), so, we'll have to address that slowness as well, but for now I want to test pure flex (flex API on a flex index).

          Show
          Michael McCandless added a comment - Committed fixes addressing silly slowness. You also need LUCENE-2044 patch, until we sync up with trunk again, to run sortBench.py. Part of the slowness was from MTQ queries incorrectly running the TermsEnum to exhaustion, instead of stopping when they hit their upperTerm. But, another part of the slowness was because sortBench.py was actually incorrectly testing flex branch against a trunk index. This is definitely something we have to test (it's what people will see when they use flex to search existing indexes – flex API emulated on the current index format), so, we'll have to address that slowness as well, but for now I want to test pure flex (flex API on a flex index).
          Hide
          Michael McCandless added a comment -

          OK new numbers after the above commits:

          JAVA:
          java version "1.5.0_19"
          Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02)
          Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode)

          OS:
          SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris

          Query Deletes % Tot hits QPS old QPS new Pct change
          body:[tec TO tet] 0.0 1934684 3.13 3.96 26.5%
          body:[tec TO tet] 0.1 1932754 2.98 3.62 21.5%
          body:[tec TO tet] 1.0 1915224 2.97 3.62 21.9%
          body:[tec TO tet] 10 1741255 2.96 3.61 22.0%
          real* 0.0 389378 27.80 28.73 3.3%
          real* 0.1 389005 26.74 28.93 8.2%
          real* 1.0 385434 26.61 29.04 9.1%
          real* 10 350404 26.32 29.29 11.3%
          1 0.0 1170209 21.81 22.27 2.1%
          1 0.1 1169068 20.41 21.47 5.2%
          1 1.0 1158528 20.42 21.41 4.8%
          1 10 1053269 20.52 21.39 4.2%
          2 0.0 1088727 23.29 23.86 2.4%
          2 0.1 1087700 21.67 22.92 5.8%
          2 1.0 1077788 21.77 22.80 4.7%
          2 10 980068 21.90 23.04 5.2%
          +1 +2 0.0 700793 7.25 6.65 -8.3%
          +1 +2 0.1 700137 6.58 6.33 -3.8%
          +1 +2 1.0 693756 6.50 6.32 -2.8%
          +1 +2 10 630953 6.73 6.37 -5.3%
          +1 -2 0.0 469416 8.11 7.27 -10.4%
          +1 -2 0.1 468931 7.02 6.61 -5.8%
          +1 -2 1.0 464772 7.27 6.75 -7.2%
          +1 -2 10 422316 7.28 6.99 -4.0%
          1 2 3 -4 0.0 1104704 4.80 4.46 -7.1%
          1 2 3 -4 0.1 1103583 4.74 4.40 -7.2%
          1 2 3 -4 1.0 1093634 4.72 4.45 -5.7%
          1 2 3 -4 10 994046 4.79 4.63 -3.3%
          "world economy" 0.0 985 19.43 16.79 -13.6%
          "world economy" 0.1 984 18.71 16.59 -11.3%
          "world economy" 1.0 970 19.65 16.86 -14.2%
          "world economy" 10 884 19.69 17.25 -12.4%

          The term range query & preifx query are now a bit faster; boolean queries are somewhat slower; the phrase query shows the biggest slowdown...

          Show
          Michael McCandless added a comment - OK new numbers after the above commits: JAVA: java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02) Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode) OS: SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris Query Deletes % Tot hits QPS old QPS new Pct change body: [tec TO tet] 0.0 1934684 3.13 3.96 26.5% body: [tec TO tet] 0.1 1932754 2.98 3.62 21.5% body: [tec TO tet] 1.0 1915224 2.97 3.62 21.9% body: [tec TO tet] 10 1741255 2.96 3.61 22.0% real* 0.0 389378 27.80 28.73 3.3% real* 0.1 389005 26.74 28.93 8.2% real* 1.0 385434 26.61 29.04 9.1% real* 10 350404 26.32 29.29 11.3% 1 0.0 1170209 21.81 22.27 2.1% 1 0.1 1169068 20.41 21.47 5.2% 1 1.0 1158528 20.42 21.41 4.8% 1 10 1053269 20.52 21.39 4.2% 2 0.0 1088727 23.29 23.86 2.4% 2 0.1 1087700 21.67 22.92 5.8% 2 1.0 1077788 21.77 22.80 4.7% 2 10 980068 21.90 23.04 5.2% +1 +2 0.0 700793 7.25 6.65 -8.3% +1 +2 0.1 700137 6.58 6.33 -3.8% +1 +2 1.0 693756 6.50 6.32 -2.8% +1 +2 10 630953 6.73 6.37 -5.3% +1 -2 0.0 469416 8.11 7.27 -10.4% +1 -2 0.1 468931 7.02 6.61 -5.8% +1 -2 1.0 464772 7.27 6.75 -7.2% +1 -2 10 422316 7.28 6.99 -4.0% 1 2 3 -4 0.0 1104704 4.80 4.46 -7.1% 1 2 3 -4 0.1 1103583 4.74 4.40 -7.2% 1 2 3 -4 1.0 1093634 4.72 4.45 -5.7% 1 2 3 -4 10 994046 4.79 4.63 -3.3% "world economy" 0.0 985 19.43 16.79 -13.6% "world economy" 0.1 984 18.71 16.59 -11.3% "world economy" 1.0 970 19.65 16.86 -14.2% "world economy" 10 884 19.69 17.25 -12.4% The term range query & preifx query are now a bit faster; boolean queries are somewhat slower; the phrase query shows the biggest slowdown...
          Hide
          Mark Miller added a comment -

          I'll merge up when I figure out how -

          merge does not like the restoration of RussianLowerCaseFilter or the move of PatternAnalyzer. Not really sure why not yet. I'll try and play with it tonight.

          Show
          Mark Miller added a comment - I'll merge up when I figure out how - merge does not like the restoration of RussianLowerCaseFilter or the move of PatternAnalyzer. Not really sure why not yet. I'll try and play with it tonight.
          Hide
          Michael McCandless added a comment -

          Yikes! That sounds challenging.

          Show
          Michael McCandless added a comment - Yikes! That sounds challenging.
          Hide
          Mark Miller added a comment -

          Indeed - the merging has been quite challenging - its a bit unfair really - one of these days we will have to switch - I'll write the flexible indexing stuff, and you start doing the hard tasks

          I'll commit the merge in a bit when the tests finish - might not get to the back compat branch if its needed till tomorrow night though.

          Show
          Mark Miller added a comment - Indeed - the merging has been quite challenging - its a bit unfair really - one of these days we will have to switch - I'll write the flexible indexing stuff, and you start doing the hard tasks I'll commit the merge in a bit when the tests finish - might not get to the back compat branch if its needed till tomorrow night though.
          Hide
          Mark Miller added a comment -

          I still get OOM's on the reopen test every so often. Many times I don't, then sometimes I do.

          Show
          Mark Miller added a comment - I still get OOM's on the reopen test every so often. Many times I don't, then sometimes I do.
          Hide
          Michael McCandless added a comment -

          I'll write the flexible indexing stuff, and you start doing the hard tasks

          Don't you just have to press one button in your IDE?

          I still get OOM's on the reopen test every so often. Many times I don't, then sometimes I do.

          Hmm... I'll try to dig. This is with the standard codec, or, eg pulsing or intblock?

          Show
          Michael McCandless added a comment - I'll write the flexible indexing stuff, and you start doing the hard tasks Don't you just have to press one button in your IDE? I still get OOM's on the reopen test every so often. Many times I don't, then sometimes I do. Hmm... I'll try to dig. This is with the standard codec, or, eg pulsing or intblock?
          Hide
          Mark Miller added a comment -

          Don't you just have to press one button in your IDE?

          Ouch - thats like claiming all it takes to drive a porsche carrera gt is pushing the accelerator

          Hmm... I'll try to dig. This is with the standard codec, or, eg pulsing or intblock?

          I'm talking standard - sep and pulsing def blow up - they still need some work in that regard - but you have gotten standard pretty darn close - it usually doesn't blow - but sometimes it still seems to (I guess depending on random factors in the test). intblock is still cachless, so I don't think it ever blows.

          Show
          Mark Miller added a comment - Don't you just have to press one button in your IDE? Ouch - thats like claiming all it takes to drive a porsche carrera gt is pushing the accelerator Hmm... I'll try to dig. This is with the standard codec, or, eg pulsing or intblock? I'm talking standard - sep and pulsing def blow up - they still need some work in that regard - but you have gotten standard pretty darn close - it usually doesn't blow - but sometimes it still seems to (I guess depending on random factors in the test). intblock is still cachless, so I don't think it ever blows.
          Hide
          Michael McCandless added a comment -

          I removed all the "if (Codec.DEBUG)" lines a local checkout and re-ran sortBench.py – looks like flex is pretty close to trunk now (on OpenSolaris, Java 1.5, at least):

          JAVA:
          java version "1.5.0_19"
          Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02)
          Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode)

          OS:
          SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris

          Index /x/lucene/wiki.baseline.nd5M already exists...
          Index /x/lucene/wiki.flex.nd5M already exists...

          Query Deletes % Tot hits QPS old QPS new Pct change
          body:[tec TO tet] 0.0 1934684 2.95 4.04 36.9%
          body:[tec TO tet] 0.1 1932754 2.86 3.73 30.4%
          body:[tec TO tet] 1.0 1915224 2.88 3.69 28.1%
          body:[tec TO tet] 10 1741255 2.86 3.74 30.8%
          real* 0.0 389378 26.85 28.74 7.0%
          real* 0.1 389005 25.83 26.96 4.4%
          real* 1.0 385434 25.55 27.15 6.3%
          real* 10 350404 25.38 28.10 10.7%
          1 0.0 1170209 21.75 21.80 0.2%
          1 0.1 1169068 20.39 22.02 8.0%
          1 1.0 1158528 20.35 21.88 7.5%
          1 10 1053269 20.48 21.96 7.2%
          2 0.0 1088727 23.37 23.42 0.2%
          2 0.1 1087700 21.61 23.49 8.7%
          2 1.0 1077788 21.85 23.46 7.4%
          2 10 980068 21.93 23.66 7.9%
          +1 +2 0.0 700793 7.29 7.32 0.4%
          +1 +2 0.1 700137 6.58 6.70 1.8%
          +1 +2 1.0 693756 6.60 6.68 1.2%
          +1 +2 10 630953 6.73 6.92 2.8%
          +1 -2 0.0 469416 8.07 7.69 -4.7%
          +1 -2 0.1 468931 7.02 7.46 6.3%
          +1 -2 1.0 464772 7.31 7.12 -2.6%
          +1 -2 10 422316 7.28 7.60 4.4%
          1 2 3 -4 0.0 1104704 4.83 4.52 -6.4%
          1 2 3 -4 0.1 1103583 4.73 4.48 -5.3%
          1 2 3 -4 1.0 1093634 4.75 4.46 -6.1%
          1 2 3 -4 10 994046 4.87 4.65 -4.5%
          "world economy" 0.0 985 19.50 20.11 3.1%
          "world economy" 0.1 984 18.65 19.76 6.0%
          "world economy" 1.0 970 19.56 18.71 -4.3%
          "world economy" 10 884 19.58 20.19 3.1%
          Show
          Michael McCandless added a comment - I removed all the "if (Codec.DEBUG)" lines a local checkout and re-ran sortBench.py – looks like flex is pretty close to trunk now (on OpenSolaris, Java 1.5, at least): JAVA: java version "1.5.0_19" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_19-b02) Java HotSpot(TM) Server VM (build 1.5.0_19-b02, mixed mode) OS: SunOS rhumba 5.11 snv_111b i86pc i386 i86pc Solaris Index /x/lucene/wiki.baseline.nd5M already exists... Index /x/lucene/wiki.flex.nd5M already exists... Query Deletes % Tot hits QPS old QPS new Pct change body: [tec TO tet] 0.0 1934684 2.95 4.04 36.9% body: [tec TO tet] 0.1 1932754 2.86 3.73 30.4% body: [tec TO tet] 1.0 1915224 2.88 3.69 28.1% body: [tec TO tet] 10 1741255 2.86 3.74 30.8% real* 0.0 389378 26.85 28.74 7.0% real* 0.1 389005 25.83 26.96 4.4% real* 1.0 385434 25.55 27.15 6.3% real* 10 350404 25.38 28.10 10.7% 1 0.0 1170209 21.75 21.80 0.2% 1 0.1 1169068 20.39 22.02 8.0% 1 1.0 1158528 20.35 21.88 7.5% 1 10 1053269 20.48 21.96 7.2% 2 0.0 1088727 23.37 23.42 0.2% 2 0.1 1087700 21.61 23.49 8.7% 2 1.0 1077788 21.85 23.46 7.4% 2 10 980068 21.93 23.66 7.9% +1 +2 0.0 700793 7.29 7.32 0.4% +1 +2 0.1 700137 6.58 6.70 1.8% +1 +2 1.0 693756 6.60 6.68 1.2% +1 +2 10 630953 6.73 6.92 2.8% +1 -2 0.0 469416 8.07 7.69 -4.7% +1 -2 0.1 468931 7.02 7.46 6.3% +1 -2 1.0 464772 7.31 7.12 -2.6% +1 -2 10 422316 7.28 7.60 4.4% 1 2 3 -4 0.0 1104704 4.83 4.52 -6.4% 1 2 3 -4 0.1 1103583 4.73 4.48 -5.3% 1 2 3 -4 1.0 1093634 4.75 4.46 -6.1% 1 2 3 -4 10 994046 4.87 4.65 -4.5% "world economy" 0.0 985 19.50 20.11 3.1% "world economy" 0.1 984 18.65 19.76 6.0% "world economy" 1.0 970 19.56 18.71 -4.3% "world economy" 10 884 19.58 20.19 3.1%
          Hide
          Mark Miller added a comment -

          I've got a big merge coming - after a recent merge I noticed a bunch of things didn't merge at all - so I started looking back and saw a few things that didn't merge properly previously as well. So I'm working on a file by file line by line update that should be ready fairly soon.

          Show
          Mark Miller added a comment - I've got a big merge coming - after a recent merge I noticed a bunch of things didn't merge at all - so I started looking back and saw a few things that didn't merge properly previously as well. So I'm working on a file by file line by line update that should be ready fairly soon.
          Hide
          Uwe Schindler added a comment -

          If you are merging, you should simplky replace the old 2.9 BW branch by the new 3.0 one I recently created for trunk.

          Show
          Uwe Schindler added a comment - If you are merging, you should simplky replace the old 2.9 BW branch by the new 3.0 one I recently created for trunk.
          Hide
          Mark Miller added a comment -

          Simply ? What about the part where I have to merge in the flexible indexing backward compat changes into the new branch after first figuring out what changes those are Okay, its not unsimple, but this backward branch stuff is my least favorite part.

          Show
          Mark Miller added a comment - Simply ? What about the part where I have to merge in the flexible indexing backward compat changes into the new branch after first figuring out what changes those are Okay, its not unsimple, but this backward branch stuff is my least favorite part.
          Hide
          Mark Miller added a comment -

          Merged up - I've gotto say - that was a nasty one. I think things are more in sync then there were though.

          Show
          Mark Miller added a comment - Merged up - I've gotto say - that was a nasty one. I think things are more in sync then there were though.
          Hide
          Michael McCandless added a comment -

          Thanks Mark! Hopefully, once 3.0 is out the door, the merging becomes a little less crazy. I was dreading carrying this through 3.0 and I'm very glad you stepped in

          Show
          Michael McCandless added a comment - Thanks Mark! Hopefully, once 3.0 is out the door, the merging becomes a little less crazy. I was dreading carrying this through 3.0 and I'm very glad you stepped in
          Hide
          Michael McCandless added a comment -

          I just committed a nice change on the flex branch: all term data in
          DocumentsWriter's RAM buffer is now stored as UTF8 bytes. Previously
          they were stored as char.

          I think this is a good step forward:

          • Single-byte UTF8 characters (ascii, including terms created by
            NumericField) now take half the RAM, which should lead to faster
            indexing (better RAM efficiency so less frequent flushing)
          • I now use the 0xff byte marker to mark the end of the term, which
            never appears in UTF-8; this should mean 0xffff is allowed again
            (though we shouldn't advertise it)
          • Merging & flushing should be a tad faster since the terms data now
            remains as UTF8 the whole time

          TermsConsumer now takes a TermRef (previously it took a char[] +
          offset), which makes it nicely symmetic with TermsEnum.

          Also I cleaned up the "nocommit not reads" – thanks Mark!

          Show
          Michael McCandless added a comment - I just committed a nice change on the flex branch: all term data in DocumentsWriter's RAM buffer is now stored as UTF8 bytes. Previously they were stored as char. I think this is a good step forward: Single-byte UTF8 characters (ascii, including terms created by NumericField) now take half the RAM, which should lead to faster indexing (better RAM efficiency so less frequent flushing) I now use the 0xff byte marker to mark the end of the term, which never appears in UTF-8; this should mean 0xffff is allowed again (though we shouldn't advertise it) Merging & flushing should be a tad faster since the terms data now remains as UTF8 the whole time TermsConsumer now takes a TermRef (previously it took a char[] + offset), which makes it nicely symmetic with TermsEnum. Also I cleaned up the "nocommit not reads" – thanks Mark!
          Hide
          Michael McCandless added a comment -

          I just committed changes to flex branch to make it possible for the
          codec to override how merging happens.

          Basically I refactored SegmentMerger's postings merging code
          (mergeTermInfos, appendPostings) onto Fields/Terms/Docs/PositionsConsumer,
          so that the base class provides a default impl for merging at each
          level but the codec can override if it wants. This should make issues
          like LUCENE-2082 easy for a codec to implement.

          Show
          Michael McCandless added a comment - I just committed changes to flex branch to make it possible for the codec to override how merging happens. Basically I refactored SegmentMerger's postings merging code (mergeTermInfos, appendPostings) onto Fields/Terms/Docs/PositionsConsumer, so that the base class provides a default impl for merging at each level but the codec can override if it wants. This should make issues like LUCENE-2082 easy for a codec to implement.
          Hide
          Robert Muir added a comment - - edited

          edit: change supp char to <suppl. char> so erik can index this one too

          Mike, this change to byte[] in TermRef will break backwards compatibility, without some special attention paid to the utf-16 to utf-8 conversion.

          imagine FuzzyQuery on a string starting with <suppl. char>, prefix of 1.
          this will create a prefix of U+D866, which is an unpaired lead surrogate.
          This is perfectly ok though, because we are not going to write it to UTF-8 form, it is just being used as an intermediary processing.
          before, this would work just fine, because everything was an internal unicode string, so startsWith() would work just fine.

          now it will no longer work, because it must be downconverted to UTF-8 byte[].
          Whether you use getBytes() or UnicodeUtil, it will be replaced by U+FFFD, and the same code will not work.
          the standard provides that this kind of processing is ok for internal unicode strings, see CH3 D89.

          Show
          Robert Muir added a comment - - edited edit: change supp char to <suppl. char> so erik can index this one too Mike, this change to byte[] in TermRef will break backwards compatibility, without some special attention paid to the utf-16 to utf-8 conversion. imagine FuzzyQuery on a string starting with <suppl. char>, prefix of 1. this will create a prefix of U+D866, which is an unpaired lead surrogate. This is perfectly ok though, because we are not going to write it to UTF-8 form, it is just being used as an intermediary processing. before, this would work just fine, because everything was an internal unicode string, so startsWith() would work just fine. now it will no longer work, because it must be downconverted to UTF-8 byte[]. Whether you use getBytes() or UnicodeUtil, it will be replaced by U+FFFD, and the same code will not work. the standard provides that this kind of processing is ok for internal unicode strings, see CH3 D89.
          Hide
          Robert Muir added a comment - - edited

          here is a workaround you will not like.
          in the impl for FuzzyTermsEnum etc, we must not use TermRef.startsWith in its current state due to this issue, if the prefix ends with unpaired surrogate.
          in this case the String must be materialized each time from TermRef for comparison.

          this is an example, where using byte[] will start to make things a bit complicated. It is not really a fault in TermRef, it is due to how the enums are currently implemented,
          they will either need additional checks or we will need special unicode conversion so we can use things like TermRef.startsWith safely.

          edit: actually i do think now this is a fault in TermRef/TermsEnum api. how do i seek to U+D866 in the term dictionary? I can do this with trunk...
          it is not possible with the flex branch, because you cannot represent this in UTF-8 byte[]

          Show
          Robert Muir added a comment - - edited here is a workaround you will not like. in the impl for FuzzyTermsEnum etc, we must not use TermRef.startsWith in its current state due to this issue, if the prefix ends with unpaired surrogate. in this case the String must be materialized each time from TermRef for comparison. this is an example, where using byte[] will start to make things a bit complicated. It is not really a fault in TermRef, it is due to how the enums are currently implemented, they will either need additional checks or we will need special unicode conversion so we can use things like TermRef.startsWith safely. edit: actually i do think now this is a fault in TermRef/TermsEnum api. how do i seek to U+D866 in the term dictionary? I can do this with trunk... it is not possible with the flex branch, because you cannot represent this in UTF-8 byte[]
          Hide
          Robert Muir added a comment -

          test that passes on trunk, fails on branch.

          Show
          Robert Muir added a comment - test that passes on trunk, fails on branch.
          Hide
          Michael McCandless added a comment -

          how do i seek to U+D866 in the term dictionary? I can do this with trunk...

          But, that's an unpaired surrogate? Ie, not a valid unicode character?
          It's nice that the current API let's you seek based on an unpaired
          surrogate, but that's not valid use of the API, right?

          I guess if we want we can assert that the incoming TermRef is actually valid
          unicode...

          Show
          Michael McCandless added a comment - how do i seek to U+D866 in the term dictionary? I can do this with trunk... But, that's an unpaired surrogate? Ie, not a valid unicode character? It's nice that the current API let's you seek based on an unpaired surrogate, but that's not valid use of the API, right? I guess if we want we can assert that the incoming TermRef is actually valid unicode...
          Hide
          Robert Muir added a comment -

          Michael, it is a valid unicode String though, this is ok, and such things are supported by the unicode standard.

          also, perhaps it would help convince you if i instead wrote the code as .terms("𩬅".charAt(0));
          previously, naive treatment of text like this would work correctly, now with byte it cannot.
          I hope you can start to see how many east asian applications will break because of this.

          http://www.unicode.org/notes/tn12/

          Show
          Robert Muir added a comment - Michael, it is a valid unicode String though, this is ok, and such things are supported by the unicode standard. also, perhaps it would help convince you if i instead wrote the code as .terms("𩬅".charAt(0)); previously, naive treatment of text like this would work correctly, now with byte it cannot. I hope you can start to see how many east asian applications will break because of this. http://www.unicode.org/notes/tn12/
          Hide
          Robert Muir added a comment -

          same test, coded in a slightly different way, to show how this can commonly happen.

          Michael, I urge you to reconsider this. Please read Ch2 and 3 of the unicode standard if you want to do this.
          The problem is, this substring, it is a valid unicode String. it is true it cannot be converted into valid utf-8, but
          its perfectly reasonable to use code units for internal processing like this, I am not attempting to write this data into the index or anything!

          I think data from TermRef for merging or writing to IndexWriter, is completely different from data being used to search!
          I know you want an elegant encapsulation of both, but I think its a broken design.

          I don't just make this up to be annoying, i have applications that will break because of this.

          Show
          Robert Muir added a comment - same test, coded in a slightly different way, to show how this can commonly happen. Michael, I urge you to reconsider this. Please read Ch2 and 3 of the unicode standard if you want to do this. The problem is, this substring, it is a valid unicode String. it is true it cannot be converted into valid utf-8, but its perfectly reasonable to use code units for internal processing like this, I am not attempting to write this data into the index or anything! I think data from TermRef for merging or writing to IndexWriter, is completely different from data being used to search! I know you want an elegant encapsulation of both, but I think its a broken design. I don't just make this up to be annoying, i have applications that will break because of this.
          Hide
          Michael McCandless added a comment -

          perhaps it would help convince you if i instead wrote the code as .terms("𩬅".charAt(0));

          I realize a java String can easily contain an unpaired surrogate (eg,
          your test case) since it operates in code units not code points, but,
          that's not valid unicode, right?

          I mean you can't in general send such a string off to a library that
          works w/ unicode (like Lucene) and expect the behavior to be well
          defined. Yes, it's neat that Lucene allows that today, but I don't
          see that it's "supposed to".

          When we encounter an unpaired surrogate during indexing, we replace it
          w/ the replacement char. Why shouldn't we do the same when
          searching/reading the index?

          What should we do during searching if the unpaired surrogate is inside
          the string (not at the end)? Why should that be different?

          Please read Ch2 and 3 of the unicode standard if you want to do this.

          Doesn't this apply here? In "3.2 Conformance"
          (http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf) is this first
          requirement (C1):

          • A process shall not interpret a high-surrogate code point or a
            low-surrogate code point as an abstract character.

          I hope you can start to see how many east asian applications will break because of this.

          But how would a search application based on an east asian language
          actually create such a term? In what situation would an unpaired
          surrogate find its way down to TermEnum?

          Eg when users enter searches, they enter whole unicode chars (code
          points) at once (not code units / unpaired surrogates)? I realize an
          app could programmatically construct eg a PrefixQuery that has an
          unpaired surrogate... but couldn't they just as easily pair it up
          before sending it to Lucene?

          i have applications that will break because of this.

          OK, can you shed some more light on how/when your apps do this?

          Show
          Michael McCandless added a comment - perhaps it would help convince you if i instead wrote the code as .terms("𩬅".charAt(0)); I realize a java String can easily contain an unpaired surrogate (eg, your test case) since it operates in code units not code points, but, that's not valid unicode, right? I mean you can't in general send such a string off to a library that works w/ unicode (like Lucene) and expect the behavior to be well defined. Yes, it's neat that Lucene allows that today, but I don't see that it's "supposed to". When we encounter an unpaired surrogate during indexing, we replace it w/ the replacement char. Why shouldn't we do the same when searching/reading the index? What should we do during searching if the unpaired surrogate is inside the string (not at the end)? Why should that be different? Please read Ch2 and 3 of the unicode standard if you want to do this. Doesn't this apply here? In "3.2 Conformance" ( http://www.unicode.org/versions/Unicode5.0.0/ch03.pdf ) is this first requirement (C1): A process shall not interpret a high-surrogate code point or a low-surrogate code point as an abstract character. I hope you can start to see how many east asian applications will break because of this. But how would a search application based on an east asian language actually create such a term? In what situation would an unpaired surrogate find its way down to TermEnum? Eg when users enter searches, they enter whole unicode chars (code points) at once (not code units / unpaired surrogates)? I realize an app could programmatically construct eg a PrefixQuery that has an unpaired surrogate... but couldn't they just as easily pair it up before sending it to Lucene? i have applications that will break because of this. OK, can you shed some more light on how/when your apps do this?
          Hide
          Robert Muir added a comment -

          I realize a java String can easily contain an unpaired surrogate (eg,
          your test case) since it operates in code units not code points, but,
          that's not valid unicode, right?

          it is valid unicode. it is a valid "Unicode String". This is different than a Term stored in the index, which will be stored as UTF-8, and thus purports to be in a valid unicode encoding form.

          However,
          the conformance clauses do not prevent processes from operating on code
          unit sequences that do not purport to be in a Unicode character encoding form.
          For example, for performance reasons a low-level string operation may simply
          operate directly on code units, without interpreting them as characters. See,
          especially, the discussion under D89.

          D89:
          Unicode strings need not contain well-formed code unit sequences under all conditions.
          This is equivalent to saying that a particular Unicode string need not be in a Unicode
          encoding form.
          • For example, it is perfectly reasonable to talk about an operation that takes the
          two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which
          contains an ill-formed UTF-16 code unit sequence, and concatenates them to
          form another Unicode string <004D D800 DF02 004D>, which contains a wellformed
          UTF-16 code unit sequence. The first two Unicode strings are not in
          UTF-16, but the resultant Unicode string is.

          But how would a search application based on an east asian language
          actually create such a term? In what situation would an unpaired
          surrogate find its way down to TermEnum?

          I gave an example already, where they use FuzzyQuery with say a prefix of one.
          with the current code, even in the flex branch!!! this will create a lead surrogate prefix.
          There is code in the lucene core that does things like this (which I plan to fix, and also try to preserve back compat!)
          This makes it impossible to preserve back compat.

          There is also probably a lot of non-lucene east asian code that does similar things.
          For example, someone with data from Hong Kong almost certainly encounters suppl. characters, because
          they are part of Big5-HKSCS. They may not be smart enough to know about this situation, i.e. they might take a string, substring(0, 1) and do a prefix query.
          right now this will work!

          This is part of the idea that for most operations (such as prefix), in java, supplementary characters work rather transparently.
          If we do this, upgrading lucene to support for unicode 4.0 will be significantly more difficult.

          OK, can you shed some more light on how/when your apps do this?

          Yes, see LUCENE-1606. This library uses UTF-16 intervals for transitions, which works fine because for its matching purposes, this is transparent.
          So there is no need for it to be aware of suppl. characters. If we make this change, I will need to refactor/rewrite a lot of this code, most likely the underlying DFA library itself.
          This is working in production for me, on chinese text outside of the BMP with lucene right now. With this change, it will no longer work, and the enumerator will most likely go into an infinite loop!

          The main difference here is semantics, before IndexReader.terms() accepted as input any Unicode String. Now it would tighten that restriction to only any interchangeable UTF-8 string. Yet the input being used, will not be stored as UTF-8 anywhere, and most certainly will not be interchanged! The paper i sent on UTF-16 mentions problems like this, because its very reasonable and handy to use code units for processing, since suppl. characters are so rare.

          Show
          Robert Muir added a comment - I realize a java String can easily contain an unpaired surrogate (eg, your test case) since it operates in code units not code points, but, that's not valid unicode, right? it is valid unicode. it is a valid "Unicode String". This is different than a Term stored in the index, which will be stored as UTF-8, and thus purports to be in a valid unicode encoding form. However, the conformance clauses do not prevent processes from operating on code unit sequences that do not purport to be in a Unicode character encoding form. For example, for performance reasons a low-level string operation may simply operate directly on code units, without interpreting them as characters. See, especially, the discussion under D89. D89: Unicode strings need not contain well-formed code unit sequences under all conditions. This is equivalent to saying that a particular Unicode string need not be in a Unicode encoding form. • For example, it is perfectly reasonable to talk about an operation that takes the two Unicode 16-bit strings, <004D D800> and <DF02 004D>, each of which contains an ill-formed UTF-16 code unit sequence, and concatenates them to form another Unicode string <004D D800 DF02 004D>, which contains a wellformed UTF-16 code unit sequence. The first two Unicode strings are not in UTF-16, but the resultant Unicode string is. But how would a search application based on an east asian language actually create such a term? In what situation would an unpaired surrogate find its way down to TermEnum? I gave an example already, where they use FuzzyQuery with say a prefix of one. with the current code, even in the flex branch!!! this will create a lead surrogate prefix. There is code in the lucene core that does things like this (which I plan to fix, and also try to preserve back compat!) This makes it impossible to preserve back compat. There is also probably a lot of non-lucene east asian code that does similar things. For example, someone with data from Hong Kong almost certainly encounters suppl. characters, because they are part of Big5-HKSCS. They may not be smart enough to know about this situation, i.e. they might take a string, substring(0, 1) and do a prefix query. right now this will work! This is part of the idea that for most operations (such as prefix), in java, supplementary characters work rather transparently. If we do this, upgrading lucene to support for unicode 4.0 will be significantly more difficult. OK, can you shed some more light on how/when your apps do this? Yes, see LUCENE-1606 . This library uses UTF-16 intervals for transitions, which works fine because for its matching purposes, this is transparent. So there is no need for it to be aware of suppl. characters. If we make this change, I will need to refactor/rewrite a lot of this code, most likely the underlying DFA library itself. This is working in production for me, on chinese text outside of the BMP with lucene right now. With this change, it will no longer work, and the enumerator will most likely go into an infinite loop! The main difference here is semantics, before IndexReader.terms() accepted as input any Unicode String. Now it would tighten that restriction to only any interchangeable UTF-8 string. Yet the input being used, will not be stored as UTF-8 anywhere, and most certainly will not be interchanged! The paper i sent on UTF-16 mentions problems like this, because its very reasonable and handy to use code units for processing, since suppl. characters are so rare.
          Hide
          Robert Muir added a comment -

          attached is a patch that provides a workaround for the back compat issue.
          in my opinion it does not hurt performance (though, you should optimize this)
          when opening a TermEnum with IndexReader.terms(Term), the deprecated API,
          in LegacyTermEnum(Term t), if the term ends with a lead surrogate, tack on \uDC00 to emulate the old behavior.

          with this patch, my testcase passes.

          we might be able to workaround these issues in similar ways for better backwards compatibility, at the same time preserving performance.
          I think we should mention somewhere in the docs that the new api behaves a bit differently though, so people know to fix their code.

          Show
          Robert Muir added a comment - attached is a patch that provides a workaround for the back compat issue. in my opinion it does not hurt performance (though, you should optimize this) when opening a TermEnum with IndexReader.terms(Term), the deprecated API, in LegacyTermEnum(Term t), if the term ends with a lead surrogate, tack on \uDC00 to emulate the old behavior. with this patch, my testcase passes. we might be able to workaround these issues in similar ways for better backwards compatibility, at the same time preserving performance. I think we should mention somewhere in the docs that the new api behaves a bit differently though, so people know to fix their code.
          Hide
          Michael McCandless added a comment -

          if the term ends with a lead surrogate, tack on \uDC00 to emulate the old behavior.

          OK I think this is a good approach, in the "emulate old on flex" layer, and then in the docs for TermRef call out that the incoming String cannot contain unpaired surrogates?

          Can you commit this, along with your test? Thanks!

          Show
          Michael McCandless added a comment - if the term ends with a lead surrogate, tack on \uDC00 to emulate the old behavior. OK I think this is a good approach, in the "emulate old on flex" layer, and then in the docs for TermRef call out that the incoming String cannot contain unpaired surrogates? Can you commit this, along with your test? Thanks!
          Hide
          Robert Muir added a comment -

          OK I think this is a good approach, in the "emulate old on flex" layer, and then in the docs for TermRef call out that the incoming String cannot contain unpaired surrogates?

          Just so you know, its not perfect back compat though.
          For perfect back compat I would have to iterate thru the string looking for unpaired surrogates.. at which point you truncate after, and tack on \uDC00 if its a high surrogate.
          If its an unpaired low surrogate, I am not actually sure what the old API would do? My guess would be to replace with U+F000, but it depends how this was being handled before.

          the joys of UTF-16 vs UTF-8 binary order...

          I didnt do any of this, because in my opinion fixing just the "trailing lead surrogate" case is all we should worry about, especially since the lucene core itself does this.

          I'll commit the patch and test, we can improve it in the future if you are worried about these corner-corner-corner cases, no problem.

          Show
          Robert Muir added a comment - OK I think this is a good approach, in the "emulate old on flex" layer, and then in the docs for TermRef call out that the incoming String cannot contain unpaired surrogates? Just so you know, its not perfect back compat though. For perfect back compat I would have to iterate thru the string looking for unpaired surrogates.. at which point you truncate after, and tack on \uDC00 if its a high surrogate. If its an unpaired low surrogate, I am not actually sure what the old API would do? My guess would be to replace with U+F000, but it depends how this was being handled before. the joys of UTF-16 vs UTF-8 binary order... I didnt do any of this, because in my opinion fixing just the "trailing lead surrogate" case is all we should worry about, especially since the lucene core itself does this. I'll commit the patch and test, we can improve it in the future if you are worried about these corner-corner-corner cases, no problem.
          Hide
          Robert Muir added a comment -

          the patch and test are in revision 883485.
          I added some javadocs to TermRef where it takes a String constructor as well.

          Show
          Robert Muir added a comment - the patch and test are in revision 883485. I added some javadocs to TermRef where it takes a String constructor as well.
          Hide
          Robert Muir added a comment -

          Mike, what to do about MultiTermQueries now?
          they still have some problems, especially with regards to doing 'startsWith' some constant prefix, which might be unpaired lead surrogate (lucene problem)

          I guess we need to specialize this case in their FilteredTermEnum (not TermsEnum), and if they are doing this stupid behavior, return null from getTermsEnum() ?
          and force it to the old TermEnum which has some back compat shims for this case?

          Show
          Robert Muir added a comment - Mike, what to do about MultiTermQueries now? they still have some problems, especially with regards to doing 'startsWith' some constant prefix, which might be unpaired lead surrogate (lucene problem) I guess we need to specialize this case in their FilteredTermEnum (not TermsEnum), and if they are doing this stupid behavior, return null from getTermsEnum() ? and force it to the old TermEnum which has some back compat shims for this case?
          Hide
          Robert Muir added a comment -

          Also, I am curious in general if we support any old index formats that might contain unpaired surrogates or \uFFFF in the term text.

          This will be good to know when trying to fix unicode 4 issues, especially if we are doing things like compareTo() or startsWith() on the raw bytes.

          Show
          Robert Muir added a comment - Also, I am curious in general if we support any old index formats that might contain unpaired surrogates or \uFFFF in the term text. This will be good to know when trying to fix unicode 4 issues, especially if we are doing things like compareTo() or startsWith() on the raw bytes.
          Hide
          Michael McCandless added a comment -

          LUCENE-510 (fixed in 2.4 release) cutover new indexes to UTF8.

          Before 2.4, here's what IndexOutput.writeString looked like:

            public void writeChars(String s, int start, int length)
                 throws IOException {
              final int end = start + length;
              for (int i = start; i < end; i++) {
                final int code = (int)s.charAt(i);
                if (code >= 0x01 && code <= 0x7F)
          	writeByte((byte)code);
                else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) {
          	writeByte((byte)(0xC0 | (code >> 6)));
          	writeByte((byte)(0x80 | (code & 0x3F)));
                } else {
          	writeByte((byte)(0xE0 | (code >>> 12)));
          	writeByte((byte)(0x80 | ((code >> 6) & 0x3F)));
          	writeByte((byte)(0x80 | (code & 0x3F)));
                }
              }
            }
          

          which I think can represent unpaired surrogates & \uFFFF just fine?

          Show
          Michael McCandless added a comment - LUCENE-510 (fixed in 2.4 release) cutover new indexes to UTF8. Before 2.4, here's what IndexOutput.writeString looked like: public void writeChars( String s, int start, int length) throws IOException { final int end = start + length; for ( int i = start; i < end; i++) { final int code = ( int )s.charAt(i); if (code >= 0x01 && code <= 0x7F) writeByte(( byte )code); else if (((code >= 0x80) && (code <= 0x7FF)) || code == 0) { writeByte(( byte )(0xC0 | (code >> 6))); writeByte(( byte )(0x80 | (code & 0x3F))); } else { writeByte(( byte )(0xE0 | (code >>> 12))); writeByte(( byte )(0x80 | ((code >> 6) & 0x3F))); writeByte(( byte )(0x80 | (code & 0x3F))); } } } which I think can represent unpaired surrogates & \uFFFF just fine?
          Hide
          Yonik Seeley added a comment -

          In general, I think things like unpaired surrogates should be undefined, giving us more room to optimize.

          Show
          Yonik Seeley added a comment - In general, I think things like unpaired surrogates should be undefined, giving us more room to optimize.
          Hide
          Michael McCandless added a comment -

          Also, on the flex branch I believe \uFFFF is no longer "reserved" by Lucene, but we should not advertise that! Terms data is stored in DocumentsWriter as UTF8 bytes, and I use 0xff byte (an invalid UTF8 byte) as end marker.

          Show
          Michael McCandless added a comment - Also, on the flex branch I believe \uFFFF is no longer "reserved" by Lucene, but we should not advertise that! Terms data is stored in DocumentsWriter as UTF8 bytes, and I use 0xff byte (an invalid UTF8 byte) as end marker.
          Hide
          Michael McCandless added a comment -

          the patch and test are in revision 883485.
          I added some javadocs to TermRef where it takes a String constructor as well.

          Thanks Robert!

          Mike, what to do about MultiTermQueries now?
          they still have some problems, especially with regards to doing 'startsWith' some constant prefix, which might be unpaired lead surrogate (lucene problem)

          Maybe open a new issue for this? Or, don't we already have an issue open to fix how various queries handle surrogates? Or I guess we could fix such queries to pair up the surrogate (add \uDC00)?

          Show
          Michael McCandless added a comment - the patch and test are in revision 883485. I added some javadocs to TermRef where it takes a String constructor as well. Thanks Robert! Mike, what to do about MultiTermQueries now? they still have some problems, especially with regards to doing 'startsWith' some constant prefix, which might be unpaired lead surrogate (lucene problem) Maybe open a new issue for this? Or, don't we already have an issue open to fix how various queries handle surrogates? Or I guess we could fix such queries to pair up the surrogate (add \uDC00)?
          Hide
          Robert Muir added a comment -

          In general, I think things like unpaired surrogates should be undefined, giving us more room to optimize.

          This is not an option I feel, when Lucene is the one creating the problem (i.e. our multitermqueries that are unaware of utf-32 boundaries).

          Show
          Robert Muir added a comment - In general, I think things like unpaired surrogates should be undefined, giving us more room to optimize. This is not an option I feel, when Lucene is the one creating the problem (i.e. our multitermqueries that are unaware of utf-32 boundaries).
          Hide
          Robert Muir added a comment -

          Maybe open a new issue for this? Or, don't we already have an issue open to fix how various queries handle surrogates? Or I guess we could fix such queries to pair up the surrogate (add \uDC00)?

          Mike, I have an issue open, for trunk. But it is not such a problem on trunk, because they work "as expected" in UTF-16 space
          The move to byte[] creates the problem really, because then the existing problems in trunk, that happened to work, start to completely fail in UTF-8 space.
          and unfortunately, we can't use the \uDC00 trick for startsWith

          Show
          Robert Muir added a comment - Maybe open a new issue for this? Or, don't we already have an issue open to fix how various queries handle surrogates? Or I guess we could fix such queries to pair up the surrogate (add \uDC00)? Mike, I have an issue open, for trunk. But it is not such a problem on trunk, because they work "as expected" in UTF-16 space The move to byte[] creates the problem really, because then the existing problems in trunk, that happened to work, start to completely fail in UTF-8 space. and unfortunately, we can't use the \uDC00 trick for startsWith
          Hide
          Michael McCandless added a comment -

          Well, for starters can't we just toString() the TermRef on every compare? Then we're back in UTF16 space.

          It's not as good as flex can be (ie doing the checks in UTF8 space), but it should still be faster than trunk today, so this shouldn't block flex landing, right?

          Show
          Michael McCandless added a comment - Well, for starters can't we just toString() the TermRef on every compare? Then we're back in UTF16 space. It's not as good as flex can be (ie doing the checks in UTF8 space), but it should still be faster than trunk today, so this shouldn't block flex landing, right?
          Hide
          Robert Muir added a comment -

          this one is more serious.
          the change to byte[] changes the sort order of lucene (at least TermEnum)

          attached is a test that passes on trunk, fails on branch.
          in trunk, things sort in UTF-16 binary order.
          in branch, things sort in UTF-8 binary order.
          these are different...

          Show
          Robert Muir added a comment - this one is more serious. the change to byte[] changes the sort order of lucene (at least TermEnum) attached is a test that passes on trunk, fails on branch. in trunk, things sort in UTF-16 binary order. in branch, things sort in UTF-8 binary order. these are different...
          Hide
          Robert Muir added a comment -

          Mike, if it means anything, I prefer the new behavior... real codepoint order
          But this is a compat problem I think.

          Show
          Robert Muir added a comment - Mike, if it means anything, I prefer the new behavior... real codepoint order But this is a compat problem I think.
          Hide
          Michael McCandless added a comment -

          in trunk, things sort in UTF-16 binary order.
          in branch, things sort in UTF-8 binary order.
          these are different...

          Ugh! In the back of my mind I almost remembered this... I think this
          was one reason why I didn't do this back in LUCENE-843 (I think we had
          discussed this already, then... though maybe I'm suffering from déjà
          vu). I could swear at one point I had that fixup logic implemented in
          a UTF-8/16 comparison method...

          UTF-8 sort order (what flex branch has switched to) is true unicode
          codepoint sort order, while UTF-16 is not when there are surrogate
          pairs as well as high (>= U+E000) unicode chars. Sigh....

          So this is definitely a back compat problem. And, unfortunately, even
          if we like the true codepoint sort order, it's not easy to switch to
          in a back-compat manner because if we write new segments into an old
          index, SegmentMerger will be in big trouble when it tries to merge two
          segments that had sorted the terms differently.

          I would also prefer true codepoint sort order... but we can't break
          back compat.

          Though it would be nice to let the codec control the sort order – eg
          then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
          needed.

          Fortunately the problem is isolated to how we sort the buffered
          postings when it's time to flush a new segment, so I think w/ the
          appropriate fixup logic (eg your comment at
          https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12781746&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781746)
          when comparing terms in oal.index.TermsHashPerField.comparePostings
          during that sort, we can get back to UTF-16 sort order.

          Show
          Michael McCandless added a comment - in trunk, things sort in UTF-16 binary order. in branch, things sort in UTF-8 binary order. these are different... Ugh! In the back of my mind I almost remembered this... I think this was one reason why I didn't do this back in LUCENE-843 (I think we had discussed this already, then... though maybe I'm suffering from déjà vu). I could swear at one point I had that fixup logic implemented in a UTF-8/16 comparison method... UTF-8 sort order (what flex branch has switched to) is true unicode codepoint sort order, while UTF-16 is not when there are surrogate pairs as well as high (>= U+E000) unicode chars. Sigh.... So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. I would also prefer true codepoint sort order... but we can't break back compat. Though it would be nice to let the codec control the sort order – eg then (I think?) the ICU/CollationKeyFilter workaround wouldn't be needed. Fortunately the problem is isolated to how we sort the buffered postings when it's time to flush a new segment, so I think w/ the appropriate fixup logic (eg your comment at https://issues.apache.org/jira/browse/LUCENE-1606?focusedCommentId=12781746&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12781746 ) when comparing terms in oal.index.TermsHashPerField.comparePostings during that sort, we can get back to UTF-16 sort order.
          Hide
          Robert Muir added a comment -

          Though it would be nice to let the codec control the sort order - eg
          then (I think?) the ICU/CollationKeyFilter workaround wouldn't be
          needed.

          I like this idea by the way, "flexible sorting". although i like codepoint order better than code unit order, i hate binary order in general to be honest.

          its nice we have 'indexable'/fast collation right now, but its maybe not what users expect either (binary keys encoded into text).

          Show
          Robert Muir added a comment - Though it would be nice to let the codec control the sort order - eg then (I think?) the ICU/CollationKeyFilter workaround wouldn't be needed. I like this idea by the way, "flexible sorting". although i like codepoint order better than code unit order, i hate binary order in general to be honest. its nice we have 'indexable'/fast collation right now, but its maybe not what users expect either (binary keys encoded into text).
          Hide
          Michael McCandless added a comment -

          i hate binary order in general to be honest.

          But binary order in this case is code point order.

          Show
          Michael McCandless added a comment - i hate binary order in general to be honest. But binary order in this case is code point order.
          Hide
          Robert Muir added a comment -

          Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things

          Show
          Robert Muir added a comment - Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things
          Hide
          Michael McCandless added a comment -

          Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things

          Ahh, gotchya. Well if we make the sort order pluggable, you could do that...

          Show
          Michael McCandless added a comment - Mike, I guess I mean i'd prefer UCA order, which isn't just the order codepoints happened to randomly appear on charts, but is actually designed for sorting and ordering things Ahh, gotchya. Well if we make the sort order pluggable, you could do that...
          Hide
          Robert Muir added a comment -

          Ahh, gotchya. Well if we make the sort order pluggable, you could do that...

          yes, then we could consider getting rid of the Collator/Locale-based range queries / sorts and things like that completely... which have performance problems.
          you would have a better way to do it...

          but if you change the sort order, any part of lucene sensitive to it might break... maybe its dangerous.

          maybe if we do it, it needs to be exposed properly so other components can change their behavior

          Show
          Robert Muir added a comment - Ahh, gotchya. Well if we make the sort order pluggable, you could do that... yes, then we could consider getting rid of the Collator/Locale-based range queries / sorts and things like that completely... which have performance problems. you would have a better way to do it... but if you change the sort order, any part of lucene sensitive to it might break... maybe its dangerous. maybe if we do it, it needs to be exposed properly so other components can change their behavior
          Hide
          Michael McCandless added a comment -

          Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat.

          Show
          Michael McCandless added a comment - Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat.
          Hide
          Robert Muir added a comment - - edited

          Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat.

          Agreed, changing the sort order breaks a lot of things (not just some crazy seeking around code that I write)

          i.e. if 'ch' is a character in some collator and sorts b, before c (completely made up example, there are real ones like this though)
          Then even prefixquery itself will fail!

          edit: better example is french collation, where the weight of accent marks is done in reverse order.
          prefix query would make assumptions based on the prefix, which are wrong.

          Show
          Robert Muir added a comment - - edited Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. Agreed, changing the sort order breaks a lot of things (not just some crazy seeking around code that I write) i.e. if 'ch' is a character in some collator and sorts b, before c (completely made up example, there are real ones like this though) Then even prefixquery itself will fail! edit: better example is french collation, where the weight of accent marks is done in reverse order. prefix query would make assumptions based on the prefix, which are wrong.
          Hide
          Uwe Schindler added a comment -

          ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict.

          Show
          Uwe Schindler added a comment - ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict.
          Hide
          DM Smith added a comment -

          Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat.

          For those of us working on texts in all different kinds of languages, it should not be very advanced stuff. It should be stock Lucene. A default UCA comparator would be good. And a way to provide a locale sensitive UCA comparator would also be good.

          My use case is that each Lucene index typically has a single language or at least has a dominant language.

          ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict.

          I think that String.compareTo works correctly on UCA collation keys.

          Show
          DM Smith added a comment - Yes, this (customizing comparator for termrefs) would definitely be very advanced stuff... you'd have to create your own codec to do it. And we'd default to UTF16 sort order for back compat. For those of us working on texts in all different kinds of languages, it should not be very advanced stuff. It should be stock Lucene. A default UCA comparator would be good. And a way to provide a locale sensitive UCA comparator would also be good. My use case is that each Lucene index typically has a single language or at least has a dominant language. ...not to talk about TermRangeQueries and NumericRangeQueries. They rely on String.compareTo like the current terms dict. I think that String.compareTo works correctly on UCA collation keys.
          Hide
          Robert Muir added a comment -

          I think that String.compareTo works correctly on UCA collation keys.

          No, because UCA collation keys are bytes
          You are right that byte comparison on these keys works though.
          But if we change the sort order like this, various components are not looking at keys, instead they are looking at the term text themselves.

          I guess what I am saying is that there is a lot of assumptions in lucene right now, (prefixquery was my example) that look at term text and assume it is sorted in binary order.

          It should be stock Lucene

          as much as I agree with you that default UCA should be "stock lucene" (with the capability to use an alternate locale or even tailored collator), this creates some practical problems, as mentioned above.
          also the practical problem that collation in the JDK is poop and we would want ICU for good performance...

          Show
          Robert Muir added a comment - I think that String.compareTo works correctly on UCA collation keys. No, because UCA collation keys are bytes You are right that byte comparison on these keys works though. But if we change the sort order like this, various components are not looking at keys, instead they are looking at the term text themselves. I guess what I am saying is that there is a lot of assumptions in lucene right now, (prefixquery was my example) that look at term text and assume it is sorted in binary order. It should be stock Lucene as much as I agree with you that default UCA should be "stock lucene" (with the capability to use an alternate locale or even tailored collator), this creates some practical problems, as mentioned above. also the practical problem that collation in the JDK is poop and we would want ICU for good performance...
          Hide
          Robert Muir added a comment - - edited

          So this is definitely a back compat problem. And, unfortunately, even
          if we like the true codepoint sort order, it's not easy to switch to
          in a back-compat manner because if we write new segments into an old
          index, SegmentMerger will be in big trouble when it tries to merge two
          segments that had sorted the terms differently.

          Mike, I think it goes well beyond this.
          I think sort order is an exceptional low-level case that can trickle all the way up high into the application layer (including user perception itself), and create bugs.
          Does a non-technical user in Hong Kong know how many code units each ideograph they enter are?
          Should they care? They will just not understand if things are in different order.

          I think we are stuck with UTF-16 without a huge effort, which would not be worth it in any case.

          Show
          Robert Muir added a comment - - edited So this is definitely a back compat problem. And, unfortunately, even if we like the true codepoint sort order, it's not easy to switch to in a back-compat manner because if we write new segments into an old index, SegmentMerger will be in big trouble when it tries to merge two segments that had sorted the terms differently. Mike, I think it goes well beyond this. I think sort order is an exceptional low-level case that can trickle all the way up high into the application layer (including user perception itself), and create bugs. Does a non-technical user in Hong Kong know how many code units each ideograph they enter are? Should they care? They will just not understand if things are in different order. I think we are stuck with UTF-16 without a huge effort, which would not be worth it in any case.
          Hide
          Michael McCandless added a comment -

          OK I finally worked out a solution for the UTF16 sort order problem
          (just committed).

          I added a TermRef.Comparator class, for comparing TermRefs, and I
          removed TermRef.compareTo, and fixed all low-level places in Lucene
          that rely on sort order of terms to use this new API instead.

          I changed the Terms/TermsEnum/TermsConsumer API, adding a
          getTermComparator(), ie, the codec now determines the sort order for
          terms in each field. For the core codecs (standard, pulsing,
          intblock) I default to UTF16 sort order, for back compat, but you
          could easily instantiate it yourself and use a different term sort.

          I changed TestExternalCodecs to test this new capability, by sorting 2
          of its fields in reversed unicode code point order.

          While this means your codec is now completely free to define the
          term sort order per field, in general Lucene queries will not behave
          right if you do this, so it's obviously a very advanced use case.

          I also changed (yet again!) how DocumentsWriter encodes the terms
          bytes, to record the length (in bytes) of the term, up front, followed by the
          term bytes (vs the trailing 0xff that I had switched to). The length
          is a 1 or 2 byte vInt, ie if it's < 128 it's 1 byte, else 2 bytes.
          This approach means the TermRef.Collector doesn't have to deal with
          0xff's (which was messy).

          I think this also means that, to the flex API, a term is actually
          opaque – it's just a series of bytes. It need not be UTF8 bytes.
          However, all of analysis, and then how TermsHash builds up these
          byte[]s, and what queries do with these bytes, is clearly still very
          much Unicode/UTF8. But one could, in theory (I haven't tested this!)
          separately use the flex API to build up a segment whose terms are
          arbitrary byte[]'s, eg maybe you want to use 4 bytes to encode int
          values, and then interact with those terms at search time
          using the flex API.

          Show
          Michael McCandless added a comment - OK I finally worked out a solution for the UTF16 sort order problem (just committed). I added a TermRef.Comparator class, for comparing TermRefs, and I removed TermRef.compareTo, and fixed all low-level places in Lucene that rely on sort order of terms to use this new API instead. I changed the Terms/TermsEnum/TermsConsumer API, adding a getTermComparator(), ie, the codec now determines the sort order for terms in each field. For the core codecs (standard, pulsing, intblock) I default to UTF16 sort order, for back compat, but you could easily instantiate it yourself and use a different term sort. I changed TestExternalCodecs to test this new capability, by sorting 2 of its fields in reversed unicode code point order. While this means your codec is now completely free to define the term sort order per field, in general Lucene queries will not behave right if you do this, so it's obviously a very advanced use case. I also changed (yet again!) how DocumentsWriter encodes the terms bytes, to record the length (in bytes) of the term, up front, followed by the term bytes (vs the trailing 0xff that I had switched to). The length is a 1 or 2 byte vInt, ie if it's < 128 it's 1 byte, else 2 bytes. This approach means the TermRef.Collector doesn't have to deal with 0xff's (which was messy). I think this also means that, to the flex API, a term is actually opaque – it's just a series of bytes. It need not be UTF8 bytes. However, all of analysis, and then how TermsHash builds up these byte[]s, and what queries do with these bytes, is clearly still very much Unicode/UTF8. But one could, in theory (I haven't tested this!) separately use the flex API to build up a segment whose terms are arbitrary byte[]'s, eg maybe you want to use 4 bytes to encode int values, and then interact with those terms at search time using the flex API.
          Hide
          Uwe Schindler added a comment -

          Hi Mike,

          I looked into your commit, looks good. You are right with your comment in NRQ, it will only work with UTF-8 or UTF-16. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order.

          Two things:

          • The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed
          • I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087). We should change this here, too. This makes also the enum simplier (and it looks more like the Automaton one). The methods in trunk 3.0 setEnum() and endEnum() both throw now UOE.

          I will look into these two changes tomorrow and change the code.

          Uwe

          Show
          Uwe Schindler added a comment - Hi Mike, I looked into your commit, looks good. You are right with your comment in NRQ, it will only work with UTF-8 or UTF-16. Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. Two things: The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087 ). We should change this here, too. This makes also the enum simplier (and it looks more like the Automaton one). The methods in trunk 3.0 setEnum() and endEnum() both throw now UOE. I will look into these two changes tomorrow and change the code. Uwe
          Hide
          Robert Muir added a comment - - edited

          Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order.

          but isn't this what it does already with the TermsEnum api? the TermRef itself is just byte[], and NRQ precomputes all the TermRef's it needs up front, there is no unicode conversion there.

          edit: btw Uwe, and the comparator is be essentially just comparing bytes, the 0xee/0xef "shifting" should never take place with NRQ because those bytes will never be in a numeric field...

          Show
          Robert Muir added a comment - - edited Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. but isn't this what it does already with the TermsEnum api? the TermRef itself is just byte[], and NRQ precomputes all the TermRef's it needs up front, there is no unicode conversion there. edit: btw Uwe, and the comparator is be essentially just comparing bytes, the 0xee/0xef "shifting" should never take place with NRQ because those bytes will never be in a numeric field...
          Hide
          Uwe Schindler added a comment -

          Robert: I know, because of that I said it works with UTF-8/UTF-16 comparator. It would not work with a reverse comparator as Mike uses in the test.

          With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change.

          Show
          Uwe Schindler added a comment - Robert: I know, because of that I said it works with UTF-8/UTF-16 comparator. It would not work with a reverse comparator as Mike uses in the test. With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change.
          Hide
          Robert Muir added a comment -

          With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change.

          Uwe, it looks like you can do this now (with the exception of tokenstreams).

          A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte"

          Show
          Robert Muir added a comment - With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change. Uwe, it looks like you can do this now (with the exception of tokenstreams). A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte"
          Hide
          Uwe Schindler added a comment - - edited

          A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte"

          This would not change anything, only would make the format incompatible. With 7bits/char the currently UTF-8 coded index is the smallest possible one (even IndexableBinaryString would cost more bytes in the index, because if you would use 14 of the 16 bits/char, most chars would take 3 bytes in index because of UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String representation would take less space than currently. See the discussion with Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much faster).

          For the TokenStreams: The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. – the new AttributeSource API was created just because of such customizations (not possible with Token).

          Show
          Uwe Schindler added a comment - - edited A partial solution for you which does work with tokenstreams, you could use indexablebinarystring which won't change between any unicode sort order... (it will not encode in any unicode range where there is a difference between the UTF-8/UTF32 and UTF-16). With this you could just compare bytes also, but you still would not have the "full 8 bits per byte" This would not change anything, only would make the format incompatible. With 7bits/char the currently UTF-8 coded index is the smallest possible one (even IndexableBinaryString would cost more bytes in the index, because if you would use 14 of the 16 bits/char, most chars would take 3 bytes in index because of UTF-8 vs. 2 bytes with the current encoding. Only the char[]/String representation would take less space than currently. See the discussion with Yonik about this and why we have choosen 7 bits/char. Also en-/decoding is much faster). For the TokenStreams: The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. – the new AttributeSource API was created just because of such customizations (not possible with Token).
          Hide
          Robert Muir added a comment -

          Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really.

          yeah I agree that encoding directly to byte[] is the way to go though, this would be nice for collation too...

          Show
          Robert Muir added a comment - Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really. yeah I agree that encoding directly to byte[] is the way to go though, this would be nice for collation too...
          Hide
          Uwe Schindler added a comment -

          As the codec is per field, we could also add an Attribute to TokenStream that holds the codec (the default is Standard). The indexer just uses the codec for the field from the TokenStream. NTS would use a NumericCodec (just thinking...) - will go sleeping now.

          Show
          Uwe Schindler added a comment - As the codec is per field, we could also add an Attribute to TokenStream that holds the codec (the default is Standard). The indexer just uses the codec for the field from the TokenStream. NTS would use a NumericCodec (just thinking...) - will go sleeping now.
          Hide
          Uwe Schindler added a comment -

          Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really.

          The new TermsEnum directly compares the byte[] arrays. Why should they compare faster when encoded by IndexableBinaryStringTools? Less bytes are faster to compare (it's one CPU instruction if optimized a very native x86/x64 loop). It may be faster if we need to decode to char[] but thats not the case (in flex branch).

          Show
          Uwe Schindler added a comment - Uwe you are right that the terms would be larger but they would have a more distinct alphabet (byte range) and might compare faster... I don't know which one is most important to NRQ really. The new TermsEnum directly compares the byte[] arrays. Why should they compare faster when encoded by IndexableBinaryStringTools? Less bytes are faster to compare (it's one CPU instruction if optimized a very native x86/x64 loop). It may be faster if we need to decode to char[] but thats not the case (in flex branch).
          Hide
          Michael McCandless added a comment -

          I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087). We should change this here, too.

          Mark has been periodically re-syncing changes down from trunk... we should probably just let this change come in through his process (else I think we cause more conflicts).

          The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed

          Ahh excellent. Wanna commit that when you get a chance?

          Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order.

          That'd be great!

          With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change.

          Right, this is a change in analysis -> DocumentsWriter – somehow we have to allow a Token to carry a byte[] and that is directly indexes as the opaque term. At search time NRQ is all byte[] already (unlike other queries, which are new String()'ing for every term on the enum).

          Show
          Michael McCandless added a comment - I changed the logic in the TermEnum in trunk and 3.0 (it no longer works recursive, see LUCENE-2087 ). We should change this here, too. Mark has been periodically re-syncing changes down from trunk... we should probably just let this change come in through his process (else I think we cause more conflicts). The legacy NumericRangeTermEnum can be removed completely and the protected getEnum() should simply throw UOE. NRQ cannot be subclassed and nobody can call this method (maybe only classes in same package, but thats not supported). So the enum with the nocommit mark can be removed Ahh excellent. Wanna commit that when you get a chance? Ideally NRQ would simply not use string terms at all and work directly on the byte[], which should then be ordered in binary order. That'd be great! With directly on bytes[] I meant that it could not use chars at all and directly encode the numbers into byte[] with the full 8 bits per byte. The resulting byte[] would be never UTF-8, but if the new TermRef API would be able to handle this and also the TokenStreams, it would be fine. Only the terms format would change. Right, this is a change in analysis -> DocumentsWriter – somehow we have to allow a Token to carry a byte[] and that is directly indexes as the opaque term. At search time NRQ is all byte[] already (unlike other queries, which are new String()'ing for every term on the enum).
          Hide
          Robert Muir added a comment -

          Why should they compare faster when encoded by IndexableBinaryStringTools?

          because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better?

          I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!?

          Show
          Robert Muir added a comment - Why should they compare faster when encoded by IndexableBinaryStringTools? because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better? I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!?
          Hide
          Michael McCandless added a comment -

          The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. - the new AttributeSource API was created just because of such customizations (not possible with Token).

          This sounds like an interesting approach! We'd have to work out some details... eg you presumably can't mix char[] term and byte[] term in the same field.

          Show
          Michael McCandless added a comment - The idea is to create an additional Attribute: BinaryTermAttribute that holds byte[]. If some tokenstream uses this attribute instead of TermAttribute, the indexer would choose to write the bytes directly to the index. NumericTokenStream could use this attribute and encode the numbers directly to byte[] with 8 bits/byte. - the new AttributeSource API was created just because of such customizations (not possible with Token). This sounds like an interesting approach! We'd have to work out some details... eg you presumably can't mix char[] term and byte[] term in the same field.
          Hide
          Uwe Schindler added a comment -

          because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better?

          It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars. The 8th bit is just a marker (simply spoken). If this marker is always 0 or always 1 does not make a difference, in UTF-8 only 7 bits/byte are used for data. And with UTF-8 in the 3rd byte more bits are unused!

          I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!?

          I do not know who made this IndexableBinaryStrings encoding, but it would not work for NRQ at all with current trunk (too complicated during indexing and decoding, because for NRQ, we also need to decode such char[] very fast for populating the FieldCache). But as discussed with Yonik (do not know the issue), the ASCII only encoding should always perform better (but needs more memory in trunk, as char[] is used during indexing – I think because of that it was added). So the difference is not speed, its memory consumption.

          Show
          Uwe Schindler added a comment - because it compares from left to right, so even if the terms are 10x as long, if they differ 2x as quick its better? It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars. The 8th bit is just a marker (simply spoken). If this marker is always 0 or always 1 does not make a difference, in UTF-8 only 7 bits/byte are used for data. And with UTF-8 in the 3rd byte more bits are unused! I hear what you are saying about ASCII-only encoding, but if NRQ's model is always best, why do we have two separate "encode byte[] into char[]" models in lucene, one that NRQ is using, and one that collation is using!? I do not know who made this IndexableBinaryStrings encoding, but it would not work for NRQ at all with current trunk (too complicated during indexing and decoding, because for NRQ, we also need to decode such char[] very fast for populating the FieldCache). But as discussed with Yonik (do not know the issue), the ASCII only encoding should always perform better (but needs more memory in trunk, as char[] is used during indexing – I think because of that it was added). So the difference is not speed, its memory consumption.
          Hide
          Robert Muir added a comment -

          It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars

          yeah you are right I dont think it will be faster on average (i was just posing the question because i dont really know NRQ), and you will waste 4 bits by using the first bit at the minimum.

          i am just always trying to improve collation too, so that's why I am bugging you. I guess hopefully soon we have byte[] and can do it properly, and speed up both.

          Show
          Robert Muir added a comment - It would not compare faster because in UTF-8 encoding, only 7 bits are used for encoding the chars yeah you are right I dont think it will be faster on average (i was just posing the question because i dont really know NRQ), and you will waste 4 bits by using the first bit at the minimum. i am just always trying to improve collation too, so that's why I am bugging you. I guess hopefully soon we have byte[] and can do it properly, and speed up both.
          Hide
          Robert Muir added a comment -

          fwiw here is a patch to use the algorithm from the unicode std for utf8 in utf16 sort order.
          they claim it is fast because there is no conditional branching... who knows

          Show
          Robert Muir added a comment - fwiw here is a patch to use the algorithm from the unicode std for utf8 in utf16 sort order. they claim it is fast because there is no conditional branching... who knows
          Hide
          Uwe Schindler added a comment - - edited

          I rewrote the NumericRangeTermsEnum, see revision 885360.

          Changed: Simplify and optimize NumericRangeTermEnum:

          • the range split logic only seeks forward (an assert verifies this), so the iterator can be reused (like Automaton)
          • removed the iteration by not using setEnum() [throws UOE], see LUCENE-2087
          • removed TermEnum, as class cannot be subclassed - so no BW break!!!; getEnum() throws UOE.
          • seek() cannot work for this TermsEnum, so throw UOE (is not needed for MTQ at the moment)
          Show
          Uwe Schindler added a comment - - edited I rewrote the NumericRangeTermsEnum, see revision 885360. Changed: Simplify and optimize NumericRangeTermEnum: the range split logic only seeks forward (an assert verifies this), so the iterator can be reused (like Automaton) removed the iteration by not using setEnum() [throws UOE] , see LUCENE-2087 removed TermEnum, as class cannot be subclassed - so no BW break!!!; getEnum() throws UOE. seek() cannot work for this TermsEnum, so throw UOE (is not needed for MTQ at the moment)
          Hide
          Michael McCandless added a comment -

          Thanks Uwe!

          Show
          Michael McCandless added a comment - Thanks Uwe!
          Hide
          Michael McCandless added a comment -

          fwiw here is a patch to use the algorithm from the unicode std for utf8 in utf16 sort order.
          they claim it is fast because there is no conditional branching... who knows

          We could try to test to see if we see a difference in practice...

          For term text without surrogate content, the branch always goes one way, so the CPU ought to predict it well and it may turn out to be faster using branching.

          With surrogates, likely the lookup approach is faster since the branch has good chance of going either way.

          However, the lookup approach adds 256 bytes to CPUs memory cache, which I'm not thrilled about. We have other places that do the same (NORM_TABLE in Similarity, scoreCache in TermScorer), that I think are much more warranted to make the time vs cache line tradeoff since they deal with a decent amount of CPU.

          Or maybe worrying about cache lines from way up in javaland is just silly

          I guess at this point I'd lean towards keeping the branch based comparator.

          Show
          Michael McCandless added a comment - fwiw here is a patch to use the algorithm from the unicode std for utf8 in utf16 sort order. they claim it is fast because there is no conditional branching... who knows We could try to test to see if we see a difference in practice... For term text without surrogate content, the branch always goes one way, so the CPU ought to predict it well and it may turn out to be faster using branching. With surrogates, likely the lookup approach is faster since the branch has good chance of going either way. However, the lookup approach adds 256 bytes to CPUs memory cache, which I'm not thrilled about. We have other places that do the same (NORM_TABLE in Similarity, scoreCache in TermScorer), that I think are much more warranted to make the time vs cache line tradeoff since they deal with a decent amount of CPU. Or maybe worrying about cache lines from way up in javaland is just silly I guess at this point I'd lean towards keeping the branch based comparator.
          Hide
          Robert Muir added a comment -

          We could try to test to see if we see a difference in practice...

          it is also very wierd to me that the method you are using is the one being used in ICU... if this one is faster why isnt ICU using it?
          its also sketchy that the table as described in the unicode std doesn't even work anyway as described... so is anyone using it?

          I like your reasoning, lets leave it alone for now... other things to work on that will surely help.

          Show
          Robert Muir added a comment - We could try to test to see if we see a difference in practice... it is also very wierd to me that the method you are using is the one being used in ICU... if this one is faster why isnt ICU using it? its also sketchy that the table as described in the unicode std doesn't even work anyway as described... so is anyone using it? I like your reasoning, lets leave it alone for now... other things to work on that will surely help.
          Hide
          Uwe Schindler added a comment -

          To prevent problems like yesterday, he is the patch I applied yesterday to the flex branch (for completeness).

          Show
          Uwe Schindler added a comment - To prevent problems like yesterday, he is the patch I applied yesterday to the flex branch (for completeness).
          Hide
          Mark Miller added a comment -

          I'm going to commit the latest merge to trunk in a bit.

          In a recent commit, NumericRangeQuery was changed to return UnsupportedOperationException for getEnum - I think thats going to be a back compat break? For now I've commented out the back compat test and put a nocommit comment:

            @Override
            // nocommit: I think this needs to be implemented for back compat? When done, 
            // the back compat test for it in TestNumericRangeQuery32 should be uncommented.
            protected FilteredTermEnum getEnum(final IndexReader reader) throws IOException {
              throw new UnsupportedOperationException("not implemented");
            }
          

          I think we need to go back to returning the Enum? But I'm not sure why this change was made, so ...

          Show
          Mark Miller added a comment - I'm going to commit the latest merge to trunk in a bit. In a recent commit, NumericRangeQuery was changed to return UnsupportedOperationException for getEnum - I think thats going to be a back compat break? For now I've commented out the back compat test and put a nocommit comment: @Override // nocommit: I think this needs to be implemented for back compat? When done, // the back compat test for it in TestNumericRangeQuery32 should be uncommented. protected FilteredTermEnum getEnum( final IndexReader reader) throws IOException { throw new UnsupportedOperationException( "not implemented" ); } I think we need to go back to returning the Enum? But I'm not sure why this change was made, so ...
          Hide
          Uwe Schindler added a comment - - edited

          It is not a break: you cannot extend NumericRangeQuery (it's final), so you can never call that method (protected). Only if you put your class that may call this method into the same package, but that's illegal and not backed by bw compatibility (The BW test is exactly such a case, just comment it out in BW branch - I added this test for explicit enum testing, we should have this in flex trunk, too).

          (I explained that in the commit and Mike already wrote that in the comment). So please keep the code clean and do not re-add this TE.

          Show
          Uwe Schindler added a comment - - edited It is not a break: you cannot extend NumericRangeQuery (it's final), so you can never call that method (protected). Only if you put your class that may call this method into the same package, but that's illegal and not backed by bw compatibility (The BW test is exactly such a case, just comment it out in BW branch - I added this test for explicit enum testing, we should have this in flex trunk, too). (I explained that in the commit and Mike already wrote that in the comment). So please keep the code clean and do not re-add this TE.
          Hide
          Mark Miller added a comment -

          Mike already wrote that in the comment

          In what comment? Would be helpful to have it in a comment above getEnum.

          just comment it out in BW branch

          Thats what I'll do. Did the BW branch pass when you did it? If not, it would be helpful to commit that fix too, or call out the break loudly in this thread - its difficult to keep up on everything and track all of this down for these merges.

          So please keep the code clean and do not re-add this TE.

          Oh, I had no plans to do it myself I just commented out the BW compat test and put the comment you see above.

          Show
          Mark Miller added a comment - Mike already wrote that in the comment In what comment? Would be helpful to have it in a comment above getEnum. just comment it out in BW branch Thats what I'll do. Did the BW branch pass when you did it? If not, it would be helpful to commit that fix too, or call out the break loudly in this thread - its difficult to keep up on everything and track all of this down for these merges. So please keep the code clean and do not re-add this TE. Oh, I had no plans to do it myself I just commented out the BW compat test and put the comment you see above.
          Hide
          Mark Miller added a comment -

          Though I do wonder ... if its not a break, why do we have the method there throwing UnsupportedExceptionOperation ... why isn't it just removed?

          Show
          Mark Miller added a comment - Though I do wonder ... if its not a break, why do we have the method there throwing UnsupportedExceptionOperation ... why isn't it just removed?
          Hide
          Uwe Schindler added a comment -

          In what comment? Would be helpful to have it in a comment above getEnum.

          Will do! It's in the log message not comment.

          Did the BW branch pass when you did it?

          I think so, at least in my checkout. I think the TermEnum test was added after 3.0?

          Though I do wonder ... if its not a break, why do we have the method there throwing UnsupportedExceptionOperation ... why isn't it just removed?

          I did not look into the super class, which just returns null. I thought it was abstract.

          Show
          Uwe Schindler added a comment - In what comment? Would be helpful to have it in a comment above getEnum. Will do! It's in the log message not comment. Did the BW branch pass when you did it? I think so, at least in my checkout. I think the TermEnum test was added after 3.0? Though I do wonder ... if its not a break, why do we have the method there throwing UnsupportedExceptionOperation ... why isn't it just removed? I did not look into the super class, which just returns null. I thought it was abstract.
          Hide
          Uwe Schindler added a comment -

          Mark: The updated backwards branch does not pass because of this (I did not update my checkout, the Enum test was added before 3.0). So the test should be commented out there, too (but you said, you would do this). Else, I will do tomorrow, I am tired, I would produce to many errors - sorry.

          Show
          Uwe Schindler added a comment - Mark: The updated backwards branch does not pass because of this (I did not update my checkout, the Enum test was added before 3.0). So the test should be commented out there, too (but you said, you would do this). Else, I will do tomorrow, I am tired, I would produce to many errors - sorry.
          Hide
          Uwe Schindler added a comment -

          I updated my commit comment above, so it's clear what I have done (copied from commit log message).

          Show
          Uwe Schindler added a comment - I updated my commit comment above, so it's clear what I have done (copied from commit log message).
          Hide
          Mark Miller added a comment -

          Else, I will do tomorrow, I am tired, I would produce to many errors - sorry.

          No problem - I got it now - just wasn't sure. Thats why I brought it up

          It's in the log message not comment.

          Yup - thats fine, no big deal. Was just saying it would be easier on me if there was a comment over it - I've got it now though - I'll just remove that method.

          Show
          Mark Miller added a comment - Else, I will do tomorrow, I am tired, I would produce to many errors - sorry. No problem - I got it now - just wasn't sure. Thats why I brought it up It's in the log message not comment. Yup - thats fine, no big deal. Was just saying it would be easier on me if there was a comment over it - I've got it now though - I'll just remove that method.
          Hide
          Uwe Schindler added a comment -

          I'll just remove that method.

          In my opinion the super method should throw UOE. If somebody misses to override either getTermsEnum() or getEnum() he will get a good message describing the problem, not just an NPE. The default impl of getTermsEnum() to return null is fine, because rewrite then delegates to getEnum(). If that also returns null, you get NPE.

          We had the same problem with Filter.bits() after deprecation in 2.x - it was not solved very good. In the 2.9 TS BW layer / DocIdSetIterator bw layer it was done correctly.

          Show
          Uwe Schindler added a comment - I'll just remove that method. In my opinion the super method should throw UOE. If somebody misses to override either getTermsEnum() or getEnum() he will get a good message describing the problem, not just an NPE. The default impl of getTermsEnum() to return null is fine, because rewrite then delegates to getEnum(). If that also returns null, you get NPE. We had the same problem with Filter.bits() after deprecation in 2.x - it was not solved very good. In the 2.9 TS BW layer / DocIdSetIterator bw layer it was done correctly.
          Hide
          Uwe Schindler added a comment -

          This is what I am thinking about for BW and delegation between getEnum() and getTermsEnum().

          Show
          Uwe Schindler added a comment - This is what I am thinking about for BW and delegation between getEnum() and getTermsEnum().
          Hide
          Mark Miller added a comment -

          Okay - thats sounds like a good idea - I'll leave it for after the merge is done though.

          Show
          Mark Miller added a comment - Okay - thats sounds like a good idea - I'll leave it for after the merge is done though.
          Hide
          Mark Miller added a comment -

          I've put the merge on hold for a bit - will try and come back to it tonight. Ive got to figure out why this BW compat test is failing, and haven't seen an obvious reason yet:

          junit.framework.AssertionFailedError: expected:<> but was:<>
          	at org.apache.lucene.search.TestWildcard.testEmptyTerm(TestWildcard.java:108)
          	at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208)
          

          Pipe in if you know. Hard to debug or run this test singular in Eclipse (because of how BW compat tests work), so its a slow slog to trouble shoot and I haven't had time yet.

          Show
          Mark Miller added a comment - I've put the merge on hold for a bit - will try and come back to it tonight. Ive got to figure out why this BW compat test is failing, and haven't seen an obvious reason yet: junit.framework.AssertionFailedError: expected:<> but was:<> at org.apache.lucene.search.TestWildcard.testEmptyTerm(TestWildcard.java:108) at org.apache.lucene.util.LuceneTestCase.runBare(LuceneTestCase.java:208) Pipe in if you know. Hard to debug or run this test singular in Eclipse (because of how BW compat tests work), so its a slow slog to trouble shoot and I haven't had time yet.
          Hide
          Michael McCandless added a comment -

          I think that test failure was from my fix of BooleanQuery to take coord into account in equals & hashCode (LUCENE-2092)? I hit exactly that same failure, and it required a fix on back-compat branch to just pass in "true" to the "new BooleanQuery()" done just before the assert. Does that explain it?

          Show
          Michael McCandless added a comment - I think that test failure was from my fix of BooleanQuery to take coord into account in equals & hashCode ( LUCENE-2092 )? I hit exactly that same failure, and it required a fix on back-compat branch to just pass in "true" to the "new BooleanQuery()" done just before the assert. Does that explain it?
          Hide
          Michael McCandless added a comment -

          And, thanks for taking over on merging trunk down! I'm especially looking forward to getting the faster unit tests (LUCENE-1844).

          Show
          Michael McCandless added a comment - And, thanks for taking over on merging trunk down! I'm especially looking forward to getting the faster unit tests ( LUCENE-1844 ).
          Hide
          Uwe Schindler added a comment -

          I have seen your change in the tests, too. The test just checks that no clauses are generated. In my opinion, it should not compare to a empty BQ instance, instead just assert bq.clauses().size()==0.

          Show
          Uwe Schindler added a comment - I have seen your change in the tests, too. The test just checks that no clauses are generated. In my opinion, it should not compare to a empty BQ instance, instead just assert bq.clauses().size()==0.
          Hide
          Michael McCandless added a comment -

          In my opinion, it should not compare to a empty BQ instance, instead just assert bq.clauses().size()==0.

          +1, that'd be a good improvement – I'll do that.

          Show
          Michael McCandless added a comment - In my opinion, it should not compare to a empty BQ instance, instead just assert bq.clauses().size()==0. +1, that'd be a good improvement – I'll do that.
          Hide
          Uwe Schindler added a comment -

          I rewrote to:

          public void testEmptyTerm() throws IOException {
          	RAMDirectory indexStore = getIndexStore("field", new String[]{"nowildcard", "nowildcardx"});
          	IndexSearcher searcher = new IndexSearcher(indexStore, true);
          
          	MultiTermQuery wq = new WildcardQuery(new Term("field", ""));
          	wq.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE);
          	assertMatches(searcher, wq, 0);
          	Query q = searcher.rewrite(wq);
          	assertTrue(q instanceof BooleanQuery);
          	assertEquals(0, ((BooleanQuery) q).clauses().size());
          }
          
          Show
          Uwe Schindler added a comment - I rewrote to: public void testEmptyTerm() throws IOException { RAMDirectory indexStore = getIndexStore( "field" , new String []{ "nowildcard" , "nowildcardx" }); IndexSearcher searcher = new IndexSearcher(indexStore, true ); MultiTermQuery wq = new WildcardQuery( new Term( "field" , "")); wq.setRewriteMethod(MultiTermQuery.SCORING_BOOLEAN_QUERY_REWRITE); assertMatches(searcher, wq, 0); Query q = searcher.rewrite(wq); assertTrue(q instanceof BooleanQuery); assertEquals(0, ((BooleanQuery) q).clauses().size()); }
          Hide
          Michael McCandless added a comment -

          Looks great – can/did you commit?