Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      PrefixCodedTermsReader/Writer (used by all "real" core codecs) already
      supports pluggable terms index impls.

      The only impl we have now is FixedGapTermsIndexReader/Writer, which
      picks every Nth (default 32) term and holds it in efficient packed
      int/byte arrays in RAM. This is already an enormous improvement (RAM
      reduction, init time) over 3.x.

      This patch adds another impl, VariableGapTermsIndexReader/Writer,
      which lets you specify an arbitrary IndexTermSelector to pick which
      terms are indexed, and then uses an FST to hold the indexed terms.
      This is typically even more memory efficient than packed int/byte
      arrays, though, it does not support ord() so it's not quite a fair
      comparison.

      I had to relax the terms index plugin api for
      PrefixCodedTermsReader/Writer to not assume that the terms index impl
      supports ord.

      I also did some cleanup of the FST/FSTEnum APIs and impls, and broke
      out separate seekCeil and seekFloor in FSTEnum. Eg we need seekFloor
      when the FST is used as a terms index but seekCeil when it's holding
      all terms in the index (ie which SimpleText uses FSTs for).

      1. LUCENE-2843.patch
        193 kB
        Michael McCandless
      2. LUCENE-2843.patch
        205 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Attached patch.

        Still some nocommits but I think it's close... though I still need to
        get indexDivisor working for var gap. Note that this patch changes
        the index format, even for Standard codec (using fixed gap terms
        index).

        A few tests fail because they assume Standard codec supports ord...

        To properly test the alternatives we now have for the terms index
        (including fixed vs variable, and then different index term selectors
        fro the variable case), I created a new fun testing codec,
        MockRandomCodec. It randomly pairs up a postings impl with a terms
        index impl. EG it sometimes uses an IndexTermPolicy that randomly
        picks index terms. We may now be able to remove the other Mock*
        codecs...

        Show
        Michael McCandless added a comment - Attached patch. Still some nocommits but I think it's close... though I still need to get indexDivisor working for var gap. Note that this patch changes the index format, even for Standard codec (using fixed gap terms index). A few tests fail because they assume Standard codec supports ord... To properly test the alternatives we now have for the terms index (including fixed vs variable, and then different index term selectors fro the variable case), I created a new fun testing codec, MockRandomCodec. It randomly pairs up a postings impl with a terms index impl. EG it sometimes uses an IndexTermPolicy that randomly picks index terms. We may now be able to remove the other Mock* codecs...
        Hide
        Michael McCandless added a comment -

        As a first test, I just made a policy that's identical to the fixed
        gap terms index, ie, it just picks every 32nd term as the index term.
        So this is really a test of the packed int/bytes vs FST.

        On the 10M Wikipedia test index, the resulting terms index files (=
        RAM used by SegmentReader) is ~38% smaller (~52% once optimized – FST
        "scales up" well).

        Here's the query perf vs trunk:

        Query QPS base QPS vargap Pct diff
        spanFirst(unit, 5) 17.13 16.75 -2.2%
        "unit state"~3 5.31 5.20 -2.1%
        spanNear([unit, state], 10, true) 4.59 4.52 -1.4%
        "unit state" 7.86 7.77 -1.1%
        +nebraska +state 204.74 202.85 -0.9%
        +unit +state 11.37 11.30 -0.6%
        doctimesecnum:[10000 TO 60000] 9.74 9.76 0.2%
        unit~1.0 21.70 21.82 0.6%
        unit* 26.18 26.55 1.4%
        state 29.29 29.75 1.6%
        uni* 15.06 15.32 1.7%
        unit state 10.73 10.93 1.9%
        unit~2.0 21.05 21.45 1.9%
        un*d 77.10 79.65 3.3%
        u*d 26.41 28.81 9.1%
        united~1.0 102.27 116.88 14.3%
        united~2.0 25.47 31.18 22.4%

        It's great that for the seek intensive fuzzy queries, the FST-based
        seeking is substantially faster. For other queries the term seek time
        is in the noise.

        I think we should make this (VariableGapTermsIndex) terms index impl
        the default (for Standard codec).

        Show
        Michael McCandless added a comment - As a first test, I just made a policy that's identical to the fixed gap terms index, ie, it just picks every 32nd term as the index term. So this is really a test of the packed int/bytes vs FST. On the 10M Wikipedia test index, the resulting terms index files (= RAM used by SegmentReader) is ~38% smaller (~52% once optimized – FST "scales up" well). Here's the query perf vs trunk: Query QPS base QPS vargap Pct diff spanFirst(unit, 5) 17.13 16.75 -2.2% "unit state"~3 5.31 5.20 -2.1% spanNear( [unit, state] , 10, true) 4.59 4.52 -1.4% "unit state" 7.86 7.77 -1.1% +nebraska +state 204.74 202.85 -0.9% +unit +state 11.37 11.30 -0.6% doctimesecnum: [10000 TO 60000] 9.74 9.76 0.2% unit~1.0 21.70 21.82 0.6% unit* 26.18 26.55 1.4% state 29.29 29.75 1.6% uni* 15.06 15.32 1.7% unit state 10.73 10.93 1.9% unit~2.0 21.05 21.45 1.9% un*d 77.10 79.65 3.3% u*d 26.41 28.81 9.1% united~1.0 102.27 116.88 14.3% united~2.0 25.47 31.18 22.4% It's great that for the seek intensive fuzzy queries, the FST-based seeking is substantially faster. For other queries the term seek time is in the noise. I think we should make this (VariableGapTermsIndex) terms index impl the default (for Standard codec).
        Hide
        Robert Muir added a comment -

        I like this idea, it would be interesting to see what other type of policies we can come up with.

        Just curious, how would the 'let FST decide' work? Do we have some quick way to know that
        selecting term X versus term Y would reduce the number of states/transitions? Isn't the resulting
        FST size also dependent upon the output value (terms dict file pointer)? And if we optimize
        this locally (X versus Y) does it tend to hold globally?

        Show
        Robert Muir added a comment - I like this idea, it would be interesting to see what other type of policies we can come up with. Just curious, how would the 'let FST decide' work? Do we have some quick way to know that selecting term X versus term Y would reduce the number of states/transitions? Isn't the resulting FST size also dependent upon the output value (terms dict file pointer)? And if we optimize this locally (X versus Y) does it tend to hold globally?
        Hide
        Michael McCandless added a comment -

        Just curious, how would the 'let FST decide' work?

        The FST builder is able to prune-as-it-builds. EG it prunes a node if the number of unique terms going through it is less than N. Alternatively, it prunes if the node just before had < N nodes coming through. To do this we'd pass all terms to the builder, and specify the prune threshold. So the FST would be "bushy"/deep when terms are a high density, and shallowish elsewhere.

        Isn't the resulting
        FST size also dependent upon the output value (terms dict file pointer)? And if we optimize
        this locally (X versus Y) does it tend to hold globally?

        Yes, very much so – the more stuff you store in the output the bigger the FST. But we only store the long file pointer into the main terms dict for this usage, and the FST is efficient (delta-codes the long values). But, I'm not trying in anyway to minimize that net size (in bytes)...

        Show
        Michael McCandless added a comment - Just curious, how would the 'let FST decide' work? The FST builder is able to prune-as-it-builds. EG it prunes a node if the number of unique terms going through it is less than N. Alternatively, it prunes if the node just before had < N nodes coming through. To do this we'd pass all terms to the builder, and specify the prune threshold. So the FST would be "bushy"/deep when terms are a high density, and shallowish elsewhere. Isn't the resulting FST size also dependent upon the output value (terms dict file pointer)? And if we optimize this locally (X versus Y) does it tend to hold globally? Yes, very much so – the more stuff you store in the output the bigger the FST. But we only store the long file pointer into the main terms dict for this usage, and the FST is efficient (delta-codes the long values). But, I'm not trying in anyway to minimize that net size (in bytes)...
        Hide
        Michael McCandless added a comment -

        New patch, resolving all nocommits. I think it's ready to commit.

        I cutover StandardCodec to the VariableGapTermsIndex, with 'every 32' as the index term selection policy. We could lower 32 to eg 20, since FST uses so much less RAM than packed ints/bytes, but for now I'm just leaving it at 32 to be safe.

        The "let FST builder pick the indexed terms" turns out not to be very easy to do w/ this API. I put a big comment explaining this in the var gap writer. It's also not clear we'd want to use this approach, since the resulting term index density can then vary substantially.

        Show
        Michael McCandless added a comment - New patch, resolving all nocommits. I think it's ready to commit. I cutover StandardCodec to the VariableGapTermsIndex, with 'every 32' as the index term selection policy. We could lower 32 to eg 20, since FST uses so much less RAM than packed ints/bytes, but for now I'm just leaving it at 32 to be safe. The "let FST builder pick the indexed terms" turns out not to be very easy to do w/ this API. I put a big comment explaining this in the var gap writer. It's also not clear we'd want to use this approach, since the resulting term index density can then vary substantially.
        Hide
        Earwin Burrfoot added a comment -

        And we're nearing a day when we keep the whole term dictionary in memory (as Sphinx does for instance).
        At that point a gazillion of term lookup-related hacks (like lookup cache) become obsolete
        Term dictionary itself can also be memory-mapped after this, instead of being "read" and "built" from disk, which makes new segment opening near-instantaneous.

        Show
        Earwin Burrfoot added a comment - And we're nearing a day when we keep the whole term dictionary in memory (as Sphinx does for instance). At that point a gazillion of term lookup-related hacks (like lookup cache) become obsolete Term dictionary itself can also be memory-mapped after this, instead of being "read" and "built" from disk, which makes new segment opening near-instantaneous.
        Hide
        Michael McCandless added a comment -

        In-memory terms dict would be great. I agree it'd fundamentally change how we execute eg the automaton queries (suddenly we can just intersect against the terms dict instead of doing the seek/next thing); FuzzyQuery might be a direct search through the terms dict instead of first building the LevN DFA; respelling similarly...

        But, I suspect we'll always have to support the "on-disk only" option because some apps seem to have an insane number of terms.

        Show
        Michael McCandless added a comment - In-memory terms dict would be great. I agree it'd fundamentally change how we execute eg the automaton queries (suddenly we can just intersect against the terms dict instead of doing the seek/next thing); FuzzyQuery might be a direct search through the terms dict instead of first building the LevN DFA; respelling similarly... But, I suspect we'll always have to support the "on-disk only" option because some apps seem to have an insane number of terms.
        Hide
        Earwin Burrfoot added a comment -

        As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage.

        My guess is that's because people with "insane number of terms" store various crap like unique timestamps as terms. With CSF ("attributes" in Sphinx lingo), and some nice filters that can work over CSF, there's no longer any need to stuff your timestamps in the same place you stuff your texts. That can be reflected in documentation, and then, suddenly, we can drop "on-disk only" support.

        Show
        Earwin Burrfoot added a comment - As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage. My guess is that's because people with "insane number of terms" store various crap like unique timestamps as terms. With CSF ("attributes" in Sphinx lingo), and some nice filters that can work over CSF, there's no longer any need to stuff your timestamps in the same place you stuff your texts. That can be reflected in documentation, and then, suddenly, we can drop "on-disk only" support.
        Hide
        Robert Muir added a comment -

        As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage

        I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok.
        sphinx also requires mysql, which only when started supporting real utf-8?! (not that 3-byte crap they tried to pass off instead)

        I don't think we should really be looking there for inspiration.

        Show
        Robert Muir added a comment - As I said, there's already a search server with strictly in-memory (in mmap sense. it can theoretically be paged out) terms dict AND widespread adoption. Their users somehow manage I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok. sphinx also requires mysql, which only when started supporting real utf-8?! (not that 3-byte crap they tried to pass off instead) I don't think we should really be looking there for inspiration.
        Hide
        Michael McCandless added a comment -

        Yes doc values should cut back on these large term dicts.

        But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky – Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs).

        If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, then we should load it into RAM (and do away w/ the terms index).

        Show
        Michael McCandless added a comment - Yes doc values should cut back on these large term dicts. But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky – Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs). If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, then we should load it into RAM (and do away w/ the terms index).
        Hide
        Yonik Seeley added a comment -

        Their users somehow manage.

        That neglects to count those who are not users because they could not manage with the limitations

        Anyway, being able to optionally keep the term dict in memory, per-field, if it's below a certain limits (terms/memory or whatever) would be very cool!

        Show
        Yonik Seeley added a comment - Their users somehow manage. That neglects to count those who are not users because they could not manage with the limitations Anyway, being able to optionally keep the term dict in memory, per-field, if it's below a certain limits (terms/memory or whatever) would be very cool!
        Hide
        Earwin Burrfoot added a comment -

        I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok.

        I'm in no way advocating it as an all-round better solution. It has it's wrinkles just as anything else.
        My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users.
        They have memory-resident term dictionary, and it works, I heard no complaints regarding this ever.

        sphinx also requires mysql

        Have you read anything at all? It has an integration ready, for the layman user who just wants to stick a fulltext search into their little app, but it is in no way reliant on it.
        Sphinx is a direct alternative to Solr.

        But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky - Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs).
        If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, then we should load it into RAM (and do away w/ the terms index).

        That's a bit delusional. If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files. I've seen this countless times.
        But then, you have a number of benefits - like sharing filesystem cache when opening same file multiple times, offloading things from Java heap (which is almost always a good thing), fastest load-into-memory times possible.

        Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction

        Show
        Earwin Burrfoot added a comment - I don't like the reasoning that, just because sphinx does it and their 'users manage', that makes it ok. I'm in no way advocating it as an all-round better solution. It has it's wrinkles just as anything else. My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users. They have memory-resident term dictionary, and it works, I heard no complaints regarding this ever. sphinx also requires mysql Have you read anything at all? It has an integration ready, for the layman user who just wants to stick a fulltext search into their little app, but it is in no way reliant on it. Sphinx is a direct alternative to Solr. But, I'm not a fan of pure disk-based terms dict. Expecting the OS to make good decisions on what gets swapped out is risky - Lucene is better informed than the OS on which data structures are worth spending RAM on (norms, terms index, field cache, del docs). If indeed the terms dict (thanks to FSTs) becomes small enough to "fit" in RAM, then we should load it into RAM (and do away w/ the terms index). That's a bit delusional. If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files. I've seen this countless times. But then, you have a number of benefits - like sharing filesystem cache when opening same file multiple times, offloading things from Java heap (which is almost always a good thing), fastest load-into-memory times possible. Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction
        Hide
        Robert Muir added a comment -

        Have you read anything at all?

        Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text!

        Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction

        So where is the problem?

        You can make your own all-on-disk impl, or all-in-ram impl and contribute it? And you dont have to implement terms dict cache,
        thats contained in the implementation?

        My problem is that we shouldnt assume all users can fit all their terms in RAM.

        I think its great to offer alternative impls that work all in ram, and maybe if termsdict < X where X is some configurable value,
        even consider using these automatically in standardcodec... but i don't see any benefit of 'forcing' this when we have this
        whole flexible indexing thing!

        Show
        Robert Muir added a comment - Have you read anything at all? Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text! Sorry, if I sound offending at times, but, damn, there's a whole world of simple and efficient code lying ahead in that direction So where is the problem? You can make your own all-on-disk impl, or all-in-ram impl and contribute it? And you dont have to implement terms dict cache, thats contained in the implementation? My problem is that we shouldnt assume all users can fit all their terms in RAM. I think its great to offer alternative impls that work all in ram, and maybe if termsdict < X where X is some configurable value, even consider using these automatically in standardcodec... but i don't see any benefit of 'forcing' this when we have this whole flexible indexing thing!
        Hide
        Yonik Seeley added a comment -

        My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users.

        Actually, I sort of agree. I read the "in memory" too fast and didn't realize you were talking about memory mapped.
        There are other parts of sphinx that are kept directly in memory (not memory mapped) and do limit it's single-node scalability too much IMO.
        Unfortunately, Java has additional overhead wrt mmap, and you also can't do some stuff that you could do in C. All this means is that trade-offs that made sense for C/C++ solutions may or may not make sense for Java solutions.

        Show
        Yonik Seeley added a comment - My reasoning is merely that alternative exists, and it is viable. As proven by pretty high-profile users. Actually, I sort of agree. I read the "in memory" too fast and didn't realize you were talking about memory mapped. There are other parts of sphinx that are kept directly in memory (not memory mapped) and do limit it's single-node scalability too much IMO. Unfortunately, Java has additional overhead wrt mmap, and you also can't do some stuff that you could do in C. All this means is that trade-offs that made sense for C/C++ solutions may or may not make sense for Java solutions.
        Hide
        Robert Muir added a comment -

        Unfortunately, Java has additional overhead wrt mmap

        Its not just that, you cant assume mmap even "works" (32-bit platform, even some troubles on 64-bit windows).
        Because this is a search engine library, not just a server on 64-bit linux only, then we need to support
        other situations like 32-bit users doing desktop search.

        In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to.

        Show
        Robert Muir added a comment - Unfortunately, Java has additional overhead wrt mmap Its not just that, you cant assume mmap even "works" (32-bit platform, even some troubles on 64-bit windows). Because this is a search engine library, not just a server on 64-bit linux only, then we need to support other situations like 32-bit users doing desktop search. In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to.
        Hide
        Earwin Burrfoot added a comment -

        Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text!

        All my points are contained within their documentation. No need to look at the code (it's as shady as Lucene's).
        In the same manner, Lucene had crappy analyzis for years, until you've taken hold of (unicode) police baton.
        So let's not allow color differences between our analyzers affect our judgement on other parts of ours : )

        In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to.

        I'm questioning is there any legal, adequate reason to have that much terms.
        I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms though :/

        A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be.

        Show
        Earwin Burrfoot added a comment - Nope, havent looked at their code... i think i stopped at the documentation when i saw how they analyzed text! All my points are contained within their documentation. No need to look at the code (it's as shady as Lucene's). In the same manner, Lucene had crappy analyzis for years, until you've taken hold of (unicode) police baton. So let's not allow color differences between our analyzers affect our judgement on other parts of ours : ) In other words, Test2BTerms in src/test should pass on my 32-bit windows machine with whatever we default to. I'm questioning is there any legal, adequate reason to have that much terms. I'm agreeing on mmap+32bit/mmap+windows point for reasonable amount of terms though :/ A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be.
        Hide
        Robert Muir added a comment -

        A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be.

        Whats the sad part again? why does it bother you if there is another alternative codec setup or terms dict implementation if you aren't using it?
        Should we also only have RAMDirectory and MMapDirectory and its sad that we have NIOFSDirectory?

        Show
        Robert Muir added a comment - A hybrid solution, with term-dict being loaded completely into memory (either via mmap, or into arrays) on per-field basis, is probably best in the end, however sad it may be. Whats the sad part again? why does it bother you if there is another alternative codec setup or terms dict implementation if you aren't using it? Should we also only have RAMDirectory and MMapDirectory and its sad that we have NIOFSDirectory?
        Hide
        Michael McCandless added a comment -

        If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files.

        In fact, even if it's not under any real memory pressure the OS will swap out your not-recently-accessed RAM. Net/net this is a good policy, if your metric is total throughput accomplished by all programs.

        But if your metric is latency to search queries, this is an awful policy.

        Fortunately OSs (at least Windows & Linux) give you some tunability here. Unfortunately, the tunable is global and it defaults "badly" for those programs that do make a careful distinction b/w what data structures are best held in RAM and what data is best left on disk.

        If I could I would offer an option to pin these pages, so the OS cannot swap them out, but I don't think we can do (easily) that from javaland (and I think you'd have to be root). Lacking pinning the best (approximation) we can do is pull these ourselves into RAM.

        Show
        Michael McCandless added a comment - If a system is forced to swap out, it'll swap your explicitly managed RAM just as likely as memory-mapped files. In fact, even if it's not under any real memory pressure the OS will swap out your not-recently-accessed RAM. Net/net this is a good policy, if your metric is total throughput accomplished by all programs. But if your metric is latency to search queries, this is an awful policy. Fortunately OSs (at least Windows & Linux) give you some tunability here. Unfortunately, the tunable is global and it defaults "badly" for those programs that do make a careful distinction b/w what data structures are best held in RAM and what data is best left on disk. If I could I would offer an option to pin these pages, so the OS cannot swap them out, but I don't think we can do (easily) that from javaland (and I think you'd have to be root). Lacking pinning the best (approximation) we can do is pull these ourselves into RAM.
        Hide
        Toke Eskildsen added a comment -

        I see that the VariableGapTermsIndexReader/Writer is now the default (or at least an experimental default) in trunk. This means that ord() and consequently seek() are not available. Are you, Michael, planning on adding these later on or are they gone for good?

        If they are gone for good, it does represent a bit of a problem for me as I use ord() and seek() for a memory-efficient hierarchical faceting system. Not having those in the default reader/writer means that most indexes "out there" will not support accessing terms by ordinals and that my code won't work on them unless they are re-build. Boo hoo for me, but not implementing the interface fully in the default implementation seems wrong. Or maybe the interface should be changed?

        Show
        Toke Eskildsen added a comment - I see that the VariableGapTermsIndexReader/Writer is now the default (or at least an experimental default) in trunk. This means that ord() and consequently seek() are not available. Are you, Michael, planning on adding these later on or are they gone for good? If they are gone for good, it does represent a bit of a problem for me as I use ord() and seek() for a memory-efficient hierarchical faceting system. Not having those in the default reader/writer means that most indexes "out there" will not support accessing terms by ordinals and that my code won't work on them unless they are re-build. Boo hoo for me, but not implementing the interface fully in the default implementation seems wrong. Or maybe the interface should be changed?
        Hide
        Robert Muir added a comment -

        Or maybe the interface should be changed?

        +1, ord is not an interface, its an implementation detail specific
        to only certain basic implementations that shouldn't be in TermsEnum.

        i would much prefer if this were some attribute, or somehow exposed
        via those implementations' TermStates... as in my opinion its really
        actually an implementation detail of TermState, not even terms.

        Show
        Robert Muir added a comment - Or maybe the interface should be changed? +1, ord is not an interface, its an implementation detail specific to only certain basic implementations that shouldn't be in TermsEnum. i would much prefer if this were some attribute, or somehow exposed via those implementations' TermStates... as in my opinion its really actually an implementation detail of TermState, not even terms.
        Hide
        Michael McCandless added a comment -

        Toke, the FixedGapTermsIndexWriter/Reader supports ord, but requires more RAM for the terms index and may cause some queries to run slower.

        Can you describe how your faceting system is using ord?

        Show
        Michael McCandless added a comment - Toke, the FixedGapTermsIndexWriter/Reader supports ord, but requires more RAM for the terms index and may cause some queries to run slower. Can you describe how your faceting system is using ord?
        Hide
        Robert Muir added a comment -

        Also, if faceting wants to exploit a codec-specific implementation detail,
        its far more interesting to evaluate things like changing VariableGapIndex's
        FST output to be a pair including max(docFreq) for the block it indexes.

        Then, faceting that wants to get the top-10 terms by docfreq, would
        instead work an FSTEnum, and only go to disk for the top-10 blocks... this
        would actually be a change in complexity order no?

        Show
        Robert Muir added a comment - Also, if faceting wants to exploit a codec-specific implementation detail, its far more interesting to evaluate things like changing VariableGapIndex's FST output to be a pair including max(docFreq) for the block it indexes. Then, faceting that wants to get the top-10 terms by docfreq, would instead work an FSTEnum, and only go to disk for the top-10 blocks... this would actually be a change in complexity order no?
        Hide
        Toke Eskildsen added a comment -

        Thank you. I will use the FixedGap-version myself, but that only works when I'm the one controlling the index build, right?

        As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs) in memory, I just hold their ordinals. As the terms themselves only need to be resolved when the final faceting result is to be returned, seeking for a few hundred or thousand terms by their ordinal has worked very well so far (no guarantees for old hardware such as spinning disks though).

        The memory savings over holding BytesRefs in memory of course varies with term lengths. There are some numbers at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ if someone finds it interesting and LUCENE-2369 has some measurements of the same principle applied to sorting.

        Show
        Toke Eskildsen added a comment - Thank you. I will use the FixedGap-version myself, but that only works when I'm the one controlling the index build, right? As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs) in memory, I just hold their ordinals. As the terms themselves only need to be resolved when the final faceting result is to be returned, seeking for a few hundred or thousand terms by their ordinal has worked very well so far (no guarantees for old hardware such as spinning disks though). The memory savings over holding BytesRefs in memory of course varies with term lengths. There are some numbers at https://sbdevel.wordpress.com/2010/10/11/hierarchical-faceting/ if someone finds it interesting and LUCENE-2369 has some measurements of the same principle applied to sorting.
        Hide
        Toke Eskildsen added a comment -

        Robert, there is already OrdTermState to hold the ord, but the ordinal itself is only interesting if the corresponding term can be seeked from it. Upon further inspection I see that the method

        TermsIndexReaderBase.supportsOrd()

        is coupled logically to

        seek(long ord)

        and

        ord()

        so support for ordinals does not seem like something one can expect.

        As for the FSTEnum-idea then I don't understand how it can work with faceting where the terms to return are defined by the documents from a search? ...But maybe we should discuss that elsewhere.

        Show
        Toke Eskildsen added a comment - Robert, there is already OrdTermState to hold the ord, but the ordinal itself is only interesting if the corresponding term can be seeked from it. Upon further inspection I see that the method TermsIndexReaderBase.supportsOrd() is coupled logically to seek( long ord) and ord() so support for ordinals does not seem like something one can expect. As for the FSTEnum-idea then I don't understand how it can work with faceting where the terms to return are defined by the documents from a search? ...But maybe we should discuss that elsewhere.
        Hide
        Robert Muir added a comment -

        Robert, there is already OrdTermState to hold the ord, but the ordinal itself is only interesting if the corresponding term can be seeked from it.

        You can seek to any arbitrary TermState (even if its not holding ord), but it might hold other things you don't care about.

        As for the FSTEnum-idea then I don't understand how it can work with faceting where the terms to return are defined by the documents from a search? ...But maybe we should discuss that elsewhere.

        In the general case, if you are using something like a priority queue to get the top-N terms (even if you are filtering by the documents from a search), this number would mean that once your priority queue is full, you can tell that an entire block of low freq terms is not-competitive to enter the PQ, without going to disk?

        Show
        Robert Muir added a comment - Robert, there is already OrdTermState to hold the ord, but the ordinal itself is only interesting if the corresponding term can be seeked from it. You can seek to any arbitrary TermState (even if its not holding ord), but it might hold other things you don't care about. As for the FSTEnum-idea then I don't understand how it can work with faceting where the terms to return are defined by the documents from a search? ...But maybe we should discuss that elsewhere. In the general case, if you are using something like a priority queue to get the top-N terms (even if you are filtering by the documents from a search), this number would mean that once your priority queue is full, you can tell that an entire block of low freq terms is not-competitive to enter the PQ, without going to disk?
        Hide
        Michael McCandless added a comment -

        Thank you. I will use the FixedGap-version myself, but that only works when I'm the one controlling the index build, right?

        Right, but, this is fair? I mean, it's easy (in Lucene 4.0) to pick the appropriate codec per field. So, if people want to use your faceting package, and you explain that it requires using a certain Codec, that seems OK?

        As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs) in memory, I just hold their ordinals. As the terms themselves only need to be resolved when the final faceting result is to be returned, seeking for a few hundred or thousand terms by their ordinal has worked very well so far (no guarantees for old hardware such as spinning disks though).

        OK that makes sense... impressive that seeking up to a few thousand terms is giving you good perf. You could also load DocTermsIndex in FieldCache, but of course then all terms data & ords are RAM resident (and the point of LUCENE-2369 is to have low memory overhead).

        Show
        Michael McCandless added a comment - Thank you. I will use the FixedGap-version myself, but that only works when I'm the one controlling the index build, right? Right, but, this is fair? I mean, it's easy (in Lucene 4.0) to pick the appropriate codec per field. So, if people want to use your faceting package, and you explain that it requires using a certain Codec, that seems OK? As for the faceting system then the principle really simple: Instead of holding terms (BytesRefs) in memory, I just hold their ordinals. As the terms themselves only need to be resolved when the final faceting result is to be returned, seeking for a few hundred or thousand terms by their ordinal has worked very well so far (no guarantees for old hardware such as spinning disks though). OK that makes sense... impressive that seeking up to a few thousand terms is giving you good perf. You could also load DocTermsIndex in FieldCache, but of course then all terms data & ords are RAM resident (and the point of LUCENE-2369 is to have low memory overhead).
        Hide
        Toke Eskildsen added a comment -

        Michael, I think I'll use a TermsEnum-wrapper which keeps every N BytesRef or TermState and uses that for ordinal-based random access. That's a simple configurable RAM/speed tradeoff that is Codec-agnostic. Question is what the lower bound of the amount of methods implemented by a given Codec is? Is anything besides ord/seek optional?

        Show
        Toke Eskildsen added a comment - Michael, I think I'll use a TermsEnum-wrapper which keeps every N BytesRef or TermState and uses that for ordinal-based random access. That's a simple configurable RAM/speed tradeoff that is Codec-agnostic. Question is what the lower bound of the amount of methods implemented by a given Codec is? Is anything besides ord/seek optional?

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development