Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.9
    • Fix Version/s: 2.1
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Linux, Java 1.5, Large Index with 4 million items and some heavily nested boolean queries

      Description

      I just recently ran a load test on the latest code from lucene , which is using a new BooleanScore and noticed the ConjunctionScorer was crunching through objects , especially while sorting as part of the skipTo call. It turns a linked list into an array, sorts the array, then converts the array back to a linked list for further processing by the scoring engines below.

      'm not sure if anyone else is experiencing this as I have a very large index (> 4 million items) and I am issuing some heavily nested queries

      Anyway, I decide to change the link list into an array and use a first and last marker to "simulate" a linked list.

      This scaled much better during my load test as the java gargbage collector was less - umm - virulent

      1. Conjunction20060921.patch
        5 kB
        Paul Elschot
      2. ConjunctionScorer.java
        4 kB
        Paul Elschot
      3. ConjunctionScorer.java
        4 kB
        Abdul Chaudhry

        Activity

        Hide
        abdollar Abdul Chaudhry added a comment -

        Updated a version of the ConjunctionScorer which uses an array rather than a linked list.
        The first(), last() and doc() method calls should probably be in-lined - maybe.

        Show
        abdollar Abdul Chaudhry added a comment - Updated a version of the ConjunctionScorer which uses an array rather than a linked list. The first(), last() and doc() method calls should probably be in-lined - maybe.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        I like performance, so I'm all for this.
        One thing about the code: it has a few probably
        superfluous casts to (Scorer) from Scorer array elements.

        Regards,
        Paul Elschot

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - I like performance, so I'm all for this. One thing about the code: it has a few probably superfluous casts to (Scorer) from Scorer array elements. Regards, Paul Elschot
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Simplified version, removed most casts, inlined first() and last().
        Passes all tests here.

        Regards,
        Paul Elschot

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Simplified version, removed most casts, inlined first() and last(). Passes all tests here. Regards, Paul Elschot
        Hide
        abdollar Abdul Chaudhry added a comment -

        Thanks Paul, re-ran load tests here, everything looks good and it looks easier to read as well.
        I also checked Arrays.sort, it's a merge sort - guaranteed nlogn.
        This all looks squeaky clean.
        Cheers,
        – Abdul

        Show
        abdollar Abdul Chaudhry added a comment - Thanks Paul, re-ran load tests here, everything looks good and it looks easier to read as well. I also checked Arrays.sort, it's a merge sort - guaranteed nlogn. This all looks squeaky clean. Cheers, – Abdul
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        The score() method can be simplified to use only the 'i' index. There is no need
        to offset the index into the scorers with the 'pos' variable, because all scorers match.

        Regards,
        Paul Elschot

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - The score() method can be simplified to use only the 'i' index. There is no need to offset the index into the scorers with the 'pos' variable, because all scorers match. Regards, Paul Elschot
        Hide
        abdollar Abdul Chaudhry added a comment -

        ok, this makes sense as the scoring engine runs something like this

        while (scorer.next()) {
        int doc = scorer.doc();
        float scorer = scorer.score();
        collector.collect(doc, score);
        }

        That is, next() will have ordered everything, so that by the time we call the scorer.score() method , everything should be in-order.

        Thanks, ill give that a go.

        The impression I have with lucene, and correct me if Im wrong, is that complex queries with many terms and clauses have their bottleneck in terms of performance in the ordering phase, that is scorer.next() requires everything to be in-document order and all the scorer sub-engines must comply. Collection is a moot point as you probably have small numbers of hits. However, on the other end of the scale, for queries with one or two terms that have a very high frequency the bottleneck is really in collection, that is the priority queue in collector.collect(),
        Essentially this is a sorting issue, somewhat masked and manipulated at various stages.
        This looks to me like lucene needs a "Query Plan".

        Show
        abdollar Abdul Chaudhry added a comment - ok, this makes sense as the scoring engine runs something like this while (scorer.next()) { int doc = scorer.doc(); float scorer = scorer.score(); collector.collect(doc, score); } That is, next() will have ordered everything, so that by the time we call the scorer.score() method , everything should be in-order. Thanks, ill give that a go. The impression I have with lucene, and correct me if Im wrong, is that complex queries with many terms and clauses have their bottleneck in terms of performance in the ordering phase, that is scorer.next() requires everything to be in-document order and all the scorer sub-engines must comply. Collection is a moot point as you probably have small numbers of hits. However, on the other end of the scale, for queries with one or two terms that have a very high frequency the bottleneck is really in collection, that is the priority queue in collector.collect(), Essentially this is a sorting issue, somewhat masked and manipulated at various stages. This looks to me like lucene needs a "Query Plan".
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Once the ConjunctionScorer matches the order of the subqueries/subscorers does not matter
        because they all match and a sum score needs to be formed.

        Scorer.next() can only tolerate documents not to be in order at top level, where hit collection is done.
        At lower levels in nested scorers, Lucene works a document at a time, and there Scorer.skipTo(docNr)
        requires that all document numbers are in order. Such skipping is needed for conjunctions.
        Since the score value of a document for a query depends on the score value of the subqueries,
        at some point the association on a single document must be done.
        For conjunctions, skipTo() is used, but for disjunctions this association is done by
        a priority queue in the trunk, and a distribution like method in 1.4.3. This distribution method works
        somewhat loosely on document order, and is therefore incompatible with skipping.

        Regards,
        Paul Elschot

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Once the ConjunctionScorer matches the order of the subqueries/subscorers does not matter because they all match and a sum score needs to be formed. Scorer.next() can only tolerate documents not to be in order at top level, where hit collection is done. At lower levels in nested scorers, Lucene works a document at a time, and there Scorer.skipTo(docNr) requires that all document numbers are in order. Such skipping is needed for conjunctions. Since the score value of a document for a query depends on the score value of the subqueries, at some point the association on a single document must be done. For conjunctions, skipTo() is used, but for disjunctions this association is done by a priority queue in the trunk, and a distribution like method in 1.4.3. This distribution method works somewhat loosely on document order, and is therefore incompatible with skipping. Regards, Paul Elschot
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        It seems like more cruft could be removed if support for incremental adding of subscorers was removed.
        Is there a reason we can't simplify things and just pass Collection<Scorer> in the constructor?

        Show
        yseeley@gmail.com Yonik Seeley added a comment - It seems like more cruft could be removed if support for incremental adding of subscorers was removed. Is there a reason we can't simplify things and just pass Collection<Scorer> in the constructor?
        Hide
        gsingers Grant Ingersoll added a comment -

        Yonik, Paul, do either of you know the status on this one? From the looks of it, it hasn't been implemented. It also has the highest number of votes in JIRA, so I thought I would take a look at it. One downside is it is not in patch form, but it also doesn't look to hard to extract the changes, either.

        One issue I have with these performance issues is that we don't have a reliable benchmarking suite. I am not a lawyer, but might we be able to use something like http://trec.nist.gov/data/reuters/reuters.html to build a sample benchmark suite? This corpus, plus 100 or so queries could work nicely. Of course, we would have to figure out some way for those interested to get their hands on the data. What do others do for benchmarking?

        Show
        gsingers Grant Ingersoll added a comment - Yonik, Paul, do either of you know the status on this one? From the looks of it, it hasn't been implemented. It also has the highest number of votes in JIRA, so I thought I would take a look at it. One downside is it is not in patch form, but it also doesn't look to hard to extract the changes, either. One issue I have with these performance issues is that we don't have a reliable benchmarking suite. I am not a lawyer, but might we be able to use something like http://trec.nist.gov/data/reuters/reuters.html to build a sample benchmark suite? This corpus, plus 100 or so queries could work nicely. Of course, we would have to figure out some way for those interested to get their hands on the data. What do others do for benchmarking?
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        Iirc the orginal performance problem was caused by creation of objects in the tight loop
        doing skipTo() on al the scorers.

        This patch is against current trunk and based on the earlier posted versions of ConjunctionScorer.
        which was based (by the first poster) on an existing ConjunctionScorer with an ASL notice,
        which is why I could grant the licence to ASF.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - Iirc the orginal performance problem was caused by creation of objects in the tight loop doing skipTo() on al the scorers. This patch is against current trunk and based on the earlier posted versions of ConjunctionScorer. which was based (by the first poster) on an existing ConjunctionScorer with an ASL notice, which is why I could grant the licence to ASF.
        Hide
        paul.elschot@xs4all.nl Paul Elschot added a comment -

        I just overlooked the grant by Abdul to the ASF.

        Show
        paul.elschot@xs4all.nl Paul Elschot added a comment - I just overlooked the grant by Abdul to the ASF.
        Hide
        yseeley@gmail.com Yonik Seeley added a comment -

        Thanks! I just committed this.

        Show
        yseeley@gmail.com Yonik Seeley added a comment - Thanks! I just committed this.
        Hide
        mikemccand Michael McCandless added a comment -

        Closing all issues that were resolved for 2.1.

        Show
        mikemccand Michael McCandless added a comment - Closing all issues that were resolved for 2.1.

          People

          • Assignee:
            yseeley@gmail.com Yonik Seeley
            Reporter:
            abdollar Abdul Chaudhry
          • Votes:
            7 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development