Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 3.1, 4.0-ALPHA
    • Fix Version/s: 3.6
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A tool to sort index according to a float document weight. Documents with high weight are given low document numbers, which means that they will be first evaluated. When using a strategy of "early termination" of queries (see TimeLimitedCollector) such sorting significantly improves the quality of partial results.

      (Originally this tool was created by Doug Cutting in Nutch, and used norms as document weights - thus the ordering was limited by the limited resolution of norms. This is a pure Lucene version of the tool, and it uses arbitrary floats from a specified stored field).

      1. indexSorter.patch
        15 kB
        Andrzej Bialecki
      2. LUCENE-2482-4.0.patch
        13 kB
        Juan Grande

        Activity

        Hide
        Andrzej Bialecki added a comment -

        Patch with the tool and a unit test.

        Show
        Andrzej Bialecki added a comment - Patch with the tool and a unit test.
        Hide
        Eks Dev added a comment -

        nice!
        There is also another interesting use case for sorting index, performance and index size!

        We use a couple of fields with low cardinality (zip code, user group... and likes). Having index sorted on these makes rle compression of postings really effective, making it possible to load all values into couple of M-bytes of ram.
        At a moment we just sort collection before indexing.

        Would it be possible somehow to use a combination of stored fields and to specify comparator? Even comparing them as byte[] would do the trick for this business case as it is only important to keep the same values together, order is irrelevant. Of course, having decoder to decode byte[] before comparing would be useful (e.g. for composite fields) , but would work in many cases without it.

        This works fine even with moderate update rate, as you can re-sort periodically. It does not have to be totally sorted, everything works, just slightly more memory is needed for filters

        With flex, having postings that use rle compression is quite possible ... this tool could become "optimizeHard()" tool for some indexes

        Show
        Eks Dev added a comment - nice! There is also another interesting use case for sorting index, performance and index size! We use a couple of fields with low cardinality (zip code, user group... and likes). Having index sorted on these makes rle compression of postings really effective, making it possible to load all values into couple of M-bytes of ram. At a moment we just sort collection before indexing. Would it be possible somehow to use a combination of stored fields and to specify comparator? Even comparing them as byte[] would do the trick for this business case as it is only important to keep the same values together, order is irrelevant. Of course, having decoder to decode byte[] before comparing would be useful (e.g. for composite fields) , but would work in many cases without it. This works fine even with moderate update rate, as you can re-sort periodically. It does not have to be totally sorted, everything works, just slightly more memory is needed for filters With flex, having postings that use rle compression is quite possible ... this tool could become "optimizeHard()" tool for some indexes
        Hide
        Andrzej Bialecki added a comment -

        Re: combination of fields + a comparator: sure, why not, take a look at the implementation of the DocScore inner class - you can stuff whatever you want there.

        I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list.

        Show
        Andrzej Bialecki added a comment - Re: combination of fields + a comparator: sure, why not, take a look at the implementation of the DocScore inner class - you can stuff whatever you want there. I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list.
        Hide
        Eks Dev added a comment -

        Re: I'm not sure if I follow your use case though

        Simple case, you have a 100Mio docs with 2 fields, CITY and TEXT

        sorting on CITY makes postings look like:
        Orlando: ---------------------------------
        New York: -------------------------------------
        perfectly compressible.

        without really affecting distribution (compressibility) of terms from the TEXT field.

        If CITY would remain in unsorted order (e.g. uniform distribution), you deal with very large postings for all terms coming from this field

        Sorting on many fields helps often, e.g. if you have hierarchical compositions like 1 CITY with many ZIP_CODES... philosophically, sorting always increases compressibility and improves locality of reference... but sure, you need to know what you want

        Show
        Eks Dev added a comment - Re: I'm not sure if I follow your use case though Simple case, you have a 100Mio docs with 2 fields, CITY and TEXT sorting on CITY makes postings look like: Orlando: --------------------------------- New York: ------------------------------------- perfectly compressible. without really affecting distribution (compressibility) of terms from the TEXT field. If CITY would remain in unsorted order (e.g. uniform distribution), you deal with very large postings for all terms coming from this field Sorting on many fields helps often, e.g. if you have hierarchical compositions like 1 CITY with many ZIP_CODES... philosophically, sorting always increases compressibility and improves locality of reference... but sure, you need to know what you want
        Hide
        Andrzej Bialecki added a comment -

        If there are no objections I'd like to commit this soon.

        Show
        Andrzej Bialecki added a comment - If there are no objections I'd like to commit this soon.
        Hide
        Andrzej Bialecki added a comment -

        Committed in rev. 998948.

        Show
        Andrzej Bialecki added a comment - Committed in rev. 998948.
        Hide
        Andrzej Bialecki added a comment -

        ..to remember we need a port of this tool to 4.0

        Show
        Andrzej Bialecki added a comment - ..to remember we need a port of this tool to 4.0
        Hide
        Koji Sekiguchi added a comment -

        I think this is an interesting tool. I'm wondering if Solr can call it, as Solr does merge indexes.

        Is there any restrictions on this? I've never looked into deeper it, but for example, I see isPayloadAvailable() returns always false. Does it mean that it doesn't support payload?
        Can it support multiple Sorts on indexed fields other than stored float field?

        Show
        Koji Sekiguchi added a comment - I think this is an interesting tool. I'm wondering if Solr can call it, as Solr does merge indexes. Is there any restrictions on this? I've never looked into deeper it, but for example, I see isPayloadAvailable() returns always false. Does it mean that it doesn't support payload? Can it support multiple Sorts on indexed fields other than stored float field?
        Hide
        Michael McCandless added a comment -

        We need to fix this on 4.0 too.

        Show
        Michael McCandless added a comment - We need to fix this on 4.0 too.
        Hide
        Robert Muir added a comment -

        I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list.

        Hi Andrzej, I came across this the other day, and thought it would be really interesting in the context of some of our newer codecs
        under development in trunk and the bulkpostings branch.

        I found the results presented there based on index sorting for codecs like simple9 to be really compelling, significant reduction
        in bits/posting for docids especially, because it can pack a lot of small deltas efficiently.

        The first method reorders the documents in a text collection based on the number of
        distinct terms contained in each document. The idea is that two documents that each
        contain a large number of distinct terms are more likely to share terms than are a
        document with many distinct terms and a document with few distinct terms. Therefore,
        by assigning docids so that documents with many terms are close together, we may
        expect a greater clustering effect than by assigning docids at random.
        
        The second method assumes that the documents have been crawled from the Web (or
        maybe a corporate Intranet). It reassigns docids in lexicographical order of URL. The
        idea here is that two documents from the same Web server (or maybe even from the
        same directory on that server) are more likely to share common terms than two random
        documents from unrelated locations on the Internet.
        

        http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id reordering)

        Show
        Robert Muir added a comment - I'm not sure if I follow your use case though ... please remember that this re-sorting is applied exactly the same to all postings, so savings on one list may cause bloat on another list. Hi Andrzej, I came across this the other day, and thought it would be really interesting in the context of some of our newer codecs under development in trunk and the bulkpostings branch. I found the results presented there based on index sorting for codecs like simple9 to be really compelling, significant reduction in bits/posting for docids especially, because it can pack a lot of small deltas efficiently. The first method reorders the documents in a text collection based on the number of distinct terms contained in each document. The idea is that two documents that each contain a large number of distinct terms are more likely to share terms than are a document with many distinct terms and a document with few distinct terms. Therefore, by assigning docids so that documents with many terms are close together, we may expect a greater clustering effect than by assigning docids at random. The second method assumes that the documents have been crawled from the Web (or maybe a corporate Intranet). It reassigns docids in lexicographical order of URL. The idea here is that two documents from the same Web server (or maybe even from the same directory on that server) are more likely to share common terms than two random documents from unrelated locations on the Internet. http://www.ir.uwaterloo.ca/book/06-index-compression.pdf (see page 214: doc id reordering)
        Hide
        Juan Grande added a comment -

        Hi! I'm attaching a patch with an implementation of this feature for Lucene 4.0. I'm not sure if the style is right because I can't download the codestyle.xml file for Eclipse.

        Show
        Juan Grande added a comment - Hi! I'm attaching a patch with an implementation of this feature for Lucene 4.0. I'm not sure if the style is right because I can't download the codestyle.xml file for Eclipse.
        Hide
        Shai Erera added a comment -

        Looks like it needs some more work - moving to 3.2

        Show
        Shai Erera added a comment - Looks like it needs some more work - moving to 3.2
        Hide
        Robert Muir added a comment -

        bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - bulk move 3.2 -> 3.3
        Hide
        Pablo Castellanos added a comment -

        Hi, I wanted to implement some early termination strategies over my Lucene index so I started playing with the 4.0 patch as I need to reorder it.

        So I have found that a lot of functions have changed in the past year and I had to go for some modifications, mainly:

        /*@Override
        public TermFreqVector[] getTermFreqVectors(int docNumber)
                throws IOException {
          return super.getTermFreqVectors(newToOld[docNumber]);
        }*/
        
        @Override
        public Fields getTermVectors(int docID) throws IOException {
        return super.getTermVectors(newToOld[docID]);
        }
        
        /*@Override
        public Document document(int n, FieldSelector fieldSelector)
                throws CorruptIndexException, IOException {
          return super.document(newToOld[n], fieldSelector);
        }*/
        
        @Override
        public void document(int docID, StoredFieldVisitor visitor)
        throws CorruptIndexException, IOException {
        super.document(newToOld[docID], visitor);
        }
        

        There exists also a getDeletedDocs function and I haven't found any good replacement for it

            /*@Override
            public Bits getDeletedDocs() {
              final Bits deletedDocs = super.getDeletedDocs();
        
              if (deletedDocs == null)
                return null;
        
              return new Bits() {
                @Override
                public boolean get(int index) {
                  return deletedDocs.get(newToOld[index]);
                }
        
                @Override
                public int length() {
                  return deletedDocs.length();
                }
              };
            }*/
        

        After applying these changes and using the code against my lucene index I get some weird results. It seems that the new sorting has worked but the posting list that access to the documents is still pointing to the old data.

        Imagine that I have 2 documents in my index and that I want to sort them by price (So the most expensive item should have a lower docId)

        Document 1

        docId:1, name: iPod, price: 100$

        Document 2

        docId:2, name: iPhone, price: 300$

        I run my modified version of IndexSorter over it and after that I try to query the new index, so if I query for name:iPhone I get:

        docId:2, name: iPod, price: 100$

        That leads me to believe that the documents have been sorted but the new index is using the old posting list.

        So I have two questions, are you planning on updating this code for newer versions of Lucene 4.0 or am I on my own to get it to work? And if this is the case, where should I look for getting a solution for my problem?

        Thanks in advance for your help.

        Show
        Pablo Castellanos added a comment - Hi, I wanted to implement some early termination strategies over my Lucene index so I started playing with the 4.0 patch as I need to reorder it. So I have found that a lot of functions have changed in the past year and I had to go for some modifications, mainly: /*@Override public TermFreqVector[] getTermFreqVectors( int docNumber) throws IOException { return super .getTermFreqVectors(newToOld[docNumber]); }*/ @Override public Fields getTermVectors( int docID) throws IOException { return super .getTermVectors(newToOld[docID]); } /*@Override public Document document( int n, FieldSelector fieldSelector) throws CorruptIndexException, IOException { return super .document(newToOld[n], fieldSelector); }*/ @Override public void document( int docID, StoredFieldVisitor visitor) throws CorruptIndexException, IOException { super .document(newToOld[docID], visitor); } There exists also a getDeletedDocs function and I haven't found any good replacement for it /*@Override public Bits getDeletedDocs() { final Bits deletedDocs = super .getDeletedDocs(); if (deletedDocs == null ) return null ; return new Bits() { @Override public boolean get( int index) { return deletedDocs.get(newToOld[index]); } @Override public int length() { return deletedDocs.length(); } }; }*/ After applying these changes and using the code against my lucene index I get some weird results. It seems that the new sorting has worked but the posting list that access to the documents is still pointing to the old data. Imagine that I have 2 documents in my index and that I want to sort them by price (So the most expensive item should have a lower docId) Document 1 docId:1, name: iPod, price: 100$ Document 2 docId:2, name: iPhone, price: 300$ I run my modified version of IndexSorter over it and after that I try to query the new index, so if I query for name:iPhone I get: docId:2, name: iPod, price: 100$ That leads me to believe that the documents have been sorted but the new index is using the old posting list. So I have two questions, are you planning on updating this code for newer versions of Lucene 4.0 or am I on my own to get it to work? And if this is the case, where should I look for getting a solution for my problem? Thanks in advance for your help.
        Hide
        Robert Muir added a comment -

        This issue is actually fixed in 3.x, but is still open for a 4.0 port.

        I'll open an issue (with fix version of 4.0) for the trunk port.

        Show
        Robert Muir added a comment - This issue is actually fixed in 3.x, but is still open for a 4.0 port. I'll open an issue (with fix version of 4.0) for the trunk port.
        Hide
        Robert Muir added a comment -

        I opened LUCENE-3918 for the trunk port to eliminate confusing in JIRA: the 3.x work has been done for a while.

        Show
        Robert Muir added a comment - I opened LUCENE-3918 for the trunk port to eliminate confusing in JIRA: the 3.x work has been done for a while.
        Hide
        Matthew Willson added a comment -

        Hi all – few quick questions if anyone is still watching this.

        • Could this be used to achieve an impact ordered index, as in e.g. [1], where documents in a given term's postings list are ordered by score contribution or term frequency?
        • Any caveats or things one should be aware of when it comes to index sorting in combination with different index merge strategies, and some of the more advanced stuff in Solr for managing distributed indexes?
        • Anyone aware of any other work along the lines of early stopping / dynamic pruning optimisations in Lucene? e.g. MaxScore from [1] (I think Xapian [2] calls it 'operator decay') or accumulator pruning based algorithms from [1] (perhaps in combination with impact ordering)? in particular is there anything in Lucene 4's approach to scoring and indexing which would make these hard in principle?

        Any pointers gratefully received.

        [1] Buettcher Clarke & Cormack "Implementing and Evaluating search engines" ch. 5 pp. 143-153
        [2] http://xapian.org/docs/matcherdesign.html

        Show
        Matthew Willson added a comment - Hi all – few quick questions if anyone is still watching this. Could this be used to achieve an impact ordered index, as in e.g. [1] , where documents in a given term's postings list are ordered by score contribution or term frequency? Any caveats or things one should be aware of when it comes to index sorting in combination with different index merge strategies, and some of the more advanced stuff in Solr for managing distributed indexes? Anyone aware of any other work along the lines of early stopping / dynamic pruning optimisations in Lucene? e.g. MaxScore from [1] (I think Xapian [2] calls it 'operator decay') or accumulator pruning based algorithms from [1] (perhaps in combination with impact ordering)? in particular is there anything in Lucene 4's approach to scoring and indexing which would make these hard in principle? Any pointers gratefully received. [1] Buettcher Clarke & Cormack "Implementing and Evaluating search engines" ch. 5 pp. 143-153 [2] http://xapian.org/docs/matcherdesign.html

          People

          • Assignee:
            Andrzej Bialecki
            Reporter:
            Andrzej Bialecki
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development