Lucene - Core
  1. Lucene - Core
  2. LUCENE-252

[PATCH] Problem with Sort logic on tokenized fields

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: core/search
    • Labels:
      None
    • Environment:

      Operating System: other
      Platform: All

      Description

      When you set s SortField to a Text field which gets tokenized
      FieldCacheImpl uses the term to do the sort, but then sorting is off
      especially with more then one word in the field. I think it is much
      more logical to sort by field's string value if the sort field is Tokenized and
      stored. This way you'll get the CORRECT sort order

        Activity

        Hide
        Aviran Mordo added a comment -

        I wrote a fix that uses the stored values in case the sort field is tokenized
        and stored, or uses the Terms in case the sort field is a Keyword.

        Index: FieldCacheImpl.java
        ===================================================================
        RCS file:
        /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/search/FieldCacheImpl.java,v
        retrieving revision 1.3
        diff -u -r1.3 FieldCacheImpl.java
        — FieldCacheImpl.java 21 Jul 2004 19:05:46 -0000 1.3
        +++ FieldCacheImpl.java 28 Jul 2004 17:45:41 -0000
        @@ -25,6 +25,8 @@
        import java.util.Map;
        import java.util.WeakHashMap;
        import java.util.HashMap;
        +import org.apache.lucene.document.Field;
        +import java.util.Arrays;

        /**

        • Expert: The default cache implementation, storing all values in memory. @@
          -80,6 +82,29 @@
          }
          }

        + class FieldEntry implements Comparable {
        + String val;
        + int ind;
        + FieldEntry(int ind, String val)
        +

        { + this.ind = ind; + this.val = val; + }

        + public String getVal()
        +

        { + return val; + }

        + public int getInd()
        +

        { + return ind; + }

        + public int compareTo(Object obj)
        +

        { + return val.compareToIgnoreCase(((FieldEntry)obj).getVal()); + }

        +}
        +
        +

        /** The internal cache. Maps Entry to array of interpreted term values. **/
        final Map cache = new WeakHashMap();
        @@ -240,54 +265,92 @@
        if (ret == null) {
        final int[] retArray = new int[reader.maxDoc()];
        String[] mterms = new String[reader.maxDoc()+1];

        • if (retArray.length > 0) {
        • TermDocs termDocs = reader.termDocs();
        • TermEnum termEnum = reader.terms (new Term (field, ""));
        • int t = 0; // current term number
          -
        • // an entry for documents that have no terms in this field
        • // should a document with no terms be at top or bottom?
        • // this puts them at the top - if it is changed, FieldDocSortedHitQueue
        • // needs to change as well.
        • mterms[t++] = null;
        • try {
        • if (termEnum.term() == null) { - throw new RuntimeException ("no terms in field " + field); - }
        • do {
        • Term term = termEnum.term();
        • if (term.field() != field) break;
          -
        • // store term text
        • // we expect that there is at most one term per document
        • if (t >= mterms.length) throw new RuntimeException ("there are more
          terms than documents in field \"" + field + "\"");
        • mterms[t] = term.text();
          -
        • termDocs.seek (termEnum);
        • while (termDocs.next()) { - retArray[termDocs.doc()] = t; - }

          -

        • t++;
        • } while (termEnum.next());
        • } finally {
        • termDocs.close();
        • termEnum.close();
          + Field docField = reader.document(0).getField(field);
          + if (docField.isStored() && docField.isTokenized()) {
          + // Fill entries
          + FieldEntry[] entries = new FieldEntry[reader.maxDoc()];
          + for (int i=0; i<reader.maxDoc(); i++) { + String fieldValue; + if (!reader.isDeleted(i)) + fieldValue = reader.document(i).get(field); + else + fieldValue = ""; + entries[i] = new FieldEntry (i,fieldValue); }
        • if (t == 0) { - // if there are no terms, make the term array - // have a single null entry - mterms = new String[1]; - }

          else if (t < mterms.length) {

        • // if there are less terms than documents,
        • // trim off the dead array space
        • String[] terms = new String[t];
        • System.arraycopy (mterms, 0, terms, 0, t);
        • mterms = terms;
          + Arrays.sort(entries);
          + for (int i=0;i<reader.maxDoc();i++)
          + { + int ind = entries[i].getInd(); + retArray[ind] = i; + mterms[ind]=entries[i].getVal(); }

          }
          + else
          + {
          + if (retArray.length > 0)
          + {
          + TermDocs termDocs = reader.termDocs();
          + TermEnum termEnum = reader.terms(new Term(field, ""));
          + int t = 0; // current term number
          +
          + // an entry for documents that have no terms in this field
          + // should a document with no terms be at top or bottom?
          + // this puts them at the top - if it is changed,
          FieldDocSortedHitQueue
          + // needs to change as well.
          + mterms[t++] = null;
          +
          + try
          + {
          + if (termEnum.term() == null)
          +

          { + throw new RuntimeException("no terms in field " + field); + }

          + do
          +

          Unknown macro: {+ Term term = termEnum.term();+ if (term.field() != field)+ break;++ // store term text+ // we expect that there is at most one term per document+ if (t >= mterms.length)+ throw new RuntimeException("there are more terms thandocuments in field "" + field ++ """);+ mterms[t] = term.text();+ termDocs.seek(termEnum);+ while (termDocs.next())+ { + retArray[termDocs.doc()] = t; + }++ t++;+ }

          + while (termEnum.next());
          + }
          + finally
          +

          { + termDocs.close(); + termEnum.close(); + }

          +
          + if (t == 0)
          +

          { + // if there are no terms, make the term array + // have a single null entry + mterms = new String[1]; + }

          + else if (t < mterms.length)
          +

          { + // if there are less terms than documents, + // trim off the dead array space + String[] terms = new String[t]; + System.arraycopy(mterms, 0, terms, 0, t); + mterms = terms; + }

          + }
          + }
          StringIndex value = new StringIndex (retArray, mterms);
          store (reader, field, STRING_INDEX, value);
          return value;
          @@ -309,7 +372,7 @@
          // inherit javadocs
          public Object getAuto (IndexReader reader, String field)
          throws IOException {

        • field = field.intern();
          + field = field.intern();
          Object ret = lookup (reader, field, SortField.AUTO);
          if (ret == null) {
          TermEnum enumerator = reader.terms (new Term (field, ""));
        Show
        Aviran Mordo added a comment - I wrote a fix that uses the stored values in case the sort field is tokenized and stored, or uses the Terms in case the sort field is a Keyword. Index: FieldCacheImpl.java =================================================================== RCS file: /home/cvspublic/jakarta-lucene/src/java/org/apache/lucene/search/FieldCacheImpl.java,v retrieving revision 1.3 diff -u -r1.3 FieldCacheImpl.java — FieldCacheImpl.java 21 Jul 2004 19:05:46 -0000 1.3 +++ FieldCacheImpl.java 28 Jul 2004 17:45:41 -0000 @@ -25,6 +25,8 @@ import java.util.Map; import java.util.WeakHashMap; import java.util.HashMap; +import org.apache.lucene.document.Field; +import java.util.Arrays; /** Expert: The default cache implementation, storing all values in memory. @@ -80,6 +82,29 @@ } } + class FieldEntry implements Comparable { + String val; + int ind; + FieldEntry(int ind, String val) + { + this.ind = ind; + this.val = val; + } + public String getVal() + { + return val; + } + public int getInd() + { + return ind; + } + public int compareTo(Object obj) + { + return val.compareToIgnoreCase(((FieldEntry)obj).getVal()); + } +} + + /** The internal cache. Maps Entry to array of interpreted term values. **/ final Map cache = new WeakHashMap(); @@ -240,54 +265,92 @@ if (ret == null) { final int[] retArray = new int [reader.maxDoc()] ; String[] mterms = new String [reader.maxDoc()+1] ; if (retArray.length > 0) { TermDocs termDocs = reader.termDocs(); TermEnum termEnum = reader.terms (new Term (field, "")); int t = 0; // current term number - // an entry for documents that have no terms in this field // should a document with no terms be at top or bottom? // this puts them at the top - if it is changed, FieldDocSortedHitQueue // needs to change as well. mterms [t++] = null; try { if (termEnum.term() == null) { - throw new RuntimeException ("no terms in field " + field); - } do { Term term = termEnum.term(); if (term.field() != field) break; - // store term text // we expect that there is at most one term per document if (t >= mterms.length) throw new RuntimeException ("there are more terms than documents in field \"" + field + "\""); mterms [t] = term.text(); - termDocs.seek (termEnum); while (termDocs.next()) { - retArray[termDocs.doc()] = t; - } - t++; } while (termEnum.next()); } finally { termDocs.close(); termEnum.close(); + Field docField = reader.document(0).getField(field); + if (docField.isStored() && docField.isTokenized()) { + // Fill entries + FieldEntry[] entries = new FieldEntry [reader.maxDoc()] ; + for (int i=0; i<reader.maxDoc(); i++) { + String fieldValue; + if (!reader.isDeleted(i)) + fieldValue = reader.document(i).get(field); + else + fieldValue = ""; + entries[i] = new FieldEntry (i,fieldValue); } if (t == 0) { - // if there are no terms, make the term array - // have a single null entry - mterms = new String[1]; - } else if (t < mterms.length) { // if there are less terms than documents, // trim off the dead array space String[] terms = new String [t] ; System.arraycopy (mterms, 0, terms, 0, t); mterms = terms; + Arrays.sort(entries); + for (int i=0;i<reader.maxDoc();i++) + { + int ind = entries[i].getInd(); + retArray[ind] = i; + mterms[ind]=entries[i].getVal(); } } + else + { + if (retArray.length > 0) + { + TermDocs termDocs = reader.termDocs(); + TermEnum termEnum = reader.terms(new Term(field, "")); + int t = 0; // current term number + + // an entry for documents that have no terms in this field + // should a document with no terms be at top or bottom? + // this puts them at the top - if it is changed, FieldDocSortedHitQueue + // needs to change as well. + mterms [t++] = null; + + try + { + if (termEnum.term() == null) + { + throw new RuntimeException("no terms in field " + field); + } + do + Unknown macro: {+ Term term = termEnum.term();+ if (term.field() != field)+ break;++ // store term text+ // we expect that there is at most one term per document+ if (t >= mterms.length)+ throw new RuntimeException("there are more terms thandocuments in field "" + field ++ """);+ mterms[t] = term.text();+ termDocs.seek(termEnum);+ while (termDocs.next())+ { + retArray[termDocs.doc()] = t; + }++ t++;+ } + while (termEnum.next()); + } + finally + { + termDocs.close(); + termEnum.close(); + } + + if (t == 0) + { + // if there are no terms, make the term array + // have a single null entry + mterms = new String[1]; + } + else if (t < mterms.length) + { + // if there are less terms than documents, + // trim off the dead array space + String[] terms = new String[t]; + System.arraycopy(mterms, 0, terms, 0, t); + mterms = terms; + } + } + } StringIndex value = new StringIndex (retArray, mterms); store (reader, field, STRING_INDEX, value); return value; @@ -309,7 +372,7 @@ // inherit javadocs public Object getAuto (IndexReader reader, String field) throws IOException { field = field.intern(); + field = field.intern(); Object ret = lookup (reader, field, SortField.AUTO); if (ret == null) { TermEnum enumerator = reader.terms (new Term (field, ""));
        Hide
        Otis Gospodnetic added a comment -

        Aviran, could you attach your diff (instead of pasting it in), please?
        Use the 'Create a new attachment' link above. This will keep the patch nice and
        clean, without any wrapped lines and such. Thanks!

        Show
        Otis Gospodnetic added a comment - Aviran, could you attach your diff (instead of pasting it in), please? Use the 'Create a new attachment' link above. This will keep the patch nice and clean, without any wrapped lines and such. Thanks!
        Hide
        Aviran Mordo added a comment -

        Created an attachment (id=12275)
        A diff patch file that uses the stored values in case the sorted field is tokenized

        Show
        Aviran Mordo added a comment - Created an attachment (id=12275) A diff patch file that uses the stored values in case the sorted field is tokenized
        Hide
        Enis Soztutar added a comment -

        In the project Nutch, we have encountered a subtle bug, which I tracked down and found to be related to unintuitive caching in tokenized fields.

        nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting dupilcates. The values from this field is cached by FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example "url" in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty.

        Current FieldCache implementation does not respect tokenized fields, and as described above caches only terms. I have ported the previous patch and improved it for the 2.0 branch. And i will write a patch for the trunk.

        I am voting for this patch to be committed.

        Show
        Enis Soztutar added a comment - In the project Nutch, we have encountered a subtle bug, which I tracked down and found to be related to unintuitive caching in tokenized fields. nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting dupilcates. The values from this field is cached by FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example "url" in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields, and as described above caches only terms. I have ported the previous patch and improved it for the 2.0 branch. And i will write a patch for the trunk. I am voting for this patch to be committed.
        Hide
        Enis Soztutar added a comment -

        A bug in FieldCacheImpl_Tokenized_fields_lucene_2.0.patch is fixed, which caused not storing newly built cache. This patch obsoletes FieldCacheImpl_Tokenized_fields_lucene_2.0.patch

        Show
        Enis Soztutar added a comment - A bug in FieldCacheImpl_Tokenized_fields_lucene_2.0.patch is fixed, which caused not storing newly built cache. This patch obsoletes FieldCacheImpl_Tokenized_fields_lucene_2.0.patch
        Hide
        Enis Soztutar added a comment -

        This version patches against the current trunk version 2.2-dev. Please see the above comments about the contents of the patch.

        Show
        Enis Soztutar added a comment - This version patches against the current trunk version 2.2-dev. Please see the above comments about the contents of the patch.
        Hide
        Doug Cutting added a comment -

        I am not convinced that we should encourage folks to use a FieldCache with a tokenized field in this way, expecting the untokenized, stored value. It is considerably slower to populate the cache this way, greatly reducing its utility. Instead we should probably improve the documentation on this point, and perhaps even throw an exception for tokenized fields.

        Show
        Doug Cutting added a comment - I am not convinced that we should encourage folks to use a FieldCache with a tokenized field in this way, expecting the untokenized, stored value. It is considerably slower to populate the cache this way, greatly reducing its utility. Instead we should probably improve the documentation on this point, and perhaps even throw an exception for tokenized fields.
        Hide
        Hoss Man added a comment -

        FieldCacheImpl.getStringIndex already throws an exception if it detects it's being used on a tokenized field (but i'm not sure if it does a perfect job of detecting that ... it seems to make a naive assumption based on the number of terms) ... i'm guessing that check wasn't in the code when this bug was initially filed.

        I agree with Doug, silently using the stored field value if/when a field is tokenized is a bad idea .. there is a seperate patch available for doing this that allows people to be much more explicit about their intentions – see (LUCENE-769)

        Show
        Hoss Man added a comment - FieldCacheImpl.getStringIndex already throws an exception if it detects it's being used on a tokenized field (but i'm not sure if it does a perfect job of detecting that ... it seems to make a naive assumption based on the number of terms) ... i'm guessing that check wasn't in the code when this bug was initially filed. I agree with Doug, silently using the stored field value if/when a field is tokenized is a bad idea .. there is a seperate patch available for doing this that allows people to be much more explicit about their intentions – see ( LUCENE-769 )
        Hide
        Enis Soztutar added a comment -

        Well, I have spent half a day to find this issue with tokenized field caching, so I absolutely agree on throwing and exception in the getStrings() and getStringIndex() functions of FieldCacheImpl. A snippet would be like :

        Field docField = getField(reader, field);
        if (docField != null && docField.isStored() && docField.isTokenized())

        { throw new RuntimeException("Caching in Tokenized Fields is not allowed"); }

        Looking at the timing of cache building tokenized fields are really slow, as Doug mentioned, for a 1.5M real index(from web documents) building the cache on a tokenized field takes 1600 ms on the avarage, but for an untokenized field, it takes 30000 ms on avarage.

        In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use the above patch and warm the cache initially in the index servers.

        I am in favor of the 3rd option and believe that this patch is necessary and it can be included with an explanatory javadoc.
        another option will be to extend the defalut FieldCacheImpl and allow for tokenized field caching and naming the class similar to LUCENE-769's such as StoredFieldCacheImpl. If that is ok, i can prepare a patch and send it here.

        Show
        Enis Soztutar added a comment - Well, I have spent half a day to find this issue with tokenized field caching, so I absolutely agree on throwing and exception in the getStrings() and getStringIndex() functions of FieldCacheImpl. A snippet would be like : Field docField = getField(reader, field); if (docField != null && docField.isStored() && docField.isTokenized()) { throw new RuntimeException("Caching in Tokenized Fields is not allowed"); } Looking at the timing of cache building tokenized fields are really slow, as Doug mentioned, for a 1.5M real index(from web documents) building the cache on a tokenized field takes 1600 ms on the avarage, but for an untokenized field, it takes 30000 ms on avarage. In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use the above patch and warm the cache initially in the index servers. I am in favor of the 3rd option and believe that this patch is necessary and it can be included with an explanatory javadoc. another option will be to extend the defalut FieldCacheImpl and allow for tokenized field caching and naming the class similar to LUCENE-769 's such as StoredFieldCacheImpl. If that is ok, i can prepare a patch and send it here.
        Hide
        Yonik Seeley added a comment -

        I wouldn't want to see sorting on a tokenized field disallowed (Lucene's definition of tokenized is that analysis is done on the field value). There are tokenized fields that simply do things like manipulate a numer into a text-sortable value, lowercase, or do some other sort of normalization that doesn't produce multiple tokens.

        Show
        Yonik Seeley added a comment - I wouldn't want to see sorting on a tokenized field disallowed (Lucene's definition of tokenized is that analysis is done on the field value). There are tokenized fields that simply do things like manipulate a numer into a text-sortable value, lowercase, or do some other sort of normalization that doesn't produce multiple tokens.
        Hide
        Hoss Man added a comment -

        definitely in agreement with yonik here, erroring out if "docField.isTokenized()" would prevent some perfectly valid use cases ... my point was that hte current test of "if (t >= mterms.length)" only triggers an error if htere are more total terms in the field then there are documents in the index ... but there can be plenty of situations where a doc has more then one indexed term, but the total number of indexed terms is less hten the number of documents, a better test would be to check and see if we have already recorded a term for this doc.

        I have to say: I'm really not understanding how the current behavior is hindering nutch ... my understanding of the nutch model is that the set of fields is very well known – why do you need to rely on FieldCache being smart enough to stop you from trying to sort on a tokenized field? (and what does that have to do with deleting duplicates?)

        if nothing else: if nutch needs to prevent using FieldCache based sorting on tokenized fields, why can't the "if (docField.isTokenized())" logic be done outside of the FieldCacheImpl ... possibly as a way to decide if you want to use the basic sorting or use something like LUCENE-769?

        ...perhaps this is something that should be discussed more on java-dev?

        Show
        Hoss Man added a comment - definitely in agreement with yonik here, erroring out if "docField.isTokenized()" would prevent some perfectly valid use cases ... my point was that hte current test of "if (t >= mterms.length)" only triggers an error if htere are more total terms in the field then there are documents in the index ... but there can be plenty of situations where a doc has more then one indexed term, but the total number of indexed terms is less hten the number of documents, a better test would be to check and see if we have already recorded a term for this doc. I have to say: I'm really not understanding how the current behavior is hindering nutch ... my understanding of the nutch model is that the set of fields is very well known – why do you need to rely on FieldCache being smart enough to stop you from trying to sort on a tokenized field? (and what does that have to do with deleting duplicates?) if nothing else: if nutch needs to prevent using FieldCache based sorting on tokenized fields, why can't the "if (docField.isTokenized())" logic be done outside of the FieldCacheImpl ... possibly as a way to decide if you want to use the basic sorting or use something like LUCENE-769 ? ...perhaps this is something that should be discussed more on java-dev?
        Hide
        Enis Soztutar added a comment -

        I also agree with tokenized field caching, which is a use case for nutch. Let me elaborate on the use case. In a nutch deployment, we generate indexes from the web documents, and indeed the set of fields is known a priori. Then the indexes are distributed to several index servers running on hadoop's RPC calls. Then the query is sent to all of the index servers, the results are collected and merged on the fly. Since the indexes need not be disjoint(since crawling is an adaptive process) the results should be merged, without having a document more then once. So we need a unique key to represent the document. Default nutch codebase uses the site field(url's hostname), which is untokenized for such a task, and allow only 1 - 2 documents from a site in the search results. For obvious performance reasons, the site field is cached in the index servers with FieldCache.getStrings(). The problem arises when we want to show more than one result from a specific site (for example in a site:apache.org query ), and if we have the same url indexed in more than one index server. We use the tokenized url field in the FieldCache, then deleting duplicates becomes error prone. Since we use FieldCache.getStrings() rather that FieldCache.getStringIndex(), the problem here is not tokenized field sorting, but tokenized field not caching correctly, an example of which is an array like [com, edu. www, youtube, ] from the getStrings() method(for each doc, only a token is returned, rather than the whole url).

        Well, if you are still with me, here is my proposal :

        1. in FieldCacheImpl.java in both getStrings and getStringIndex functions add

        Field docField = getField(reader, field);
        if (docField != null && docField.isStored() && docField.isTokenized())

        { throw new RuntimeException("Caching in Tokenized Fields is not allowed"); }

        2. subclass FieldCacheImpl as StoredFieldCacheImpl and implement stored field caching there, delegating untokenized fields to super class
        3. add the implementation to FieldCache.java :

        public static FieldCache DEFAULT = new FieldCacheImpl();
        public static FieldCache STORED_CACHE = new StoredCacheImpl();

        this way both lucene internals will not be affected and a stored field caching could be performed.

        Show
        Enis Soztutar added a comment - I also agree with tokenized field caching, which is a use case for nutch. Let me elaborate on the use case. In a nutch deployment, we generate indexes from the web documents, and indeed the set of fields is known a priori. Then the indexes are distributed to several index servers running on hadoop's RPC calls. Then the query is sent to all of the index servers, the results are collected and merged on the fly. Since the indexes need not be disjoint(since crawling is an adaptive process) the results should be merged, without having a document more then once. So we need a unique key to represent the document. Default nutch codebase uses the site field(url's hostname), which is untokenized for such a task, and allow only 1 - 2 documents from a site in the search results. For obvious performance reasons, the site field is cached in the index servers with FieldCache.getStrings(). The problem arises when we want to show more than one result from a specific site (for example in a site:apache.org query ), and if we have the same url indexed in more than one index server. We use the tokenized url field in the FieldCache, then deleting duplicates becomes error prone. Since we use FieldCache.getStrings() rather that FieldCache.getStringIndex(), the problem here is not tokenized field sorting, but tokenized field not caching correctly, an example of which is an array like [com, edu. www, youtube, ] from the getStrings() method(for each doc, only a token is returned, rather than the whole url). Well, if you are still with me, here is my proposal : 1. in FieldCacheImpl.java in both getStrings and getStringIndex functions add Field docField = getField(reader, field); if (docField != null && docField.isStored() && docField.isTokenized()) { throw new RuntimeException("Caching in Tokenized Fields is not allowed"); } 2. subclass FieldCacheImpl as StoredFieldCacheImpl and implement stored field caching there, delegating untokenized fields to super class 3. add the implementation to FieldCache.java : public static FieldCache DEFAULT = new FieldCacheImpl(); public static FieldCache STORED_CACHE = new StoredCacheImpl(); this way both lucene internals will not be affected and a stored field caching could be performed.
        Hide
        Hoss Man added a comment -

        I'm afraid i'm still not understanding the issue in nutch, it seems like the root of hte problem is..

        > ... We use the tokenized url field in the FieldCache ...

        ...if you know this field is tokenized, don't use it this way. if you want to use it this way, index it a second time untokenized.

        At a more practical level:

        1) the change you propose to getStrings and getStringIndex is not practical because as we've discussed before, a field being tokenized isn't a garuntee that FieldCache won't work – isTokenized just inidcates that an Analyzer was used – it doesn't indicate that any real tokenization took place (the analyzer might have just been used to lowercase the field value before indexing, or strip off leading/trailing white space) that doesn't mean the normal FieldCache can't be used for sorting. the converse is also true: !isTokenized doens't tell you that it's safe to build the FieldCache – even if no Analyzer is ever used, multiple Field values can be added for the same field – and that is hte root cause of hte problem, not tokenization but multiple terms for a given field.

        2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without making any changes to what so ever to FieldCacheImpl – since nutch knows exactly which fields it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple or using a FieldCacheImpl. (but as i've said, i really don't think that's the right solution)

        Show
        Hoss Man added a comment - I'm afraid i'm still not understanding the issue in nutch, it seems like the root of hte problem is.. > ... We use the tokenized url field in the FieldCache ... ...if you know this field is tokenized, don't use it this way. if you want to use it this way, index it a second time untokenized. At a more practical level: 1) the change you propose to getStrings and getStringIndex is not practical because as we've discussed before, a field being tokenized isn't a garuntee that FieldCache won't work – isTokenized just inidcates that an Analyzer was used – it doesn't indicate that any real tokenization took place (the analyzer might have just been used to lowercase the field value before indexing, or strip off leading/trailing white space) that doesn't mean the normal FieldCache can't be used for sorting. the converse is also true: !isTokenized doens't tell you that it's safe to build the FieldCache – even if no Analyzer is ever used, multiple Field values can be added for the same field – and that is hte root cause of hte problem, not tokenization but multiple terms for a given field. 2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without making any changes to what so ever to FieldCacheImpl – since nutch knows exactly which fields it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple or using a FieldCacheImpl. (but as i've said, i really don't think that's the right solution)
        Hide
        Enis Soztutar added a comment -

        I should admit that, considering the case Yonik has mentioned. throwing an exception by checking the Field.isTokenized() is not suitable. However the check if (t >= mterms.length) is only in getStringIndex() and not in getStrings(). I think that a more robust check then the aforementioned should be included in both getStrings and getStringIndex functions. A possibility would be to allocate a boolean array(or BitSet) of the same size with the retArray, and then use the array to avoid multiple terms per document.

        > 2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without making any changes to what so ever to FieldCacheImpl – since nutch knows exactly which fields it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple or using a FieldCacheImpl.

        from my previous post : In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use the above patch and warm the cache initially in the index servers.

        Yes indexing a field a second time is an option, but considering my use cases with nutch, why would i want to grow my index by indexing the field twice, instead of tolerating 30 seconds of cache building in a web server, which will serve the indexes for days or even weeks.

        with a class like StoredFieldCacheImpl we can get the desired behaviour w/o modifiying the FieldCacheImpl, and my suggestion in my previous post without the 1st part does just this. I couldl have sent this to nutch but i think it is a lucene issue.

        Any more suggestions ?

        Show
        Enis Soztutar added a comment - I should admit that, considering the case Yonik has mentioned. throwing an exception by checking the Field.isTokenized() is not suitable. However the check if (t >= mterms.length) is only in getStringIndex() and not in getStrings(). I think that a more robust check then the aforementioned should be included in both getStrings and getStringIndex functions. A possibility would be to allocate a boolean array(or BitSet) of the same size with the retArray, and then use the array to avoid multiple terms per document. > 2) the desired behavior you are requesting in a StoredFieldCacheImpl could be done without making any changes to what so ever to FieldCacheImpl – since nutch knows exactly which fields it's indexing multiple tokens for, it can make the choice between using a StoredFieldCacheImple or using a FieldCacheImpl. from my previous post : In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use the above patch and warm the cache initially in the index servers. Yes indexing a field a second time is an option, but considering my use cases with nutch, why would i want to grow my index by indexing the field twice, instead of tolerating 30 seconds of cache building in a web server, which will serve the indexes for days or even weeks. with a class like StoredFieldCacheImpl we can get the desired behaviour w/o modifiying the FieldCacheImpl, and my suggestion in my previous post without the 1st part does just this. I couldl have sent this to nutch but i think it is a lucene issue. Any more suggestions ?
        Hide
        Mark Miller added a comment -

        This issue is too old - if a new patch/proposal is brought up we can reopen it.

        Show
        Mark Miller added a comment - This issue is too old - if a new patch/proposal is brought up we can reopen it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Aviran Mordo
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development