Nutch
  1. Nutch
  2. NUTCH-455

dedup on tokenized fields is faulty

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: 0.9.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      (From LUCENE-252)
      nutch uses several index servers, and the search results from these servers are merged using a dedup field for for deleting duplicates. The values from this field is cached by Lucene's FieldCachImpl. The default is the site field, which is indexed and tokenized. However for a Tokenized Field (for example "url" in nutch), FieldCacheImpl returns an array of Terms rather that array of field values, so dedup'ing becomes faulty. Current FieldCache implementation does not respect tokenized fields , and as described above caches only terms.

      So in the situation that we are searching using "url" as the dedup field, when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the url (such as "www" or "com") rather that the whole url. This prevents using tokenized fields in the dedup field.

      I have written a patch for lucene and attached it in http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the aforementioned issue about tokenized field caching. However building such a cache for about 1.5M documents takes 20+ secs. The code in IndexSearcher.translateHits() starts with

      if (dedupField != null)
      dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);

      and for the first call of search in IndexSearcher, cache is built.

      Long story short, i have written a patch against IndexSearcher, which in constructor warms-up the caches of wanted fields(configurable). I think we should vote for LUCENE-252, and then commit the above patch with the last version of lucene.

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Closed Closed
        1486d 4h 28m 1 Markus Jelsma 01/Apr/11 15:35
        Julien Nioche made changes -
        Component/s searcher [ 11593 ]
        Markus Jelsma made changes -
        Status Open [ 1 ] Closed [ 6 ]
        Resolution Won't Fix [ 2 ]
        Show
        Markus Jelsma added a comment - Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira
        Chris A. Mattmann made changes -
        Fix Version/s 1.1 [ 12313609 ]
        Hide
        Chris A. Mattmann added a comment -
        Show
        Chris A. Mattmann added a comment - pushing this out per http://bit.ly/c7tBv9
        Hide
        Andrzej Bialecki added a comment -

        Since LUCENE-252 is still unresolved, and it's not clear which of the proposed solutions should be selected, I'm postponing this issue.

        Show
        Andrzej Bialecki added a comment - Since LUCENE-252 is still unresolved, and it's not clear which of the proposed solutions should be selected, I'm postponing this issue.
        Andrzej Bialecki made changes -
        Fix Version/s 1.1 [ 12313609 ]
        Fix Version/s 1.0.0 [ 12312443 ]
        Sami Siren made changes -
        Fix Version/s 0.9.0 [ 12312013 ]
        Fix Version/s 1.0.0 [ 12312443 ]
        Hide
        Enis Soztutar added a comment -

        (from LUCENE-252)

        In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above patch and warm the cache initially in the index servers.

        I am in favor of the 3rd option.
        I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more sensible.

        Show
        Enis Soztutar added a comment - (from LUCENE-252 ) In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above patch and warm the cache initially in the index servers. I am in favor of the 3rd option. I think first resolving LUCENE-252 , and then proceeding with NUTCH-255 is more sensible.
        Hide
        Doug Cutting added a comment -

        Alternately, we could define it as an error to attempt to dedup by a tokenized field. That's the (undocumented) expectation of FieldCache. Using documents to populate a FieldCache for tokenized fields is very slow. It's better to add an untokenized version and use that, no? If you agree, then the more appropriate fix is to document the restriction and try to check for it at runtime.

        Show
        Doug Cutting added a comment - Alternately, we could define it as an error to attempt to dedup by a tokenized field. That's the (undocumented) expectation of FieldCache. Using documents to populate a FieldCache for tokenized fields is very slow. It's better to add an untokenized version and use that, no? If you agree, then the more appropriate fix is to document the restriction and try to check for it at runtime.
        Enis Soztutar made changes -
        Field Original Value New Value
        Attachment IndexSearcherCacheWarm.patch [ 12352821 ]
        Hide
        Enis Soztutar added a comment -

        the patch to the IndexSearcher is attached

        Show
        Enis Soztutar added a comment - the patch to the IndexSearcher is attached
        Enis Soztutar created issue -

          People

          • Assignee:
            Unassigned
            Reporter:
            Enis Soztutar
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development