Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-8380

UTF8TaxonomyWriterCache inconsistency

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 7.1
    • 7.5
    • modules/facet
    • None
    • New

    Description

      I’m facing a problem with taxonomy writer cache inconsistency. At some point in time UTF8TaxonomyWriterCache starts to return wrong ord for some facet labels. As result wrong ord are written in doc facet fields, and wrong counts are returned (undercount) during search. This bug is manifested on different servers with different index contents (we have several separate indexes with unique data).
      Unfortunately I can’t reproduce this behaviour in tests. 
      I've dumped "broken" UTF8TaxonomyWriterCache instance and created app to load it and to compare with real taxonomy. Dumps and app are in attachment. To run demo extract archives content and exec:

      mvn compile
      mvn exec:java -Dexec.mainClass="me.torobaev.lucene.taxonomy.cache.TaxonomyCacheCheck" -DtaxonomyDir=../taxonomy/ -DcacheDump=../taxonomy-cache.json
      

      As you can see, labels [frametype, 7] and [modification_id, 682] have same ord in cache.

      Attachments

        1. LUCENE-8380.patch
          5 kB
          Dawid Weiss
        2. lucene-taxonomy-cache-report.tar.gz
          4 kB
          Ruslan Torobaev
        3. taxonomy.tar.gz
          1.28 MB
          Ruslan Torobaev
        4. taxonomy-cache.json.gz
          1.25 MB
          Ruslan Torobaev

        Activity

          People

            dweiss Dawid Weiss
            ruslan.torobaev Ruslan Torobaev
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: