Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10541

What to do about massive terms in our Wikipedia EN LineFileDocs?

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 9.2
    • None
    • None
    • New

    Description

      Spinoff from this fun build failure that Dawid Weiss root caused: https://lucene.markmail.org/thread/pculfuazll4oebra

      Thank you and sorry Dawid Weiss!!

      This test failure happened because the test case randomly indexed a chunk of the nightly (many GBs) LineFileDocs Wikipedia file that had a massive (> IW's ~32 KB limit) term, and IW threw an IllegalArgumentException failing the test.

      It's crazy that it took so long for Lucene's randomized tests to discover this too-massive term in Lucene's nightly benchmarks.  It's like searching for Nessie, or SETI.

      We need to prevent such false failures, somehow, and there are multiple options: fix this test to not use LineFileDocs, remove all "massive" terms from all tests (nightly and git) LineFileDocs, fix MockTokenizer to trim such ridiculous terms (I think this is the best option?), ...

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m

                  Slack

                    Issue deployment