Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3905

BaseTokenStreamTestCase should test analyzers on real-ish content

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We already have LineFileDocs, that pulls content generated from europarl or wikipedia... I think sometimes BTSTC should test the analyzers on that as well.

      1. LUCENE-3905.patch
        8 kB
        Michael McCandless

        Activity

        Hide
        mikemccand Michael McCandless added a comment -

        Patch.

        I also fixed end() offset bug in the ngram tokenizers...

        Show
        mikemccand Michael McCandless added a comment - Patch. I also fixed end() offset bug in the ngram tokenizers...
        Hide
        rcmuir Robert Muir added a comment -

        +1, we know the ngramtokenizers truncate to the first 1024 chars, but that
        doesn't mean they can't implement end() correctly so that at least
        highlighting on multivalued fields etc works.

        Show
        rcmuir Robert Muir added a comment - +1, we know the ngramtokenizers truncate to the first 1024 chars, but that doesn't mean they can't implement end() correctly so that at least highlighting on multivalued fields etc works.
        Hide
        rcmuir Robert Muir added a comment -

        oh one thing: i think we should blast the filter versions of these the same way?

        e.g. if i have mocktokenizer + (edge)ngramfilter, are they ok?

        Show
        rcmuir Robert Muir added a comment - oh one thing: i think we should blast the filter versions of these the same way? e.g. if i have mocktokenizer + (edge)ngramfilter, are they ok?
        Hide
        mikemccand Michael McCandless added a comment -

        The ngram filters are unfortunately not OK: they use up tons of RAM when you send random/big tokens through them, because they don't have the same 1024 character limit... I think we should open a new issue for them... in fact I think repairing them could make a good GSoC!

        Show
        mikemccand Michael McCandless added a comment - The ngram filters are unfortunately not OK: they use up tons of RAM when you send random/big tokens through them, because they don't have the same 1024 character limit... I think we should open a new issue for them... in fact I think repairing them could make a good GSoC!
        Hide
        rcmuir Robert Muir added a comment -

        I see... well +1 for this commit, its an improvement!

        Show
        rcmuir Robert Muir added a comment - I see... well +1 for this commit, its an improvement!
        Hide
        mikemccand Michael McCandless added a comment -

        OK I opened LUCENE-3907 for ngram love...

        Show
        mikemccand Michael McCandless added a comment - OK I opened LUCENE-3907 for ngram love...

          People

          • Assignee:
            Unassigned
            Reporter:
            mikemccand Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development