Lucene - Core
  1. Lucene - Core
  2. LUCENE-3905

BaseTokenStreamTestCase should test analyzers on real-ish content

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We already have LineFileDocs, that pulls content generated from europarl or wikipedia... I think sometimes BTSTC should test the analyzers on that as well.

      1. LUCENE-3905.patch
        8 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Patch.

        I also fixed end() offset bug in the ngram tokenizers...

        Show
        Michael McCandless added a comment - Patch. I also fixed end() offset bug in the ngram tokenizers...
        Hide
        Robert Muir added a comment -

        +1, we know the ngramtokenizers truncate to the first 1024 chars, but that
        doesn't mean they can't implement end() correctly so that at least
        highlighting on multivalued fields etc works.

        Show
        Robert Muir added a comment - +1, we know the ngramtokenizers truncate to the first 1024 chars, but that doesn't mean they can't implement end() correctly so that at least highlighting on multivalued fields etc works.
        Hide
        Robert Muir added a comment -

        oh one thing: i think we should blast the filter versions of these the same way?

        e.g. if i have mocktokenizer + (edge)ngramfilter, are they ok?

        Show
        Robert Muir added a comment - oh one thing: i think we should blast the filter versions of these the same way? e.g. if i have mocktokenizer + (edge)ngramfilter, are they ok?
        Hide
        Michael McCandless added a comment -

        The ngram filters are unfortunately not OK: they use up tons of RAM when you send random/big tokens through them, because they don't have the same 1024 character limit... I think we should open a new issue for them... in fact I think repairing them could make a good GSoC!

        Show
        Michael McCandless added a comment - The ngram filters are unfortunately not OK: they use up tons of RAM when you send random/big tokens through them, because they don't have the same 1024 character limit... I think we should open a new issue for them... in fact I think repairing them could make a good GSoC!
        Hide
        Robert Muir added a comment -

        I see... well +1 for this commit, its an improvement!

        Show
        Robert Muir added a comment - I see... well +1 for this commit, its an improvement!
        Hide
        Michael McCandless added a comment -

        OK I opened LUCENE-3907 for ngram love...

        Show
        Michael McCandless added a comment - OK I opened LUCENE-3907 for ngram love...

          People

          • Assignee:
            Unassigned
            Reporter:
            Michael McCandless
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development