[LUCENE-6991] WordDelimiterFilter bug - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 4.10.4, 5.3.1
Fix Version/s: None
Component/s: None
Labels:
None

Lucene Fields:

New

Description

I was preparing analyzer which contains WordDelimiterFilter and I realized it sometimes gives results different then expected.

I prepared a short test which shows the problem. I haven't used Lucene tests for this but this doesn't matter for showing the bug.

    String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/ HTTP/1.1\"" +
            " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
            "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" +
            "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2" +
            "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) " +
            "Gecko/20100101 Firefox/20.0\"";

    List<String> tokens1 = new ArrayList<String>();
    List<String> tokens2 = new ArrayList<String>();
    WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
    TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
    tokenStream = new WordDelimiterFilter(tokenStream,
            WordDelimiterFilter.GENERATE_WORD_PARTS |
            WordDelimiterFilter.CATENATE_WORDS |
            WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
        null);
    CharTermAttribute charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      tokens1.add(charAttrib.toString());
      System.out.println(charAttrib.toString());
    }
    tokenStream.end();
    tokenStream.close();

    urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/ HTTP/1.1\"" +
        " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
        "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" +
        "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2" +
        "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) " +
        "Gecko/20100101 Firefox/20.0\"";


    System.out.println("\n\n====\n\n");
    tokenStream = analyzer.tokenStream("test", urlIndexed);
    tokenStream = new WordDelimiterFilter(tokenStream,
            WordDelimiterFilter.GENERATE_WORD_PARTS |
            WordDelimiterFilter.CATENATE_WORDS |
            WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
        null);
    charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
    tokenStream.reset();
    while(tokenStream.incrementToken()) {
      tokens2.add(charAttrib.toString());
      System.out.println(charAttrib.toString());
    }
    tokenStream.end();
    tokenStream.close();

    assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Pawel Rog

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Jan/16 10:41

Updated:: 28/Aug/22 14:48