Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-6991

WordDelimiterFilter bug

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 4.10.4, 5.3.1
    • None
    • None
    • None
    • New

    Description

      I was preparing analyzer which contains WordDelimiterFilter and I realized it sometimes gives results different then expected.

      I prepared a short test which shows the problem. I haven't used Lucene tests for this but this doesn't matter for showing the bug.

          String urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/ HTTP/1.1\"" +
                  " 200 3437 http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
                  "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" +
                  "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2" +
                  "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) " +
                  "Gecko/20100101 Firefox/20.0\"";
      
          List<String> tokens1 = new ArrayList<String>();
          List<String> tokens2 = new ArrayList<String>();
          WhitespaceAnalyzer analyzer = new WhitespaceAnalyzer();
          TokenStream tokenStream = analyzer.tokenStream("test", urlIndexed);
          tokenStream = new WordDelimiterFilter(tokenStream,
                  WordDelimiterFilter.GENERATE_WORD_PARTS |
                  WordDelimiterFilter.CATENATE_WORDS |
                  WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
              null);
          CharTermAttribute charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
          tokenStream.reset();
          while(tokenStream.incrementToken()) {
            tokens1.add(charAttrib.toString());
            System.out.println(charAttrib.toString());
          }
          tokenStream.end();
          tokenStream.close();
      
          urlIndexed = "144.214.37.14 - - [05/Jun/2013:08:39:27 +0000] \"GET /products/key-phrase-extractor/ HTTP/1.1\"" +
              " 200 3437 \"http://www.google.com/url?sa=t&rct=j&q=&esrc=s&" +
              "source=web&cd=15&cad=rja&ved=0CEgQFjAEOAo&url=http%3A%2F%2Fwww.sematext.com%2Fproducts%2Fkey-" +
              "phrase-extractor%2F&ei=TPOuUbaWM-OKiQfGxIGYDw&usg=AFQjCNGwYAFYg_M3EZnp2eEWJzdvRrVPrg&sig2" +
              "=oYitONI2EIZ0CQar7Ej8HA&bvm=bv.47380653,d.aGc\" \"Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:20.0) " +
              "Gecko/20100101 Firefox/20.0\"";
      
      
          System.out.println("\n\n====\n\n");
          tokenStream = analyzer.tokenStream("test", urlIndexed);
          tokenStream = new WordDelimiterFilter(tokenStream,
                  WordDelimiterFilter.GENERATE_WORD_PARTS |
                  WordDelimiterFilter.CATENATE_WORDS |
                  WordDelimiterFilter.SPLIT_ON_CASE_CHANGE,
              null);
          charAttrib = tokenStream.addAttribute(CharTermAttribute.class);
          tokenStream.reset();
          while(tokenStream.incrementToken()) {
            tokens2.add(charAttrib.toString());
            System.out.println(charAttrib.toString());
          }
          tokenStream.end();
          tokenStream.close();
      
          assertEquals(Joiner.on(",").join(tokens1), Joiner.on(",").join(tokens2));
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            prog Pawel Rog
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: