Lucene - Core
  1. Lucene - Core
  2. LUCENE-1683

RegexQuery matches terms the input regex doesn't actually match

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.2
    • Fix Version/s: 2.9
    • Component/s: modules/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      I was writing some unit tests for our own wrapper around the Lucene regex classes, and got tripped up by something interesting.

      The regex "cat." will match "cats" but also anything with "cat" and 1+ following letters (e.g. "cathy", "catcher", ...) It is as if there is an implicit .* always added to the end of the regex.

      Here's a unit test for the behaviour I would expect myself:

      @Test
      public void testNecessity() throws Exception {
      File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
      IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
      try

      { Document doc = new Document(); doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); }

      finally

      { writer.close(); }

      IndexReader reader = IndexReader.open(dir);
      try

      { TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader); assertEquals("Wrong term", "cats", terms.term()); assertFalse("Should have only been one term", terms.next()); }

      finally

      { reader.close(); }

      }

      This test fails on the term check with terms.term() equal to "cathy".

      Our workaround is to mangle the query like this:

      String fixed = String.format("(?:%s)$", original);

        Activity

        Hide
        Trejkaz added a comment -

        I screwed up the formatting. Fixed version:

            @Test
            public void testNecessity() throws Exception
            {
                File dir = new File(new File(System.getProperty("java.io.tmpdir")), "index");
                IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true);
                Document doc = new Document();
                doc.add(new Field("field", "cat cats cathy", Field.Store.YES, Field.Index.TOKENIZED));
                writer.addDocument(doc);
                writer.close();
        
                IndexReader reader = IndexReader.open(dir);
        
                TermEnum terms = new RegexQuery(new Term("field", "cat.")).getEnum(reader);
                assertEquals("Wrong term", "cats", terms.term().text());
                assertFalse("Should have only been one term", terms.next());
            }
        
        Show
        Trejkaz added a comment - I screwed up the formatting. Fixed version: @Test public void testNecessity() throws Exception { File dir = new File( new File( System .getProperty( "java.io.tmpdir" )), "index" ); IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(), true ); Document doc = new Document(); doc.add( new Field( "field" , "cat cats cathy" , Field.Store.YES, Field.Index.TOKENIZED)); writer.addDocument(doc); writer.close(); IndexReader reader = IndexReader.open(dir); TermEnum terms = new RegexQuery( new Term( "field" , "cat." )).getEnum(reader); assertEquals( "Wrong term" , "cats" , terms.term().text()); assertFalse( "Should have only been one term" , terms.next()); }
        Hide
        Michael McCandless added a comment -

        Do you have a proposed fix for this...? Or, why is RegexQuery treating the trailing "." as a ".*" instead?

        Show
        Michael McCandless added a comment - Do you have a proposed fix for this...? Or, why is RegexQuery treating the trailing "." as a ".*" instead?
        Hide
        Steve Rowe added a comment - - edited

        ... why is RegexQuery treating the trailing "." as a ".*" instead?

        JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern.

        By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*".

        The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile".

        The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt().

        Show
        Steve Rowe added a comment - - edited ... why is RegexQuery treating the trailing "." as a ".*" instead? JavaUtilRegexCapabilities.match() is implemented as j.u.regex.Matcher.lookingAt(), which is equivalent to adding a trailing ".*", unless you explicity append a "$" to the pattern. By contrast, JakartaRegexpCapabilities.match() is implemented as RE.match(), which does not imply the trailing ".*". The difference in the two implementations implies this is a kind of bug, especially since the javadoc "contract" on RegexCapabilities.match() just says "@return true if string matches the pattern last passed to compile". The fix is to switch JavaUtilRegexCapabilities.match to use Matcher.match() instead of lookingAt().
        Hide
        Michael McCandless added a comment -

        I agree this is a bug – I'll switch to matches shortly.

        Show
        Michael McCandless added a comment - I agree this is a bug – I'll switch to matches shortly.
        Hide
        Michael McCandless added a comment -

        Thanks Trejkaz!

        Show
        Michael McCandless added a comment - Thanks Trejkaz!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Trejkaz
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development