Bug 4137 - regexp match gets different results on different platforms
Summary: regexp match gets different results on different platforms
Status: CLOSED FIXED
Alias: None
Product: Regexp
Classification: Unclassified
Component: Other (show other bugs)
Version: unspecified
Hardware: All All
: P3 normal (vote)
Target Milestone: ---
Assignee: Jakarta Notifications Mailing List
URL:
Keywords:
: 4183 (view as bug list)
Depends on:
Blocks: 25985
  Show dependency tree
 
Reported: 2001-10-12 16:49 UTC by Steven Procter
Modified: 2004-11-16 19:05 UTC (History)
2 users (show)



Attachments
Suggested fix for the bug (4.46 KB, patch)
2003-12-22 02:37 UTC, Oleg Sukhodolsky
Details | Diff
Suggested fix in unified format. (4.18 KB, patch)
2003-12-30 02:28 UTC, Oleg Sukhodolsky
Details | Diff
Additional patch: OP_ANY (.) did only check for \n but should use new method isNewline (1.25 KB, patch)
2004-01-09 11:51 UTC, Hendrik Brummermann
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Procter 2001-10-12 16:49:49 UTC
The following code, which uses the regexp1.2 package, gets different results when running on windows NT and linux, both running the javasoft jdk 1.3.  The problem seems to be with the handling of \n in multiline matches.





See the comment at the beginning of the class for some more detail.





--- BadRE.java ---





import org.apache.regexp.*;





// The following class defines a regular expression and attempts to


//   match a text string with it.  The regular expression is trying to


//   match the literal "window.location.href=" at the beginning of a


//   line, following any number of space characters.


//


// The results are different on Windows NT and Linux.  On linux


//   running sun jdk1.3, it matches.  On Windows NT Workstation running


//   sun jdk1.3, it doesn't match.


//


// If the \n is removed from the beginning of the input string then it


//   matches on windows and linux.


// If bol (beginning-of-line) is changed from "^[ \t]*" to "^[ \t\n]*" then


//   it matches on windows and linux.


// If bol is changed to "^\n[ \t]*" then it matches on windows and linux.


//





public class BadRE {


    public static RE makeRE() throws RESyntaxException {


	String bol = "^[ \t]*";


	String regexp = bol + "window.location.href=";


        RE matchRE = new RE(regexp, RE.MATCH_MULTILINE | RE.MATCH_CASEINDEPENDENT);


	return matchRE;


    }





    public static void test() throws RESyntaxException {


	String input = "\nwindow.location.href=";


	RE re = makeRE();


	if (re.match(input)) {


            System.out.println("match: " + re.getParen(0));


        }


	else {


            System.out.println("no match");


        }


    }





    public static void main(String [] args) throws RESyntaxException {


	test();


    }


}
Comment 1 Jon Stevens 2002-12-13 18:43:36 UTC
*** Bug 4183 has been marked as a duplicate of this bug. ***
Comment 2 Oleg Sukhodolsky 2003-10-10 09:15:22 UTC
I think the problem is that on Linux/Unix line.separator=="\n",
but on Windows it is "\r\n".  Thus on Linux we consider string which we match as to line text (and last line matches to regexp), but on Window this is one 
line text and this line diesn't match to regexp.

To correct test we should use
String input = System.getProperty("line.separator") + "window.location.href=";

So, I would say that the test is incorrect.
Comment 3 Oleg Sukhodolsky 2003-12-22 02:37:29 UTC
Created attachment 9656 [details]
Suggested fix for the bug
Comment 4 Vadim Gritsenko 2003-12-22 02:42:31 UTC
Nice patch! I'll test it and apply as soon as I have a bit of time... (ps: your
patches look different... do you/can you use "diff -u"?)
Comment 5 Steven Procter 2003-12-22 07:15:12 UTC
> To correct test we should use
> String input = System.getProperty("line.separator") + "window.location.href=";

The input comes from a remote web server, so the newline sequence used in the
data is not related to the platform that the code is running on.
Comment 6 Oleg Sukhodolsky 2003-12-30 02:28:46 UTC
Created attachment 9746 [details]
Suggested fix in unified format.
Comment 7 Hendrik Brummermann 2004-01-09 11:51:08 UTC
Created attachment 9872 [details]
Additional patch: OP_ANY (.) did only check for \n but should use new method isNewline
Comment 8 Vadim Gritsenko 2004-01-30 14:28:38 UTC
Oleg,

I'd added following testcase:

        r = new RE("^a.*b$", RE.MATCH_MULTILINE);
        if (!r.match("a\nb")) {
            fail("\"a\\nb\" doesn't match");
        }
        if (!r.match("a\rb")) {
            fail("\"a\\rb\" doesn't match");
        }
        if (!r.match("a\r\nb")) {
            fail("\"a\\r\\nb\" doesn't match");
        }
        if (!r.match("a\u0085b")) {
            fail("\"a\\u0085b\" doesn't match");
        }
        if (!r.match("a\u2028b")) {
            fail("\"a\\u2028b\" doesn't match");
        }
        if (!r.match("a\u2029b")) {
            fail("\"a\\u2029b\" doesn't match");
        }

And two of them fail:
[java] "a\nb" doesn't match
[java] "a\r\nb" doesn't match

Do you have a suggestion what's wrong here?


Hendrik,

With your patch and test above, several tests fail.


Vadim
Comment 9 Vadim Gritsenko 2004-01-30 14:43:57 UTC
Oops, I got it wrong. '.' should not match new line in MULTILINE mode. Correct
test is:
        // Test MATCH_MULTILINE. Test that '.' does not mathces new line.
        r = new RE("^a.*b$", RE.MATCH_MULTILINE);
        if (r.match("a\nb")) {
            fail("\"a\\nb\" matches \"^a.*b$\"");
        }
        if (r.match("a\rb")) {
            fail("\"a\\rb\" matches \"^a.*b$\"");
        }
        if (r.match("a\r\nb")) {
            fail("\"a\\r\\nb\" matches \"^a.*b$\"");
        }
        if (r.match("a\u0085b")) {
            fail("\"a\\u0085b\" matches \"^a.*b$\"");
        }
        if (r.match("a\u2028b")) {
            fail("\"a\\u2028b\" matches \"^a.*b$\"");
        }
        if (r.match("a\u2029b")) {
            fail("\"a\\u2029b\" matches \"^a.*b$\"");
        }

And Hendrik's patch is working ok.

Vadim
Comment 10 Vadim Gritsenko 2004-01-30 14:47:45 UTC
Patches applied, thanks to everybody.

Vadim