Commons Validator
  1. Commons Validator
  2. VALIDATOR-248

Add an option to allow 'localhost' as a valid hostname part in the URL

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3.1 Release
    • Fix Version/s: 1.4.0 Release
    • Component/s: Routines
    • Labels:
      None

      Description

      Working on Grails we've discovered (http://jira.codehaus.org/browse/GRAILS-1692) that commons-validator's UrlValidator rejects URLs like "http://localhost:8080/tau_gwi_00/clif/cb/19". I looked at commons-validator sources and found that any URL which contains 'localhost' as it's hostname part will be rejected.

      RFC-2396 (http://www.ietf.org/rfc/rfc2396.txt) accepts 'localhost' as a valid hostname (appendix G.3 paragraph 2 says that explicitly).

      So, it would be good to have additional option (UrlValidator.ALLOW_LOCALHOST) which will control UrlValidator behavior on localhost URLs.

        Activity

        Hide
        Sergey Nebolsin added a comment -

        This patch against rev. 592416 adds described UrlValidator.ALLOW_LOCALHOST option.

        Show
        Sergey Nebolsin added a comment - This patch against rev. 592416 adds described UrlValidator.ALLOW_LOCALHOST option.
        Hide
        Ben Speakmon added a comment -

        You're quite right, Sergey. I'll look at this.

        Show
        Ben Speakmon added a comment - You're quite right, Sergey. I'll look at this.
        Hide
        Niall Pemberton added a comment -

        I guess the question is does it need to be an option - or should it always be valid? If the latter then we could just incorporate it into the DomainValidator's validation. Having said that the way Sergey's done his patch we could do both:

        public boolean isValid(String domain) {
        if (isLocalhost(domain))

        { return true; }

        String[] groups = domainRegex.match(domain);
        if (groups != null && groups.length > 0)

        { return isValidTld(groups[0]); }

        else

        { return false; }

        }

        Show
        Niall Pemberton added a comment - I guess the question is does it need to be an option - or should it always be valid? If the latter then we could just incorporate it into the DomainValidator's validation. Having said that the way Sergey's done his patch we could do both: public boolean isValid(String domain) { if (isLocalhost(domain)) { return true; } String[] groups = domainRegex.match(domain); if (groups != null && groups.length > 0) { return isValidTld(groups[0]); } else { return false; } }
        Hide
        Ben Speakmon added a comment -

        The only reason I can think of for making it an option is if you want to make sure something didn't come from localhost. So, if anything, the option should be to turn off localhost matching and to accept it as the default.

        Same patch as before, except:

        if (!DISALLOW_LOCALHOST && isLocalhost(domain))

        { return true; }

        Show
        Ben Speakmon added a comment - The only reason I can think of for making it an option is if you want to make sure something didn't come from localhost. So, if anything, the option should be to turn off localhost matching and to accept it as the default. Same patch as before, except: if (!DISALLOW_LOCALHOST && isLocalhost(domain)) { return true; }
        Hide
        Sergey Nebolsin added a comment -

        Wow, guys, you're really responsive. Thinking further there might be not only 'localhost' but any non-IANA TLD (I'd refer to appendix G.3 paragraph 2 of RFC-2396) again. For example, I'm adding something into my local /etc/hosts. In this general case the option could be ALLOW_NON_IANA_TLDS and it will cover 'localhost' case.

        public boolean isValid(String domain) {
        String[] groups = domainRegex.match(domain);
        if (groups != null && groups.length > 0)

        { return ALLOW_NON_IANA_TLDS || isValidTld(groups[0]); }

        else

        { return false; }

        }

        Show
        Sergey Nebolsin added a comment - Wow, guys, you're really responsive. Thinking further there might be not only 'localhost' but any non-IANA TLD (I'd refer to appendix G.3 paragraph 2 of RFC-2396) again. For example, I'm adding something into my local /etc/hosts. In this general case the option could be ALLOW_NON_IANA_TLDS and it will cover 'localhost' case. public boolean isValid(String domain) { String[] groups = domainRegex.match(domain); if (groups != null && groups.length > 0) { return ALLOW_NON_IANA_TLDS || isValidTld(groups[0]); } else { return false; } }
        Hide
        Paul Benedict added a comment -

        I vote it to be an option, disabled by default, since not recognizing 'localhost' is the present behavior.

        PS: What about expanding the feature to allow/disallow certain domains/TLD?

        Show
        Paul Benedict added a comment - I vote it to be an option, disabled by default, since not recognizing 'localhost' is the present behavior. PS: What about expanding the feature to allow/disallow certain domains/TLD?
        Hide
        Sergey Nebolsin added a comment -

        This new patch against rev. 592416 adds the ability to allow top-level domains which are not defined by IANA (this behavior is controlled by UrlValidator.ALLOW_NON_IANA_TLDS option, which is turned off by default). It recognizes hostnames like 'localhost', 'localhost.localdomain', 'aaa', etc.

        Show
        Sergey Nebolsin added a comment - This new patch against rev. 592416 adds the ability to allow top-level domains which are not defined by IANA (this behavior is controlled by UrlValidator.ALLOW_NON_IANA_TLDS option, which is turned off by default). It recognizes hostnames like 'localhost', 'localhost.localdomain', 'aaa', etc.
        Hide
        Ben Speakmon added a comment -

        Assuming that we're talking about the new UrlValidator/DomainValidator in the routines package, since they're new classes we're not constrained by the behavior of the old implementations in the validator package. The plan is to deprecate the ones in validator, so I doubt we'll be adding new features to them or changing their behavior.

        To be clear, I'm opposed to changing DomainValidator to allow non-IANA TLDs because that obviates the point of domain validation, even if you have a list of unofficial TLDs you want to approve. So the discussion is then about what changes we make to UrlValidator. I think there are three things to consider:

        1) Not supporting localhost is a bug per RFC2396; we should enable it by default and provide an option to turn it off. That's this JIRA issue.
        2) Supporting random machine names, e.g., http://mymachine/testapp/test.jsp. I imagine this would be an option that allows you to specify what machine names are acceptable; I don't want to just allow anything in there, since what's the point of validating if you do? This should be opened as a separate issue (as an enhancement).
        3) Things like file URLs are different since the file:// scheme doesn't have an authority part at all, so how UrlValidator handles authorities would have to be dependent on the scheme. (Maybe we have a FileUrlValidator or something.) This should be opened as a separate issue (as an enhancement).

        Thoughts?

        Show
        Ben Speakmon added a comment - Assuming that we're talking about the new UrlValidator/DomainValidator in the routines package, since they're new classes we're not constrained by the behavior of the old implementations in the validator package. The plan is to deprecate the ones in validator, so I doubt we'll be adding new features to them or changing their behavior. To be clear, I'm opposed to changing DomainValidator to allow non-IANA TLDs because that obviates the point of domain validation, even if you have a list of unofficial TLDs you want to approve. So the discussion is then about what changes we make to UrlValidator. I think there are three things to consider: 1) Not supporting localhost is a bug per RFC2396; we should enable it by default and provide an option to turn it off. That's this JIRA issue. 2) Supporting random machine names, e.g., http://mymachine/testapp/test.jsp . I imagine this would be an option that allows you to specify what machine names are acceptable; I don't want to just allow anything in there, since what's the point of validating if you do? This should be opened as a separate issue (as an enhancement). 3) Things like file URLs are different since the file:// scheme doesn't have an authority part at all, so how UrlValidator handles authorities would have to be dependent on the scheme. (Maybe we have a FileUrlValidator or something.) This should be opened as a separate issue (as an enhancement). Thoughts?
        Hide
        Sergey Nebolsin added a comment -

        From RFC2396, section 3.2.2: "Hostnames take the form described in Section 3 of [RFC1034] and Section 2.1 of [RFC1123]: a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters. The rightmost domain label of a fully qualified domain name will never start with a digit, thus syntactically distinguishing domain names from IPv4 addresses, and may be followed by a single "." if it is necessary to distinguish between the complete domain name and any local domain. To actually be "Uniform" as a resource locator, a URL hostname should be a fully qualified domain name. In practice, however, the host component may be a local domain literal."

        So, if I understand correctly, not supporting names like 'my-machine.my-local-domain', 'my-machine' is againts RFC too, and I cannot see a difference between 'localhost' and 'my-computer' in terms of RFC. However, I think that developer should have a choice. When we're developing internet application, we don't want such local domains in most cases, since they make no sense in internet. But if we're developing some intranet (or even local) application which leverages the power of URLs, we want to allow local domains for sure. So this behavior should be configurable.

        What about domain validation I can imagine two main modes: "strict" - hostname is checked against IANA list, and "soft" which just checks hostname against general RFC rules.

        I like configurable things, so it's great to have ALLOW_NON_IANA_TLDS (there might be a better name) option as well as a possibility to provide a concrete list of "additional" allowed domain. And at the very least I would like to have a possibility to substitute DomainValidator in UrlValidator with my own implementation if I need some special behavior.

        And I agree that we should open new issues for these enhancements when we'll decide about the scope of these new issues.

        Show
        Sergey Nebolsin added a comment - From RFC2396, section 3.2.2: "Hostnames take the form described in Section 3 of [RFC1034] and Section 2.1 of [RFC1123] : a sequence of domain labels separated by ".", each domain label starting and ending with an alphanumeric character and possibly also containing "-" characters. The rightmost domain label of a fully qualified domain name will never start with a digit, thus syntactically distinguishing domain names from IPv4 addresses, and may be followed by a single "." if it is necessary to distinguish between the complete domain name and any local domain. To actually be "Uniform" as a resource locator, a URL hostname should be a fully qualified domain name. In practice, however, the host component may be a local domain literal." So, if I understand correctly, not supporting names like 'my-machine.my-local-domain', 'my-machine' is againts RFC too, and I cannot see a difference between 'localhost' and 'my-computer' in terms of RFC. However, I think that developer should have a choice. When we're developing internet application, we don't want such local domains in most cases, since they make no sense in internet. But if we're developing some intranet (or even local) application which leverages the power of URLs, we want to allow local domains for sure. So this behavior should be configurable. What about domain validation I can imagine two main modes: "strict" - hostname is checked against IANA list, and "soft" which just checks hostname against general RFC rules. I like configurable things, so it's great to have ALLOW_NON_IANA_TLDS (there might be a better name) option as well as a possibility to provide a concrete list of "additional" allowed domain. And at the very least I would like to have a possibility to substitute DomainValidator in UrlValidator with my own implementation if I need some special behavior. And I agree that we should open new issues for these enhancements when we'll decide about the scope of these new issues.
        Hide
        Paul Benedict added a comment -

        What complicates the validation, as Ben pointed out, is the scheme. If this validation was purely HTTP/FTP, then it would be easier to implement my above suggestion Thus for this ticket, supporting Ben's #1 is enough and I do not oppose it. That was the main point of this thread anyhow.

        Show
        Paul Benedict added a comment - What complicates the validation, as Ben pointed out, is the scheme. If this validation was purely HTTP/FTP, then it would be easier to implement my above suggestion Thus for this ticket, supporting Ben's #1 is enough and I do not oppose it. That was the main point of this thread anyhow.
        Hide
        Ben Speakmon added a comment -

        DomainValidator does support labels with dashes in them: "my-machine", "this-domain.org", etc., all work. Those are valid outside of URLs, for example in email addresses.

        The question is how (or if) a URL with unqualified machine names or invalid TLDs should be validated. I can see cases where, as you say, you'd want to allow machine names in URLs, but I also think that shouldn't be allowed by default since you would expect machine name URLs to not validate in security-sensitive contexts such as web form validation.

        So right now it works like this in UrlValidator (omitting nonrelevant parts):

        if authority doesn't validate in DomainValidator {
        if authority doesn't validate in InetAddressValidator

        { false }

        }

        I'm proposing changing it to this:

        if ALLOW_MACHINE_NAMES && authority is in nameslist

        { true // "localhost", "my-machine", etc., specified by user will validate here, "blah" will not }

        else

        { // as above }

        This makes UrlValidator smart enough to handle machine name cases and also lets you do stuff like http://my-machine.rack.colo/app/test, so you can even use illegal TLDs if you really want.

        DomainValidator, on the other hand, should not be changed. It has a very narrowly defined scope: validating IANA-approved TLDs and domain names that use them. Since UrlValidator requires functionality above and beyond that, it makes sense to put that logic in UrlValidator. Since domain names (as opposed to hostnames/authorities in URLs and RFC 2396) are either valid or not, it doesn't make sense to allow DomainValidator to loosen its rules.

        Show
        Ben Speakmon added a comment - DomainValidator does support labels with dashes in them: "my-machine", "this-domain.org", etc., all work. Those are valid outside of URLs, for example in email addresses. The question is how (or if) a URL with unqualified machine names or invalid TLDs should be validated. I can see cases where, as you say, you'd want to allow machine names in URLs, but I also think that shouldn't be allowed by default since you would expect machine name URLs to not validate in security-sensitive contexts such as web form validation. So right now it works like this in UrlValidator (omitting nonrelevant parts): if authority doesn't validate in DomainValidator { if authority doesn't validate in InetAddressValidator { false } } I'm proposing changing it to this: if ALLOW_MACHINE_NAMES && authority is in nameslist { true // "localhost", "my-machine", etc., specified by user will validate here, "blah" will not } else { // as above } This makes UrlValidator smart enough to handle machine name cases and also lets you do stuff like http://my-machine.rack.colo/app/test , so you can even use illegal TLDs if you really want. DomainValidator, on the other hand, should not be changed. It has a very narrowly defined scope: validating IANA-approved TLDs and domain names that use them. Since UrlValidator requires functionality above and beyond that, it makes sense to put that logic in UrlValidator. Since domain names (as opposed to hostnames/authorities in URLs and RFC 2396) are either valid or not, it doesn't make sense to allow DomainValidator to loosen its rules.
        Hide
        Sergey Nebolsin added a comment -

        Ok, I do agree. ALLOW_MACHINE_NAMES with a list of allowed machine names is a good start anyway, and it'll cover most possible cases. Can I also suggest what machine names in this list to be treated as regexps, so I will be able to specify things like "all machines from my local domain" or "all machines with names like xyz-*". I think in this case it will be flexible enough to cover almost all cases without further complication of UrlValidator.

        If such solution is desirable I'll create a separate issue and link this ticket to it, since ALLOW_MACHINE_NAMES feature will perfectly cover 'localhost' case, and there will be no need for ALLOW_LOCALHOST option.

        Show
        Sergey Nebolsin added a comment - Ok, I do agree. ALLOW_MACHINE_NAMES with a list of allowed machine names is a good start anyway, and it'll cover most possible cases. Can I also suggest what machine names in this list to be treated as regexps, so I will be able to specify things like "all machines from my local domain" or "all machines with names like xyz-*". I think in this case it will be flexible enough to cover almost all cases without further complication of UrlValidator. If such solution is desirable I'll create a separate issue and link this ticket to it, since ALLOW_MACHINE_NAMES feature will perfectly cover 'localhost' case, and there will be no need for ALLOW_LOCALHOST option.
        Hide
        Ben Speakmon added a comment -

        So, revised plan:

        • add an option MANUAL_SCHEME_VALIDATION
        • provide a constructor that accepts an array of RegexValidators
        • change isValidScheme() to iterate the RegexValidators if said option is active, falling back to domain name validation if it passes.

        As Sergey says, this removes the need for localhost special-casing.

        Show
        Ben Speakmon added a comment - So, revised plan: add an option MANUAL_SCHEME_VALIDATION provide a constructor that accepts an array of RegexValidators change isValidScheme() to iterate the RegexValidators if said option is active, falling back to domain name validation if it passes. As Sergey says, this removes the need for localhost special-casing.
        Hide
        Sergey Nebolsin added a comment -

        That sounds like a very good plan. Should I create an issue for tracking it?

        Show
        Sergey Nebolsin added a comment - That sounds like a very good plan. Should I create an issue for tracking it?
        Hide
        Ben Speakmon added a comment -

        nah – the conversation happened here, so the fix should go here.

        Show
        Ben Speakmon added a comment - nah – the conversation happened here, so the fix should go here.
        Hide
        Sergey Nebolsin added a comment -

        Ok. One final question - we are talking about MANUAL_DOMAIN_VALIDATION and method isValidAuthority(), not about scheme validation (as in your comment), right?

        Show
        Sergey Nebolsin added a comment - Ok. One final question - we are talking about MANUAL_DOMAIN_VALIDATION and method isValidAuthority(), not about scheme validation (as in your comment), right?
        Hide
        Ben Speakmon added a comment -

        $ svn ci -m "VALIDATOR-248: add MANUAL_AUTHORITY_VALIDATION feature and tests"
        Sending src/main/java/org/apache/commons/validator/routines/UrlValidator.java
        Sending src/test/java/org/apache/commons/validator/routines/UrlValidatorTest.java
        Transmitting file data ..
        Committed revision 594224.

        Show
        Ben Speakmon added a comment - $ svn ci -m " VALIDATOR-248 : add MANUAL_AUTHORITY_VALIDATION feature and tests" Sending src/main/java/org/apache/commons/validator/routines/UrlValidator.java Sending src/test/java/org/apache/commons/validator/routines/UrlValidatorTest.java Transmitting file data .. Committed revision 594224.
        Hide
        Niall Pemberton added a comment -

        I've just been looking at whats implemented for this and I don't really understand what is the point of the MANUAL_AUTHORITY_VALIDATION option. Seems unecessary to have to pass both an array of RegexValidators and set the MANUAL_AUTHORITY_VALIDATION. I would suggest we remove the MANUAL_AUTHORITY_VALIDATION option and assume that if the array of RegexValidators is not null then effectively the option has been set.

        Also I think we should change it from being an array of RegexValidators to a single RegexValidator - you can sepcify an array of regex patterns in RegexValidator.

        Show
        Niall Pemberton added a comment - I've just been looking at whats implemented for this and I don't really understand what is the point of the MANUAL_AUTHORITY_VALIDATION option. Seems unecessary to have to pass both an array of RegexValidators and set the MANUAL_AUTHORITY_VALIDATION. I would suggest we remove the MANUAL_AUTHORITY_VALIDATION option and assume that if the array of RegexValidators is not null then effectively the option has been set. Also I think we should change it from being an array of RegexValidators to a single RegexValidator - you can sepcify an array of regex patterns in RegexValidator.
        Hide
        Niall Pemberton added a comment -

        Attaching a patch with my proposed changes

        Show
        Niall Pemberton added a comment - Attaching a patch with my proposed changes
        Hide
        Ben Speakmon added a comment -

        I forgot about RegexValidator being able to contain multiple patterns, so that's a no-brainer. Your approach is simpler, so I'm +1 on that too.

        Show
        Ben Speakmon added a comment - I forgot about RegexValidator being able to contain multiple patterns, so that's a no-brainer. Your approach is simpler, so I'm +1 on that too.
        Hide
        Sergey Nebolsin added a comment -

        I also wasn't aware about RegexValidator with multiple patterns. So +1 to whole Niall's suggestion from me.

        Show
        Sergey Nebolsin added a comment - I also wasn't aware about RegexValidator with multiple patterns. So +1 to whole Niall's suggestion from me.
        Hide
        Niall Pemberton added a comment -

        Thanks for the prompt feedback - I've applied those changes - so I'll close this again

        http://svn.apache.org/viewvc?view=rev&revision=595020

        Show
        Niall Pemberton added a comment - Thanks for the prompt feedback - I've applied those changes - so I'll close this again http://svn.apache.org/viewvc?view=rev&revision=595020

          People

          • Assignee:
            Ben Speakmon
            Reporter:
            Sergey Nebolsin
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development