Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Invalid
    • Affects Version/s: 1.3
    • Fix Version/s: 1.5
    • Component/s: fetcher
    • Labels:
    • Environment:

      Any

    • Patch Info:
      Patch Available

      Description

      Basic URL normalizer lacks 2 important features

      Encode space in URL into %20 to unbreak httpclient and possibly others who do not expect space inside URL

      Ability to decode %33 encoding in URL. This is important for avoiding duplicates

        Activity

        Hide
        Radim Kolar added a comment -

        Patch against branch-1.4

        Show
        Radim Kolar added a comment - Patch against branch-1.4
        Hide
        Radim Kolar added a comment -

        Updated patch. It also normalizes unprintable % sequences to upper case. Like %1f -> %1F . upper case seems to be way more common.

        Show
        Radim Kolar added a comment - Updated patch. It also normalizes unprintable % sequences to upper case. Like %1f -> %1F . upper case seems to be way more common.
        Hide
        Markus Jelsma added a comment -

        What's the real life use-case for decoding? I've not yet seen servers returning URL's where that range for some reason is URL encoded.

        Show
        Markus Jelsma added a comment - What's the real life use-case for decoding? I've not yet seen servers returning URL's where that range for some reason is URL encoded.
        Hide
        Radim Kolar added a comment -

        1. Some servers sends spaces in URLs
        2. Based on their web framework, if they use URL rewriting some characters are % encoded by it. For example ~ is very often encoded

        Show
        Radim Kolar added a comment - 1. Some servers sends spaces in URLs 2. Based on their web framework, if they use URL rewriting some characters are % encoded by it. For example ~ is very often encoded
        Hide
        Radim Kolar added a comment -

        Actually it might be even better to add capability to encode high bit chars in normalizer using UTF-8

        Show
        Radim Kolar added a comment - Actually it might be even better to add capability to encode high bit chars in normalizer using UTF-8
        Hide
        Lewis John McGibbney added a comment -

        Hi Radim are you happy with this patch as is? Do you intend on adding in your latter comments?

        Show
        Lewis John McGibbney added a comment - Hi Radim are you happy with this patch as is? Do you intend on adding in your latter comments?
        Hide
        Sebastian Nagel added a comment -

        Spaces in URLs are quite frequent and most browsers do the escaping inherentyl.
        So +1 for escaping spaces per default.

        A little problem: The patch replaces spaces by %20 in all positions but in the query part (after ?) a space is usually encoded as '+'.

        Show
        Sebastian Nagel added a comment - Spaces in URLs are quite frequent and most browsers do the escaping inherentyl. So +1 for escaping spaces per default. A little problem: The patch replaces spaces by %20 in all positions but in the query part (after ?) a space is usually encoded as '+'.
        Hide
        Radim Kolar added a comment -

        Patch is good. i will add replace high bit chars with %82 in next one

        Show
        Radim Kolar added a comment - Patch is good. i will add replace high bit chars with %82 in next one
        Hide
        Lewis John McGibbney added a comment -

        Does anyone have comments on this one? Thanks for including the test with this patch Radim. I will test and comment accordingly (hopefully) tonight. Have you considered Sebastians comments?

        Show
        Lewis John McGibbney added a comment - Does anyone have comments on this one? Thanks for including the test with this patch Radim. I will test and comment accordingly (hopefully) tonight. Have you considered Sebastians comments?
        Hide
        Radim Kolar added a comment -

        I did, but due to lack of time to test what current browsers do with spaces in query, i will leave it as it is.

        Show
        Radim Kolar added a comment - I did, but due to lack of time to test what current browsers do with spaces in query, i will leave it as it is.
        Hide
        Radim Kolar added a comment -

        Browsers seems to send spaces in URL encoded like this:
        GET /index.jsp?q=wha%20just%20happened HTTP/1.0

        Show
        Radim Kolar added a comment - Browsers seems to send spaces in URL encoded like this: GET /index.jsp?q=wha%20just%20happened HTTP/1.0
        Hide
        Sebastian Nagel added a comment -

        It depends on the used browser and where the query is entered (in the URL line or in a form field, cf. http://en.wikipedia.org/wiki/URL_encoding#The_application.2Fx-www-form-urlencoded_type).
        From the user perspective it seems to be irrelevant how a query containing spaces is encoded. For both
        http://www.google.com/search?q=nutch+crawler
        http://www.google.com/search?q=nutch%20crawler
        you get the same result page. Servlets are not sensitive to whether `+' or `%20' represents a space (at least, if standard decoding routines are used).

        But despite the question what browsers are doing a URL normalizer should replace the space because:
        1. fetching URLs containing spaces may fail (it will with httpclient)
        2. avoid fetching the same document twice by two variant URLs

        Since the first and more severe problem is fixed I opt for applying the patch "as is". But we should keep point 2 on the radar for a later improvement.
        I think BasicURLNormalizer is the right place for a solution because a position-dependent replacement by `%20' or `+' before resp. after `?' can hardly be expressed by a single regular expression and hence cannot be done by adding a rule to RegexURLNormalizer.

        Show
        Sebastian Nagel added a comment - It depends on the used browser and where the query is entered (in the URL line or in a form field, cf. http://en.wikipedia.org/wiki/URL_encoding#The_application.2Fx-www-form-urlencoded_type ). From the user perspective it seems to be irrelevant how a query containing spaces is encoded. For both http://www.google.com/search?q=nutch+crawler http://www.google.com/search?q=nutch%20crawler you get the same result page. Servlets are not sensitive to whether `+' or `%20' represents a space (at least, if standard decoding routines are used). But despite the question what browsers are doing a URL normalizer should replace the space because: 1. fetching URLs containing spaces may fail (it will with httpclient) 2. avoid fetching the same document twice by two variant URLs Since the first and more severe problem is fixed I opt for applying the patch "as is". But we should keep point 2 on the radar for a later improvement. I think BasicURLNormalizer is the right place for a solution because a position-dependent replacement by `%20' or `+' before resp. after `?' can hardly be expressed by a single regular expression and hence cannot be done by adding a rule to RegexURLNormalizer.
        Hide
        Lewis John McGibbney added a comment -

        I think this makes sense, but yes number two above is important as the problem scales in parallel with respect to crawl job size => big problems w.r.t index maintenance issues.

        Radim, can you please update your patch as it does not apply cleanly to most recent trunk. Once this has been done we can make a decision. Thank you

        Show
        Lewis John McGibbney added a comment - I think this makes sense, but yes number two above is important as the problem scales in parallel with respect to crawl job size => big problems w.r.t index maintenance issues. Radim, can you please update your patch as it does not apply cleanly to most recent trunk. Once this has been done we can make a decision. Thank you
        Hide
        Lewis John McGibbney added a comment -

        Hi Markus, do you have any comments on Radim's latest patch?

        Show
        Lewis John McGibbney added a comment - Hi Markus, do you have any comments on Radim's latest patch?
        Hide
        Markus Jelsma added a comment -

        I am happy with the encoding of space to %20 but i am still not sure if decoding is without problems, especially decoding %2F to / (slash) as both have a different meaning as slash denotes the path. I went through a part of our CrawlDB searching for %2F occurences and got a list. I tried several URL's alternating between %2F and /. There are at least a few URL's that return different content.

        This is correct:
        http://www.detelefoongids.nl/bg-l/18236834-B%2Fmak+Bedrijfsmakelaars/vermelding/

        But this is incorrect:
        http://www.detelefoongids.nl/bg-l/18236834-B/mak+Bedrijfsmakelaars/vermelding/

        This also applies for decoding %23 to # (hash). By decoding this character valid URL's are seen having a URL fragment which is then normalized away. See these examples:

        http://www.noordhollandsdagblad.nl/nieuws/stadstreek/denhelder/article9516101.ece/%23Nieuwedweep%3A-Ambassadeur-van-Den-Helder-zijn-we-allemaal
        http://www.motorstek.nl/gebruiker/fido%20%2377

        In the above examples the encoding of the hash is intented and correct. The content contains a hash but the URL does not contain a fragment but this would: http://www.motorstek.nl/gebruiker/fido%20%2377#part-of-page

        Another example of bad decoding is : http://st-annaland.citysite.nl/nieuws/6276724/2816_Rosenthal+ontvangt+oppositieleider+Syri%26%23235%3B.html.

        I propose not to decode characters that have special meaning in URI's and be very careful in tests. As this is a very crucial issue and the implementation is not sound yet i propose to push this to 1.5 as i'm not very comfortable by introducing a potential big issue just before 1.4 is to be released.

        Show
        Markus Jelsma added a comment - I am happy with the encoding of space to %20 but i am still not sure if decoding is without problems, especially decoding %2F to / (slash) as both have a different meaning as slash denotes the path. I went through a part of our CrawlDB searching for %2F occurences and got a list. I tried several URL's alternating between %2F and /. There are at least a few URL's that return different content. This is correct: http://www.detelefoongids.nl/bg-l/18236834-B%2Fmak+Bedrijfsmakelaars/vermelding/ But this is incorrect: http://www.detelefoongids.nl/bg-l/18236834-B/mak+Bedrijfsmakelaars/vermelding/ This also applies for decoding %23 to # (hash). By decoding this character valid URL's are seen having a URL fragment which is then normalized away. See these examples: http://www.noordhollandsdagblad.nl/nieuws/stadstreek/denhelder/article9516101.ece/%23Nieuwedweep%3A-Ambassadeur-van-Den-Helder-zijn-we-allemaal http://www.motorstek.nl/gebruiker/fido%20%2377 In the above examples the encoding of the hash is intented and correct. The content contains a hash but the URL does not contain a fragment but this would: http://www.motorstek.nl/gebruiker/fido%20%2377#part-of-page Another example of bad decoding is : http://st-annaland.citysite.nl/nieuws/6276724/2816_Rosenthal+ontvangt+oppositieleider+Syri%26%23235%3B.html . I propose not to decode characters that have special meaning in URI's and be very careful in tests. As this is a very crucial issue and the implementation is not sound yet i propose to push this to 1.5 as i'm not very comfortable by introducing a potential big issue just before 1.4 is to be released.
        Hide
        Chris A. Mattmann added a comment -
        • push
        Show
        Chris A. Mattmann added a comment - push
        Hide
        Radim Kolar added a comment -

        Do not decode # and / characters during %XX decoding. Unit tests added.

        Show
        Radim Kolar added a comment - Do not decode # and / characters during %XX decoding. Unit tests added.
        Hide
        Ferdy Galema added a comment - - edited

        +1 for encoding spaces into %20, both in file and query. (Although encoding spaces in the query part into 'plus' is also fine with me.)

        +1 for uppercasing escaped chars. (%1f --> %1F)

        I propose to split this issue into separate issues, as this makes discussing the individual features a lot easier. (Also making it easier to commit changes that have smaller overall impact such as the %20 feature).

        Also, the latest patch still has some flaws. (Wrong path as files are prefixed with 'a/' and some indent issues here and there.)

        Show
        Ferdy Galema added a comment - - edited +1 for encoding spaces into %20, both in file and query. (Although encoding spaces in the query part into 'plus' is also fine with me.) +1 for uppercasing escaped chars. (%1f --> %1F) I propose to split this issue into separate issues, as this makes discussing the individual features a lot easier. (Also making it easier to commit changes that have smaller overall impact such as the %20 feature). Also, the latest patch still has some flaws. (Wrong path as files are prefixed with 'a/' and some indent issues here and there.)
        Hide
        Markus Jelsma added a comment -

        Path prefixes can be overcome by using -p1 with patch. Indenting will be looked after.

        I also believe we should to disable this feature by default as existing crawls are very much affected by this issue and need to be taken care off. One would have to renormalize the crawldb (and linkdb and webgraphdb if in use) which is trivial. However, the keys in existing segments are not updated so indexing the data won't work anymore.

        Show
        Markus Jelsma added a comment - Path prefixes can be overcome by using -p1 with patch. Indenting will be looked after. I also believe we should to disable this feature by default as existing crawls are very much affected by this issue and need to be taken care off. One would have to renormalize the crawldb (and linkdb and webgraphdb if in use) which is trivial. However, the keys in existing segments are not updated so indexing the data won't work anymore.
        Hide
        Radim Kolar added a comment -

        1. patch --ignore-whitespace

        2. only small portion of existing crawled segments will become invalid and data in crawled segments has short lifetime anyway - They are indexed and deleted

        3. decoding %2f into / is good idea because ASP.NET generated redirects comes with / encoded as %2f. (this was initially included but removed due to Markus complain)

        4. No, i am not going to split such trivial patch like this

        Show
        Radim Kolar added a comment - 1. patch --ignore-whitespace 2. only small portion of existing crawled segments will become invalid and data in crawled segments has short lifetime anyway - They are indexed and deleted 3. decoding %2f into / is good idea because ASP.NET generated redirects comes with / encoded as %2f. (this was initially included but removed due to Markus complain) 4. No, i am not going to split such trivial patch like this
        Hide
        Ferdy Galema added a comment -

        @Markus/Radim

        I certainly do not want to nitpick about patches, but I think feedback about unnecessary changes or malformed patches should be given. Of course when applying the patch you could simply ignore or correct them, but in the end higher quality patches benefit all of us. It just makes the process of reviewing/editing/committing a lot easier.

        @Radim

        Do you agree that "better url-normalizer basic" is perhaps overly broad? I can probably think of tens of other improvements that fall under the scope of a better basic urlnormalizer. Discussing / managing them in separate issues is much more efficient than cramming them all into a single one.

        Anyway this is not to undermine the effort of course. Keep up the good work! (And feel free to disagree)

        Cheers!

        Show
        Ferdy Galema added a comment - @Markus/Radim I certainly do not want to nitpick about patches, but I think feedback about unnecessary changes or malformed patches should be given. Of course when applying the patch you could simply ignore or correct them, but in the end higher quality patches benefit all of us. It just makes the process of reviewing/editing/committing a lot easier. @Radim Do you agree that "better url-normalizer basic" is perhaps overly broad? I can probably think of tens of other improvements that fall under the scope of a better basic urlnormalizer. Discussing / managing them in separate issues is much more efficient than cramming them all into a single one. Anyway this is not to undermine the effort of course. Keep up the good work! (And feel free to disagree) Cheers!
        Hide
        Radim Kolar added a comment -

        a/ Please direct your complains about quality of git generated patches to git mailing list. i am not going to generate patches for you manually by running diff -Naur

        b/ if you used something better then SVN (hg,git,bzr) you can cherrypick changes from my branch, create new branches for every subtask and attaching branches to JIRA reports and then you can discuss them separately.

        c/ more efficient is if i dont spend more 10x more time in pointless discussions then on coding

        Show
        Radim Kolar added a comment - a/ Please direct your complains about quality of git generated patches to git mailing list. i am not going to generate patches for you manually by running diff -Naur b/ if you used something better then SVN (hg,git,bzr) you can cherrypick changes from my branch, create new branches for every subtask and attaching branches to JIRA reports and then you can discuss them separately. c/ more efficient is if i dont spend more 10x more time in pointless discussions then on coding
        Hide
        Radim Kolar added a comment -

        Added support for encoding string to UTF-8 and then URL %escaping it.

        Show
        Radim Kolar added a comment - Added support for encoding string to UTF-8 and then URL %escaping it.
        Hide
        Julien Nioche added a comment -

        @Radim

        Sounds like "I am not going to" is your favourite phrase.

        You certainly prefer GIT to SVN, but the fact is that Nutch uses the latter and contributions are expected to be generated with 'svn diff'. By going on with what the rest of the community do (vs imposing your ways to others) you will make it easier for people to discuss, review and commit your contributions.

        http://wiki.apache.org/nutch/HowToContribute >> Creating a patch
        http://www.apache.org/foundation/how-it-works.html >> Philosophy

        Thanks

        Show
        Julien Nioche added a comment - @Radim Sounds like "I am not going to" is your favourite phrase. You certainly prefer GIT to SVN, but the fact is that Nutch uses the latter and contributions are expected to be generated with 'svn diff'. By going on with what the rest of the community do (vs imposing your ways to others) you will make it easier for people to discuss, review and commit your contributions. http://wiki.apache.org/nutch/HowToContribute >> Creating a patch http://www.apache.org/foundation/how-it-works.html >> Philosophy Thanks
        Hide
        Radim Kolar added a comment -

        Attached patch was in improper format.

        Show
        Radim Kolar added a comment - Attached patch was in improper format.
        Hide
        Markus Jelsma added a comment -

        Restored original patch.

        Show
        Markus Jelsma added a comment - Restored original patch.
        Hide
        Radim Kolar added a comment -

        By removing my patch i also withdraw permission which i previously granted to ASF to distribute my copyrighted works.

        Show
        Radim Kolar added a comment - By removing my patch i also withdraw permission which i previously granted to ASF to distribute my copyrighted works.
        Hide
        Ferdy Galema added a comment -

        Radim, that's funny. I don't believe that is possible.

        Also, aside from the fact that you could not deal with the feedback about the patch quality, I simply did not agree with the entire contents of the patch. There are some parts in it which I did not understand yet (or still don't). You failed to deal with this feedback too. So please do not fool yourself that this is simply about the patch quality.

        Show
        Ferdy Galema added a comment - Radim, that's funny. I don't believe that is possible. Also, aside from the fact that you could not deal with the feedback about the patch quality, I simply did not agree with the entire contents of the patch. There are some parts in it which I did not understand yet (or still don't). You failed to deal with this feedback too. So please do not fool yourself that this is simply about the patch quality.
        Hide
        Chris A. Mattmann added a comment -

        Guys: let's change the tone of this issue, OK?

        Radim, thanks for your patch. Sorry that it didn't get applied or that folks tried to engage in feedback/discussion with you on it. I would encourage you to not get discouraged and I appreciate your effort in trying to contribute to the Apache Nutch project.

        The committers are the ones that have to figure out how to maintain things and sometimes we get hung up on yes I'll agree less important issues. I'm going to recommend that everyone just table those at the moment and that we move forward here.

        Here are some concrete next steps:

        1. Ferdy: is it possible to commit a portion of this patch that you do understand? Then we could leave the part that you don't uncommitted. This has 2 immediate goals:

        • gives Radim a good feeling for contributing to the project – he deserves that.
        • gives us the ability to cherry pick what we understand and are willing to maintain

        2. Radim: if you want to help in improving the formatting and other requested issues, great. If you don't then that's fine too. At that point though the maintenance/evolution of the patch will transition more into the Nutch folks and you might not be as involved with it unless you get on board with what the guys have decided are their code formatting and patch generation guidelines.

        Thanks!

        Show
        Chris A. Mattmann added a comment - Guys: let's change the tone of this issue, OK? Radim, thanks for your patch. Sorry that it didn't get applied or that folks tried to engage in feedback/discussion with you on it. I would encourage you to not get discouraged and I appreciate your effort in trying to contribute to the Apache Nutch project. The committers are the ones that have to figure out how to maintain things and sometimes we get hung up on yes I'll agree less important issues. I'm going to recommend that everyone just table those at the moment and that we move forward here. Here are some concrete next steps: 1. Ferdy: is it possible to commit a portion of this patch that you do understand? Then we could leave the part that you don't uncommitted. This has 2 immediate goals: gives Radim a good feeling for contributing to the project – he deserves that. gives us the ability to cherry pick what we understand and are willing to maintain 2. Radim: if you want to help in improving the formatting and other requested issues, great. If you don't then that's fine too. At that point though the maintenance/evolution of the patch will transition more into the Nutch folks and you might not be as involved with it unless you get on board with what the guys have decided are their code formatting and patch generation guidelines. Thanks!
        Hide
        Ferdy Galema added a comment -

        Like I said before, I'm up for converting spaces to %20, so that at least the fetcher will not fail. Also I think convert lowercase escapings to uppercase is a good idea. (Though I'm not complete sure if this is completely interchangable)

        However I cannot commit this using the latest patch, because obviously it is intertwined with other changes. Why is the url decoded before changes spaces into %20 whereafter it is encoded again? Why "Remove % encoding from URL in range 0x20-0x80 exclusive / and # are not decoded" and why "// this pattern tries to find spots like "%34" in the url"?

        I think changes to default filtering/normalizing should be very well thought out, because like Markus said the impact can be potentially very big. In short, I'm not able to commit anything as of now. (But if anyone else is, don't be stopped by me). Just my 2 cents.

        Show
        Ferdy Galema added a comment - Like I said before, I'm up for converting spaces to %20, so that at least the fetcher will not fail. Also I think convert lowercase escapings to uppercase is a good idea. (Though I'm not complete sure if this is completely interchangable) However I cannot commit this using the latest patch, because obviously it is intertwined with other changes. Why is the url decoded before changes spaces into %20 whereafter it is encoded again? Why "Remove % encoding from URL in range 0x20-0x80 exclusive / and # are not decoded" and why "// this pattern tries to find spots like "%34" in the url"? I think changes to default filtering/normalizing should be very well thought out, because like Markus said the impact can be potentially very big. In short, I'm not able to commit anything as of now. (But if anyone else is, don't be stopped by me). Just my 2 cents.
        Hide
        Radim Kolar added a comment -

        If you are so clever and hard working then stop undeleting my patch and write better one yourself. I am licensing my work as Affero GPL v3 from now.

        You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months.

        I reworked major portion of nutch and i dont want to spend years waiting if they and ever and when will be merged. I have hadoop 0.21 api, generator with plugable algorithm, fixed building with maven, database backend switched to cassandra and other stuff. For me is far better to just pull 20 yours patches per month from github and dont waste my time with you in pointless discussions like git vs svn diff format.

        Show
        Radim Kolar added a comment - If you are so clever and hard working then stop undeleting my patch and write better one yourself. I am licensing my work as Affero GPL v3 from now. You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months. I reworked major portion of nutch and i dont want to spend years waiting if they and ever and when will be merged. I have hadoop 0.21 api, generator with plugable algorithm, fixed building with maven, database backend switched to cassandra and other stuff. For me is far better to just pull 20 yours patches per month from github and dont waste my time with you in pointless discussions like git vs svn diff format.
        Hide
        Chris A. Mattmann added a comment -

        You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months.

        This will NOT be tolerated at Apache. Take it off list. Racial slurs, insults, etc. have no place here on the list.

        Show
        Chris A. Mattmann added a comment - You simply need months to discuss trivial code change. Everybody here claims to be smart like TV and hard working like black men but look at your results: mere 13 trivial commits in October. Look at my results i have 2.1 billions files indexed in 4 months. This will NOT be tolerated at Apache. Take it off list. Racial slurs, insults, etc. have no place here on the list.
        Hide
        Radim Kolar added a comment -

        Remove my patch from this ticket. I hold copyrights on it and i have right to decide what to do with my copyrighted work. I don't want it to be included.

        Show
        Radim Kolar added a comment - Remove my patch from this ticket. I hold copyrights on it and i have right to decide what to do with my copyrighted work. I don't want it to be included.
        Hide
        Chris A. Mattmann added a comment -

        I was told the following from an experienced ASF infra person:

        While a contribution under the AL2 cannot
        legally be withdrawn, ASF policy is to abide by the wishes of the
        contributor even when legally we have sufficient rights to ignore them.

        So, we will leave the patch in place for the record, but abide by your wishes
        not to include it.

        Show
        Chris A. Mattmann added a comment - I was told the following from an experienced ASF infra person: While a contribution under the AL2 cannot legally be withdrawn, ASF policy is to abide by the wishes of the contributor even when legally we have sufficient rights to ignore them. So, we will leave the patch in place for the record, but abide by your wishes not to include it.

          People

          • Assignee:
            Unassigned
            Reporter:
            Radim Kolar
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 4h
              4h
              Remaining:
              Remaining Estimate - 4h
              4h
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development