Nutch
  1. Nutch
  2. NUTCH-559

NTLM, Basic and Digest Authentication schemes for web/proxy server

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.0.0
    • Fix Version/s: 1.0.0
    • Component/s: fetcher
    • Labels:
      None

      Description

      Added basic, digest and NTLM authentication schemes to protocol-httpclient. The authentication schemes can be configured for proxy server as well as web servers of a domain. HTTP authentication can take place over HTTP/1.0, HTTP/1.1 and HTTPS.

      The authentication guide can be found here: http://wiki.apache.org/nutch/HttpAuthenticationSchemes.

      1. NUTCH-559v0.1.patch
        21 kB
        Susam Pal
      2. NUTCH-559v0.2.patch
        20 kB
        Susam Pal
      3. NUTCH-559v0.3.patch
        27 kB
        Susam Pal
      4. NUTCH-559v0.4.patch
        26 kB
        Susam Pal
      5. NUTCH-559v0.5.patch
        51 kB
        Susam Pal

        Issue Links

          Activity

          Hide
          Susam Pal added a comment -

          I have generated this patch against Nutch trunk. It will add support for authentication schemes, add the related properties in 'conf/nutch-default.xml' and update the 'package.html' for this package. Additionally, the download limit check has been improved.

          To apply and build:-

          patch -p0 < NUTCH-559v0.1.patch
          ant

          Show
          Susam Pal added a comment - I have generated this patch against Nutch trunk. It will add support for authentication schemes, add the related properties in 'conf/nutch-default.xml' and update the 'package.html' for this package. Additionally, the download limit check has been improved. To apply and build:- patch -p0 < NUTCH-559 v0.1.patch ant
          Hide
          Robert Dale added a comment -

          I've tested this patch wrt http basic auth and it works wonderfully. I'd like to see this added sooner rather than later.
          Thanks for the work!

          Show
          Robert Dale added a comment - I've tested this patch wrt http basic auth and it works wonderfully. I'd like to see this added sooner rather than later. Thanks for the work!
          Hide
          Susam Pal added a comment -

          Apart from adding the authentication features, this patch would fix three major issues and one minor issue: NUTCH-481, NUTCH-539, NUTCH-560, NUTCH-561.

          Show
          Susam Pal added a comment - Apart from adding the authentication features, this patch would fix three major issues and one minor issue: NUTCH-481 , NUTCH-539 , NUTCH-560 , NUTCH-561 .
          Hide
          Doğacan Güney added a comment -

          Susam, thanks for turning this into a patch.

          Patch is looking good, but I think it needs to go through a couple of revisions before it is ready. Some comments:

          • Is it possible to extend authentication to work for more than one host? I am thinking something like this:
            <property>
            <name>http.auth</name>
            <value> {host}

            :

            {realm}

            :

            {username}

            :

            {passwd}

            ,...</value>
            </property>

          So that in a single fetch, fetcher can authenticate across different domains.

          • You don't have to call configureClient on every getResponse. I understand that you are doing this for authentication reasons. But you can just move configureClient to where it used to be and add a new method there for authentication.
          • Also, you are removing too much from configureClient (for example, headers). Unless there is a reason that for removing the code, it should stay there.
          • In HttpResponse, you removed all the cookie policy related code. Why?
          • Also in HttpResponse, you can use calculateTryToRead and other response reading code instead of writing your own.
          Show
          Doğacan Güney added a comment - Susam, thanks for turning this into a patch. Patch is looking good, but I think it needs to go through a couple of revisions before it is ready. Some comments: Is it possible to extend authentication to work for more than one host? I am thinking something like this: <property> <name>http.auth</name> <value> {host} : {realm} : {username} : {passwd} ,...</value> </property> So that in a single fetch, fetcher can authenticate across different domains. You don't have to call configureClient on every getResponse. I understand that you are doing this for authentication reasons. But you can just move configureClient to where it used to be and add a new method there for authentication. Also, you are removing too much from configureClient (for example, headers). Unless there is a reason that for removing the code, it should stay there. In HttpResponse, you removed all the cookie policy related code. Why? Also in HttpResponse, you can use calculateTryToRead and other response reading code instead of writing your own.
          Hide
          Susam Pal added a comment -

          Uploading a revised (v0.2) patch which accommodates most of the suggestions by Doğacan. A few points I want to discuss:-

          • Extending the authentication to work for more than one host was in my mind but I found too many possible cases. So I was planning to have a different configuration file where all the authentication rules can be mentioned to override the corresponding 'conf/nutch-site.xml' properties. The different possible cases are:
            • Different credentials for different domain or sub-domains, say, example.com, ad.example.com, example.net, etc.
            • Different credentials for different hosts.
            • Different credentials for different realms.
          • I removed cookie related code earlier because I didn't find it to work (even before merging my work). However, I have brought them back in the revised patch. We can discuss more on this if required.
          • I have restored most of the original response reading code except for 'calculateTryToRead'. This method is not checking for 'Content-Length' limit. The content-length limit check present in this patch is similar to that of 'protocol-http' which is simpler and correct.

          If the idea of having a separate authentication configuration file looks good, I can work on it when I get some free time.

          Show
          Susam Pal added a comment - Uploading a revised (v0.2) patch which accommodates most of the suggestions by Doğacan. A few points I want to discuss:- Extending the authentication to work for more than one host was in my mind but I found too many possible cases. So I was planning to have a different configuration file where all the authentication rules can be mentioned to override the corresponding 'conf/nutch-site.xml' properties. The different possible cases are: Different credentials for different domain or sub-domains, say, example.com, ad.example.com, example.net, etc. Different credentials for different hosts. Different credentials for different realms. I removed cookie related code earlier because I didn't find it to work (even before merging my work). However, I have brought them back in the revised patch. We can discuss more on this if required. I have restored most of the original response reading code except for 'calculateTryToRead'. This method is not checking for 'Content-Length' limit. The content-length limit check present in this patch is similar to that of 'protocol-http' which is simpler and correct. If the idea of having a separate authentication configuration file looks good, I can work on it when I get some free time.
          Hide
          Doğacan Güney added a comment -

          I haven't tested it yet but after a quick review, latest patch looks good to me. However, it would be nice if we can have some unit tests for the new functionality.

          > Extending the authentication to work for more than one host was in my mind but I found too many possible cases. So I was
          > planning to have a different configuration file where all the authentication rules can be mentioned to override the corresponding
          > 'conf/nutch-site.xml' properties. The different possible cases are: [...]

          OK, a different configuration file sounds good (I don't like that we are putting a file in conf/ for a plugin, but we already do that anyway. We should probably prefix the name of the file with plugin's name to make it clear, like: httpclient-auth.txt)

          > I removed cookie related code earlier because I didn't find it to work (even before merging my work). However, I have brought
          > them back in the revised patch. We can discuss more on this if required.

          I think it should work. It doesn't remember cookies across different crawl cycles but it should remember them during a single fetch.

          > I have restored most of the original response reading code except for 'calculateTryToRead'. This method is not checking for
          > 'Content-Length' limit. The content-length limit check present in this patch is similar to that of 'protocol-http' which is simpler
          > and correct.

          OK.

          Show
          Doğacan Güney added a comment - I haven't tested it yet but after a quick review, latest patch looks good to me. However, it would be nice if we can have some unit tests for the new functionality. > Extending the authentication to work for more than one host was in my mind but I found too many possible cases. So I was > planning to have a different configuration file where all the authentication rules can be mentioned to override the corresponding > 'conf/nutch-site.xml' properties. The different possible cases are: [...] OK, a different configuration file sounds good (I don't like that we are putting a file in conf/ for a plugin, but we already do that anyway. We should probably prefix the name of the file with plugin's name to make it clear, like: httpclient-auth.txt) > I removed cookie related code earlier because I didn't find it to work (even before merging my work). However, I have brought > them back in the revised patch. We can discuss more on this if required. I think it should work. It doesn't remember cookies across different crawl cycles but it should remember them during a single fetch. > I have restored most of the original response reading code except for 'calculateTryToRead'. This method is not checking for > 'Content-Length' limit. The content-length limit check present in this patch is similar to that of 'protocol-http' which is simpler > and correct. OK.
          Hide
          Susam Pal added a comment -

          Uploading a revised (v0.3) patch that allows flexible authentication configuration. Different credentials can be configured for different authentication scopes.

          This patch will introduce a new file: 'conf/httpclient-auth.xml.template'. The configuration guidelines are included as comments in the XML file.

          Show
          Susam Pal added a comment - Uploading a revised (v0.3) patch that allows flexible authentication configuration. Different credentials can be configured for different authentication scopes. This patch will introduce a new file: 'conf/httpclient-auth.xml.template'. The configuration guidelines are included as comments in the XML file.
          Hide
          Doğacan Güney added a comment -

          Hi Susam,

          Your last patch looks great!

          I have one minor nit: I think it would be better to move http.auth.* properties to httpclient-auth.xml file so that all authentication configuration is in one place
          (for example, we can add a defaultscope or whatever tag to define defauflt username/password/etc). What do you think?

          Also, I am wondering if we can have some test cases for this patch. Is it possible to start, say, jetty during a junit test and test authentication?

          I am going to test this patch within this week and return with further comments.

          Show
          Doğacan Güney added a comment - Hi Susam, Your last patch looks great! I have one minor nit: I think it would be better to move http.auth.* properties to httpclient-auth.xml file so that all authentication configuration is in one place (for example, we can add a defaultscope or whatever tag to define defauflt username/password/etc). What do you think? Also, I am wondering if we can have some test cases for this patch. Is it possible to start, say, jetty during a junit test and test authentication? I am going to test this patch within this week and return with further comments.
          Hide
          Susam Pal added a comment -

          Uploading a revised (v0.4) patch that has all authentication configuration at one place in httpclient-auth.xml.

          Hi Doğacan,

          Thanks for the suggestion. It really simplified the configuration. Yes, we need some JUnit test cases because there are many different permutations possible in the authentication configuration. We should test as many of them as possible. I'll start working on the test cases when I get some free time again.

          Show
          Susam Pal added a comment - Uploading a revised (v0.4) patch that has all authentication configuration at one place in httpclient-auth.xml. Hi Doğacan, Thanks for the suggestion. It really simplified the configuration. Yes, we need some JUnit test cases because there are many different permutations possible in the authentication configuration. We should test as many of them as possible. I'll start working on the test cases when I get some free time again.
          Hide
          Susam Pal added a comment -

          Uploading a revised (v0.5) patch with some test cases. Added a 'scheme' attribute in <authscope> tag so that different credentials can be specified for different authentication schemes.

          Show
          Susam Pal added a comment - Uploading a revised (v0.5) patch with some test cases. Added a 'scheme' attribute in <authscope> tag so that different credentials can be specified for different authentication schemes.
          Hide
          Les Cheong added a comment -

          Hi Susam.

          Like others, thank you for this valuable patch. It seems to work perfectly so far (with my limited testing, mainly against BASIC authentication schemes), but the configuration aspect is wonderful.

          /LesC

          Show
          Les Cheong added a comment - Hi Susam. Like others, thank you for this valuable patch. It seems to work perfectly so far (with my limited testing, mainly against BASIC authentication schemes), but the configuration aspect is wonderful. /LesC
          Hide
          Doğacan Güney added a comment -

          Susam,

          Sorry for the late review...

          I have finally reviewed your patch and I have to say: Wow! . With the introduction of unit tests, this already-quite-nice patch has become an excellent one (it is both very well written and throughly documented). So, a +1 from me for committing this one.

          Anyway, I guess everyone is on Christmas holiday, so I will wait until after a few days into next year and if there are no objections, commit this patch.

          Show
          Doğacan Güney added a comment - Susam, Sorry for the late review... I have finally reviewed your patch and I have to say: Wow! . With the introduction of unit tests, this already-quite-nice patch has become an excellent one (it is both very well written and throughly documented). So, a +1 from me for committing this one. Anyway, I guess everyone is on Christmas holiday, so I will wait until after a few days into next year and if there are no objections, commit this patch.
          Hide
          Emmanuel Joke added a comment - - edited

          Dogacan, is there any chance that you commit this patch quite soon ?

          Show
          Emmanuel Joke added a comment - - edited Dogacan, is there any chance that you commit this patch quite soon ?
          Hide
          Doğacan Güney added a comment -

          Latest patch committed in rev. 608972 with some minor changes.

          I moved some files around because it seems, doing an "ant test" under src/plugin/protocol-httpclient returned an error before.

          Show
          Doğacan Güney added a comment - Latest patch committed in rev. 608972 with some minor changes. I moved some files around because it seems, doing an "ant test" under src/plugin/protocol-httpclient returned an error before.
          Hide
          Doğacan Güney added a comment -

          Closing this issue, since code is now committed.

          Thanks Susam for the great patch! And a thank you to everyone else who has tested it

          Show
          Doğacan Güney added a comment - Closing this issue, since code is now committed. Thanks Susam for the great patch! And a thank you to everyone else who has tested it
          Hide
          Hudson added a comment -
          Show
          Hudson added a comment - Integrated in Nutch-Nightly #318 (See http://lucene.zones.apache.org:8080/hudson/job/Nutch-Nightly/318/ )

            People

            • Assignee:
              Doğacan Güney
              Reporter:
              Susam Pal
            • Votes:
              7 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development