Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.1, nutchgora
    • Fix Version/s: 1.10
    • Component/s: protocol
    • Patch Info:
      Patch Available

      Description

      I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link.

      This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable.

      1. NUTCH-827-trunk-v3.patch
        27 kB
        Sebastian Nagel
      2. NUTCH-827-trunkv2.patch
        27 kB
        Lewis John McGibbney
      3. NUTCH-827-trunk.patch
        29 kB
        Lewis John McGibbney
      4. http-client-form-authtication.patch
        17 kB
        jefferyyuan
      5. nutch-http-cookies.patch
        9 kB
        Jasper van Veghel

        Issue Links

          Activity

          Jasper van Veghel created issue -
          Jasper van Veghel made changes -
          Field Original Value New Value
          Attachment nutch-http-cookies.patch [ 12445647 ]
          Hide
          Ian Piper added a comment -

          Could you possible say exactly where the username and password need to go? I presently have these in runtime/local/conf/httpclient-auth.xml, but this doesn't seem to work. Also, what url needs to go in the nutch-site.xml file?

          Show
          Ian Piper added a comment - Could you possible say exactly where the username and password need to go? I presently have these in runtime/local/conf/httpclient-auth.xml, but this doesn't seem to work. Also, what url needs to go in the nutch-site.xml file?
          Markus Jelsma made changes -
          Fix Version/s 1.5 [ 12318246 ]
          Lewis John McGibbney made changes -
          Link This issue is blocked by NUTCH-1086 [ NUTCH-1086 ]
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Markus Jelsma made changes -
          Fix Version/s 1.6 [ 12319941 ]
          Fix Version/s 1.5 [ 12318246 ]
          Hide
          Max Dzyuba added a comment -

          Jasper, if you're still around, can you please answer Ian's question? I have a similar problem figuring out how exactly I can provide the username and password.

          Thank you!

          Show
          Max Dzyuba added a comment - Jasper, if you're still around, can you please answer Ian's question? I have a similar problem figuring out how exactly I can provide the username and password. Thank you!
          Hide
          Jasper van Veghel added a comment -

          Hey guys,

          This has been some time back, but take a look at the patch:

          nutch-default.xml ..

          <name>http.cookie.login.page</name>
          <description>URL of the login page to derive the cookies from. Cookies will be stored upon initialization and re-initialized upon expiration. Any URL request attributes will be [..] POSTed to the page. [..]</description>

          Apologies for the poor grammar in the original. Basically:

          • Whenever protocol-httpclient performs an HTTP request, it will first check if there are cookies stored in the cookie jar.
          • If there are cookies in the cookie jar AND none of the cookies have expired, it will do nothing.
          • If there are no cookies in the cookie jar OR at least one of the cookies has expired, it will ..
          • POST the URL / parameters provided in "http.cookie.login.page" property
          • In the process of which, the cookie jar should get filled with the cookies you need to perform subsequent (authenticated) requests

          The "http.cookie.login.page" property could contain something like "http://abc/def?username=foo&password=bar" .. the 'username' and 'password' properties will them be POSTed to 'http://abc/def', which should result in cookies being added to the cookie jar which is used for each subsequent request.

          This isn't exactly a fool-proof solution (what if other requests generate expired cookies? what if the login fails? etc.), but for the project for which I wrote the patch, it suited our needs. Hope it helps!

          Show
          Jasper van Veghel added a comment - Hey guys, This has been some time back, but take a look at the patch: nutch-default.xml .. <name>http.cookie.login.page</name> <description>URL of the login page to derive the cookies from. Cookies will be stored upon initialization and re-initialized upon expiration. Any URL request attributes will be [..] POSTed to the page. [..] </description> Apologies for the poor grammar in the original. Basically: Whenever protocol-httpclient performs an HTTP request, it will first check if there are cookies stored in the cookie jar. If there are cookies in the cookie jar AND none of the cookies have expired, it will do nothing. If there are no cookies in the cookie jar OR at least one of the cookies has expired, it will .. POST the URL / parameters provided in "http.cookie.login.page" property In the process of which, the cookie jar should get filled with the cookies you need to perform subsequent (authenticated) requests The "http.cookie.login.page" property could contain something like "http://abc/def?username=foo&password=bar" .. the 'username' and 'password' properties will them be POSTed to 'http://abc/def', which should result in cookies being added to the cookie jar which is used for each subsequent request. This isn't exactly a fool-proof solution (what if other requests generate expired cookies? what if the login fails? etc.), but for the project for which I wrote the patch, it suited our needs. Hope it helps!
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thanks a lot for the explanation! I applied the patch and compiled Nutch just fine, but can't confirm that it is working. Can you point to a website that this patch worked to pass the form auth at? I need to verify that it is working for me, but can't at the moment.

          Thanks in advance,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thanks a lot for the explanation! I applied the patch and compiled Nutch just fine, but can't confirm that it is working. Can you point to a website that this patch worked to pass the form auth at? I need to verify that it is working for me, but can't at the moment. Thanks in advance, Max
          Hide
          Jasper van Veghel added a comment -

          Hi Max,

          I'm sorry, but I don't really use the patch anymore, so I wouldn't be able to tell you. We used it in conjunction with an internal SAP system that we needed to spider, so that's not a public source you could try it against. Why not write up your own quick script which lets you POST some data, sets a cookie, and then returns some specific piece of data only when that cookie is set?

          Good luck!

          Jasper

          Show
          Jasper van Veghel added a comment - Hi Max, I'm sorry, but I don't really use the patch anymore, so I wouldn't be able to tell you. We used it in conjunction with an internal SAP system that we needed to spider, so that's not a public source you could try it against. Why not write up your own quick script which lets you POST some data, sets a cookie, and then returns some specific piece of data only when that cookie is set? Good luck! Jasper
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thank you for the tip! I'll have to go that way then.

          Best regards,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thank you for the tip! I'll have to go that way then. Best regards, Max
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          I've set up a script that does just that (receives POSTed data, sets a cookie, returns some data if the cookie is set), but now I have this error in my log:

          2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.
          2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; code = 200

          The second line with response code 200 is what I don't understand. I'd appreciate any tips you could give in this regard.

          Thanks,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, I've set up a script that does just that (receives POSTed data, sets a cookie, returns some data if the cookie is set), but now I have this error in my log: 2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one. 2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; code = 200 The second line with response code 200 is what I don't understand. I'd appreciate any tips you could give in this regard. Thanks, Max
          Hide
          Jasper van Veghel added a comment - - edited

          Looks like a pretty sloppy mistake in the patch ..

          +      if (code == 200 && Http.LOG.isTraceEnabled()) {
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          +      } else {
          +          Http.LOG.error("Unable to retrieve login page; code = " + code);
          +      }
          

          Change that to something like ..

          +      if (code == 200 && Http.LOG.isTraceEnabled()) {
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          +      } else if (code != 200) {
          +          Http.LOG.error("Unable to retrieve login page; code = " + code);
          +      }
          

          And also change this ..

          +          LOG.error("Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.");
          

          To something like this ..

          +          LOG.error("Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.", e);
          

          To see where the Exception is coming from. All it does after that LOG.error() is release the connection. So it shouldn't be throwing an Exception.

          Show
          Jasper van Veghel added a comment - - edited Looks like a pretty sloppy mistake in the patch .. + if (code == 200 && Http.LOG.isTraceEnabled()) { + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); + } else { + Http.LOG.error( "Unable to retrieve login page; code = " + code); + } Change that to something like .. + if (code == 200 && Http.LOG.isTraceEnabled()) { + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); + } else if (code != 200) { + Http.LOG.error( "Unable to retrieve login page; code = " + code); + } And also change this .. + LOG.error( "Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one." ); To something like this .. + LOG.error( "Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one." , e); To see where the Exception is coming from. All it does after that LOG.error() is release the connection. So it shouldn't be throwing an Exception.
          Hide
          Max Dzyuba added a comment -

          Now I get the following error:

          2012-10-01 14:40:54,996 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.
          java.lang.IllegalArgumentException: Entity enclosing requests cannot be redirected without user intervention
          at org.apache.commons.httpclient.methods.EntityEnclosingMethod.setFollowRedirects(EntityEnclosingMethod.java:225)
          at org.apache.nutch.protocol.httpclient.HttpCookieAuthentication.<init>(HttpCookieAuthentication.java:73)
          at org.apache.nutch.protocol.httpclient.Http.resolveCookieCredentials(Http.java:402)
          at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:387)
          at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:152)
          at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:440)
          at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:425)
          at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:403)
          at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:668)

          Sorry to bug you about this...

          Thanks for your time!
          Max

          Show
          Max Dzyuba added a comment - Now I get the following error: 2012-10-01 14:40:54,996 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one. java.lang.IllegalArgumentException: Entity enclosing requests cannot be redirected without user intervention at org.apache.commons.httpclient.methods.EntityEnclosingMethod.setFollowRedirects(EntityEnclosingMethod.java:225) at org.apache.nutch.protocol.httpclient.HttpCookieAuthentication.<init>(HttpCookieAuthentication.java:73) at org.apache.nutch.protocol.httpclient.Http.resolveCookieCredentials(Http.java:402) at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:387) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:152) at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:440) at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:425) at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:403) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:668) Sorry to bug you about this... Thanks for your time! Max
          Hide
          Jasper van Veghel added a comment -

          That exception looks familiar — I think that we ended up solving that simply by removing ..

          +    method.setFollowRedirects(followRedirects);
          

          As redirects are not supported for POST-requests.

          Show
          Jasper van Veghel added a comment - That exception looks familiar — I think that we ended up solving that simply by removing .. + method.setFollowRedirects(followRedirects); As redirects are not supported for POST-requests.
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thanks, removing that line fixed the exception problem.
          At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up.

          Is there an easy way to verify if the cookie was created by Nutch and stored as intended?

          Thanks,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thanks, removing that line fixed the exception problem. At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up. Is there an easy way to verify if the cookie was created by Nutch and stored as intended? Thanks, Max
          Hide
          Jasper van Veghel added a comment -
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          

          If you turn on TRACE logging, you should see messages like that.

          Show
          Jasper van Veghel added a comment - + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); If you turn on TRACE logging, you should see messages like that.
          Hide
          Max Dzyuba added a comment -

          Thank you, Jasper. I did just that and now I see that cookies are received (at least in some cases).

          Do you know of any reason why I still wouldn't be able to retrieve pages that require authentication (even though I see the cookies stored)? Does it have to do with those pages returning status code "200"?

          Thanks for the help!

          Show
          Max Dzyuba added a comment - Thank you, Jasper. I did just that and now I see that cookies are received (at least in some cases). Do you know of any reason why I still wouldn't be able to retrieve pages that require authentication (even though I see the cookies stored)? Does it have to do with those pages returning status code "200"? Thanks for the help!
          Lewis John McGibbney made changes -
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 1.6 [ 12319941 ]
          Lewis John McGibbney made changes -
          Fix Version/s 2.2 [ 12323285 ]
          Lewis John McGibbney made changes -
          Fix Version/s 2.3 [ 12324325 ]
          Fix Version/s 1.7 [ 12323281 ]
          Fix Version/s 2.2 [ 12323285 ]
          Sebastian Nagel made changes -
          Fix Version/s 1.8 [ 12324326 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.9 [ 12324611 ]
          Fix Version/s 2.3 [ 12324325 ]
          Fix Version/s 1.8 [ 12324326 ]
          jefferyyuan made changes -
          Attachment http-client-form-authtication.patch [ 12632311 ]
          Hide
          jefferyyuan added a comment -

          I was assigned a task to use nutch2 to crawla web site which uses form-based authentication.
          Based on Jasper's code, I made some improvement to make it work. Please view the patch: http-client-form-authtication.patch.

          To use it, first we try to figure it out how to use http client to do form based login successfully, We can use Chrome Devtools to get the login formId, username and password fields, get the exact post request; we may remove some form fields, or add some headers.

          private static void authTestAspWebApp() throws Exception, IOException {
            HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();
            authConfigurer.setLoginUrl("http://localhost:44444/Account/Login.aspx")
              .setLoginFormId("ctl01").setLoginRedirect(true);
            Map<String, String> loginPostData = new HashMap<String, String>();
            loginPostData.put("ctl00$MainContent$LoginUser$UserName", "admin");
            loginPostData.put("ctl00$MainContent$LoginUser$Password", "admin123");
            authConfigurer.setLoginPostData(loginPostData);
           
            Set<String> removedFormFields = new HashSet<String>();
            removedFormFields.add("ctl00$MainContent$LoginUser$RememberMe");
            authConfigurer.setRemovedFormFields(removedFormFields);
           
            HttpFormAuthentication example = new HttpFormAuthentication(
              authConfigurer);
            example.login();
            String result = example
              .httpGetPageContent("http://localhost:44444/secret/needlogin.aspx");
            System.out.println(result);
           }
          

          After make the test code work, we define form authentication info in httpclient-auth.xml:

          <?xml version="1.0"?>
          <auth-configuration>
            <credentials authMethod="formAuth" loginUrl="http://localhost:44444/Account/Login.aspx" loginFormId="ctl01" loginRedirect="true">
              <loginPostData>
                <field name="ctl00$MainContent$LoginUser$UserName" value="admin"/>
                <field name="ctl00$MainContent$LoginUser$Password" value="admin123"/>
              </loginPostData>
              <removedFormFields>
                <field name="ctl00$MainContent$LoginUser$RememberMe"/>
              </removedFormFields>
            </credentials>
          </auth-configuration>
          

          Be sure to use protocol-httpclient plugin in nutch-site.xml: not protocol-http.
          If you are interested, you may read:http://lifelongprogrammer.blogspot.com/2014/02/part1-using-apache-http-client-to-do-http-post-form-authentication.html

          Show
          jefferyyuan added a comment - I was assigned a task to use nutch2 to crawla web site which uses form-based authentication. Based on Jasper's code, I made some improvement to make it work. Please view the patch: http-client-form-authtication.patch. To use it, first we try to figure it out how to use http client to do form based login successfully, We can use Chrome Devtools to get the login formId, username and password fields, get the exact post request; we may remove some form fields, or add some headers. private static void authTestAspWebApp() throws Exception, IOException {   HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();   authConfigurer.setLoginUrl( "http: //localhost:44444/Account/Login.aspx" )     .setLoginFormId( "ctl01" ).setLoginRedirect( true );   Map< String , String > loginPostData = new HashMap< String , String >();   loginPostData.put( "ctl00$MainContent$LoginUser$UserName" , "admin" );   loginPostData.put( "ctl00$MainContent$LoginUser$Password" , "admin123" );   authConfigurer.setLoginPostData(loginPostData);     Set< String > removedFormFields = new HashSet< String >();   removedFormFields.add( "ctl00$MainContent$LoginUser$RememberMe" );   authConfigurer.setRemovedFormFields(removedFormFields);     HttpFormAuthentication example = new HttpFormAuthentication(     authConfigurer);   example.login();    String result = example     .httpGetPageContent( "http: //localhost:44444/secret/needlogin.aspx" );    System .out.println(result);  } After make the test code work, we define form authentication info in httpclient-auth.xml: <?xml version= "1.0" ?> <auth-configuration> <credentials authMethod= "formAuth" loginUrl= "http://localhost:44444/Account/Login.aspx" loginFormId= "ctl01" loginRedirect= "true" > <loginPostData> <field name= "ctl00$MainContent$LoginUser$UserName" value= "admin" /> <field name= "ctl00$MainContent$LoginUser$Password" value= "admin123" /> </loginPostData> <removedFormFields> <field name= "ctl00$MainContent$LoginUser$RememberMe" /> </removedFormFields> </credentials> </auth-configuration> Be sure to use protocol-httpclient plugin in nutch-site.xml: not protocol-http. If you are interested, you may read: http://lifelongprogrammer.blogspot.com/2014/02/part1-using-apache-http-client-to-do-http-post-form-authentication.html
          Julien Nioche made changes -
          Component/s protocol [ 12318529 ]
          Component/s fetcher [ 11591 ]
          Sebastian Nagel made changes -
          Link This issue is duplicated by NUTCH-1518 [ NUTCH-1518 ]
          Sebastian Nagel made changes -
          Link This issue relates to NUTCH-1613 [ NUTCH-1613 ]
          Julien Nioche made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.9 [ 12324611 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.11 [ 12329358 ]
          Fix Version/s 1.10 [ 12327187 ]
          Lewis John McGibbney made changes -
          Assignee Lewis John McGibbney [ lewismc ]
          Lewis John McGibbney made changes -
          Fix Version/s 2.4 [ 12324540 ]
          Hide
          Lewis John McGibbney added a comment -

          I am working on this issue as I require form-based authentication for a current research task.

          Show
          Lewis John McGibbney added a comment - I am working on this issue as I require form-based authentication for a current research task.
          Hide
          Lewis John McGibbney added a comment -

          Patch for trunk.
          This has been tested and verified to enable access to various large Databases requiring HTTP Post authentication. I also would like to mention that setting the redirect boolean flag to true is usually always required.
          Would really appreciate if folks could try this out and comment.

          Show
          Lewis John McGibbney added a comment - Patch for trunk. This has been tested and verified to enable access to various large Databases requiring HTTP Post authentication. I also would like to mention that setting the redirect boolean flag to true is usually always required. Would really appreciate if folks could try this out and comment.
          Lewis John McGibbney made changes -
          Attachment NUTCH-827-trunk.patch [ 12696328 ]
          Lewis John McGibbney made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Lewis John McGibbney made changes -
          Link This issue is blocked by NUTCH-1086 [ NUTCH-1086 ]
          Lewis John McGibbney made changes -
          Link This issue is related to NUTCH-1929 [ NUTCH-1929 ]
          Lewis John McGibbney made changes -
          Fix Version/s 1.10 [ 12327187 ]
          Fix Version/s 1.11 [ 12329358 ]
          Hide
          Sebastian Nagel added a comment -

          Looks promising:

          • successfully crawled one protected site
          • a second trial failed because the form element is referenced via "name" attribute instead of "id". That's obviously ok, maybe old-style/deprecated (cf. [1], [2]). I'll continue this trial to provide a fix/work-around.
          • log level TRACE should provide sufficient information what goes wrong when logging in
          • config file to be committed should be conf/httpclient-auth.xml.template instead of conf/httpclient-auth.xml
          Show
          Sebastian Nagel added a comment - Looks promising: successfully crawled one protected site a second trial failed because the form element is referenced via "name" attribute instead of "id". That's obviously ok, maybe old-style/deprecated (cf. [ 1 ], [ 2 ]). I'll continue this trial to provide a fix/work-around. log level TRACE should provide sufficient information what goes wrong when logging in config file to be committed should be conf/httpclient-auth.xml.template instead of conf/httpclient-auth.xml
          Hide
          Lewis John McGibbney added a comment -

          Sebastian Nagel fantastic, thanks

          a second trial failed because the form element is referenced via "name" attribute instead of "id". That's obviously ok, maybe old-style/deprecated (cf. [1], [2]). I'll continue this trial to provide a fix/work-around.

          Ah... possibly try id, if empty try name?
          .bq log level TRACE should provide sufficient information what goes wrong when logging in
          +1

          config file to be committed should be conf/httpclient-auth.xml.template instead of conf/httpclient-auth.xml

          +1, patch coming up
          Thanks for review

          Show
          Lewis John McGibbney added a comment - Sebastian Nagel fantastic, thanks a second trial failed because the form element is referenced via "name" attribute instead of "id". That's obviously ok, maybe old-style/deprecated (cf. [1] , [2] ). I'll continue this trial to provide a fix/work-around. Ah... possibly try id, if empty try name? .bq log level TRACE should provide sufficient information what goes wrong when logging in +1 config file to be committed should be conf/httpclient-auth.xml.template instead of conf/httpclient-auth.xml +1, patch coming up Thanks for review
          Lewis John McGibbney made changes -
          Status In Progress [ 3 ] Open [ 1 ]
          Hide
          Lewis John McGibbney added a comment -

          Updated patch for trunk which takes on Sebastian Nagel's comments.

          • I've moved the additions to httpclient-auth.xml to httpclient-auth.xml.template
          • I've also added some primative checking for form 'name' if we cannot locate an 'id'
              Element loginform = doc.getElementById(authConfigurer.getLoginFormId());
              if (loginform == null) {
                LOGGER.debug("'id' attribute for form element is null, trying 'name'.");
                loginform = doc.select("form.answer[name="+ authConfigurer.getLoginFormId() + "]").first();
                if (loginform == null) {
                  LOGGER.debug("'name' attribute for form element is also null.");
                  throw new IllegalArgumentException("No form exists: "
                      + authConfigurer.getLoginFormId());
                }
              }
          

          The rest seem to be OK to me and I am able to use this patch to fetch content from secure databases.

          Show
          Lewis John McGibbney added a comment - Updated patch for trunk which takes on Sebastian Nagel 's comments. I've moved the additions to httpclient-auth.xml to httpclient-auth.xml.template I've also added some primative checking for form 'name' if we cannot locate an 'id' Element loginform = doc.getElementById(authConfigurer.getLoginFormId()); if (loginform == null ) { LOGGER.debug( "'id' attribute for form element is null , trying 'name'." ); loginform = doc.select( "form.answer[name=" + authConfigurer.getLoginFormId() + "]" ).first(); if (loginform == null ) { LOGGER.debug( "'name' attribute for form element is also null ." ); throw new IllegalArgumentException( "No form exists: " + authConfigurer.getLoginFormId()); } } The rest seem to be OK to me and I am able to use this patch to fetch content from secure databases.
          Lewis John McGibbney made changes -
          Attachment NUTCH-827-trunkv2.patch [ 12697655 ]
          Lewis John McGibbney made changes -
          Fix Version/s 2.4 [ 12324540 ]
          Lewis John McGibbney made changes -
          Link This issue is related to NUTCH-1940 [ NUTCH-1940 ]
          Hide
          Lewis John McGibbney added a comment -

          Would be great to commit and get in to 1.10

          Show
          Lewis John McGibbney added a comment - Would be great to commit and get in to 1.10
          Hide
          Sebastian Nagel added a comment -

          Hi Lewis John McGibbney, attached patch fixes two points

          • the CSS statement to select of "form" elements by "name" attribute didn't work properly
          • (should be documented) the configuration allows to set <additionalPostHeaders>, e.g.
             <additionalPostHeaders>
               <field name="User-Agent"
                      value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0" />
             </additionalPostHeaders>
            

          One point is open (but we could delay it, it may take some work):

          • the form authentication is global and ignores <authScope>. So you have to restrict your crawl to the form authentication pages only. Ideally, also form authentication should be bound to a scope (one host, one URL prefix, etc.) same as HTTP authentication.
          Show
          Sebastian Nagel added a comment - Hi Lewis John McGibbney , attached patch fixes two points the CSS statement to select of "form" elements by "name" attribute didn't work properly (should be documented) the configuration allows to set <additionalPostHeaders>, e.g. <additionalPostHeaders> <field name="User-Agent" value="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:35.0) Gecko/20100101 Firefox/35.0" /> </additionalPostHeaders> One point is open (but we could delay it, it may take some work): the form authentication is global and ignores <authScope> . So you have to restrict your crawl to the form authentication pages only. Ideally, also form authentication should be bound to a scope (one host, one URL prefix, etc.) same as HTTP authentication.
          Sebastian Nagel made changes -
          Attachment NUTCH-827-trunk-v3.patch [ 12698825 ]
          Hide
          Lewis John McGibbney added a comment -

          Fantastic Sebastian Nagel
          I will commit this patch and log an issue to accommodate and address your final suggestion (and an excellent one it is too!).
          Thanks Seb.

          Show
          Lewis John McGibbney added a comment - Fantastic Sebastian Nagel I will commit this patch and log an issue to accommodate and address your final suggestion (and an excellent one it is too!). Thanks Seb.
          Hide
          Lewis John McGibbney added a comment -

          Committed @revision 1659697 in trunk
          Thank you to everyone involved. All credited in CHANGES

          Show
          Lewis John McGibbney added a comment - Committed @revision 1659697 in trunk Thank you to everyone involved. All credited in CHANGES
          Lewis John McGibbney made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Lewis John McGibbney made changes -
          Link This issue relates to NUTCH-1943 [ NUTCH-1943 ]
          Hide
          Lewis John McGibbney added a comment -

          part 2 (new files) Committed @revision 1659701

          Show
          Lewis John McGibbney added a comment - part 2 (new files) Committed @revision 1659701
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Nutch-trunk #2976 (See https://builds.apache.org/job/Nutch-trunk/2976/)
          NUTCH-827 HTTP POST Authentication (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659701)

          • /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpFormAuthConfigurer.java
          • /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpFormAuthentication.java
            NUTCH-827 HTTP POST Authentication (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659697)
          • /nutch/trunk/CHANGES.txt
          • /nutch/trunk/conf/httpclient-auth.xml.template
          • /nutch/trunk/src/plugin/protocol-httpclient/ivy.xml
          • /nutch/trunk/src/plugin/protocol-httpclient/plugin.xml
          • /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
          Show
          Hudson added a comment - SUCCESS: Integrated in Nutch-trunk #2976 (See https://builds.apache.org/job/Nutch-trunk/2976/ ) NUTCH-827 HTTP POST Authentication (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659701 ) /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpFormAuthConfigurer.java /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpFormAuthentication.java NUTCH-827 HTTP POST Authentication (lewismc: http://svn.apache.org/viewvc/nutch/trunk/?view=rev&rev=1659697 ) /nutch/trunk/CHANGES.txt /nutch/trunk/conf/httpclient-auth.xml.template /nutch/trunk/src/plugin/protocol-httpclient/ivy.xml /nutch/trunk/src/plugin/protocol-httpclient/plugin.xml /nutch/trunk/src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
          Hide
          Tyler Palsulich added a comment -

          Lewis John McGibbney, I think the new template wasn't included in the commit.

          Show
          Tyler Palsulich added a comment - Lewis John McGibbney , I think the new template wasn't included in the commit.
          Hide
          Sebastian Nagel added a comment -

          Hi Tyler Palsulich, what is meant by "new template"? The file "conf/httpclient-auth.xml.template" looks ok. In case of running Nutch from dev enviroment the file "conf/httpclient-auth.xml" needs to be replaced (or merged) by the template. It's not automatically updated/overwritten (which is ok, in case there are local changes).

          Show
          Sebastian Nagel added a comment - Hi Tyler Palsulich , what is meant by "new template"? The file "conf/httpclient-auth.xml.template" looks ok. In case of running Nutch from dev enviroment the file "conf/httpclient-auth.xml" needs to be replaced (or merged) by the template. It's not automatically updated/overwritten (which is ok, in case there are local changes).
          Hide
          Tyler Palsulich added a comment -

          Ahh. My mistake. You're right. Thanks!

          Show
          Tyler Palsulich added a comment - Ahh. My mistake. You're right. Thanks!
          Lewis John McGibbney made changes -
          Labels authentication authentication memex
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          1713d 11h 56m 1 Lewis John McGibbney 04/Feb/15 00:04
          In Progress In Progress Open Open
          6d 5h 5m 1 Lewis John McGibbney 10/Feb/15 05:10
          Open Open Resolved Resolved
          3d 16h 55m 1 Lewis John McGibbney 13/Feb/15 22:06

            People

            • Assignee:
              Lewis John McGibbney
              Reporter:
              Jasper van Veghel
            • Votes:
              3 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development