Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 1.1, nutchgora
    • Fix Version/s: 1.9
    • Component/s: protocol
    • Labels:
    • Patch Info:
      Patch Available

      Description

      I've created a patch against the trunk which adds support for very rudimentary POST-based authentication support. It takes a link from nutch-site.xml with a site to POST to and its respective parameters (username, password, etc.). It then checks upon every request whether any cookies have been initialized, and if none have, it fetches them from the given link.

      This isn't perfect but Works For Me (TM) as I generally only need to retrieve results from a single domain and so have no cookie overlap (i.e. if the domain cookies expire, all cookies disappear from the HttpClient and I can simply re-fetch them). A natural improvement would be to be able to specify one particular cookie to check the expiration-date against. If anyone is interested in this beside me I'd be glad to put some more effort into making this more universally applicable.

      1. nutch-http-cookies.patch
        9 kB
        Jasper van Veghel
      2. http-client-form-authtication.patch
        17 kB
        yuanyun.cn

        Issue Links

          Activity

          Hide
          Ian Piper added a comment -

          Could you possible say exactly where the username and password need to go? I presently have these in runtime/local/conf/httpclient-auth.xml, but this doesn't seem to work. Also, what url needs to go in the nutch-site.xml file?

          Show
          Ian Piper added a comment - Could you possible say exactly where the username and password need to go? I presently have these in runtime/local/conf/httpclient-auth.xml, but this doesn't seem to work. Also, what url needs to go in the nutch-site.xml file?
          Hide
          Markus Jelsma added a comment -

          20120304-push-1.6

          Show
          Markus Jelsma added a comment - 20120304-push-1.6
          Hide
          Max Dzyuba added a comment -

          Jasper, if you're still around, can you please answer Ian's question? I have a similar problem figuring out how exactly I can provide the username and password.

          Thank you!

          Show
          Max Dzyuba added a comment - Jasper, if you're still around, can you please answer Ian's question? I have a similar problem figuring out how exactly I can provide the username and password. Thank you!
          Hide
          Jasper van Veghel added a comment -

          Hey guys,

          This has been some time back, but take a look at the patch:

          nutch-default.xml ..

          <name>http.cookie.login.page</name>
          <description>URL of the login page to derive the cookies from. Cookies will be stored upon initialization and re-initialized upon expiration. Any URL request attributes will be [..] POSTed to the page. [..]</description>

          Apologies for the poor grammar in the original. Basically:

          • Whenever protocol-httpclient performs an HTTP request, it will first check if there are cookies stored in the cookie jar.
          • If there are cookies in the cookie jar AND none of the cookies have expired, it will do nothing.
          • If there are no cookies in the cookie jar OR at least one of the cookies has expired, it will ..
          • POST the URL / parameters provided in "http.cookie.login.page" property
          • In the process of which, the cookie jar should get filled with the cookies you need to perform subsequent (authenticated) requests

          The "http.cookie.login.page" property could contain something like "http://abc/def?username=foo&password=bar" .. the 'username' and 'password' properties will them be POSTed to 'http://abc/def', which should result in cookies being added to the cookie jar which is used for each subsequent request.

          This isn't exactly a fool-proof solution (what if other requests generate expired cookies? what if the login fails? etc.), but for the project for which I wrote the patch, it suited our needs. Hope it helps!

          Show
          Jasper van Veghel added a comment - Hey guys, This has been some time back, but take a look at the patch: nutch-default.xml .. <name>http.cookie.login.page</name> <description>URL of the login page to derive the cookies from. Cookies will be stored upon initialization and re-initialized upon expiration. Any URL request attributes will be [..] POSTed to the page. [..] </description> Apologies for the poor grammar in the original. Basically: Whenever protocol-httpclient performs an HTTP request, it will first check if there are cookies stored in the cookie jar. If there are cookies in the cookie jar AND none of the cookies have expired, it will do nothing. If there are no cookies in the cookie jar OR at least one of the cookies has expired, it will .. POST the URL / parameters provided in "http.cookie.login.page" property In the process of which, the cookie jar should get filled with the cookies you need to perform subsequent (authenticated) requests The "http.cookie.login.page" property could contain something like "http://abc/def?username=foo&password=bar" .. the 'username' and 'password' properties will them be POSTed to 'http://abc/def', which should result in cookies being added to the cookie jar which is used for each subsequent request. This isn't exactly a fool-proof solution (what if other requests generate expired cookies? what if the login fails? etc.), but for the project for which I wrote the patch, it suited our needs. Hope it helps!
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thanks a lot for the explanation! I applied the patch and compiled Nutch just fine, but can't confirm that it is working. Can you point to a website that this patch worked to pass the form auth at? I need to verify that it is working for me, but can't at the moment.

          Thanks in advance,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thanks a lot for the explanation! I applied the patch and compiled Nutch just fine, but can't confirm that it is working. Can you point to a website that this patch worked to pass the form auth at? I need to verify that it is working for me, but can't at the moment. Thanks in advance, Max
          Hide
          Jasper van Veghel added a comment -

          Hi Max,

          I'm sorry, but I don't really use the patch anymore, so I wouldn't be able to tell you. We used it in conjunction with an internal SAP system that we needed to spider, so that's not a public source you could try it against. Why not write up your own quick script which lets you POST some data, sets a cookie, and then returns some specific piece of data only when that cookie is set?

          Good luck!

          Jasper

          Show
          Jasper van Veghel added a comment - Hi Max, I'm sorry, but I don't really use the patch anymore, so I wouldn't be able to tell you. We used it in conjunction with an internal SAP system that we needed to spider, so that's not a public source you could try it against. Why not write up your own quick script which lets you POST some data, sets a cookie, and then returns some specific piece of data only when that cookie is set? Good luck! Jasper
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thank you for the tip! I'll have to go that way then.

          Best regards,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thank you for the tip! I'll have to go that way then. Best regards, Max
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          I've set up a script that does just that (receives POSTed data, sets a cookie, returns some data if the cookie is set), but now I have this error in my log:

          2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.
          2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; code = 200

          The second line with response code 200 is what I don't understand. I'd appreciate any tips you could give in this regard.

          Thanks,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, I've set up a script that does just that (receives POSTed data, sets a cookie, returns some data if the cookie is set), but now I have this error in my log: 2012-10-01 13:11:24,557 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one. 2012-10-01 13:11:24,682 ERROR httpclient.Http - Unable to retrieve login page; code = 200 The second line with response code 200 is what I don't understand. I'd appreciate any tips you could give in this regard. Thanks, Max
          Hide
          Jasper van Veghel added a comment - - edited

          Looks like a pretty sloppy mistake in the patch ..

          +      if (code == 200 && Http.LOG.isTraceEnabled()) {
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          +      } else {
          +          Http.LOG.error("Unable to retrieve login page; code = " + code);
          +      }
          

          Change that to something like ..

          +      if (code == 200 && Http.LOG.isTraceEnabled()) {
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          +      } else if (code != 200) {
          +          Http.LOG.error("Unable to retrieve login page; code = " + code);
          +      }
          

          And also change this ..

          +          LOG.error("Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.");
          

          To something like this ..

          +          LOG.error("Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.", e);
          

          To see where the Exception is coming from. All it does after that LOG.error() is release the connection. So it shouldn't be throwing an Exception.

          Show
          Jasper van Veghel added a comment - - edited Looks like a pretty sloppy mistake in the patch .. + if (code == 200 && Http.LOG.isTraceEnabled()) { + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); + } else { + Http.LOG.error( "Unable to retrieve login page; code = " + code); + } Change that to something like .. + if (code == 200 && Http.LOG.isTraceEnabled()) { + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); + } else if (code != 200) { + Http.LOG.error( "Unable to retrieve login page; code = " + code); + } And also change this .. + LOG.error( "Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one." ); To something like this .. + LOG.error( "Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one." , e); To see where the Exception is coming from. All it does after that LOG.error() is release the connection. So it shouldn't be throwing an Exception.
          Hide
          Max Dzyuba added a comment -

          Now I get the following error:

          2012-10-01 14:40:54,996 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one.
          java.lang.IllegalArgumentException: Entity enclosing requests cannot be redirected without user intervention
          at org.apache.commons.httpclient.methods.EntityEnclosingMethod.setFollowRedirects(EntityEnclosingMethod.java:225)
          at org.apache.nutch.protocol.httpclient.HttpCookieAuthentication.<init>(HttpCookieAuthentication.java:73)
          at org.apache.nutch.protocol.httpclient.Http.resolveCookieCredentials(Http.java:402)
          at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:387)
          at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:152)
          at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:440)
          at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:425)
          at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:403)
          at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:668)

          Sorry to bug you about this...

          Thanks for your time!
          Max

          Show
          Max Dzyuba added a comment - Now I get the following error: 2012-10-01 14:40:54,996 ERROR httpclient.Http - Cookie-based authentication failed; cookies will not be present for this request but an attempt to retrieve them will be made for the next one. java.lang.IllegalArgumentException: Entity enclosing requests cannot be redirected without user intervention at org.apache.commons.httpclient.methods.EntityEnclosingMethod.setFollowRedirects(EntityEnclosingMethod.java:225) at org.apache.nutch.protocol.httpclient.HttpCookieAuthentication.<init>(HttpCookieAuthentication.java:73) at org.apache.nutch.protocol.httpclient.Http.resolveCookieCredentials(Http.java:402) at org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:387) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:152) at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:440) at org.apache.nutch.protocol.http.api.RobotRulesParser.getRobotRulesSet(RobotRulesParser.java:425) at org.apache.nutch.protocol.http.api.HttpBase.getRobotRules(HttpBase.java:403) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:668) Sorry to bug you about this... Thanks for your time! Max
          Hide
          Jasper van Veghel added a comment -

          That exception looks familiar — I think that we ended up solving that simply by removing ..

          +    method.setFollowRedirects(followRedirects);
          

          As redirects are not supported for POST-requests.

          Show
          Jasper van Veghel added a comment - That exception looks familiar — I think that we ended up solving that simply by removing .. + method.setFollowRedirects(followRedirects); As redirects are not supported for POST-requests.
          Hide
          Max Dzyuba added a comment -

          Hi Jasper,

          Thanks, removing that line fixed the exception problem.
          At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up.

          Is there an easy way to verify if the cookie was created by Nutch and stored as intended?

          Thanks,
          Max

          Show
          Max Dzyuba added a comment - Hi Jasper, Thanks, removing that line fixed the exception problem. At the moment, the log file doesn't have any errors related to HTTPclient plugin or authentication process. However, my tests show that the cookie can't be read by the test auth page I've set up. Is there an easy way to verify if the cookie was created by Nutch and stored as intended? Thanks, Max
          Hide
          Jasper van Veghel added a comment -
          +        Http.LOG.trace("url: " + url +
          +            "; status code: " + code +
          +            "; cookies received: " + Http.getClient().getState().getCookies().length);
          

          If you turn on TRACE logging, you should see messages like that.

          Show
          Jasper van Veghel added a comment - + Http.LOG.trace( "url: " + url + + "; status code: " + code + + "; cookies received: " + Http.getClient().getState().getCookies().length); If you turn on TRACE logging, you should see messages like that.
          Hide
          Max Dzyuba added a comment -

          Thank you, Jasper. I did just that and now I see that cookies are received (at least in some cases).

          Do you know of any reason why I still wouldn't be able to retrieve pages that require authentication (even though I see the cookies stored)? Does it have to do with those pages returning status code "200"?

          Thanks for the help!

          Show
          Max Dzyuba added a comment - Thank you, Jasper. I did just that and now I see that cookies are received (at least in some cases). Do you know of any reason why I still wouldn't be able to retrieve pages that require authentication (even though I see the cookies stored)? Does it have to do with those pages returning status code "200"? Thanks for the help!
          Hide
          yuanyun.cn added a comment -

          I was assigned a task to use nutch2 to crawla web site which uses form-based authentication.
          Based on Jasper's code, I made some improvement to make it work. Please view the patch: http-client-form-authtication.patch.

          To use it, first we try to figure it out how to use http client to do form based login successfully, We can use Chrome Devtools to get the login formId, username and password fields, get the exact post request; we may remove some form fields, or add some headers.

          private static void authTestAspWebApp() throws Exception, IOException {
            HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();
            authConfigurer.setLoginUrl("http://localhost:44444/Account/Login.aspx")
              .setLoginFormId("ctl01").setLoginRedirect(true);
            Map<String, String> loginPostData = new HashMap<String, String>();
            loginPostData.put("ctl00$MainContent$LoginUser$UserName", "admin");
            loginPostData.put("ctl00$MainContent$LoginUser$Password", "admin123");
            authConfigurer.setLoginPostData(loginPostData);
           
            Set<String> removedFormFields = new HashSet<String>();
            removedFormFields.add("ctl00$MainContent$LoginUser$RememberMe");
            authConfigurer.setRemovedFormFields(removedFormFields);
           
            HttpFormAuthentication example = new HttpFormAuthentication(
              authConfigurer);
            example.login();
            String result = example
              .httpGetPageContent("http://localhost:44444/secret/needlogin.aspx");
            System.out.println(result);
           }
          

          After make the test code work, we define form authentication info in httpclient-auth.xml:

          <?xml version="1.0"?>
          <auth-configuration>
            <credentials authMethod="formAuth" loginUrl="http://localhost:44444/Account/Login.aspx" loginFormId="ctl01" loginRedirect="true">
              <loginPostData>
                <field name="ctl00$MainContent$LoginUser$UserName" value="admin"/>
                <field name="ctl00$MainContent$LoginUser$Password" value="admin123"/>
              </loginPostData>
              <removedFormFields>
                <field name="ctl00$MainContent$LoginUser$RememberMe"/>
              </removedFormFields>
            </credentials>
          </auth-configuration>
          

          Be sure to use protocol-httpclient plugin in nutch-site.xml: not protocol-http.
          If you are interested, you may read:http://lifelongprogrammer.blogspot.com/2014/02/part1-using-apache-http-client-to-do-http-post-form-authentication.html

          Show
          yuanyun.cn added a comment - I was assigned a task to use nutch2 to crawla web site which uses form-based authentication. Based on Jasper's code, I made some improvement to make it work. Please view the patch: http-client-form-authtication.patch. To use it, first we try to figure it out how to use http client to do form based login successfully, We can use Chrome Devtools to get the login formId, username and password fields, get the exact post request; we may remove some form fields, or add some headers. private static void authTestAspWebApp() throws Exception, IOException {   HttpFormAuthConfigurer authConfigurer = new HttpFormAuthConfigurer();   authConfigurer.setLoginUrl( "http: //localhost:44444/Account/Login.aspx" )     .setLoginFormId( "ctl01" ).setLoginRedirect( true );   Map< String , String > loginPostData = new HashMap< String , String >();   loginPostData.put( "ctl00$MainContent$LoginUser$UserName" , "admin" );   loginPostData.put( "ctl00$MainContent$LoginUser$Password" , "admin123" );   authConfigurer.setLoginPostData(loginPostData);     Set< String > removedFormFields = new HashSet< String >();   removedFormFields.add( "ctl00$MainContent$LoginUser$RememberMe" );   authConfigurer.setRemovedFormFields(removedFormFields);     HttpFormAuthentication example = new HttpFormAuthentication(     authConfigurer);   example.login();    String result = example     .httpGetPageContent( "http: //localhost:44444/secret/needlogin.aspx" );    System .out.println(result);  } After make the test code work, we define form authentication info in httpclient-auth.xml: <?xml version= "1.0" ?> <auth-configuration> <credentials authMethod= "formAuth" loginUrl= "http://localhost:44444/Account/Login.aspx" loginFormId= "ctl01" loginRedirect= "true" > <loginPostData> <field name= "ctl00$MainContent$LoginUser$UserName" value= "admin" /> <field name= "ctl00$MainContent$LoginUser$Password" value= "admin123" /> </loginPostData> <removedFormFields> <field name= "ctl00$MainContent$LoginUser$RememberMe" /> </removedFormFields> </credentials> </auth-configuration> Be sure to use protocol-httpclient plugin in nutch-site.xml: not protocol-http. If you are interested, you may read: http://lifelongprogrammer.blogspot.com/2014/02/part1-using-apache-http-client-to-do-http-post-form-authentication.html

            People

            • Assignee:
              Unassigned
              Reporter:
              Jasper van Veghel
            • Votes:
              3 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:

                Development