Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: nutchgora, 1.5
    • Fix Version/s: 2.4
    • Component/s: protocol
    • Labels:
      None

      Description

      There are several issues about protocol-httpclient and several comments about rewriting the plugin with the new http client libraries. There is, however, not yet an issue for rewriting/reimplementing protocol-httpclient.

      http://hc.apache.org/httpcomponents-client-ga/

      1. Http.java
        13 kB
        Fabio Santagostino
      2. HttpResponse.java
        7 kB
        Fabio Santagostino

        Activity

        Hide
        Talat UYARER added a comment -

        Hi Fabio Santagostino,

        Thanks for contribution. Could you create a patch file ? This document would be helping you. https://wiki.apache.org/nutch/HowToContribute

        Talat

        Show
        Talat UYARER added a comment - Hi Fabio Santagostino , Thanks for contribution. Could you create a patch file ? This document would be helping you. https://wiki.apache.org/nutch/HowToContribute Talat
        Hide
        Fabio Santagostino added a comment -

        Hi,
        I've done an attempt to rewrite the component using httpclient 4.4. It works for me !
        My main goal was to use a correct implementation of NTLMv2 auhentication for my corporate web sites.
        Anyway it seams to be backward compatible with previous implementation. Proxy support is the only part I've not tested yet.

        I had to change only 2 classes (in attachment) :

        • /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java
        • /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java

        Of course package dependency files must be modified also. In /ivy/ivy.xml :

        + added httpclient 4.4 version

          <dependency org="org.apache.httpcomponents" name="httpclient" rev="4.4" conf="*->master" />
        

        + updated codec version from

        <dependency org="commons-codec" name="commons-codec" rev="1.3" conf="*->default" />

        to

        <dependency org="commons-codec" name="commons-codec" rev="1.4" conf="*->default" />

        Files in attachment are tested for v1.9 branch, but probably minor changes are needed to make it suitable for v2.3.

        Regards,
        Fabio

        Show
        Fabio Santagostino added a comment - Hi, I've done an attempt to rewrite the component using httpclient 4.4. It works for me ! My main goal was to use a correct implementation of NTLMv2 auhentication for my corporate web sites. Anyway it seams to be backward compatible with previous implementation. Proxy support is the only part I've not tested yet. I had to change only 2 classes (in attachment) : /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/Http.java /src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java Of course package dependency files must be modified also. In /ivy/ivy.xml : + added httpclient 4.4 version <dependency org= "org.apache.httpcomponents" name= "httpclient" rev= "4.4" conf= "*-> master" /> + updated codec version from <dependency org= "commons-codec" name= "commons-codec" rev= "1.3" conf= "*-> default" /> to <dependency org= "commons-codec" name= "commons-codec" rev= "1.4" conf= "*-> default" /> Files in attachment are tested for v1.9 branch, but probably minor changes are needed to make it suitable for v2.3. Regards, Fabio
        Hide
        Fabio Santagostino added a comment -

        Add httpclient 4.4 library

        Show
        Fabio Santagostino added a comment - Add httpclient 4.4 library
        Hide
        Fabio Santagostino added a comment -

        Add httpclient 4.4 library

        Show
        Fabio Santagostino added a comment - Add httpclient 4.4 library
        Hide
        Simon Zhu added a comment -

        Hi Talat/Julien/Markus,

        I tested NTCredentials in components httpclient 4.3.4 by using a proxy server that requires NTLM authentication, and the response code was 200 OK, However when used NTCredentials of commons httpclient 3.1, which is currently used by protocol-httpclient, the returned code was 407, indicated the proxy server I'am using found NTCredentials in httpclient 3.1 could not explain NTLM protocol correctly. I supposed the reason is commons httpclient 3.1 was EOL in 2007 but the current NTLM version was released in 2008.

        Since httpclient 4.x does not compatible with 3.1, so IMHO it's not easy to address the NTLM authentication issue by adding a patch. But will be very happy if anyone can help to develop such a patch for the issue.

        Appreciate all kinds of advice/suggestions/clues for the proxy server authentication issue, more than happy to have further discussions on this.

        Regards

        Simon

        Show
        Simon Zhu added a comment - Hi Talat/Julien/Markus, I tested NTCredentials in components httpclient 4.3.4 by using a proxy server that requires NTLM authentication, and the response code was 200 OK, However when used NTCredentials of commons httpclient 3.1, which is currently used by protocol-httpclient, the returned code was 407, indicated the proxy server I'am using found NTCredentials in httpclient 3.1 could not explain NTLM protocol correctly. I supposed the reason is commons httpclient 3.1 was EOL in 2007 but the current NTLM version was released in 2008. Since httpclient 4.x does not compatible with 3.1, so IMHO it's not easy to address the NTLM authentication issue by adding a patch. But will be very happy if anyone can help to develop such a patch for the issue. Appreciate all kinds of advice/suggestions/clues for the proxy server authentication issue, more than happy to have further discussions on this. Regards Simon
        Hide
        Talat UYARER added a comment -

        Hi Markus,

        Yes I know that Httpclient is still in development as part of Apache HttpComponents. Second comment is very good information for me. Actually i asked that question because i found a little bug in protocol-http: Even If I have http.content.limit value set, protocol-http fetches files of all sizes (larger files are fetched until limit allows).
        But when Parsing, parser skips incomplete files (parser.skip.truncated configuration). It seems like an unnecessary effort to partially fetch contents larger than limit if they are not gonna be parsed.
        What do you think about this? I will upload a patch about this issue.

        Show
        Talat UYARER added a comment - Hi Markus, Yes I know that Httpclient is still in development as part of Apache HttpComponents. Second comment is very good information for me. Actually i asked that question because i found a little bug in protocol-http: Even If I have http.content.limit value set, protocol-http fetches files of all sizes (larger files are fetched until limit allows). But when Parsing, parser skips incomplete files (parser.skip.truncated configuration). It seems like an unnecessary effort to partially fetch contents larger than limit if they are not gonna be parsed. What do you think about this? I will upload a patch about this issue.
        Hide
        Markus Jelsma added a comment -

        And to answer your question, no, i'm not working on this issue. We still manage with protocol-http and only use protocol-httpclient for TLS connections. It still works, for now

        Show
        Markus Jelsma added a comment - And to answer your question, no, i'm not working on this issue. We still manage with protocol-http and only use protocol-httpclient for TLS connections. It still works, for now
        Hide
        Markus Jelsma added a comment -

        Hi Talat - what do you mean by EOL of HttpClient? Version 4.3 was just releases a few months ago. I assume you mean that Nutch' implementation of it is old, it is indeed! This issue is about completely rewriting Nutch' protocol-httpclient plugin to the most recent version of the HttpClient 4.x.

        Show
        Markus Jelsma added a comment - Hi Talat - what do you mean by EOL of HttpClient? Version 4.3 was just releases a few months ago. I assume you mean that Nutch' implementation of it is old, it is indeed! This issue is about completely rewriting Nutch' protocol-httpclient plugin to the most recent version of the HttpClient 4.x.
        Hide
        Talat UYARER added a comment -

        Markus,

        I guess httpclient is end of life. Are you make any development for this issue ?

        Show
        Talat UYARER added a comment - Markus, I guess httpclient is end of life. Are you make any development for this issue ?
        Hide
        Ross Judson added a comment -

        The Oracle bug report # is 7129065. HttpUrlConnection-based NTLM auth to Sharepoint succeeds with JDK 6, and crashes the VM on JDK. I am investigating other solutions to this.

        Show
        Ross Judson added a comment - The Oracle bug report # is 7129065. HttpUrlConnection-based NTLM auth to Sharepoint succeeds with JDK 6, and crashes the VM on JDK. I am investigating other solutions to this.
        Hide
        Oleg Kalnichevski added a comment -

        For what it is worth to you, HttpClient users have been reporting the best NTLMv2 compatibility results when using JCIFS as an NTLM engine. The trouble is the library is LGPL licensed and therefore may not be directly incorporated into ASF works. However, you might consider giving your users an option of hooking JCIFS up though an extension mechanism of some sort similar to that used by HttpClient [1]

        Oleg

        [1] http://hc.apache.org/httpcomponents-client-ga/ntlm.html

        Show
        Oleg Kalnichevski added a comment - For what it is worth to you, HttpClient users have been reporting the best NTLMv2 compatibility results when using JCIFS as an NTLM engine. The trouble is the library is LGPL licensed and therefore may not be directly incorporated into ASF works. However, you might consider giving your users an option of hooking JCIFS up though an extension mechanism of some sort similar to that used by HttpClient [1] Oleg [1] http://hc.apache.org/httpcomponents-client-ga/ntlm.html
        Hide
        Ferdy Galema added a comment -

        Seems like a JVM bug, perhaps you could reproduce it using specific urls? Btw, does anyone has an NTLMv2 example URL that is publicly accessible?

        Besides lacking NTLMv2 support, is there anything else that isn't working properly? Support for https is not entirely broken, because "https://www.iana.org/" for example can be fetched perfectly fine.

        Show
        Ferdy Galema added a comment - Seems like a JVM bug, perhaps you could reproduce it using specific urls? Btw, does anyone has an NTLMv2 example URL that is publicly accessible? Besides lacking NTLMv2 support, is there anything else that isn't working properly? Support for https is not entirely broken, because "https://www.iana.org/" for example can be fetched perfectly fine.
        Hide
        Remi Tassing added a comment -

        With the dirty code I wrote on NTLMv2 and HttpUrlConnection, I'm having the following Java error from time to time. I believe it's due to the poor integration of my code with Nutch:

        #

        1. A fatal error has been detected by the Java Runtime Environment:
          #
        2. EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x762135c8, pid=7320, tid=5720
          #
        3. JRE version: 7.0-b147
        4. Java VM: Java HotSpot(TM) Client VM (21.0-b17 mixed mode windows-x86 )
        5. Problematic frame:
        6. C [Secur32.dll+0x35c8]
          #
        7. Failed to write core dump. Minidumps are not enabled by default on client versions of Windows
          #
        8. If you would like to submit a bug report, please visit:
        9. http://bugreport.sun.com/bugreport/crash.jsp
        10. The crash happened outside the Java Virtual Machine in native code.
        11. See problematic frame for where to report the bug.
          #

        --------------- T H R E A D ---------------

        Current thread (0x4753a800): JavaThread "FetcherThread" daemon [_thread_in_native, id=5720, stack(0x48350000,0x483a0000)]

        siginfo: ExceptionCode=0xc0000005, reading address 0x00000010

        Registers:
        EAX=0x00000000, EBX=0x00000000, ECX=0x4839f0cc, EDX=0x02bdafe8
        ESP=0x4839f0c4, EBP=0x4839f0d4, ESI=0x002b0058, EDI=0x00000000
        EIP=0x762135c8, EFLAGS=0x00010202

        Top of Stack: (sp=0x4839f0c4)
        0x4839f0c4: 4839f0cc 65a5014e 002b0058 02bdafe8
        0x4839f0d4: 4839f0e4 6b62a15c 477c2d10 4753a928
        0x4839f0e4: 4839f180 6b62a2b1 477c2d10 477c2d00
        0x4839f0f4: 4753a800 437469b8 437469b8 052c98e8
        0x4839f104: 4839f320 025aa595 4839f354 4839f1c4
        0x4839f114: 4753a800 47657b90 47075798 470757d8
        0x4839f124: 00000000 00000001 4839f130 00000200
        0x4839f134: 00000002 4839f154 477c2d10 00000000

        Instructions: (pc=0x762135c8)
        0x762135a8: 00 e8 c2 f6 ff ff 8b f0 85 f6 74 1c 56 ff 35 54
        0x762135b8: 10 22 76 ff 15 20 11 21 76 8b 46 60 8d 4d f8 51
        0x762135c8: ff 50 10 5e c9 c2 04 00 b8 01 03 09 80 eb f4 90
        0x762135d8: 90 90 90 90 8b ff 55 8b ec 51 51 8b 45 08 8b 08

        Register to memory mapping:

        EAX=0x00000000 is an unknown value
        EBX=0x00000000 is an unknown value
        ECX=0x4839f0cc is pointing into the stack for thread: 0x4753a800
        EDX=0x02bdafe8 is an unknown value
        ESP=0x4839f0c4 is pointing into the stack for thread: 0x4753a800
        EBP=0x4839f0d4 is pointing into the stack for thread: 0x4753a800
        ESI=0x002b0058 is an unknown value
        EDI=0x00000000 is an unknown value

        Stack: [0x48350000,0x483a0000], sp=0x4839f0c4, free space=316k
        Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
        C [Secur32.dll+0x35c8] FreeCredentialsHandle+0x30
        C [net.dll+0xa15c] Java_sun_net_www_protocol_http_ntlm_NTLMAuthSequence_getCredentialsHandle+0x180
        C [net.dll+0xa2b1] Java_sun_net_www_protocol_http_ntlm_NTLMAuthSequence_getNextToken+0x137

        Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
        j sun.net.www.protocol.http.ntlm.NTLMAuthSequence.getNextToken(J[B)[B+0
        j sun.net.www.protocol.http.ntlm.NTLMAuthSequence.getAuthHeader(Ljava/lang/String;)Ljava/lang/String;+24
        j sun.net.www.protocol.http.ntlm.NTLMAuthentication.setHeaders(Lsun/net/www/protocol/http/HttpURLConnection;Lsun/net/www/HeaderParser;Ljava/lang/String;)Z+73
        j sun.net.www.protocol.http.HttpURLConnection.getServerAuthentication(Lsun/net/www/protocol/http/AuthenticationHeader;)Lsun/net/www/protocol/http/AuthenticationInfo;+760
        j sun.net.www.protocol.http.HttpURLConnection.getInputStream()Ljava/io/InputStream;+972
        j sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream()Ljava/io/InputStream;+4
        j org.apache.nutch.protocol.httpclient.HttpResponse.<init>(Lorg/apache/nutch/protocol/httpclient/Http;Ljava/net/URL;Lorg/apache/nutch/crawl/CrawlDatum;Z)V+453
        j org.apache.nutch.protocol.httpclient.Http.getResponse(Ljava/net/URL;Lorg/apache/nutch/crawl/CrawlDatum;Z)Lorg/apache/nutch/net/protocols/Response;+13
        j org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)Lorg/apache/nutch/protocol/ProtocolOutput;+283
        j org.apache.nutch.fetcher.Fetcher$FetcherThread.run()V+646
        v ~StubRoutines::call_stub

        --------------- P R O C E S S ---------------

        Java Threads: ( => current thread )
        0x4708ac00 JavaThread "Thread-27" daemon [_thread_in_native, id=6032, stack(0x48cf0000,0x48d40000)]
        0x4708a400 JavaThread "MultiThreadedHttpConnectionManager cleanup" daemon [_thread_blocked, id=4920, stack(0x48450000,0x484a0000)]
        0x4708a000 JavaThread "FetcherThread" daemon [_thread_blocked, id=6244, stack(0x01210000,0x01260000)]
        0x47089800 JavaThread "FetcherThread" daemon [_thread_blocked, id=7148, stack(0x483a0000,0x483f0000)]
        =>0x4753a800 JavaThread "FetcherThread" daemon [_thread_in_native, id=5720, stack(0x48350000,0x483a0000)]
        0x4753a000 JavaThread "FetcherThread" daemon [_thread_blocked, id=7808, stack(0x48200000,0x48250000)]
        0x47539c00 JavaThread "FetcherThread" daemon [_thread_blocked, id=6348, stack(0x47300000,0x47350000)]
        0x47539000 JavaThread "FetcherThread" daemon [_thread_blocked, id=4668, stack(0x47410000,0x47460000)]
        0x470b7800 JavaThread "FetcherThread" daemon [_thread_blocked, id=4424, stack(0x480d0000,0x48120000)]
        0x4764c000 JavaThread "FetcherThread" daemon [_thread_blocked, id=1600, stack(0x48140000,0x48190000)]
        0x4764b800 JavaThread "FetcherThread" daemon [_thread_blocked, id=4476, stack(0x47b20000,0x47b70000)]
        0x4764b400 JavaThread "FetcherThread" daemon [_thread_blocked, id=8000, stack(0x47350000,0x473a0000)]
        0x4767ac00 JavaThread "SpillThread" daemon [_thread_blocked, id=5708, stack(0x47bd0000,0x47c20000)]
        0x47689400 JavaThread "communication thread" daemon [_thread_blocked, id=4976, stack(0x47260000,0x472b0000)]
        0x4711e800 JavaThread "Thread-11" [_thread_blocked, id=6608, stack(0x478d0000,0x47920000)]
        0x47089000 JavaThread "Service Thread" daemon [_thread_blocked, id=3652, stack(0x00b30000,0x00b80000)]
        0x4706c400 JavaThread "C1 CompilerThread0" daemon [_thread_blocked, id=5272, stack(0x473c0000,0x47410000)]
        0x4706b000 JavaThread "Attach Listener" daemon [_thread_blocked, id=3568, stack(0x00b90000,0x00be0000)]
        0x47069c00 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=6512, stack(0x472b0000,0x47300000)]
        0x0240f000 JavaThread "Finalizer" daemon [_thread_blocked, id=4252, stack(0x01260000,0x012b0000)]
        0x0240c800 JavaThread "Reference Handler" daemon [_thread_blocked, id=7492, stack(0x00aa0000,0x00af0000)]
        0x00b1dc00 JavaThread "main" [_thread_blocked, id=2896, stack(0x00820000,0x00870000)]

        Other Threads:
        0x02407800 VMThread [stack: 0x46fc0000,0x47010000] [id=8048]
        0x4709b000 WatcherThread [stack: 0x47460000,0x474b0000] [id=6908]

        VM state:not at safepoint (normal execution)

        VM Mutex/Monitor currently owned by a thread: None

        Heap
        def new generation total 81664K, used 13803K [0x045a0000, 0x09e30000, 0x192f0000)
        eden space 72640K, 19% used [0x045a0000, 0x0531ada0, 0x08c90000)
        from space 9024K, 0% used [0x08c90000, 0x08c90000, 0x09560000)
        to space 9024K, 0% used [0x09560000, 0x09560000, 0x09e30000)
        tenured generation total 181236K, used 108739K [0x192f0000, 0x243ed000, 0x42da0000)
        the space 181236K, 59% used [0x192f0000, 0x1fd20ff8, 0x1fd21000, 0x243ed000)
        compacting perm gen total 12288K, used 10197K [0x42da0000, 0x439a0000, 0x46da0000)
        the space 12288K, 82% used [0x42da0000, 0x43795548, 0x43795600, 0x439a0000)
        No shared spaces configured.

        Code Cache [0x025a0000, 0x027c8000, 0x045a0000)
        total_blobs=1160 nmethods=977 adapters=115 free_code_cache=30587Kb largest_free_block=31319360

        Dynamic libraries:
        ...

        VM Arguments:
        ...
        Launcher Type: SUN_STANDARD

        Environment Variables:
        ...

        --------------- S Y S T E M ---------------
        ...
        elapsed time: 336 seconds

        Show
        Remi Tassing added a comment - With the dirty code I wrote on NTLMv2 and HttpUrlConnection, I'm having the following Java error from time to time. I believe it's due to the poor integration of my code with Nutch: # A fatal error has been detected by the Java Runtime Environment: # EXCEPTION_ACCESS_VIOLATION (0xc0000005) at pc=0x762135c8, pid=7320, tid=5720 # JRE version: 7.0-b147 Java VM: Java HotSpot(TM) Client VM (21.0-b17 mixed mode windows-x86 ) Problematic frame: C [Secur32.dll+0x35c8] # Failed to write core dump. Minidumps are not enabled by default on client versions of Windows # If you would like to submit a bug report, please visit: http://bugreport.sun.com/bugreport/crash.jsp The crash happened outside the Java Virtual Machine in native code. See problematic frame for where to report the bug. # --------------- T H R E A D --------------- Current thread (0x4753a800): JavaThread "FetcherThread" daemon [_thread_in_native, id=5720, stack(0x48350000,0x483a0000)] siginfo: ExceptionCode=0xc0000005, reading address 0x00000010 Registers: EAX=0x00000000, EBX=0x00000000, ECX=0x4839f0cc, EDX=0x02bdafe8 ESP=0x4839f0c4, EBP=0x4839f0d4, ESI=0x002b0058, EDI=0x00000000 EIP=0x762135c8, EFLAGS=0x00010202 Top of Stack: (sp=0x4839f0c4) 0x4839f0c4: 4839f0cc 65a5014e 002b0058 02bdafe8 0x4839f0d4: 4839f0e4 6b62a15c 477c2d10 4753a928 0x4839f0e4: 4839f180 6b62a2b1 477c2d10 477c2d00 0x4839f0f4: 4753a800 437469b8 437469b8 052c98e8 0x4839f104: 4839f320 025aa595 4839f354 4839f1c4 0x4839f114: 4753a800 47657b90 47075798 470757d8 0x4839f124: 00000000 00000001 4839f130 00000200 0x4839f134: 00000002 4839f154 477c2d10 00000000 Instructions: (pc=0x762135c8) 0x762135a8: 00 e8 c2 f6 ff ff 8b f0 85 f6 74 1c 56 ff 35 54 0x762135b8: 10 22 76 ff 15 20 11 21 76 8b 46 60 8d 4d f8 51 0x762135c8: ff 50 10 5e c9 c2 04 00 b8 01 03 09 80 eb f4 90 0x762135d8: 90 90 90 90 8b ff 55 8b ec 51 51 8b 45 08 8b 08 Register to memory mapping: EAX=0x00000000 is an unknown value EBX=0x00000000 is an unknown value ECX=0x4839f0cc is pointing into the stack for thread: 0x4753a800 EDX=0x02bdafe8 is an unknown value ESP=0x4839f0c4 is pointing into the stack for thread: 0x4753a800 EBP=0x4839f0d4 is pointing into the stack for thread: 0x4753a800 ESI=0x002b0058 is an unknown value EDI=0x00000000 is an unknown value Stack: [0x48350000,0x483a0000] , sp=0x4839f0c4, free space=316k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) C [Secur32.dll+0x35c8] FreeCredentialsHandle+0x30 C [net.dll+0xa15c] Java_sun_net_www_protocol_http_ntlm_NTLMAuthSequence_getCredentialsHandle+0x180 C [net.dll+0xa2b1] Java_sun_net_www_protocol_http_ntlm_NTLMAuthSequence_getNextToken+0x137 Java frames: (J=compiled Java code, j=interpreted, Vv=VM code) j sun.net.www.protocol.http.ntlm.NTLMAuthSequence.getNextToken(J[B)[B+0 j sun.net.www.protocol.http.ntlm.NTLMAuthSequence.getAuthHeader(Ljava/lang/String;)Ljava/lang/String;+24 j sun.net.www.protocol.http.ntlm.NTLMAuthentication.setHeaders(Lsun/net/www/protocol/http/HttpURLConnection;Lsun/net/www/HeaderParser;Ljava/lang/String;)Z+73 j sun.net.www.protocol.http.HttpURLConnection.getServerAuthentication(Lsun/net/www/protocol/http/AuthenticationHeader;)Lsun/net/www/protocol/http/AuthenticationInfo;+760 j sun.net.www.protocol.http.HttpURLConnection.getInputStream()Ljava/io/InputStream;+972 j sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream()Ljava/io/InputStream;+4 j org.apache.nutch.protocol.httpclient.HttpResponse.<init>(Lorg/apache/nutch/protocol/httpclient/Http;Ljava/net/URL;Lorg/apache/nutch/crawl/CrawlDatum;Z)V+453 j org.apache.nutch.protocol.httpclient.Http.getResponse(Ljava/net/URL;Lorg/apache/nutch/crawl/CrawlDatum;Z)Lorg/apache/nutch/net/protocols/Response;+13 j org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(Lorg/apache/hadoop/io/Text;Lorg/apache/nutch/crawl/CrawlDatum;)Lorg/apache/nutch/protocol/ProtocolOutput;+283 j org.apache.nutch.fetcher.Fetcher$FetcherThread.run()V+646 v ~StubRoutines::call_stub --------------- P R O C E S S --------------- Java Threads: ( => current thread ) 0x4708ac00 JavaThread "Thread-27" daemon [_thread_in_native, id=6032, stack(0x48cf0000,0x48d40000)] 0x4708a400 JavaThread "MultiThreadedHttpConnectionManager cleanup" daemon [_thread_blocked, id=4920, stack(0x48450000,0x484a0000)] 0x4708a000 JavaThread "FetcherThread" daemon [_thread_blocked, id=6244, stack(0x01210000,0x01260000)] 0x47089800 JavaThread "FetcherThread" daemon [_thread_blocked, id=7148, stack(0x483a0000,0x483f0000)] =>0x4753a800 JavaThread "FetcherThread" daemon [_thread_in_native, id=5720, stack(0x48350000,0x483a0000)] 0x4753a000 JavaThread "FetcherThread" daemon [_thread_blocked, id=7808, stack(0x48200000,0x48250000)] 0x47539c00 JavaThread "FetcherThread" daemon [_thread_blocked, id=6348, stack(0x47300000,0x47350000)] 0x47539000 JavaThread "FetcherThread" daemon [_thread_blocked, id=4668, stack(0x47410000,0x47460000)] 0x470b7800 JavaThread "FetcherThread" daemon [_thread_blocked, id=4424, stack(0x480d0000,0x48120000)] 0x4764c000 JavaThread "FetcherThread" daemon [_thread_blocked, id=1600, stack(0x48140000,0x48190000)] 0x4764b800 JavaThread "FetcherThread" daemon [_thread_blocked, id=4476, stack(0x47b20000,0x47b70000)] 0x4764b400 JavaThread "FetcherThread" daemon [_thread_blocked, id=8000, stack(0x47350000,0x473a0000)] 0x4767ac00 JavaThread "SpillThread" daemon [_thread_blocked, id=5708, stack(0x47bd0000,0x47c20000)] 0x47689400 JavaThread "communication thread" daemon [_thread_blocked, id=4976, stack(0x47260000,0x472b0000)] 0x4711e800 JavaThread "Thread-11" [_thread_blocked, id=6608, stack(0x478d0000,0x47920000)] 0x47089000 JavaThread "Service Thread" daemon [_thread_blocked, id=3652, stack(0x00b30000,0x00b80000)] 0x4706c400 JavaThread "C1 CompilerThread0" daemon [_thread_blocked, id=5272, stack(0x473c0000,0x47410000)] 0x4706b000 JavaThread "Attach Listener" daemon [_thread_blocked, id=3568, stack(0x00b90000,0x00be0000)] 0x47069c00 JavaThread "Signal Dispatcher" daemon [_thread_blocked, id=6512, stack(0x472b0000,0x47300000)] 0x0240f000 JavaThread "Finalizer" daemon [_thread_blocked, id=4252, stack(0x01260000,0x012b0000)] 0x0240c800 JavaThread "Reference Handler" daemon [_thread_blocked, id=7492, stack(0x00aa0000,0x00af0000)] 0x00b1dc00 JavaThread "main" [_thread_blocked, id=2896, stack(0x00820000,0x00870000)] Other Threads: 0x02407800 VMThread [stack: 0x46fc0000,0x47010000] [id=8048] 0x4709b000 WatcherThread [stack: 0x47460000,0x474b0000] [id=6908] VM state:not at safepoint (normal execution) VM Mutex/Monitor currently owned by a thread: None Heap def new generation total 81664K, used 13803K [0x045a0000, 0x09e30000, 0x192f0000) eden space 72640K, 19% used [0x045a0000, 0x0531ada0, 0x08c90000) from space 9024K, 0% used [0x08c90000, 0x08c90000, 0x09560000) to space 9024K, 0% used [0x09560000, 0x09560000, 0x09e30000) tenured generation total 181236K, used 108739K [0x192f0000, 0x243ed000, 0x42da0000) the space 181236K, 59% used [0x192f0000, 0x1fd20ff8, 0x1fd21000, 0x243ed000) compacting perm gen total 12288K, used 10197K [0x42da0000, 0x439a0000, 0x46da0000) the space 12288K, 82% used [0x42da0000, 0x43795548, 0x43795600, 0x439a0000) No shared spaces configured. Code Cache [0x025a0000, 0x027c8000, 0x045a0000) total_blobs=1160 nmethods=977 adapters=115 free_code_cache=30587Kb largest_free_block=31319360 Dynamic libraries: ... VM Arguments: ... Launcher Type: SUN_STANDARD Environment Variables: ... --------------- S Y S T E M --------------- ... elapsed time: 336 seconds
        Hide
        Remi Tassing added a comment -

        For the NTLMv2 issue I used a dirty solution in HttpResponse.java. Inside the creator and after the getResponseBodyAsStream()attempt:
        1. I check the result code, if it's 500 (inside finally

        {...}

        )
        2. I use HttpUrlConnection to authenticate and open a connection
        3. Then read the InputStream, get the Content and change the code to 200

        The problems with that solution are that:
        1. The authentication keys are hardcoded
        2. It doesn't check if the content is valid or not but set the return code to 200
        3. Error code 500 doesn't necessarily mean that it's a NTLMv2 authentication problem

        I have no idea on how to write patches to the "trunk"...

        Remi

        Show
        Remi Tassing added a comment - For the NTLMv2 issue I used a dirty solution in HttpResponse.java. Inside the creator and after the getResponseBodyAsStream()attempt: 1. I check the result code, if it's 500 (inside finally {...} ) 2. I use HttpUrlConnection to authenticate and open a connection 3. Then read the InputStream, get the Content and change the code to 200 The problems with that solution are that: 1. The authentication keys are hardcoded 2. It doesn't check if the content is valid or not but set the return code to 200 3. Error code 500 doesn't necessarily mean that it's a NTLMv2 authentication problem I have no idea on how to write patches to the "trunk"... Remi
        Hide
        Lewis John McGibbney added a comment -

        When trying to access some SharePoint(IIS) website using NTLMv2 authentication, Nutch fails and gets an error code 500. HttpClient only supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be used instead.

        [1]http://oaklandsoftware.com/papers/ntlm.html
        [2]http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html

        Show
        Lewis John McGibbney added a comment - When trying to access some SharePoint(IIS) website using NTLMv2 authentication, Nutch fails and gets an error code 500. HttpClient only supports an early version of NTLM but not NTLMv2. HttpUrlConnection can be used instead. [1] http://oaklandsoftware.com/papers/ntlm.html [2] http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html
        Hide
        Aravind Srini added a comment -

        Thanks, Oleg for pitching in and confirming the right thing.

        Meanwhile - SOLR-2727 logged independently, to upgrade that to httpclient 4.x codeline.

        Show
        Aravind Srini added a comment - Thanks, Oleg for pitching in and confirming the right thing. Meanwhile - SOLR-2727 logged independently, to upgrade that to httpclient 4.x codeline.
        Hide
        Oleg Kalnichevski added a comment -

        The 4.1.3 release of HttpCore patched a regression affecting non-blocking (NIO) SSL transports only. There have been no changes between 4.1.2 and 4.1.3 releases in blocking transport components relevant for HttpClient.

        Please let me know if you need any help migrating off HttpClient 3.1 to HttpClient 4.1.x.

        Oleg

        Show
        Oleg Kalnichevski added a comment - The 4.1.3 release of HttpCore patched a regression affecting non-blocking (NIO) SSL transports only. There have been no changes between 4.1.2 and 4.1.3 releases in blocking transport components relevant for HttpClient. Please let me know if you need any help migrating off HttpClient 3.1 to HttpClient 4.1.x. Oleg
        Hide
        Aravind Srini added a comment -

        Some transitive dependencies:

        • Solr 3.1.0 , seems to depend on commons-httpclient 3.1.

        Started an independent email thread with the solr community ( "solr - httpclient from 3.x to 4.1.x" ) to open it up for discussion.

        • hadoop 0.20.2 , depends on commons-httpclient 3.0.1 as well.

        Also - httpclient 4.1.2, depends on httpcore 4.1.2 - but there seems to have been an emergency release of httpcore 4.1.3 ( and httpclient , not republished after the same) so both needs to be explicitly published in ivy.xml (or pom.xml ).

        Show
        Aravind Srini added a comment - Some transitive dependencies: Solr 3.1.0 , seems to depend on commons-httpclient 3.1. Started an independent email thread with the solr community ( "solr - httpclient from 3.x to 4.1.x" ) to open it up for discussion. hadoop 0.20.2 , depends on commons-httpclient 3.0.1 as well. Also - httpclient 4.1.2, depends on httpcore 4.1.2 - but there seems to have been an emergency release of httpcore 4.1.3 ( and httpclient , not republished after the same) so both needs to be explicitly published in ivy.xml (or pom.xml ).
        Hide
        Ken Krugler added a comment -

        For what it's worth, there's a SimpleHttpFetcher in crawler-commons that uses HttpClient 4.1.

        Show
        Ken Krugler added a comment - For what it's worth, there's a SimpleHttpFetcher in crawler-commons that uses HttpClient 4.1.
        Hide
        Markus Jelsma added a comment -

        Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support.

        Show
        Markus Jelsma added a comment - Preferably the 4.1.x version. Nutch still uses the deprecated 3.x and there are a lot of issues to be resolved such as HTTPS support.
        Hide
        Aravind Srini added a comment -

        Are we talking about httpclient 4.0.1 ?

        Show
        Aravind Srini added a comment - Are we talking about httpclient 4.0.1 ?

          People

          • Assignee:
            Fabio Santagostino
            Reporter:
            Markus Jelsma
          • Votes:
            2 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development