Issue Details (XML | Word | Printable)

Key: NUTCH-91
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Unassigned
Reporter: Michael Nebel
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

empty encoding causes exception

Created: 10/Sep/05 09:35 PM   Updated: 10/Mar/06 05:17 AM
Return to search
Component/s: None
Affects Version/s: 0.8
Fix Version/s: 0.7.2, 0.8

Time Tracking:
Not Specified

Resolution Date: 10/Mar/06 05:17 AM


 Description  « Hide
I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:

Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
— src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
@@ -120,7 +120,7 @@
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
String encoding = StringUtil.parseCharacterEncoding(contentType);

  • if (encoding!=null) {
    + if (encoding!=null && !"".equals(encoding)) {
    metadata.put("OriginalCharEncoding", encoding);
    if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
    metadata.put("CharEncodingForConversion", encoding);


 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Piotr Kosiorowski made changes - 10/Mar/06 05:17 AM
Field Original Value New Value
Fix Version/s 0.7.2-dev [ 12310360 ]
Status Open [ 1 ] Closed [ 6 ]
Fix Version/s 0.8-dev [ 12310224 ]
Resolution Fixed [ 1 ]