
| Key: |
NUTCH-91
|
| Type: |
Bug
|
| Status: |
Closed
|
| Resolution: |
Fixed
|
| Priority: |
Major
|
| Assignee: |
Unassigned
|
| Reporter: |
Michael Nebel
|
| Votes: |
0
|
| Watchers: |
0
|
|
If you were logged in you would be able to see more operations.
|
|
|
| Resolution Date: |
10/Mar/06 05:17 AM
|
I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
— src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
@@ -120,7 +120,7 @@
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
String encoding = StringUtil.parseCharacterEncoding(contentType);
- if (encoding!=null) {
+ if (encoding!=null && !"".equals(encoding)) {
metadata.put("OriginalCharEncoding", encoding);
if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.put("CharEncodingForConversion", encoding);
|
|
Description
|
I found some sites, where the header says: "Content-Type: text/html; charset=". This causes an exception in the HtmlParser. My suggestion:
Index: src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
— src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (revision 279397)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java (working copy)
@@ -120,7 +120,7 @@
byte[] contentInOctets = content.getContent();
InputSource input = new InputSource(new ByteArrayInputStream(contentInOctets));
String encoding = StringUtil.parseCharacterEncoding(contentType);
- if (encoding!=null) {
+ if (encoding!=null && !"".equals(encoding)) {
metadata.put("OriginalCharEncoding", encoding);
if ((encoding = StringUtil.resolveEncodingAlias(encoding)) != null) {
metadata.put("CharEncodingForConversion", encoding);
|
Show » |
made changes - 10/Mar/06 05:17 AM
| Field |
Original Value |
New Value |
|
Fix Version/s
|
|
0.7.2-dev
[ 12310360
]
|
|
Status
|
Open
[ 1
]
|
Closed
[ 6
]
|
|
Fix Version/s
|
|
0.8-dev
[ 12310224
]
|
|
Resolution
|
|
Fixed
[ 1
]
|
|