Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
0.9.0
-
None
-
None
Description
During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta.
When Metadata.write() tries to write such a pair, it causes an NPE.
Stack trace will be something like this:
at org.apache.hadoop.io.Text.encode(Text.java:373)
at org.apache.hadoop.io.Text.encode(Text.java:354)
at org.apache.hadoop.io.Text.writeString(Text.java:394)
at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)
I can consistently reproduce this using the following url:
http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf