Issue Details (XML | Word | Printable)

Key: NUTCH-406
Type: Bug Bug
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Chris A. Mattmann
Reporter: Doğacan Güney
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Nutch

Metadata tries to write null values

Created: 23/Nov/06 01:25 PM   Updated: 23/Nov/06 05:19 PM
Return to search
Component/s: None
Affects Version/s: 0.9.0
Fix Version/s: 0.9.0

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works NUTCH-406.patch 2006-11-23 04:16 PM Doğacan Güney 0.6 kB
Text File Licensed for inclusion in ASF works NUTCH-406.patch 2006-11-23 01:27 PM Doğacan Güney 0.5 kB

Resolution Date: 23/Nov/06 05:17 PM


 Description  « Hide
During parsing, some urls (especially pdfs, it seems) may create <some_key, null> pairs in ParseData's parseMeta.
When Metadata.write() tries to write such a pair, it causes an NPE.

Stack trace will be something like this:
at org.apache.hadoop.io.Text.encode(Text.java:373)
at org.apache.hadoop.io.Text.encode(Text.java:354)
at org.apache.hadoop.io.Text.writeString(Text.java:394)
at org.apache.nutch.metadata.Metadata.write(Metadata.java:214)

I can consistently reproduce this using the following url:
http://www.efesbev.com/corporate_governance/pdf/MergerAgreement.pdf



 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Doğacan Güney added a comment - 23/Nov/06 01:27 PM
A simple patch that writes nulls as empty strings.

Andrzej Bialecki added a comment - 23/Nov/06 03:57 PM
Null value is not equivalent to an empty String - perhaps we should simply skip such values.

Doğacan Güney added a comment - 23/Nov/06 04:16 PM
How about something like this then?

Chris A. Mattmann added a comment - 23/Nov/06 04:25 PM
Hi Andrzej, Do?acan,

+1. I think it makes a lot of sense to just not include the null key in the Met container. Do?acan, in the future, when you attach a new version of a patch for a JIRA issue, please indicate the change by renaming the patch. Not a big deal, but good style points

I'll commit this patch shortly.

Cheers,
Chris


Andrzej Bialecki added a comment - 23/Nov/06 04:43 PM
Erhm, -1 from me. This code checks only if the first value is null, and then discards all other values (which may be non-null), thus we could lose valuable data if only the first value happens to be null ...

I think we should indeed check if the first value is null, but then if it is then loop over all other values, count non-nulls, and if the count > 0 then write out the <key, <non-null values>> set.


Chris A. Mattmann added a comment - 23/Nov/06 04:47 PM
Hi Do?acan,

Loooking at your latest patch, I'm not sure that it completely does the right behavior. For example, what happens if there are 3 met values for a key k, and one of them is null, but the other 2 are not? Specifically, what if the first value is null, but the other 2 are not. In that case, your patch would skip over writing all of the keys. Wouldn't it just be easier to do something like this?

Index: src/java/org/apache/nutch/metadata/Metadata.java
===================================================================
— src/java/org/apache/nutch/metadata/Metadata.java (revision 478613)
+++ src/java/org/apache/nutch/metadata/Metadata.java (working copy)
@@ -211,7 +211,9 @@
values = getValues(names[i]);
out.writeInt(values.length);
for (int j = 0; j < values.length; j++) {

  • Text.writeString(out, values[j]);
    + if(values[j] != null && !values[j].equals("")){ + Text.writeString(out, values[j]); + }
    }
    }
    }

Chris A. Mattmann added a comment - 23/Nov/06 04:49 PM
Hi Andrzej,

Yup, you caught the same thing as me. +1 for your solution. I will extend my above patch by writing getNumNonNullValues(values) instead of values.length.

Cheers,
Chris


Chris A. Mattmann added a comment - 23/Nov/06 05:17 PM
Fix applied and tested in trunk.

Chris A. Mattmann added a comment - 23/Nov/06 05:19 PM