[NUTCH-1591] Incorrect conversion of ByteBuffer to String - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Critical
Resolution: Fixed
Affects Version/s: 2.2
Fix Version/s: 2.2.1
Component/s: crawldb, indexer, parser, storage
Labels:
None
Environment:

Mac O/S 10.8.4, JDK 1.6.0_51

Patch Info:

Patch Available

Description

There are many occurrences of the following ByteBuffer-to-String conversion throughout the Nutch codebase:

ByteBuffer buf = ...;
return new String(buf.array);

This approach assume that the ByteBuffer and its underlying array are aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying array is the same as ByteBuffer.remaining()). In many cases this is not the case. The correct way to convert a ByteBuffer to a String (or stream thereof) is the following:

ByteBuffer buf = ...;
return new String(buf.array(), buf.arrayOffset() + buf.position(), buf.remaining());

I noticed this bug when using Nutch with Cassandra. In most cases, the parsed content contains data from other columns (as well as garbage content) since the Cassandra client library returns ByteBuffers that are views on top of a larger byte[]. It also seems that others have hit this as well:

http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop

I've attached a patch based on the release-2.2 tag of the 2.x branch on GitHub:

https://github.com/apache/nutch/tree/release-2.2

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-1591.patch
21/Jun/13 19:56
32 kB
Jason Howes
NUTCH-1591.zip
21/Jun/13 19:45
87 kB
Jason Howes
Nutch1591Test.java
24/Jun/13 23:32
7 kB
Jason Howes

Activity

People

Assignee:: Unassigned

Reporter:: Jason Howes

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 21/Jun/13 18:20

Updated:: 11/Oct/19 15:36

Resolved:: 27/Jun/13 17:07