Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1591

Incorrect conversion of ByteBuffer to String

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.2.1
    • Component/s: crawldb, indexer, parser, storage
    • Labels:
      None
    • Environment:

      Mac O/S 10.8.4, JDK 1.6.0_51

    • Patch Info:
      Patch Available

      Description

      There are many occurrences of the following ByteBuffer-to-String conversion throughout the Nutch codebase:

      ByteBuffer buf = ...;
      return new String(buf.array);
      

      This approach assume that the ByteBuffer and its underlying array are aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying array is the same as ByteBuffer.remaining()). In many cases this is not the case. The correct way to convert a ByteBuffer to a String (or stream thereof) is the following:

      ByteBuffer buf = ...;
      return new String(buf.array(), buf.arrayOffset() + buf.position(), buf.remaining());
      

      I noticed this bug when using Nutch with Cassandra. In most cases, the parsed content contains data from other columns (as well as garbage content) since the Cassandra client library returns ByteBuffers that are views on top of a larger byte[]. It also seems that others have hit this as well:

      http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop

      I've attached a patch based on the release-2.2 tag of the 2.x branch on GitHub:

      https://github.com/apache/nutch/tree/release-2.2

        Attachments

        1. NUTCH-1591.patch
          32 kB
          Jason Howes
        2. NUTCH-1591.zip
          87 kB
          Jason Howes
        3. Nutch1591Test.java
          7 kB
          Jason Howes

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jasonhowes Jason Howes
            • Votes:
              1 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: