Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1591

Incorrect conversion of ByteBuffer to String

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.2
    • Fix Version/s: 2.2.1
    • Component/s: crawldb, indexer, parser, storage
    • Labels:
      None
    • Environment:

      Mac O/S 10.8.4, JDK 1.6.0_51

    • Patch Info:
      Patch Available

      Description

      There are many occurrences of the following ByteBuffer-to-String conversion throughout the Nutch codebase:

      ByteBuffer buf = ...;
      return new String(buf.array);
      

      This approach assume that the ByteBuffer and its underlying array are aligned (i.e. ByteBuffer.arrayOffset() is equal to 0 and the length of the underlying array is the same as ByteBuffer.remaining()). In many cases this is not the case. The correct way to convert a ByteBuffer to a String (or stream thereof) is the following:

      ByteBuffer buf = ...;
      return new String(buf.array(), buf.arrayOffset() + buf.position(), buf.remaining());
      

      I noticed this bug when using Nutch with Cassandra. In most cases, the parsed content contains data from other columns (as well as garbage content) since the Cassandra client library returns ByteBuffers that are views on top of a larger byte[]. It also seems that others have hit this as well:

      http://grokbase.com/p/nutch/user/132jnq8s4r/slow-parse-on-hadoop

      I've attached a patch based on the release-2.2 tag of the 2.x branch on GitHub:

      https://github.com/apache/nutch/tree/release-2.2

      1. Nutch1591Test.java
        7 kB
        Jason Howes
      2. NUTCH-1591.patch
        32 kB
        Jason Howes
      3. NUTCH-1591.zip
        87 kB
        Jason Howes

        Activity

        Hide
        hudson Hudson added a comment -

        Integrated in Nutch-nutchgora #664 (See https://builds.apache.org/job/Nutch-nutchgora/664/)
        NUTCH-1591 Incorrect conversion of ByteBuffer to String (Revision 1497447)

        Result = SUCCESS
        lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1497447
        Files :

        • /nutch/branches/2.x/CHANGES.txt
        • /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/MD5Signature.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/SignatureComparator.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexUtil.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserChecker.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/storage/Host.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/util/Bytes.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/util/EncodingDetector.java
        • /nutch/branches/2.x/src/java/org/apache/nutch/util/StringUtil.java
        • /nutch/branches/2.x/src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java
        • /nutch/branches/2.x/src/plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java
        • /nutch/branches/2.x/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java
        • /nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java
        • /nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java
        • /nutch/branches/2.x/src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/TestHTMLLanguageParser.java
        • /nutch/branches/2.x/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagIndexingFilter.java
        • /nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
        • /nutch/branches/2.x/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
        • /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
        • /nutch/branches/2.x/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
        • /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java
        • /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java
        Show
        hudson Hudson added a comment - Integrated in Nutch-nutchgora #664 (See https://builds.apache.org/job/Nutch-nutchgora/664/ ) NUTCH-1591 Incorrect conversion of ByteBuffer to String (Revision 1497447) Result = SUCCESS lewismc : http://svn.apache.org/viewvc/nutch/branches/2.x/?view=rev&rev=1497447 Files : /nutch/branches/2.x/CHANGES.txt /nutch/branches/2.x/src/java/org/apache/nutch/api/DbReader.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/DbUpdateReducer.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/MD5Signature.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/SignatureComparator.java /nutch/branches/2.x/src/java/org/apache/nutch/crawl/WebTableReader.java /nutch/branches/2.x/src/java/org/apache/nutch/indexer/IndexUtil.java /nutch/branches/2.x/src/java/org/apache/nutch/parse/ParserChecker.java /nutch/branches/2.x/src/java/org/apache/nutch/storage/Host.java /nutch/branches/2.x/src/java/org/apache/nutch/util/Bytes.java /nutch/branches/2.x/src/java/org/apache/nutch/util/EncodingDetector.java /nutch/branches/2.x/src/java/org/apache/nutch/util/StringUtil.java /nutch/branches/2.x/src/plugin/creativecommons/src/java/org/creativecommons/nutch/CCIndexingFilter.java /nutch/branches/2.x/src/plugin/creativecommons/src/test/org/creativecommons/nutch/TestCCParseFilter.java /nutch/branches/2.x/src/plugin/index-basic/src/java/org/apache/nutch/indexer/basic/BasicIndexingFilter.java /nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/HTMLLanguageParser.java /nutch/branches/2.x/src/plugin/language-identifier/src/java/org/apache/nutch/analysis/lang/LanguageIndexingFilter.java /nutch/branches/2.x/src/plugin/language-identifier/src/test/org/apache/nutch/analysis/lang/TestHTMLLanguageParser.java /nutch/branches/2.x/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag/RelTagIndexingFilter.java /nutch/branches/2.x/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java /nutch/branches/2.x/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java /nutch/branches/2.x/src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java /nutch/branches/2.x/src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java /nutch/branches/2.x/src/test/org/apache/nutch/crawl/TestInjector.java /nutch/branches/2.x/src/test/org/apache/nutch/fetcher/TestFetcher.java
        Hide
        lewismc Lewis John McGibbney added a comment -

        Committed @revision 1497447 in 2.x head
        Thank you v much for the patch Jason and the insight.

        Show
        lewismc Lewis John McGibbney added a comment - Committed @revision 1497447 in 2.x head Thank you v much for the patch Jason and the insight.
        Hide
        jasonhowes Jason Howes added a comment -

        Got it. Thanks for the background. I'd definitely support committing the patch. I've been testing quite a bit and haven't been able to recreate the issue, and I'd rather go forward with an official build rather than my hacked version.

        Show
        jasonhowes Jason Howes added a comment - Got it. Thanks for the background. I'd definitely support committing the patch. I've been testing quite a bit and haven't been able to recreate the issue, and I'd rather go forward with an official build rather than my hacked version.
        Hide
        lewismc Lewis John McGibbney added a comment -

        In Gora, with all data stores, we store everything as Bytes. This was not always the case and I hope that this provides some more context. Lets just say that in 'some' data stores, previously we stored data types 'as they were' instead of converting everything to Bytes.
        As you said, we saw the light and now everything is persisted down into Bytes within Gora. This directly delegates the responsibility and task of dealing with the Bytes to the client code.
        If he wishes Renato can chime in here with some more commentary. He explained this very well at CassandraSummit a couple weeks back when discussing and explaining some changes to gora-cassandra which are now in 0.3.
        I would like to commit this patch tomorrow. I've been running since last Friday and I have no quims.
        Thanks.

        Show
        lewismc Lewis John McGibbney added a comment - In Gora, with all data stores, we store everything as Bytes. This was not always the case and I hope that this provides some more context. Lets just say that in 'some' data stores, previously we stored data types 'as they were' instead of converting everything to Bytes. As you said, we saw the light and now everything is persisted down into Bytes within Gora. This directly delegates the responsibility and task of dealing with the Bytes to the client code. If he wishes Renato can chime in here with some more commentary. He explained this very well at CassandraSummit a couple weeks back when discussing and explaining some changes to gora-cassandra which are now in 0.3. I would like to commit this patch tomorrow. I've been running since last Friday and I have no quims. Thanks.
        Hide
        jasonhowes Jason Howes added a comment -

        Interesting. I guess it comes down to how and when the metadata value in question is used. It seems to me there was a conscientious decision to leave the deserialization (for lack of a better word) of various WebPage attributes up to the consumer. This makes sense to me, as it allows one to simply "pass through" the binary values until they need to be interpreted, thus reducing overall CPU utilization. I'm sure I'm missing some context as I've just started using Nutch, but I assume the fix for NUTCH-1511 is to decode the CASH_KEY value using Bytes.toFloat() after retrieving it from the Gora store?

        Show
        jasonhowes Jason Howes added a comment - Interesting. I guess it comes down to how and when the metadata value in question is used. It seems to me there was a conscientious decision to leave the deserialization (for lack of a better word) of various WebPage attributes up to the consumer. This makes sense to me, as it allows one to simply "pass through" the binary values until they need to be interpreted, thus reducing overall CPU utilization. I'm sure I'm missing some context as I've just started using Nutch, but I assume the fix for NUTCH-1511 is to decode the CASH_KEY value using Bytes.toFloat() after retrieving it from the Gora store?
        Hide
        lewismc Lewis John McGibbney added a comment -

        I think what you've described is the fix for NUTCH-1511
        Never mind the fact that it discusses MySQL, the behavior is consistent for other Gora stores as well where the metadata field is overwritten and populated with some incorrect chars when scoring-opic is used.

        Show
        lewismc Lewis John McGibbney added a comment - I think what you've described is the fix for NUTCH-1511 Never mind the fact that it discusses MySQL, the behavior is consistent for other Gora stores as well where the metadata field is overwritten and populated with some incorrect chars when scoring-opic is used.
        Hide
        jasonhowes Jason Howes added a comment -

        Hi Lewis,

        This looks correct to me. It appears the value for the CASH_KEY isn't a String, but rather, a float converted to a byte array. If you'd like to reconstruct the float value, you need to do something like:

        ByteBuffer buf = row.getFromMetadata(CASH_KEY);
        float cash = Bytes.toFloat(buf.array(), buf.arrayOffset() + buf.position());
        

        Jason

        Show
        jasonhowes Jason Howes added a comment - Hi Lewis, This looks correct to me. It appears the value for the CASH_KEY isn't a String, but rather, a float converted to a byte array. If you'd like to reconstruct the float value, you need to do something like: ByteBuffer buf = row.getFromMetadata(CASH_KEY); float cash = Bytes.toFloat(buf.array(), buf.arrayOffset() + buf.position()); Jason
        Hide
        lewismc Lewis John McGibbney added a comment -

        One more this here, when opic-scoring is activated and used, we do

        row.putToMetadata(CASH_KEY, ByteBuffer.wrap(Bytes.toBytes(cash + adjust)));
        

        which always results in

        metadata _csh_ : 	;s��
        

        Any idea what we need to do to correct this?
        Thanks

        Show
        lewismc Lewis John McGibbney added a comment - One more this here, when opic-scoring is activated and used, we do row.putToMetadata(CASH_KEY, ByteBuffer.wrap(Bytes.toBytes(cash + adjust))); which always results in metadata _csh_ : ;s�� Any idea what we need to do to correct this? Thanks
        Hide
        lewismc Lewis John McGibbney added a comment -

        Any objections to committing this patch and rolling an RC for 2.2.1?

        Show
        lewismc Lewis John McGibbney added a comment - Any objections to committing this patch and rolling an RC for 2.2.1?
        Hide
        lewismc Lewis John McGibbney added a comment -

        My criteria for testing your patch is not as technically minded as yours Jason
        I used the readdb task to verify that no crap was in the content and text field's for the URLs.
        I can also verify (as previously stated) that both my memory and CPU is significantly lower when doing ALL operations on the Nutch side.
        I get dodgy characters when reading from Cassandra when the WebPage content and text is Chinese, but I think this is to be expected as no parsing implementation I have for that compensates for such characters.
        @Jason thanks for the test you provide. I don't think we need to pull this in to the codebase, but it is always great to have there should we need to do some testing like this again.

        Show
        lewismc Lewis John McGibbney added a comment - My criteria for testing your patch is not as technically minded as yours Jason I used the readdb task to verify that no crap was in the content and text field's for the URLs. I can also verify (as previously stated) that both my memory and CPU is significantly lower when doing ALL operations on the Nutch side. I get dodgy characters when reading from Cassandra when the WebPage content and text is Chinese, but I think this is to be expected as no parsing implementation I have for that compensates for such characters. @Jason thanks for the test you provide. I don't think we need to pull this in to the codebase, but it is always great to have there should we need to do some testing like this again.
        Hide
        jasonhowes Jason Howes added a comment - - edited

        Hi Renato,

        I've attached my test to this issue, although I'm not sure how useful it will be for performance testing. Perhaps Lewis can share his test(s) that he used to validate the fix?

        Jason

        Show
        jasonhowes Jason Howes added a comment - - edited Hi Renato, I've attached my test to this issue, although I'm not sure how useful it will be for performance testing. Perhaps Lewis can share his test(s) that he used to validate the fix? Jason
        Hide
        renato2099 Renato Javier Marroquín Mogrovejo added a comment -

        Hi Jason,

        I am coming from GoraLand (: do you think you could share this awesome test of yours?? We are trying to benchmark Gora's performance, and having an starting point would be just great!
        Thanks in advance!

        Show
        renato2099 Renato Javier Marroquín Mogrovejo added a comment - Hi Jason, I am coming from GoraLand (: do you think you could share this awesome test of yours?? We are trying to benchmark Gora's performance, and having an starting point would be just great! Thanks in advance!
        Hide
        lewismc Lewis John McGibbney added a comment -

        Dynamite. Well as I said, I am running this patch and have left the job to do it's thing.
        I'll monitor before I leave (for pub ) tonight and I think this may be cause for a 2.2.1 as it is a critical and extremely useful catch indeed.
        Thanks for the explanation and your logic.

        Show
        lewismc Lewis John McGibbney added a comment - Dynamite. Well as I said, I am running this patch and have left the job to do it's thing. I'll monitor before I leave (for pub ) tonight and I think this may be cause for a 2.2.1 as it is a critical and extremely useful catch indeed. Thanks for the explanation and your logic.
        Hide
        jasonhowes Jason Howes added a comment -

        Awesome! Glad to help out. I too had initially assumed that it was a gora-cassandra issue. I was able to narrow it down by writing a simple test client that queried the content of a WebPage directly using a Cassandra client (1) and compared it to a test case where I obtained it via Gora (2). I noticed that when I persisted the content via (2) I was able to pull it out just fine with (1) and visa versa. That led me to believe it was an issue with how the content was being converted to a String by Nutch.

        Thanks again,
        Jason

        Show
        jasonhowes Jason Howes added a comment - Awesome! Glad to help out. I too had initially assumed that it was a gora-cassandra issue. I was able to narrow it down by writing a simple test client that queried the content of a WebPage directly using a Cassandra client (1) and compared it to a test case where I obtained it via Gora (2). I noticed that when I persisted the content via (2) I was able to pull it out just fine with (1) and visa versa. That led me to believe it was an issue with how the content was being converted to a String by Nutch. Thanks again, Jason
        Hide
        lewismc Lewis John McGibbney added a comment -

        I am running this right now on my cluster. Performance seems to be magnitudes better.
        I am curious, how did you narrow this to the ByteBuffer? I immediately started questioning stuff further down @gora-cassandra.
        Honestly, all is looking swell here, I think this justifies another RC. 2.2.1 as this is really really important. Parsing was a major headache for me and others with Cassandra as backend.

        Show
        lewismc Lewis John McGibbney added a comment - I am running this right now on my cluster. Performance seems to be magnitudes better. I am curious, how did you narrow this to the ByteBuffer? I immediately started questioning stuff further down @gora-cassandra. Honestly, all is looking swell here, I think this justifies another RC. 2.2.1 as this is really really important. Parsing was a major headache for me and others with Cassandra as backend.
        Hide
        jasonhowes Jason Howes added a comment -

        Hi Lewis,

        I just attached a patch. Thanks!

        Jason

        Show
        jasonhowes Jason Howes added a comment - Hi Lewis, I just attached a patch. Thanks! Jason
        Hide
        lewismc Lewis John McGibbney added a comment - - edited

        Hi Jason, this is dynamite. You've hit the nail right on the head with this one.
        Please produce the patch from your Guthub branch like this

        git diff --no-prefix branch_name > NUTCH-1591.patch

        Thank you

        Show
        lewismc Lewis John McGibbney added a comment - - edited Hi Jason, this is dynamite. You've hit the nail right on the head with this one. Please produce the patch from your Guthub branch like this git diff --no-prefix branch_name > NUTCH-1591 .patch Thank you
        Hide
        jasonhowes Jason Howes added a comment -

        Gotcha. Any change you guys can take a GitHub pull or are you using GitHub just as a read-only mirror?

        Show
        jasonhowes Jason Howes added a comment - Gotcha. Any change you guys can take a GitHub pull or are you using GitHub just as a read-only mirror?
        Hide
        markus17 Markus Jelsma added a comment -

        Hi Jason,

        Please use svn diff to create patches and upload diffs instead of complete sources. It's almost impossible to work with complete source trees

        Thanks

        Show
        markus17 Markus Jelsma added a comment - Hi Jason, Please use svn diff to create patches and upload diffs instead of complete sources. It's almost impossible to work with complete source trees Thanks

          People

          • Assignee:
            Unassigned
            Reporter:
            jasonhowes Jason Howes
          • Votes:
            1 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development