Nutch
  1. Nutch
  2. NUTCH-1252

SegmentReader -get shows wrong data

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4, 1.5
    • Fix Version/s: 1.6
    • Component/s: None
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      The command/option -get of the SegmentReader may show wrong data associated with the given URL.

      To reproduce:

      % mkdir -p test_readseg/urls
      % echo -e "http://nutch.apache.org/\ttest=ApacheNutch\nhttp://abc.test/\ttest=AbcTest\tnutch.score=10.0" > test_readseg/urls/seeds
      
      % nutch inject test_readseg/crawldb test_readseg/urls
      Injector: starting at 2012-01-18 09:32:25
      Injector: crawlDb: test_readseg/crawldb
      Injector: urlDir: test_readseg/urls
      Injector: Converting injected urls to crawl db entries.
      Injector: Merging injected urls into crawl db.
      Injector: finished at 2012-01-18 09:32:28, elapsed: 00:00:03
      
      % nutch generate test_readseg/crawldb test_readseg/segments/
      Generator: starting at 2012-01-18 09:32:30
      Generator: Selecting best-scoring urls due for fetch.
      Generator: filtering: true
      Generator: normalizing: true
      Generator: jobtracker is 'local', generating exactly one partition.
      Generator: Partitioning selected urls for politeness.
      Generator: segment: test_readseg/segments/20120118093232
      Generator: finished at 2012-01-18 09:32:34, elapsed: 00:00:03
      
      % nutch readseg -get test_readseg/segments/* 'http://nutch.apache.org/' -nocontent -noparse -nofetch -noparsedata -noparsetext
      SegmentReader: get 'http://nutch.apache.org/'
      Crawl Generate::
      Version: 7
      Status: 1 (db_unfetched)
      Fetch time: Wed Jan 18 09:32:26 CET 2012
      Modified time: Thu Jan 01 01:00:00 CET 1970
      Retries since fetch: 0
      Retry interval: 2592000 seconds (30 days)
      Score: 10.0
      Signature: null
      Metadata: _ngt_: 1326875550401test: AbcTest
      

      The metadata and the score indicate that the CrawlDatum shown is the wrong one (that associated to http://abc.test/ but not to http://nutch.apache.org/).

      1. NUTCH-1252.patch
        1 kB
        Sebastian Nagel
      2. NUTCH-1252-v2.patch
        1 kB
        Sebastian Nagel

        Activity

        Hide
        Sebastian Nagel added a comment -

        New patch fixes also a second bug:
        In getMapRecords it is assumed that for a given key (URL) there is only one value.
        But there may be more. E.g., in crawl_fetch:

        1. one fetch datum (e.g., fetch_success)
        2. one or more linked datums from redirects which are stored in crawl_fetch if the number of redirects overflows http.redirect.max
        Show
        Sebastian Nagel added a comment - New patch fixes also a second bug: In getMapRecords it is assumed that for a given key (URL) there is only one value. But there may be more. E.g., in crawl_fetch: one fetch datum (e.g., fetch_success) one or more linked datums from redirects which are stored in crawl_fetch if the number of redirects overflows http.redirect.max
        Hide
        Markus Jelsma added a comment -

        Thanks. Marked for 1.5, keeping it on the radar.

        Show
        Markus Jelsma added a comment - Thanks. Marked for 1.5, keeping it on the radar.
        Hide
        Markus Jelsma added a comment -

        Thanks. Marked for 1.5, keeping it on the radar.

        Show
        Markus Jelsma added a comment - Thanks. Marked for 1.5, keeping it on the radar.
        Hide
        Markus Jelsma added a comment -

        20120304-push-1.6

        Show
        Markus Jelsma added a comment - 20120304-push-1.6
        Hide
        Sebastian Nagel added a comment -

        committed to trunk (revision 1397281)

        Show
        Sebastian Nagel added a comment - committed to trunk (revision 1397281)
        Hide
        Hudson added a comment -

        Integrated in nutch-trunk-maven #451 (See https://builds.apache.org/job/nutch-trunk-maven/451/)
        NUTCH-1252 SegmentReader -get shows wrong data (Revision 1397281)

        Result = SUCCESS
        snagel :
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
        Show
        Hudson added a comment - Integrated in nutch-trunk-maven #451 (See https://builds.apache.org/job/nutch-trunk-maven/451/ ) NUTCH-1252 SegmentReader -get shows wrong data (Revision 1397281) Result = SUCCESS snagel : Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
        Hide
        Hudson added a comment -

        Integrated in Nutch-trunk #1985 (See https://builds.apache.org/job/Nutch-trunk/1985/)
        NUTCH-1252 SegmentReader -get shows wrong data (Revision 1397281)

        Result = SUCCESS
        snagel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397281
        Files :

        • /nutch/trunk/CHANGES.txt
        • /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java
        Show
        Hudson added a comment - Integrated in Nutch-trunk #1985 (See https://builds.apache.org/job/Nutch-trunk/1985/ ) NUTCH-1252 SegmentReader -get shows wrong data (Revision 1397281) Result = SUCCESS snagel : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1397281 Files : /nutch/trunk/CHANGES.txt /nutch/trunk/src/java/org/apache/nutch/segment/SegmentReader.java

          People

          • Assignee:
            Sebastian Nagel
            Reporter:
            Sebastian Nagel
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development