Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2787

CrawlDb JSON dump does not export metadata primitive data types correctly

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.17
    • Fix Version/s: 1.17
    • Component/s: crawldb
    • Labels:
      None
    • Environment:
    • Patch Info:
      Patch Available

      Description

      To reproduce:

      • Activate scoring-depth plugin
      • Create a new crawldb from a seed URL:
      • Dump the crawldb as json
      • Look at the json
      $ nutch inject crawl/crawldb seeds.txt
      $ rm -rf out; nutch readdb crawl/crawldb -dump out -format json
      $ cat out/part-r-00000 | head -1 | python -m json.tool
      {
          "url": "http://example.com/",
          "statusCode": 1,
          "statusName": "db_unfetched",
          "fetchTime": "Thu Jun 04 15:19:02 CEST 2020",
          "modifiedTime": "Thu Jan 01 01:00:00 CET 1970",
          "retriesSinceFetch": 0,
          "retryIntervalSeconds": 2592000,
          "retryIntervalDays": 30,
          "score": 1.0,
          "signature": "null",
          "metadata": {
              "_depth_": {},
              "_maxdepth_": {}
          }
      }

      KO => `_depth` and `maxdepth_` are not integer.

      The fields are correct in the crawldb, as shown by a CSV dump:

      $ rm -rf out; nutch readdb crawl/crawldb -dump out -format csv
      $ cat out/part-r-00000 
      Url,Status code,Status name,Fetch Time,Modified Time,Retries since fetch,Retry interval seconds,Retry interval days,Score,Signature,Metadata
      "http://example.com/",1,"db_unfetched",Thu Jun 04 15:19:02 CEST 2020,Thu Jan 01 01:00:00 CET 1970,0,2592000.0,30.0,1.0,"null","_depth_:1|||_maxdepth_:5|||" 

      Code is here:

      https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/CrawlDbReader.java#L269

      I do not know Java very well but I think it comes from IntWritable & co not being POJO types (or at least not the way we want them).

      One fix might be to:

      • Map all primitive type Writable classes to some function casting the base interface and calling "get" (may boxing the value as well).
      • Call that in the metadata conversion loop.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                snagel Sebastian Nagel
                Reporter:
                pmezard Patrick M├ęzard
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: