Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2935

DeduplicationJob: failure on URLs with invalid percent encoding

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.18
    • 1.19
    • crawldb
    • None
    • Patch Available

    Description

      The DeduplicationJob may fail with an IllegalArgumentException on invalid percent encodings in URLs:

      2021-11-25 04:36:41,747 INFO mapreduce.Job: Task Id : attempt_1637669672674_0018_r_000193_0, Status : FAILED
      Error: java.lang.IllegalArgumentException: URLDecoder: Illegal hex characters in escape (%) pattern - Error at index 0 in: "YR"
              at java.base/java.net.URLDecoder.decode(URLDecoder.java:232)
              at java.base/java.net.URLDecoder.decode(URLDecoder.java:142)
              at org.apache.nutch.crawl.DeduplicationJob$DedupReducer.getDuplicate(DeduplicationJob.java:211)
      ...
      Exception in thread "main" java.lang.RuntimeException: Crawl job did not succeed, job status:FAILED, reason: Task failed task_1637669672674_0018_r_000193
      Job failed as tasks failed. failedMaps:0 failedReduces:1 killedMaps:0 killedReduces: 0
      

      The IllegalArgumentException should be caught, logged and if only one of the two URLs with duplicated content is invalid, it should be flagged as duplicate while the valid URL "survives".

      Attachments

        Issue Links

          Activity

            People

              snagel Sebastian Nagel
              snagel Sebastian Nagel
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: