Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-153

TextParser is only supposed to parse plain text, but if given postscript, it can take hours and then fail

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.8
    • 1.0.0
    • fetcher
    • None
    • all

    Description

      If TextParser is given postscript, it can take hours and then fail. This can be avoided with careful configuration, but if the server MIME type is wrong and the basename of the URL has no "file extension", then the this parser will take a long time and fail every time.

      Analysis: The real problem is OutlinkExtractor.java as reported with bug NUTCH-150, but the problem cannot be entirely addressed with that patch since the first call to reg expr match() can take a long time, despite quantifier limits.

      Suggested fix: Reject files with "%!PS-Adobe" in the first 40 characters of the file.

      Actual experience has shown that for safety and fail-safe reasons, it is worth protecting against GIGO directly in TextParse for this case, even though the suggested fix is not a general solution. (A general solution would be a timeout on match().)

      Attachments

        1. TextParser.java.patch
          1 kB
          Paul Baclace

        Activity

          People

            Unassigned Unassigned
            pbaclace Paul Baclace
            Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: