Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-1389

parsechecker and indexchecker to report truncated content

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • nutchgora, 1.5
    • 1.7, 2.2
    • indexer, parser
    • None

    Description

      ParserChecker and IndexingFiltersChecker should report when a document is truncated due to

      {http,file,ftp}

      .content.limit.
      Truncated content may cause text and metadata extraction to fail for PDF and other binary document formats.
      A hint that truncation (and not a broken plugin) is the possible reason would be useful.
      See NUTCH-965 and ParseSegment.isTruncated(content).

      Attachments

        1. NUTCH-1389-trunk.patch
          2 kB
          Sebastian Nagel
        2. NUTCH-1389-2x.patch
          1 kB
          Sebastian Nagel

        Activity

          People

            snagel Sebastian Nagel
            snagel Sebastian Nagel
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: