Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2703

parse-tika: Boilerpipe should not run for non-(X)HTML pages

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.15
    • Fix Version/s: 1.16
    • Component/s: parser, plugin
    • Labels:
      None

      Description

      Boilerpipe is running for non-(X)html pages which is require more resources.

      In my testing scenario, I've large PDFs in my websites and by enabling Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job without issues.

      Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no issues.

        Attachments

        1. NUTCH-2703.patch
          1 kB
          Markus Jelsma

          Issue Links

            Activity

              People

              • Assignee:
                markus17 Markus Jelsma
                Reporter:
                hanyshehata Hany Shehata
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: