Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-2703

parse-tika: Boilerpipe should not run for non-(X)HTML pages

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.15
    • 1.16
    • parser, plugin
    • None

    Description

      Boilerpipe is running for non-(X)html pages which is require more resources.

      In my testing scenario, I've large PDFs in my websites and by enabling Boilerpipe I have to assign 8500MB for JAVA Heap to finish the crawl job without issues.

      Disabling Boilerpipe allow me to minimize the JVM Heap to 500MB with no issues.

      Attachments

        1. NUTCH-2703.patch
          1 kB
          Markus Jelsma

        Issue Links

          Activity

            People

              markus17 Markus Jelsma
              hanyshehata Hany Shehata
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: