Nutch
  1. Nutch
  2. NUTCH-961

Expose Tika's boilerpipe support

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

      1. BoilerpipeExtractorRepository.java
        3 kB
        Markus Jelsma
      2. NUTCH-961-1.3-3.patch
        2 kB
        Markus Jelsma
      3. NUTCH-961-1.3-tikaparser.patch
        2 kB
        Markus Jelsma
      4. NUTCH-961-1.3-tikaparser1.patch
        3 kB
        Gabriele Kahlout
      5. NUTCH-961-1.4-dombuilder-1.patch
        0.6 kB
        Markus Jelsma
      6. NUTCH-961-1.5-1.patch
        7 kB
        Markus Jelsma
      7. NUTCH-961-1.8-1.patch
        7 kB
        Markus Jelsma
      8. NUTCH-961-2.1-v1.patch
        7 kB
        Roland von Herget
      9. NUTCH-961-2.1-v2.patch
        7 kB
        Roland von Herget
      10. NUTCH-961v2.patch
        3 kB
        Gabriele Kahlout

        Issue Links

          Activity

            People

            • Assignee:
              Markus Jelsma
              Reporter:
              Markus Jelsma
            • Votes:
              7 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

              • Created:
                Updated:

                Development