Uploaded image for project: 'Nutch'
  1. Nutch
  2. NUTCH-961

Expose Tika's boilerpipe support

    XMLWordPrintableJSON

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.11
    • Fix Version/s: 1.12
    • Component/s: parser
    • Labels:
      None
    • Patch Info:
      Patch Available

      Description

      Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

      Use the following properties to enable and control Boilerpipe.

      <property>
        <name>tika.extractor</name>
        <value>none</value>
        <description>
        Which text extraction algorithm to use. Valid values are: boilerpipe or none.
        </description>
      </property>
       
      <property> 
        <name>tika.extractor.boilerpipe.algorithm</name>
        <value>ArticleExtractor</value>
        <description> 
        Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
        or CanolaExtractor.
        </description>
      </property>
      

        Attachments

        1. NUTCH-961.patch
          6 kB
          Markus Jelsma
        2. NUTCH-961.patch
          3 kB
          Markus Jelsma
        3. NUTCH-961-1.11-1.patch
          7 kB
          Vincent Slot
        4. nutch-2.x-boilerpipe.patch
          5 kB
          Alexander Kingson
        5. NUTCH-961-1.8-1.patch
          7 kB
          Markus Jelsma
        6. NUTCH-961-2.1-v2.patch
          7 kB
          Roland von Herget
        7. NUTCH-961-2.1-v1.patch
          7 kB
          Roland von Herget
        8. NUTCH-961-1.5-1.patch
          7 kB
          Markus Jelsma
        9. NUTCH-961-1.4-dombuilder-1.patch
          0.6 kB
          Markus Jelsma
        10. NUTCH-961-1.3-3.patch
          2 kB
          Markus Jelsma
        11. NUTCH-961v2.patch
          3 kB
          Gabriele Kahlout
        12. NUTCH-961-1.3-tikaparser1.patch
          3 kB
          Gabriele Kahlout
        13. BoilerpipeExtractorRepository.java
          3 kB
          Markus Jelsma
        14. NUTCH-961-1.3-tikaparser.patch
          2 kB
          Markus Jelsma

          Issue Links

            Activity

              People

              • Assignee:
                markus17 Markus Jelsma
                Reporter:
                markus17 Markus Jelsma
              • Votes:
                6 Vote for this issue
                Watchers:
                16 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: