[NUTCH-961] Expose Tika's boilerpipe support - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 1.11
Fix Version/s: 1.12
Component/s: parser
Labels:
None

Patch Info:

Patch Available

Description

Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration.

Use the following properties to enable and control Boilerpipe.

<property>
  <name>tika.extractor</name>
  <value>none</value>
  <description>
  Which text extraction algorithm to use. Valid values are: boilerpipe or none.
  </description>
</property>
 
<property> 
  <name>tika.extractor.boilerpipe.algorithm</name>
  <value>ArticleExtractor</value>
  <description> 
  Which Boilerpipe algorithm to use. Valid values are: DefaultExtractor, ArticleExtractor
  or CanolaExtractor.
  </description>
</property>

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NUTCH-961v2.patch
02/Jun/11 09:58
3 kB
Gabriele Kahlout
NUTCH-961-2.1-v2.patch
06/Mar/13 16:16
7 kB
Roland von Herget
NUTCH-961-2.1-v1.patch
06/Mar/13 10:32
7 kB
Roland von Herget
NUTCH-961-1.8-1.patch
17/Jun/13 14:34
7 kB
Markus Jelsma
NUTCH-961-1.5-1.patch
22/Nov/11 12:51
7 kB
Markus Jelsma
NUTCH-961-1.4-dombuilder-1.patch
17/Jul/11 14:07
0.6 kB
Markus Jelsma
NUTCH-961-1.3-tikaparser1.patch
12/May/11 01:25
3 kB
Gabriele Kahlout
NUTCH-961-1.3-tikaparser.patch
18/Apr/11 13:11
2 kB
Markus Jelsma
NUTCH-961-1.3-3.patch
27/Jun/11 13:14
2 kB
Markus Jelsma
NUTCH-961-1.11-1.patch
08/Dec/15 10:48
7 kB
Vincent Slot
NUTCH-961.patch
16/Feb/16 14:10
3 kB
Markus Jelsma
NUTCH-961.patch
16/Feb/16 14:39
6 kB
Markus Jelsma
nutch-2.x-boilerpipe.patch
01/Apr/15 22:00
5 kB
Alexander Kingson
BoilerpipeExtractorRepository.java
26/Apr/11 09:48
3 kB
Markus Jelsma

Issue Links

depends upon

TIKA-676 Boilerpipe fails

Resolved

NUTCH-967 Upgrade to Tika 0.9

Closed

relates to

NUTCH-1375 extract main content of a html file

Closed

NUTCH-1233 Rely on Tika for outlink extraction

Closed

Activity

People

Assignee:: Markus Jelsma

Reporter:: Markus Jelsma

Votes:: 6 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 23/Jan/11 13:23

Updated:: 13/Mar/24 14:51

Resolved:: 16/Feb/16 14:43