Description
Tika provides outlink extraction features that are not used in Nutch. To be able to use it in Nutch we need Tika to return the rel attr value of each link, which it currently doesn't. There's a patch for Tika 1.1. If that patch is included in Tika and we upgraded to that new version this issue can be worked on. Here's preliminary code that does both Tika and current outlink extraction. This also includes parts of the Boilerpipe code.
Attachments
Attachments
Issue Links
- depends upon
-
TIKA-1835 LinkContentHandler skips iframe and rel tags
- Resolved
-
TIKA-824 Extract rel attr with LinkContentHandler
- Resolved
-
NUTCH-1234 Upgrade to Tika 1.1
- Closed
-
NUTCH-2210 Upgrade to Tika 1.12
- Closed
- is related to
-
NUTCH-961 Expose Tika's boilerpipe support
- Closed
- relates to
-
TIKA-975 LinkBuilder to optionally collapse anchor whitespace
- Resolved
- requires
-
TIKA-1835 LinkContentHandler skips iframe and rel tags
- Resolved