Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.2.0
-
None
-
None
Description
We should be using Tika to a greater extent. New versions of Tika can do some of the things we've wrote our own code for.
In addition, new content handlers can provide interesting data. The BoilerpipeContentHandler will try to only grab the content that really matters.
The Metadata class can return all sorts of interesting values without having to parse them out of the document yourself such as the title or robots meta field.