Description
Many thanks to lewismc for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.
One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
Any other candidate corpora?
willp-bl, have anything handy you'd like to contribute?
http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite
Attachments
Attachments
1.
|
Create cron job to pull fresh versions of Tika | Open | Unassigned | |
2.
|
Add presentation layer for results of each run | Open | Unassigned | |
3.
|
Build simple stacktrace search interface | Open | Tim Allison |