Many thanks to Lewis John McGibbney for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.
One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.
Any other candidate corpora?
William Palmer, have anything handy you'd like to contribute?
http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite
1.
|
Add robust tika-batch code |
|
Resolved | Tim Allison |
2.
|
Find/configure a vm and gather initial corpus |
|
Resolved | Tim Allison |
3.
|
Create tika-eval module |
|
Resolved | Tim Allison |
4.
|
Create cron job to pull fresh versions of Tika |
|
Open | Unassigned |
5.
|
Add presentation layer for results of each run |
|
Open | Unassigned |
6.
|
Build simple stacktrace search interface |
|
Open | Tim Allison |