[TIKA-1302] Let's run Tika against a large batch of docs nightly - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: cli, general, server
Labels:
None

Description

Many thanks to lewismc for ~~TIKA-1301~~! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.

One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.

Any other candidate corpora?
willp-bl, have anything handy you'd like to contribute?
http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

wayback_exception_summaries.xlsx
26/Nov/14 14:41
476 kB
Tim Allison

Sub-Tasks

1.	Create cron job to pull fresh versions of Tika	Open	Unassigned
2.	Add presentation layer for results of each run	Open	Unassigned
3.	Build simple stacktrace search interface	Open	Tim Allison

Activity

People

Assignee:: Unassigned

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 16/May/14 19:55

Updated:: 29/Oct/19 15:52