Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1302

Let's run Tika against a large batch of docs nightly

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: cli, general, server
    • Labels:
      None

      Description

      Many thanks to Lewis John McGibbney for TIKA-1301! Once we get nightly builds up and running again, it might be fun to run Tika regularly against a large set of docs and report metrics.

      One excellent candidate corpus is govdocs1: http://digitalcorpora.org/corpora/files.

      Any other candidate corpora?
      William Palmer, have anything handy you'd like to contribute?
      http://www.openplanetsfoundation.org/blogs/2014-03-21-tika-ride-characterising-web-content-nanite

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              tallison@mitre.org Tim Allison
            • Votes:
              0 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

              • Created:
                Updated: