[TIKA-2822] Update common tokens files for tika-eval - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Trivial
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.21
Component/s: tika-eval
Labels:
None

Description

We initially created the common tokens files (top 20k tokens by document frequency) in Wikipedia with Lucene 6.x. We should rerun that code with an updated Lucene on the off chance that there are slight changes in tokenization.

While doing this work, I found a trivial bug in filtering common tokens that we should fix as well.

Attachments

Activity

People

Assignee:: Tim Allison

Reporter:: Tim Allison

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 28/Jan/19 16:46

Updated:: 30/Jan/19 19:17

Resolved:: 30/Jan/19 18:43