[LUCENE-1103] WikipediaTokenizer - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.3
Component/s: modules/analysis
Labels:
None

Lucene Fields:

Patch Available

Description

I have extended StandardTokenizer to recognize Wikipedia syntax and mark tokens with certain attributes. It isn't necessarily complete, but it does a good enough job for it to be consumed and improved by others.

It sets the Token.type() value depending on the Wikipedia syntax (links, internal links, bold, italics, etc.) based on my pass at http://en.wikipedia.org/wiki/Wikipedia:Tutorial

I have only tested it with the benchmarking EnwikiDocMaker wikipedia stuff and it seems to do a decent job.

Caveats: I am not sure how to best handle testing, since the content is licensed under GNU Free Doc License, I believe I can't copy and paste a whole document into the unit test. I have hand coded one doc and have another one that just generally runs over the benchmark Wikipedia download.

One more question is where to put it. It could go in analysis, but the tests at least will have a dependency on Benchmark. I am thinking of adding a new contrib/wikipedia where this could grow to have other wikipedia things (perhaps we would move EnwikiDocMaker there????) and reverse the dependency on Benchmark.

I will post a patch over the next few days.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1103.patch
04/Jan/08 03:10
84 kB
Grant Ingersoll
LUCENE-1103.patch
04/Jan/08 02:36
81 kB
Grant Ingersoll
LUCENE-1103.patch
03/Jan/08 22:09
80 kB
Grant Ingersoll
LUCENE-1103.patch
02/Jan/08 22:45
76 kB
Grant Ingersoll
LUCENE-1103.patch
02/Jan/08 19:46
71 kB
Grant Ingersoll
LUCENE-1103.patch
02/Jan/08 19:32
69 kB
Grant Ingersoll
LUCENE-1103.patch
02/Jan/08 16:44
69 kB
Grant Ingersoll

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Grant Ingersoll

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 28/Dec/07 17:18

Updated:: 28/Aug/22 11:44

Resolved:: 04/Jan/08 14:29