[UIMA-2318] Define automatic distribution of a closed term set over multiple fields in one field definition. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.4.0Addons
Component/s: Sandbox-Lucas
Labels:
None

Description

In the course of my work I needed LuCas to do a few things which were not possible or at least not too easy out-of-the-box. I checked out the latest LuCas version and adapted it to suit my needs.

The main extension arose from the following idea:

In the documents I want to index, gene names and identifiers are tagged (into a UIMA type 'Gene'). These identifiers are indexed so you can search for them. For faceting purposes I send these identifiers into a Lucene field named 'facetTerms'. However, I have quite a whole lot of identifiers AND the identifiers are organized in multiple categories in my application. The best thing for me would be to have a single field for each of these categories, containing only gene identifiers belonging to this category.
This allows to easily obtain facet counts per category.

Now I have over 20 categories and I did not like the idea of a LuCas mapping file with 20 copies of nearly the same field definition.

So I allowed new attributes to a field element in the mapping file. These attributes would specify:

A file determining the association between each possible term and its category (same format as hypernym file, so one term can belong to multiple categories);
The naming scheme of the new fields;
Whether to ignore the case when comparing the entries of the above mentioned file to the actual terms extracted from documents.

I wrote a class which realizes the distribution of the terms to their categories by creating the corresponding TokenStreams. Each TokenStream is supposed to let only those tokens pass which belong to its category. These tokens are determined by the association file described above. Thus we need the opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose. This filter mainly takes a set representing a closed vocabulary and lets tokens pass which are included in the set and denies other tokens (here comes the ignore option into play).

Another thing I did was to implement a RegExp replacement filter - it simply matches token string against a regular expression. On match the token string is replaced by a given replacement string (may include reg exp replacement characters like &).

Please note that the delivered patch file is not complete in terms of documentation, file headers etc. I would add these things if the changes are accepted.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

termstoFieldsDistr.patch
22/Jul/13 08:41
160 kB
Erik Faessler
termsToFieldsDistr.patch
19/Jul/13 15:42
113 kB
Erik Faessler
termsToFieldsDistr.patch
09/Jan/12 08:53
16 kB
Erik Faessler

Activity

People

Assignee:: Tommaso Teofili

Reporter:: Erik Faessler

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jan/12 08:47

Updated:: 22/Jul/13 11:52

Resolved:: 22/Jul/13 10:36

Time Tracking

Estimated:

Remaining:

Logged:

Not Specified