Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-2318

Define automatic distribution of a closed term set over multiple fields in one field definition.

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 2.4.0Addons
    • Sandbox-Lucas
    • None

    Description

      In the course of my work I needed LuCas to do a few things which were not possible or at least not too easy out-of-the-box. I checked out the latest LuCas version and adapted it to suit my needs.

      The main extension arose from the following idea:

      In the documents I want to index, gene names and identifiers are tagged (into a UIMA type 'Gene'). These identifiers are indexed so you can search for them. For faceting purposes I send these identifiers into a Lucene field named 'facetTerms'. However, I have quite a whole lot of identifiers AND the identifiers are organized in multiple categories in my application. The best thing for me would be to have a single field for each of these categories, containing only gene identifiers belonging to this category.
      This allows to easily obtain facet counts per category.

      Now I have over 20 categories and I did not like the idea of a LuCas mapping file with 20 copies of nearly the same field definition.

      So I allowed new attributes to a field element in the mapping file. These attributes would specify:

      • A file determining the association between each possible term and its category (same format as hypernym file, so one term can belong to multiple categories);
      • The naming scheme of the new fields;
      • Whether to ignore the case when comparing the entries of the above mentioned file to the actual terms extracted from documents.

      I wrote a class which realizes the distribution of the terms to their categories by creating the corresponding TokenStreams. Each TokenStream is supposed to let only those tokens pass which belong to its category. These tokens are determined by the association file described above. Thus we need the opposite of a StopWordFilter. I've added the 'SelectFilter' for this purpose. This filter mainly takes a set representing a closed vocabulary and lets tokens pass which are included in the set and denies other tokens (here comes the ignore option into play).

      Another thing I did was to implement a RegExp replacement filter - it simply matches token string against a regular expression. On match the token string is replaced by a given replacement string (may include reg exp replacement characters like &).

      Please note that the delivered patch file is not complete in terms of documentation, file headers etc. I would add these things if the changes are accepted.

      Attachments

        1. termsToFieldsDistr.patch
          16 kB
          Erik Faessler
        2. termsToFieldsDistr.patch
          113 kB
          Erik Faessler
        3. termstoFieldsDistr.patch
          160 kB
          Erik Faessler

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            teofili Tommaso Teofili
            chew Erik Faessler
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment