Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-1033

ConceptMapper--a highly configurable, token-based dictionary lookup UIMA component



    • New Feature
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 2.3S
    • Sandbox-ConceptMapper
    • None
    • Java 5


      ConceptMapper is a token-based dictionary lookup UIMA component. It was
      designed specifically to allow any external tokenizer that is a UIMA
      component to be used to tokenize its dictionary. Using the same tokenizer
      on both the dictionary and for subsequent text processing prevents
      situations where a particular dictionary entry is not found, though it
      exists, because it was tokenized differently than the text being processed.

      ConceptMapper is highly configurable, in terms of:

      • the way dictionary entries are mapped to resultant annotations
      • the way input documents are processed
      • the availability of multiple lookup strategies
      • its various output options.

      Additionally, a set of post-processing filters are supplied, as well as an
      interface to easily create new filters. This allows for overgenerating
      results during the lookup phase, if so desired, then reducing the result
      set according to particular rules.

      More details:

      The structure of the dictionary itself is quite flexible. Entries can have
      any number of variants (synonyms), and arbitrary features can be associated
      with dictionary entries. Individual variants inherit features from parent
      token (i.e., the canonical from), but can override them or add additional
      features. In the following sample dictionary entry, there are 5 variants of
      the canonical form, and as described earlier, each inherits the SemClass
      and POS attributes from the canonical form, with the exception of the
      variant "mesenteric fibromatosis (c48.1)", which overrides the value of the
      SemClass attribute (this is somewhat of a contrived example, just to make
      that point):

      <token canonical="abdominal fibromatosis" SemClass="Diagnosis" POS="NN">
      <variant base="abdominal fibromatosis" />
      <variant base="abdominal desmoid" />
      <variant base="mesenteric fibromatosis (c48.1)"
      SemClass="Diagnosis-Site" />
      <variant base="mesenteric fibromatosis" />
      <variant base="retroperitoneal fibromatosis" />

      Input tokens are processed one span at a time, where both the token and
      span (usually a sentence) annotation type are configurable. Additionally,
      the particular feature of the token annotation to use for lookups can be
      specified, otherwise its covered text is used. Other input configuration
      settings are whether to use case sensitive matching, an optional class name
      of a stemmer to apply to the tokens, and a list of stop words to to ignore
      during lookup. One additional input control mechanism is the ability to
      skip tokens during lookups based on particular feature values. In this way,
      it is easy to skip, for example, all tokens with particular part of speech
      tags, or with some previously computed semantic class.

      Output is in the form of new annotations, and the type of resulting
      annotations can be specified in a descriptor file. The mapping from
      dictionary entry attributes to the result annotation features can also be
      specified. Additionally, a string containing the matched text, a list of
      matched tokens, and the span enclosing the match can be specified to be set
      in the result annotations. It is also possible to indicate dictionary
      attributes to write back into each of the matched tokens.

      Dictionary lookup is controlled by three parameters in the descriptor, one
      of which allows for order-independent lookup (i.e., A B == B A), another
      togles between finding only the longest match vs. finding all possible
      matches. The final parameter specifies the search strategy, of which there
      are three. The default search strategy only considers contiguous tokens
      (not including tokens frm the stop word list or otherwise skipped tokens),
      and then begins the subsequent search after the longest match. The second
      strategy allows for ignoring non-matching tokens, allowing for disjoint
      matches, so that a dictionary entry of

      A C

      would match against the text

      A B C

      As with the default search strategy, the subsequent search begins after the
      longest match. The final search strategy is identical to the previous,
      except that subsequent searches begin one token ahead, instead of after the
      previous match. This enables overlapped matching.


        1. conceptMapper.zip
          55 kB
          Michael Tanenblatt
        2. conceptMapper.zip.md5
          0.1 kB
          Michael Tanenblatt



            mbaessler Michael Baessler
            tanenblatt Michael Tanenblatt
            0 Vote for this issue
            0 Start watching this issue



              Time Tracking

                Original Estimate - 24h
                Remaining Estimate - 24h
                Time Spent - Not Specified
                Not Specified