Uploaded image for project: 'cTAKES'
  1. cTAKES
  2. CTAKES-498

UmlsOverlapLookupAnnotator skips tokens inconsistently?

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • None

    Description

      Hi Sean,

      I'm perplexed. It seems as if the number of tokens that the UmlsOverlapLookupAnnotator will skip varies with the content of the RareWordDictionary.

      Here's my setup. I think I've included enough information to replicate my perplexity, if you have time/inclination to do that; let me know if I've left anything out.

      I have a custom dictionary built from UMLS sources including SNOMEDCT_US:

      sql> select cui,text from cui_terms where text='chronic kidney disease' or cui in (2316786,2316787);
       CUI TEXT
      ------- --------------------------------
      1561643 chronic kidney disease
      2316787 stage 3 chronic kidney disease
      2316787 chronic kidney disease stage 3
      2316787 chronic kidney disease , stage 3
      2316787 ckd stage 3
      2316786 chronic kidney disease stage 2
      2316786 chronic kidney disease , stage 2
      2316786 stage 2 chronic kidney disease
      2316786 ckd stage 2
      Fetched 9 rows.
      sql> 
      

      My documents contain acronym expansions and Roman numerals for stages, like this:

      Problem List:
      CKD (chronic kidney disease), stage II
      Decubitus ulcer - grade II

      So I create a BSV RareWordDictionary to capture the Roman numerals.
      I don't want to have to guess at all the possible punctuation variations,
      so I try to make my entries as general as safely possible,
      using the UmlsOverlapLookupAnnotator with consecutiveSkips set to 2.

       

      C2316786|chronic kidney disease II
      C2316787|chronic kidney disease III

       

      I add dictionary and dictionaryConceptPair entries for my BSV file to cTakesHsql.xml as shown in the example/ directory, using SemanticCleanupTermConsumer as rareWordConsumer.

      Success! Now "chronic kidney disease), stage II" gets annotated as a DiseaseDisorderMention with CUI C2316786.

      But a couple of things confuse me.

      1. Removing an entry

      If I remove the other BSV entry, "chronic kidney disease III",
      "chronic kidney disease), stage II" isn't identified anymore:
      suddenly it only annotates "chronic kidney disease", with C1561643.

      2. Adding an entry

      My documents also have language like "Decubitus ulcer - stage II".

      If I add an entry for this to my BSV dictionary, so now I have:

       

      C2316786|chronic kidney disease II
      C2316787|chronic kidney disease III
      C1720518|decubitus ulcer II
      

      and annotate this text:

      Problem List:
      CKD (chronic kidney disease), stage II
      Decubitus ulcer - grade II

      then "Decubitus ulcer - grade II" gets annotated as a DiseaseDisorderMention with C1720518, as hoped.
      But only "chronic kidney disease" is identified, as before – not "chronic kidney disease), stage II".

      3. Adding a comma

      If I add an entry with a comma in it:

      C2316786|chronic kidney disease , II

      then "chronic kidney disease), stage II" gets picked up, no matter what.

      What perplexes me is that the UmlsOverlapLookupAnnotator seems willing to skip more tokens depending on what else is in the dictionary.

      Can you replicate this, or is there something else I missed about my configuration?
      Is this expected behavior?
      If so, can you help me understand what to expect?
      At this point I hesitate to add anything to the BSV dictionary!

      Attachments

        Activity

          People

            Unassigned Unassigned
            keanR1 Kean Kaufmann
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: