Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-1645

Extraction of biomedical information using CTAKESParser



    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:


      As mentioned in TIKA-1642, CTAKESContentHandler is a preliminary work in order to integrate Apache cTAKES into Tika allowing users to extract biomedical information from clinical text.
      Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats.

      You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from TIKA-1642. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES.

      To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on User Install Guide. Then, you can launch Tika as follows:

      java -cp tika-app-1.10-SNAPSHOT.jar:${CTAKES_HOME}/desc:${CTAKES_HOME}/resources:${CTAKES_HOME}/lib/*:/path/to/CTAKESConfig org.apache.tika.cli.TikaCLI --config=/path/to/tika-config.xml /path/to/input

      In the example above, /path/to/CTAKESConfig is the parent directory of file org/apache/tika/parser/ctakes/CTAKESConfig.properties that contains the configuration properties to build the cTAKES AnalysisEngine; tika-config.xml is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing.
      You can find in attachment an example of both CTAKESConfig.properties and tika-config.xml to parse ISA-Tab files using cTAKES.

      You need UMLS credentials in order to use the UMLS-based components of cTAKES.

      I would really appreciate your feedback.
      Thanks Selina Chu, Chris A. Mattmann and Lewis John McGibbney for supporting me on this work.


        1. CTAKESConfig.properties
          0.2 kB
          Giuseppe Totaro
        2. TIKA-1645.patch
          32 kB
          Giuseppe Totaro
        3. TIKA-1645.v02.patch
          33 kB
          Giuseppe Totaro
        4. tika-config.xml
          0.2 kB
          Giuseppe Totaro

          Issue Links



              • Assignee:
                chrismattmann Chris A. Mattmann
                gostep Giuseppe Totaro
              • Votes:
                0 Vote for this issue
                3 Start watching this issue


                • Created: