As mentioned in TIKA-1642, CTAKESContentHandler is a preliminary work in order to integrate Apache cTAKES into Tika allowing users to extract biomedical information from clinical text.
Essentially, this work includes a wrapper for CAS serializers that aim at dumping out the identified annotations into XML-based formats.
You can find in attachment a new patch that includes the CTAKESParser, a new parser that decorates the AutoDetectParser and relies on a new version of CTAKESContentHandler, based on feedback from TIKA-1642. This parser generates the same output of AutoDetectParser and, in addition, the metadata containing the identified clinical annotations detected by cTAKES.
To perform a cTAKES AnalysisEngine by using Tika CTAKESParser, you need first to install the last stable release of cTAKES (3.2.2), following the instructions on User Install Guide. Then, you can launch Tika as follows:
In the example above, /path/to/CTAKESConfig is the parent directory of file org/apache/tika/parser/ctakes/CTAKESConfig.properties that contains the configuration properties to build the cTAKES AnalysisEngine; tika-config.xml is a custom configuration file for Tika that contains the mimetypes whose CTAKESParser will perform parsing.
You can find in attachment an example of both CTAKESConfig.properties and tika-config.xml to parse ISA-Tab files using cTAKES.
You need UMLS credentials in order to use the UMLS-based components of cTAKES.