Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-4286

Ruta: HTMLConverter: Option to convert tags outside body tags

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.2.1ruta
    • Fix Version/s: 2.3.0ruta
    • Component/s: Ruta
    • Labels:
      None

      Description

      The HTML converter only converts tags that are found inside the body tag. Therefore some information carrying tags like citations get left out when applying the converter to XML articles with many metadata. It would be useful to add the option to have all tags converted since this would allow content outside the body to be parsed by natural language analysers as well.

      The converter was originally, as the name implies, conceived for HTML documents but together with the HTML Annotator it can this way be more generally useful in enabling NL parsing of a broader class of documents such as articles stored in XML documents.

      An example of how this option might work can be given by disabling the "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates what offsets to apply to such annotations but otherwise the document annotation offsets can be used. Empty tags can still be ignored but tags with only attributes and no content should preferably be converted.

      Experiments with disabling the "in body"-constraint reveals that there will be an additional need to separate the content metadata tags in the converted text view. An NL parser reading the text will in many case read different tags as one word or one sentence, which is not desirable. Some text delimiter should therefore be inserted between tags were required, which optionally could be customizable as well.

        Attachments

          Activity

            People

            • Assignee:
              pkluegl Peter Klügl
              Reporter:
              mjuric Mario Juric
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: