[UIMA-4286] Ruta: HTMLConverter: Option to convert tags outside body tags - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.2.1ruta
Fix Version/s: 2.3.0ruta
Component/s: Ruta
Labels:
None

Description

The HTML converter only converts tags that are found inside the body tag. Therefore some information carrying tags like citations get left out when applying the converter to XML articles with many metadata. It would be useful to add the option to have all tags converted since this would allow content outside the body to be parsed by natural language analysers as well.

The converter was originally, as the name implies, conceived for HTML documents but together with the HTML Annotator it can this way be more generally useful in enabling NL parsing of a broader class of documents such as articles stored in XML documents.

An example of how this option might work can be given by disabling the "inBody"-flag inside the HTMLConverterVisitor. The example also illustrates what offsets to apply to such annotations but otherwise the document annotation offsets can be used. Empty tags can still be ignored but tags with only attributes and no content should preferably be converted.

Experiments with disabling the "in body"-constraint reveals that there will be an additional need to separate the content metadata tags in the converted text view. An NL parser reading the text will in many case read different tags as one word or one sentence, which is not desirable. Some text delimiter should therefore be inserted between tags were required, which optionally could be customizable as well.

Attachments

Activity

People

Assignee:: Peter Klügl

Reporter:: Mario Juric

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 13/Mar/15 22:45

Updated:: 30/Apr/15 08:45

Resolved:: 29/Apr/15 12:54