Uploaded image for project: 'UIMA'
  1. UIMA
  2. UIMA-4115

TikaAnnotator: incorrect order of tags processing

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.1Addons
    • Fix Version/s: None
    • Component/s: addons
    • Labels:

      Description

      org.apache.uima.tika.MarkupAnnotator outputs incorrect content due to bug in org.apache.uima.tika.MarkupHandler. The problem located in the end element event handler: MarkupHandler#endElement method should close opened tags by removing them from the stack (last added tag should be removed first if corresponding end tag found). But in current implementation it removes start elements beginning from the first open element which results in incorrect text spans annotated by the processor.

      The fix is trivial:
      in MarkupHandler#endElement replace startedAnnotations.iterator() with
      startedAnnotations.descendingIterator().

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              voli Vadym Oliinyk
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified