Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-889

XHTMLContentHandler wont emit newline when html element matches ENDLINE set

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Cannot Reproduce
    • None
    • 1.3
    • parser
    • None

    Description

      XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline. The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements. This means that none of the html elements in the web page will match the elements in the ENDLINE set.

      This also is a problem with the INDENT set as well

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            kkrugler Kenneth William Krugler
            jconwell John Conwell
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment