Tika
  1. Tika
  2. TIKA-889

XHTMLContentHandler wont emit newline when html element matches ENDLINE set

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: parser
    • Labels:
      None

      Description

      XHTMLContentHandler.endElement checks if the element is in the ENDLINE set to see if it should emit a newline. The html elements in ENDLINE are all lower case, but the HtmlParser class uses the XHTMLDowngradeHandler handler to upper case all html elements. This means that none of the html elements in the web page will match the elements in the ENDLINE set.

      This also is a problem with the INDENT set as well

        Activity

        Hide
        Ken Krugler added a comment -

        Added unit test to validate in r137506

        Show
        Ken Krugler added a comment - Added unit test to validate in r137506
        Hide
        Ken Krugler added a comment -

        Hi John - I tried this with trunk, and it works as expected.

        Yes, it's true that XHTMLDowngradeHandler will uppercase the element names, but then DefaultHtmlMapper.mapSafeElement() lower-cases them (I know, seems odd to me too). So the comparison works, and I see the expected output.

        I'm adding a test case to validate behavior, at least for a simple <ul><li>xxx</li></ul> example.

        Show
        Ken Krugler added a comment - Hi John - I tried this with trunk, and it works as expected. Yes, it's true that XHTMLDowngradeHandler will uppercase the element names, but then DefaultHtmlMapper.mapSafeElement() lower-cases them (I know, seems odd to me too). So the comparison works, and I see the expected output. I'm adding a test case to validate behavior, at least for a simple <ul><li>xxx</li></ul> example.
        Hide
        Chris A. Mattmann added a comment -
        • classify
        Show
        Chris A. Mattmann added a comment - classify

          People

          • Assignee:
            Ken Krugler
            Reporter:
            John Conwell
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development