Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-343

some parsers produces glued words

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.5, 0.6
    • Fix Version/s: 0.6
    • Component/s: parser
    • Labels:
      None

      Description

      Some parsers ignores word/line delimiters.

      Document:
      "<html><head></head><body>test<br>test</body></html>"
      is decoded by HtmlParser to "testtest".

      I think the HtmlParser.mapSafeElement method should be extended by:

      if ("BR".equals(name)) return "br";
      if ("DIV".equals(name)) return "div";
      if ("HR".equals(name)) return "hr";
      if ("ADDRESS".equals(name)) return "address";
      if ("FIELDSET".equals(name)) return "fieldset";
      if ("FORM".equals(name)) return "form";
      if ("NOSCRIPT".equals(name)) return "noscript";
      if ("NOFRAMES".equals(name)) return "noframes";

      Also application/xml documents are parsed by removing unknown tags instead of replacing them into spaces.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                jukkaz Jukka Zitting
                Reporter:
                bartex Piotr Bartosiewicz
              • Votes:
                0 Vote for this issue
                Watchers:
                0 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: