Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3024

Extra whitespace appended within a tag element's text

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.16, 1.20
    • None
    • None
    • None

    Description

      Website: http://www.thevanitycase.com/about-us.php

      While parsing the content of the page using Tika Parser, it splits the text in the tag and sends it to crawler4j for content handling. But the text is contained within a single tag (span tag). The content handler appends extra whitespace ("  ") as it normally does for any text received

      Text: "Tel: +91-22-61801700".
      That is, 
      Expected text: "<text before this>Tel: +91-22-61801700<text after this>"

      Actual text: "<text before this>Tel: +91-22-6180170  0<text after this>"

      The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm > div > p > span

      Attachments

        1. one.odt
          10 kB
          Claas Aug.
        2. one.odt-parsed.html
          3 kB
          Claas Aug.

        Activity

          People

            Unassigned Unassigned
            vivek_0079 Vivek
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: