Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
1.16, 1.20
-
None
-
None
-
None
Description
Website: http://www.thevanitycase.com/about-us.php
While parsing the content of the page using Tika Parser, it splits the text in the tag and sends it to crawler4j for content handling. But the text is contained within a single tag (span tag). The content handler appends extra whitespace (" ") as it normally does for any text received
Text: "Tel: +91-22-61801700".
That is,
Expected text: "<text before this>Tel: +91-22-61801700<text after this>"
Actual text: "<text before this>Tel: +91-22-6180170 0<text after this>"
The JS path of the element: body > div > div:nth-child(6) > div > div.footer-full.footer-btm > div > p > span