Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Duplicate
-
0.8
-
None
-
None
-
Win Pro 7, x64, jdk1.6.0_22, jre 6.0.220.4
Description
The included PDF (a legal filing from the web) when parsed by Tika 0.7 has the following as its first several lines of plain text:
------- start ---------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTON
DIVISION ONE
SERGEY SAVCHUK, )
) No. 64269-3-I
Appellant, )
v. )
) UNPUBLISHED OPINION
STEVEN G. JERDE and )
DARLYCE J. JERDE, husband and wife )
)
Respondents. )
_______________________________ ) FILED: November 1, 2010
--------------- end ---------------------
Tika 0.8 has this instead:
-------------- start ---------------------
IN THE COURT OF APPEALS OF THE STATE OF WASHINGTONDIVISION ONESERGEYSAVCHUK,))No. 64269-3-IAppellant,)v.))UNPUBLISHED OPINIONSTEVENG. JERDE and )DARLYCE J. JERDE, husband and wife))Respondents.)_______________________________ )FILED: November 1, 2010schindler, j
--------------- end ---------------------
Notice that as part of the improved paragraph breaking for PDF files, the "header" of the document had lines catenated together without spaces in between, creating run-on words (e.g. "WASHINGTONDIVISION" and "ONESERGEYSAVCHUK"). See the original PDF for more details and compare to the text.
Attachments
Attachments
Issue Links
- duplicates
-
TIKA-548 PDF content extracted as single line
- Closed