Details
-
Bug
-
Status: Open
-
Minor
-
Resolution: Unresolved
-
2.4.1
-
None
-
None
Description
For certain PDFs, the text extracted using tika-app-2.4.1 is split into paragraphs incorrectly.
For example, when the attached PDF is used as input, the output text is split into paragraphs at the end of each line. Looking at the first paragraph of said file,
Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.
This is output as five different paragraphs, as below
Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which
term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and
directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the
relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by
the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.
Is there any way to configure Tika to recognise such input as one paragraph, or is any fix possible for this issue?
Thank you.