[TIKA-3805] Incorrect end of paragraph detection - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Minor
Resolution: Unresolved
Affects Version/s: 2.4.1
Fix Version/s: None
Component/s: tika-app
Labels:
None

Description

For certain PDFs, the text extracted using tika-app-2.4.1 is split into paragraphs incorrectly.

For example, when the attached PDF is used as input, the output text is split into paragraphs at the end of each line. Looking at the first paragraph of said file,

Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.

This is output as five different paragraphs, as below

Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which

term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and

directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the

relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by

the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.

Is there any way to configure Tika to recognise such input as one paragraph, or is any fix possible for this issue?

Thank you.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gcb-formerly-vm-base-prospectus-050319.pdf
29/Jun/22 16:32
2.56 MB
Jacopo Carrani

Activity

People

Assignee:: Unassigned

Reporter:: Jacopo Carrani

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 29/Jun/22 16:39

Updated:: 29/Jun/22 16:39