Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3805

Incorrect end of paragraph detection

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 2.4.1
    • None
    • tika-app
    • None

    Description

      For certain PDFs, the text extracted using tika-app-2.4.1 is split into paragraphs incorrectly.

      For example, when the attached PDF is used as input, the output text is split into paragraphs at the end of each line. Looking at the first paragraph of said file, 

      Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.

      This is output as five different paragraphs, as below

      Under the €7 billion covered bond programme described in this Prospectus (the "Programme"), Virgin Money plc (the "Issuer", which 

      term shall include any Part VII Successor (as defined in the Conditions)), subject to compliance with all relevant laws, regulations and 

      directives, may from time to time issue bonds (the "Covered Bonds") denominated in any currency agreed between the Issuer and the 

      relevant Dealer(s) (as defined below). The price and amount of the Covered Bonds to be issued under the Programme will be determined by 

      the Issuer and the relevant Dealers at the time of issue in accordance with prevailing market conditions.

      Is there any way to configure Tika to recognise such input as one paragraph, or is any fix possible for this issue?

      Thank you.

      Attachments

        Activity

          People

            Unassigned Unassigned
            jcarrani Jacopo Carrani
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: