Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3170

PDF extraction space issue

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.24.1
    • 1.25
    • parser
    • None

    Description

      While extracting pdf files, we are observing spaces between some letters.

      As per below documentation, 

      https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html

      we can resolve this by disabling Enable Auto Space property. But when we disable this value, we are getting an issue with another text.

      With Enable Auto Space 

      < <p>2014 C H A M B R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015

      Without Enable Auto Space
      > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e ZITTINGSPERIODE2015

       

      Now there is no space between 2014 and CHAMBRE.

       

      Is there some configuration to over come this issue.

      Attachments

        1. image-2020-08-18-20-23-16-159.png
          26 kB
          Akash
        2. document_example.pdf
          139 kB
          Akash

        Activity

          People

            Unassigned Unassigned
            akki1607 Akash
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: