Uploaded image for project: 'Tika'
  1. Tika
  2. TIKA-3170

PDF extraction space issue

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Duplicate
    • Affects Version/s: 1.24.1
    • Fix Version/s: 1.25
    • Component/s: parser
    • Labels:
      None

      Description

      While extracting pdf files, we are observing spaces between some letters.

      As per below documentation, 

      https://tika.apache.org/1.24.1/api/org/apache/tika/parser/pdf/PDFParserConfig.html

      we can resolve this by disabling Enable Auto Space property. But when we disable this value, we are getting an issue with another text.

      With Enable Auto Space 

      < <p>2014 C H A M B R E 2 e S E S S I O N D E L A 5 4 e L É G I S L A T U R EK A M E R 2 e Z I T T I N G V A N D E 5 4 e Z I T T I N G S P E R I O D E 2015

      Without Enable Auto Space
      > <p>*2014CHA*MBRE 2e SESSION DE LA 54e LÉGISLATUREKAMER 2e ZITTING VAN DE 54e ZITTINGSPERIODE2015

       

      Now there is no space between 2014 and CHAMBRE.

       

      Is there some configuration to over come this issue.

        Attachments

        1. image-2020-08-18-20-23-16-159.png
          26 kB
          Akash
        2. document_example.pdf
          139 kB
          Akash

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              akki1607 Akash
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: