Tika
  1. Tika
  2. TIKA-724

PDF text sometimes has extra space between letters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I have a PDF with simple text "Here is some formatted text", but when
      I extract with Tika I get extra spaces inserted:

      H e re  i s  so me  fo rma tte d  te x t
      

      When I created the text in this PDF (I used the PDFpen tool on OS X),
      I set the style of the text to "loosen" (ie, increase space slightly
      between the letters), so I think Tika (PDFBox) is trying to "respect"
      that whitespace, but it'd be nice to turn this off (if it won't mess
      up other places where we DO want the whitespace).

      When I copy/paste the text is copied correctly.

      1. TIKA-724.patch
        6 kB
        Michael McCandless
      2. extraSpaces.pdf
        20 kB
        Michael McCandless

        Issue Links

          Activity

          Michael McCandless created issue -
          Michael McCandless made changes -
          Field Original Value New Value
          Attachment extraSpaces.pdf [ 12495118 ]
          Michael McCandless made changes -
          Assignee Michael McCandless [ mikemccand ]
          Michael McCandless made changes -
          Attachment TIKA-724.patch [ 12499666 ]
          Michael McCandless made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 1.0 [ 12317967 ]
          Resolution Fixed [ 1 ]
          Jan Høydahl made changes -
          Link This issue is required by SOLR-2930 [ SOLR-2930 ]

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development