Tika
  1. Tika
  2. TIKA-724

PDF text sometimes has extra space between letters

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.0
    • Component/s: parser
    • Labels:
      None

      Description

      I have a PDF with simple text "Here is some formatted text", but when
      I extract with Tika I get extra spaces inserted:

      H e re  i s  so me  fo rma tte d  te x t
      

      When I created the text in this PDF (I used the PDFpen tool on OS X),
      I set the style of the text to "loosen" (ie, increase space slightly
      between the letters), so I think Tika (PDFBox) is trying to "respect"
      that whitespace, but it'd be nice to turn this off (if it won't mess
      up other places where we DO want the whitespace).

      When I copy/paste the text is copied correctly.

      1. TIKA-724.patch
        6 kB
        Michael McCandless
      2. extraSpaces.pdf
        20 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          I dug into this one some more.

          Handling space between words is tricky in PDF! This is because a PDF
          need not actually include space characters; instead it can (and does!)
          simply place the glyphs at x/y positions with added whitespace between
          them. This easily happens for white-space based languages too.

          Yet, sometimes PDFs do include space characters themselves (the attached
          PDF is such an example). Ideally we would be able to somehow detect
          this (eg if the PDF is encoded differently internally something) but
          I don't know how to do this / if it's even possible.

          So for the time being I made a simple addition to PDFParser, adding an
          option set/getEnableAutoSpace, defaulting to enabled (ie keeping the
          behavior today). So at least if an app hits PDFs like the one
          attached here, or somehow they know their PDFs always include explicit
          space characters, they can set this option.

          Show
          Michael McCandless added a comment - I dug into this one some more. Handling space between words is tricky in PDF! This is because a PDF need not actually include space characters; instead it can (and does!) simply place the glyphs at x/y positions with added whitespace between them. This easily happens for white-space based languages too. Yet, sometimes PDFs do include space characters themselves (the attached PDF is such an example). Ideally we would be able to somehow detect this (eg if the PDF is encoded differently internally something) but I don't know how to do this / if it's even possible. So for the time being I made a simple addition to PDFParser, adding an option set/getEnableAutoSpace, defaulting to enabled (ie keeping the behavior today). So at least if an app hits PDFs like the one attached here, or somehow they know their PDFs always include explicit space characters, they can set this option.
          Hide
          Michael McCandless added a comment -

          Patch.

          Show
          Michael McCandless added a comment - Patch.
          Hide
          Ravish Bhagdev added a comment -

          Is there a way to control this flag from Solr? Would have expected I could add something in solrconfig.xml to control this flag?

          As I typed this I realized this might not be the place, so is there a way to control this from command line in tika-app?

          Show
          Ravish Bhagdev added a comment - Is there a way to control this flag from Solr? Would have expected I could add something in solrconfig.xml to control this flag? As I typed this I realized this might not be the place, so is there a way to control this from command line in tika-app?
          Hide
          Ravish Bhagdev added a comment -

          and also in tika.config

          Show
          Ravish Bhagdev added a comment - and also in tika.config
          Hide
          Michael McCandless added a comment -

          Alas, no, I don't believe you can control this from Solr today; maybe open a Solr issue?

          Likewise for TikaCLI.. would be nice to expose that. Maybe open an issue / cons up a patch? Thanks!

          Show
          Michael McCandless added a comment - Alas, no, I don't believe you can control this from Solr today; maybe open a Solr issue? Likewise for TikaCLI.. would be nice to expose that. Maybe open an issue / cons up a patch? Thanks!
          Hide
          Ravish Bhagdev added a comment -

          OK, will open the issue with Solr/Lucene. Many thanks for your help.

          Show
          Ravish Bhagdev added a comment - OK, will open the issue with Solr/Lucene. Many thanks for your help.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development