Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3248

Unwanted spaces in text extraction (2)

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.8.11, 2.0.0
    • None
    • Text extraction

    Description

      The attached file provided by Francisco from the user mailing list has spaces in text extraction regardless of setting spacingTolerance or averageCharTolerance. I was unable to extract "Cada frasco ampolla" which looked straightforward in rendering, but it always appeared as "Ca da fras co ampo lla". Adobe Reader has no such problem.

      The content stream has this:

           6 0 1.058 6 122.0924 312.51 Tm
           (Ca) Tj
           /Span << /ActualText (\376\377\000\255) >> BDC
             ( ) Tj
           EMC
           [ (da ) -301 (fras) ] TJ
           /Span << /ActualText (\376\377\000\255) >> BDC
             ( ) Tj
           EMC
           [ (co ) -301 (ampo) ] TJ
           /Span << /ActualText (\376\377\000\255) >> BDC
             ( ) Tj
           EMC
           [ (lla ) -301 (con) ] TJ
      

      So there are really spaces there, and we keep them. Adobe is smarter, and ignores them because they are overwritten thanks to the "-301" backwards positioning.

      Would /ActualText help? However it is always the same here...

      Would it help to ignore spaces and decide based on positions only, maybe as an option? I added these two lines below the first existing one:

                      String characterValue = position.getUnicode();
                      if (" ".equals(characterValue))
                          continue;
      

      The output looks promising:

      F ó r m u l a :
      Cronopen® Balsámico Adultos:
      Cada frasco ampolla contiene: ampicilina (como ampicilina sódica)
      100 mg; ampicilina (como ampicilina benzatínica) 500 mg.
      Cada ampolla solvente de 5 ml contiene: dipirona 1000 mg; guaife­
      nesina 100 mg. Exc.: bisulfito de sodio; agua destilada.

      A complete test brings many differences, most are harmless or are improvements. Only one test case really fails, hello3.pdf. Original extract is "Hello محمد World.", new extract is "Hello .Worldمحمد".

      More from Francisco

      As additional information, I've found 2 related posts (about another tools)
      in StackOverflow:
      http://stackoverflow.com/questions/34579824/itext-how-to-tweak-text-extraction
      http://stackoverflow.com/questions/22671974/itext-reading-pdf-1s-as-up-arrows-error/22688775#22688775

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            tilman Tilman Hausherr
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment