Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-3058 Support TIKA Migration to PDFBox 2.0
  3. PDFBOX-3127

Text with vertical font not extracted correctly

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.8.10, 2.0.0
    • 2.0.0
    • Text extraction
    • None

    Description

      The attached file has a vertical font, although the text is horizontal.

      Extraction with 1.8:

      NOTI CE OF PUBLI C HEARI NG
      The Sout h Caroli na Depart ment of I nsurance will hol d a publi c
      heari ng i n accordance wit h t he require ments of Secti on 38-3-
      110? 5? Thursday, April 29, 2010 at The Conf erence and Busi ness
      Cent er at t he Grand Strand Ca mpus of t he Horry- Georgetown
      Techni cal Coll ege, 950 Crabtree Lane, Myrtl e Beach, S. C., 29577
      fro m 5: 30 p. m.-7: 00 p. m. The purpose of t hi s heari ng i s t o provi de
      an opportunity t o di scuss and off er i nput concerni ng t he st atus of t he
      coastal property i nsurance market. The Conf erence Cent er i s l ocat ed
      one mil e sout h of t he Myrtl e Beach I nt ernati onal Airport bet ween
      Hi ghway 17 Busi ness and Hi ghway 17 Bypass. The t el ephone
      nu mber f or t he Conf erence and Busi ness Cent er i s 843-477-2042.

      Extraction with 2.0:

      N O T I C E O F P U B L I C H E A R I N G

      T h e S o u t h C a r o l i n a D e p a r t m e n t o f I n s u r a n c e w i l l h o l d a p u b l i c
      h e a r i n g i n a c c o r d a n c e w i t h t h e r e q u i r e m e n t s o f S e c t i o n 3 8 - 3 -
      1 1 0 ︵5 ︶ T h u r s d a y , A p r i l 2 9 , 2 0 1 0 a t T h e C o n f e r e n c e a n d B u s i n e s s
      C e n t e r a t t h e G r a n d S t r a n d C a m p u s o f t h e H o r r y - G e o r g e t o w n
      T e c h n i c a l C o l l e g e , 9 5 0 C r a b t r e e L a n e , M y r t l e B e a c h , S . C . , 2 9 5 7 7
      f r o m 5 : 3 0 p . m . - 7 : 0 0 p . m . T h e p u r p o s e o f t h i s h e a r i n g i s t o p r o v i d e
      a n o p p o r t u n i t y t o d i s c u s s a n d o f f e r i n p u t c o n c e r n i n g t h e s t a t u s o f t h e
      c o a s t a l p r o p e r t y i n s u r a n c e m a r k e t . T h e C o n f e r e n c e C e n t e r i s l o c a t e d
      o n e m i l e s o u t h o f t h e M y r t l e B e a c h I n t e r n a t i o n a l A i r p o r t b e t w e e n
      H i g h w a y 1 7 B u s i n e s s a n d H i g h w a y 1 7 B y p a s s . T h e t e l e p h o n e
      n u m b e r f o r t h e C o n f e r e n c e a n d B u s i n e s s C e n t e r i s 8 4 3 - 4 7 7 - 2 0 4 2 .

      A brute force change that uses the correct width, and that works only with this file brings this:

      NOTICE OF PUBLIC HEARING

      The South Carolina Department of Insurance will hold a public
      hearing in accordance with the requirements of Section 38-3-
      110 ︵5 ︶ Thursday, April 29, 2010 at The Conference and Business
      Center at the Grand Strand Campus of the Horry-Georgetown
      Technical College, 950 Crabtree Lane, Myrtle Beach, S.C., 29577
      from 5:30 p.m.-7:00 p.m. The purpose of this hearing is to provide
      an opportunity to discuss and offer input concerning the status of the
      coastal property insurance market. The Conference Center is located
      one mile south of the Myrtle Beach International Airport between
      Highway 17 Business and Highway 17 Bypass. The telephone
      number for the Conference and Business Center is 843-477-2042.

      The problem is that the PDFTextStreamEngine doesn't work well with vertical fonts. The red lines in the attached image show that the size is only half of whats needed. It may be related to PDCIDFont.getDefaultPositionVector() but changing that isn't enough.

      Attachments

        1. RAU4G6QMOVRYBISJU7R6MOVZCRFUO7P4.pdf
          79 kB
          Tilman Hausherr
        2. RAU4G6QMOVRYBISJU7R6MOVZCRFUO7P4-marked-1.png
          262 kB
          Tilman Hausherr
        3. 103799.pdf
          132 kB
          Tilman Hausherr

        Activity

          People

            lehmi Andreas Lehmkühler
            tilman Tilman Hausherr
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: