Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-4482

True Type vs Embedded Text Output

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Trivial
    • Resolution: Unresolved
    • 2.0.14
    • None
    • Text extraction
    • None
    • Windows or Linux

    Description

      Kinda difficult to describe but here goes

      We use tinymce editor and then process that html document through M$Word to create a PDF. All is good there. Once we have the PDF we need to send the info to another system that only recognizes text. We need to preserve the vertical spacing between parts of the document.

      If Arial font is used all works well.

      In Times font is used the <P> Paragraphs are messed up.

      Source HTML for Times is here:

      <p><span style="font-family: times new roman,times; font-size: 10pt;">COMPARISON: None</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">TECHNIQUE: Axial CT images obtained of the spine with sagittal and coronal reconstructions.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique section.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">FINDINGS: No acute fracture, dislocation or abnormal lesion is shown. There is loss of the cervical lordosis. Osteophyte is noted at C4-5, and C5-6 levels with moderate disc space narrowing. The spinal cord is normal. No Chiari malformation. No extradural soft tissue masses or paraspinal soft tissue masses. Upper thoracic spine is normal.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings section.  The previous lines are all one paragraph and should not have any breaks.  The next several lines are one line (one paragraph) per spine section.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C2-3: No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C3-4: Shallow central and right paracentral disc herniation. No central stenosis or neural foraminal stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C4-5: Diffuse disc bulge and right posterior lateral osteophyte with uncovertebral joint hypertrophy and moderate bilateral neural foraminal stenosis. No central stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C5-6: Shallow central and right posterior lateral disc herniation with mass-effect along the cervical cord, right lateral recess stenosis and bilateral neural foraminal stenosis, moderate on the left and mild on the right. No central stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C6-7: No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;">C7-T1: No disc herniation, central stenosis or neural foraminal stenosis.</span></p>
      <p><span style="font-family: times new roman,times; font-size: 10pt;"><span style="font-family: times new roman,times; font-size: 10pt;">IMPRESSION: </span><br /><span style="font-family: times new roman,times; font-size: 10pt;">1. Moderate degenerative disc disease with loss of cervical lordosis.  This is line 1 of the impression.</span><br /><br /><span style="font-family: times new roman,times; font-size: 10pt;">2. C3-4 level with shallow central and right paracentral disc herniation but no stenosis.  This is line 2 of the impression.</span><br /><br /><span style="font-family: times new roman,times; font-size: 10pt;">3. C4-5 level with diffuse disc bulge, right posterior lateral osteophyte and uncovertebral joint hypertrophy with moderate bilateral neural foraminal stenosis.  This is line 3 of the impression.</span><br /><br /><span style="font-family: times new roman,times; font-size: 10pt;">4. C5-6 level with shallow central and right posterior lateral disc herniation, right lateral recess stenosis and bilateral neural foraminal stenosis, moderate on the left greater than right.  This is line 4 of the impression.</span></span></p>

      :BREAK!

      this results in a decoded result that breaks the paragraphs <P> on each line instead of keeping the whole paragraph intact and keeping the line breaks.

      :SUBPART!

      INFO  <p>TECHNIQUE: Axial CT images obtained of the spine with sagittal and coronal reconstructions.  This is the technique
      INFO  </p>
      INFO  <p>section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique
      INFO  </p>
      INFO  <p>section.  This is the technique section.  This is the technique section.  This is the technique section.  This is the technique
      INFO  </p>
      INFO  <p>section.  This is the technique section.  This is the technique section.
      INFO  </p>
      INFO  <p>FINDINGS: No acute fracture, dislocation or abnormal lesion is shown. There is loss of the cervical lordosis. Osteophyte is
      INFO  </p>
      INFO  <p>noted at C4-5, and C5-6 levels with moderate disc space narrowing. The spinal cord is normal. No Chiari malformation. No
      INFO  </p>
      INFO  <p>extradural soft tissue masses or paraspinal soft tissue masses. Upper thoracic spine is normal.  This is the findings
      INFO  </p>
      INFO  <p>section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings
      INFO  </p>
      INFO  <p>section.  This is the findings section.  This is the findings section.  This is the findings section.  This is the findings
      INFO  </p>
      INFO  <p>section.  This is the findings section.  The previous lines are all one paragraph and should not have any breaks.  The next
      INFO  </p>
      INFO  <p>several lines are one line (one paragraph) per spine section.
      INFO  </p>
      INFO  <p>C2-3: No disc herniation, central stenosis or neural foraminal stenosis.
      INFO  </p>

       

      :The original <P> is broken up line by line and not represented as a true paragraph. Am I doing something wrong or is it the conversion?

      Any help appreciated!

      Attachments

        1. 581.pdf
          28 kB
          Stev Dempsey
        2. 582.pdf
          32 kB
          Stev Dempsey
        3. 583.pdf
          27 kB
          Stev Dempsey
        4. 584.pdf
          32 kB
          Stev Dempsey
        5. 585.pdf
          87 kB
          Stev Dempsey

        Activity

          People

            Unassigned Unassigned
            stevmon Stev Dempsey
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: