Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1860

HTML converter escapes formatting close tags

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 1.8.3
    • Fix Version/s: 1.8.5, 2.0.0
    • Component/s: Text extraction
    • Labels:
      None

      Description

      Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
      Bold style tags are opened correctly, but the close tags are html-escaped.

      ~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar ExtractText -html -nonSeq -console pdftest.pdf 
      <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
      "http://www.w3.org/TR/html4/loose.dtd">
      <html><head><title>1725.PDF</title>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      </head>
      <body>
      <div style="page-break-before:always; page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, IPM, University of Liverpool
      </p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE&lt;/b&gt;</p>
      
      </div></div>
      </body></html>
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                lehmi Andreas Lehmkühler
                Reporter:
                cheng@indeed.com Cheng Leong
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: