Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1860

HTML converter escapes formatting close tags

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • 1.8.3
    • 1.8.5, 2.0.0
    • Text extraction
    • None

    Description

      Bug introduced by PDFBOX-1213 in 1.8.3 for HTML style information.
      Bold style tags are opened correctly, but the close tags are html-escaped.

      ~/work/pdfbox ((1.8.3))$ java -jar app/target/pdfbox-app-1.8.3.jar ExtractText -html -nonSeq -console pdftest.pdf 
      <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
      "http://www.w3.org/TR/html4/loose.dtd">
      <html><head><title>1725.PDF</title>
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      </head>
      <body>
      <div style="page-break-before:always; page-break-after:always"><div><p>E:\M55\!\1725.fm 2003-01-01 18:15 P Tagg, IPM, University of Liverpool
      </p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE
      &lt;/b&gt;</p>
      <p><b>A VERY SMALL PDF FILE&lt;/b&gt;</p>
      
      </div></div>
      </body></html>
      

      Attachments

        1. PDFBOX-1860_Do_not_escape_html_formatting_close_tags.patch
          3 kB
          Cheng Leong
        2. pdftest.pdf
          3 kB
          Cheng Leong

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              cheng@indeed.com Cheng Leong
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: