Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-1130

ExtractText -html doesn't always close the <p> tags it opens

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 1.8.0
    • None
    • None

    Description

      I have a test document (same one on PDFBOX-1129), which when run through ExtractText -html, extracts the page number for each page, however in each case the page number looks like:

      <p>N<p>Text of page N...

      Ie, the <p> tag for the page number wasn't closed.

      Maybe related: if I run ExtractText without html, there is not space after the page number and before the next word, ie I see words like 1Massachusetts, 2Course, 3also, 4the.

      Attachments

        1. 000086.pdf
          41 kB
          Michael McCandless
        2. PDFBOX-1130.patch
          1 kB
          Michael McCandless

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              mikemccand Michael McCandless
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: