Uploaded image for project: 'PDFBox'
  1. PDFBox
  2. PDFBOX-884

Error in HTML output when detecting paragraph boundaries

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Duplicate
    • 1.3.1, 1.4.0
    • None
    • Text extraction
    • None

    Description

      There is an error in paragraph detection in 1.3 version and in current development trunk. In some situations is outputing a non closed <p> tag. It happens in new pages, when - I think - the first paragraph is empty (an empty string).

      I attach a quite simple patch for solving this problem, perhaps is not very elegant, but is quite dificult for me to understand paragraph logic in current development trunk.

      Attachments

        1. input.pdf
          65 kB
          David Rodríguez Alfayate
        2. output_error.html
          3 kB
          David Rodríguez Alfayate
        3. pdfbox-paragraph-detection.patch
          1 kB
          David Rodríguez Alfayate

        Issue Links

          Activity

            People

              lehmi Andreas Lehmkühler
              erudil David Rodríguez Alfayate
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: