[PDFBOX-1130] ExtractText -html doesn't always close the <p> tags it opens - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.8.0
Component/s: None
Labels:
None

Description

I have a test document (same one on ~~PDFBOX-1129~~), which when run through ExtractText -html, extracts the page number for each page, however in each case the page number looks like:

<p>N<p>Text of page N...

Ie, the <p> tag for the page number wasn't closed.

Maybe related: if I run ExtractText without html, there is not space after the page number and before the next word, ie I see words like 1Massachusetts, 2Course, 3also, 4the.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

PDFBOX-1130.patch
04/Oct/11 10:29
1 kB
Michael McCandless
000086.pdf
04/Oct/11 10:24
41 kB
Michael McCandless

Issue Links

relates to

PDFBOX-2160 PDFTextStripper doesn't always write paragraph start

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Michael McCandless

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 04/Oct/11 10:23

Updated:: 02/Mar/15 20:51

Resolved:: 13/Oct/12 13:59