[PDFBOX-1104] Improves parsing speed of a pdf by an average of 45% when extracting text from one random page in the document. - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.6.0
Component/s: Parsing, Utilities
Labels:
None

Description

The parser proposed just parses the minimal required from the PDF file according to PDF specifications. A random page can be parsed without having to parse the entire document first. Exist parsing code was used to transfer existing bugfixes and compliance fixes to this parser.

The parser has been tested with the text extraction tool. But has not been tested with the viewer or other pdf tools. Some tools may need to be recoded to use the parser to prevent null pointer exceptions since the COSDocument will contain null pointers for COSObjects that have not been parsed. For example, the Current Text Extractor assumes the entire document is loaded. On this code submission a modified text extractor is also included with the name OnePagePDFTextStripper. The class has a function that will extract the text from a PDPage submitted by the programmer.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

fast_parser.diff
18/Aug/11 20:04
12 kB
Jeremy Villalobos
OnePagePDFTextStripper.java
18/Aug/11 20:04
67 kB
Jeremy Villalobos
PagesNotExpectedHere.java
18/Aug/11 20:04
0.9 kB
Jeremy Villalobos
ParseTester.java
18/Aug/11 20:04
4 kB
Jeremy Villalobos
QuickParser.java
18/Aug/11 20:04
25 kB
Jeremy Villalobos

Activity

People

Assignee:: Unassigned

Reporter:: Jeremy Villalobos

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 18/Aug/11 19:59

Updated:: 24/Mar/13 14:30

Resolved:: 18/Aug/11 20:01