[PDFBOX-3058] Support TIKA Migration to PDFBox 2.0 - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 2.0.0
Fix Version/s: 2.0.0
Component/s: Text extraction
Labels:
None

Description

This issue is to track fixing issues which came up as part of ~~TIKA-1285~~ (Upgrade to PDFBox 2.0.0 when available) mainly

new exceptions compared to PDFBox 1.8.x
regressions in text extraction
lower quality text extraction

There should be individual issues to track tasks/bugs arising from that.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

textLostFromACausedByNewExceptionsInB.zip
17/Nov/15 15:50
42 kB
Tim Allison
NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_2_0.json
26/Oct/15 19:53
2 kB
Tim Allison
NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU_1_8_10.json
26/Oct/15 19:53
2 kB
Tim Allison
content_diffs-4.xlsx
04/Jan/16 17:26
3.14 MB
Tilman Hausherr
content_diffs-1.8-to-2.0.xlsx
25/Oct/15 13:01
1.52 MB
Tilman Hausherr

Issue Links

depends upon

PDFBOX-3051 COSArray.getObject() incorrect handling of indirect reference to COSNull

Closed

PDFBOX-3059 java.io.IOException: Error: Unknown annotation type COSNull{}

Closed

Sub-Tasks

1.	COSArray.getObject() incorrect handling of indirect reference to COSNull	Closed	Tilman Hausherr
2.	NPE in CFFParser.parseType1Dicts()	Closed	Tilman Hausherr
3.	Text extraction fails with type 3 fonts	Closed	Tilman Hausherr
4.	NPE in PDFStreamEngine.ShowText when no font set	Closed	Tilman Hausherr
5.	java.io.IOException: Error: Unknown annotation type COSNull{}	Closed	Unassigned
6.	Catalog cannot be found	Closed	Andreas Lehmkühler
7.	Word concatenation in 2.0 not in 1.8	Closed	Tilman Hausherr
8.	Text extraction and height different in 2.0	Closed	Tilman Hausherr
9.	Null metadata in 2.0 in some files that had metadata in 1.8.10 with old parser	Closed	Tilman Hausherr
10.	Avoid crazy /Length1 values in font descriptor	Closed	Tilman Hausherr
11.	Text extraction partially garbled in this file, was OK in 1.8	Closed	Tilman Hausherr
12.	Text extraction garbled in this file, was OK in 1.8	Closed	Tilman Hausherr
13.	IndexOutOfBoundsException in PDFont.getWidth()	Closed	Tilman Hausherr
14.	IndexOutOfBoundsException in PfbParser.parsePfb	Closed	Tilman Hausherr
15.	NullPointerException in PDFStreamEngine.showText()	Closed	Tilman Hausherr
16.	Text with vertical font not extracted correctly	Closed	Andreas Lehmkühler
17.	Text extraction garbled in this file, was OK in 1.8	Closed	Unassigned
18.	Parsing fails when XRef stream object is 1 byte later	Closed	Andreas Lehmkühler
19.	The trailer rebuild mechnism doesn't work	Closed	Andreas Lehmkühler
20.	One 32kb truncated file causes OOM in 2.0.0-trunk	Closed	Andreas Lehmkühler
21.	Rare new NPE in 2.0.0-trunk	Resolved	Unassigned

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: Maruan Sahyoun

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 25/Oct/15 08:54

Updated:: 28/Mar/16 19:51

Resolved:: 23/Jan/16 18:02