[PDFBOX-570] Wingdings font recognition + spacing issue - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: 0.7.3
Fix Version/s: None
Component/s: Text extraction
Labels:
- wingdings
Environment:
Windows XP / Java JDK 1.6.0_15 / Tika 0.4 with PDFbox-0.7.3.jar and fontbox-0.1.0.jar embedded

Description

Windings characters issue
-------------------------

If filed this question first in Tika's wish list (tika-331) but Ken Krugler suggest it was a PDFBox issue.

I have PDF files that include some characters in Windings font.
Tika parser replaces them with some Unicode characters that have nothing to do with the original, and, in some cases, replaces them with alphabetic characters. That is normal regarding these characters codes inside Windings font, but when hands pictures are replaced by alphabetic characters like A, B, etc. that disturbs further lexical analysis.

Would it be possible to improve the parsing and remplace these characters with more accurate Unicode characters ?
(see http://www.alanwood.net/demos/wingdings.html for possible correspondences).

Attached files :

test1.pdf is a PDF file including Windings characters. Some are commonly used by people, others less fequently.

Parsing_result1.txt is the text file produced by Tika.

test2.pdf is another example with the same WORD source file converted into PDF with another tool, and Parsing_result2.txt is the Tika parsing result. Windings characters are translated into different Unicode characters than with the previous version.

Spacing issue
-------------

Look at lines 10 and 11 in test2.pdf.
Look at lines 11 and 12 in Tika parsing result (Parsing_result2.txt) :

ðLocalisation des zones de livraison et de stockage
ðLocalisation des zones dangereuses

There is no space between ð and Localisation (ð is the translation of Winding's "Rightwards white arrow" by Tika).

If you copy and paste lines 10 and 11 in test2.pdf into a Notepad Window, you get :

ð Localisation des zones de livraison et de stockage
ð Localisation des zones dangereuses

...with a space between ð and Localisation.

In my case, the missing space after Tika parsing result in considering "ðLocalisation" as a single word in following analysis.

Regards

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

test1.pdf
26/Nov/09 09:07
48 kB
MRIT64
Parsing_Result1.txt
26/Nov/09 09:07
2 kB
MRIT64
test2.pdf
26/Nov/09 09:07
206 kB
MRIT64
Parsing_Result2.txt
26/Nov/09 09:07
1 kB
MRIT64

Issue Links

depends upon

PDFBOX-11 CID to Unicode mapping

Closed

Activity

People

Assignee:: Andreas Lehmkühler

Reporter:: MRIT64

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Nov/09 09:06

Updated:: 13/Oct/14 17:33

Resolved:: 13/Oct/14 17:33