[TIKA-1315] Basic list support in WordExtractor - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 1.6
Fix Version/s: 1.10
Component/s: parser
Labels:
None

Description

Hello guys, I am really sorry to post issue like this because I have no other way of contacting you and I don't quite understand how you manage forks and pull requests (I don't think you do that). Plus I don't know your coding styles and stuff.

In my project I needed for tika to parse numbered lists from word .doc documents, but TIKA doesn't support it. So I looked for solution and found one here: http://developerhints.blog.com/2010/08/28/finding-out-list-numbers-in-word-document-using-poi-hwpf/ . So I adapted this solution to Apache TIKA with few fixes and improvements. Anyway feel free to use any of it so it can help people who struggle with lists in TIKA like I did.

Attached files are:
Updated test
Fixed WordExtractor
Added ListUtils

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

ListUtils.java
30/May/14 15:54
8 kB
Filip Bednárik
WordExtractor.java.patch
30/May/14 16:38
3 kB
Filip Bednárik
WordParserTest.java.patch
30/May/14 16:38
0.6 kB
Filip Bednárik
ListManager.tar.bz2
21/Sep/14 16:22
8 kB
Moritz Dorka
ListNumbering.patch
21/Sep/14 16:22
7 kB
Moritz Dorka
complex_list_test.doc
05/May/15 12:00
52 kB
Moritz Dorka

Issue Links

depends upon

TIKA-1667 Upgrade to POI 3.13-beta1 when available

Resolved

duplicates

TIKA-1440 Auto-Paragraph numbers not extracted from Word Document

Resolved

Activity

People

Assignee:: Tim Allison

Reporter:: Filip Bednárik

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/May/14 15:53

Updated:: 23/Jul/15 17:31

Resolved:: 23/Jul/15 17:31