[ANY23-280] Refactor ContentExtractor to improve extraction flexibility - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.1
Fix Version/s: 2.2
Component/s: core, extractors
Labels:
None

Description

As discussed on ~~ANY23-247~~, the ContentExtractor is simply not fit for purpose. This issue was discovered and the cause has plagued our builds ever since. Any extractors which implement BaseRDFExtractor are based on the Extractor.ContentExtractor and hence work off of an 'unfixed' raw data stream as oppose to a more flexible model such as the TagSoupDOMExtractor.
This issue should refactor RDF extractors to enable more flexibility and to avoid issues we encounter with the strict SAX parsing logic.

Attachments

Issue Links

is superceded by

ANY23-318 ExtractionException handling in BaseRDFExtractor.java kills entire extraction

Resolved

links to

GitHub Pull Request #24

Activity

People

Assignee:: Lewis John McGibbney

Reporter:: Lewis John McGibbney

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 02/Apr/16 20:01

Updated:: 07/Apr/22 20:19

Resolved:: 27/Dec/17 20:10