[TIKA-203] Earlier metadata extraction in ParsingReader - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.3
Component/s: parser
Labels:
None

Description

The normal parse() method guarantees that all extracted metadata will be available in the metadata object once the method returns. But since the ParsingReader class runs the parse() method in a background thread, one can only assume that extracted metadata is available once the entire character stream has been consumed. This is troublesome for example when creating Lucene Document objects, as Lucene postpones reading the given character stream to when the already constructed Document is passed to an IndexWriter. The result is that (depending on thread scheduling and the structure of the input document format) metadata may not be available for inclusion in the indexed Document.

One way of fixing this issue is to add a small character buffer in ParsingReader, and to make sure that the buffer is filled with extracted text before the ParsingReader constructor returns. This should ensure that relevant document metadata is almost always available, since the majority of document formats have all or most metadata at the beginning of the document stream.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

lipsum.doc
17/Jul/09 13:17
53 kB
Daan de Wit

Issue Links

relates to

TIKA-262 ParsingReader does not parse metadata for larger MS Office documents

Closed

Activity

People

Assignee:: Jukka Zitting

Reporter:: Jukka Zitting

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 11/Feb/09 16:14

Updated:: 17/Jul/09 13:24

Resolved:: 13/Feb/09 23:09