[JCR-2219] Improved background text extraction - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0-alpha7
Component/s: indexing, jackrabbit-core
Labels:
None

Description

As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.

Also, we currently all of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

JCR-2219.patch
16/Jul/09 15:10
24 kB
Jukka Zitting

Activity

People

Assignee:: Unassigned

Reporter:: Jukka Zitting

Votes:: 0 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 16/Jul/09 14:52

Updated:: 13/Aug/09 15:01

Resolved:: 31/Jul/09 13:46