Issue Details (XML | Word | Printable)

Key: JCR-415
Type: Improvement Improvement
Status: Closed Closed
Resolution: Fixed
Priority: Minor Minor
Assignee: Unassigned
Reporter: Marcel Reutegger
Votes: 0
Watchers: 0
Operations

If you were logged in you would be able to see more operations.
Jackrabbit Content Repository

Enhance indexing of binary content

Created: 27/Apr/06 04:14 PM   Updated: 25/Apr/07 08:45 AM
Component/s: indexing
Affects Version/s: 0.9, 1.0, 1.0.1
Fix Version/s: 1.3

Time Tracking:
Not Specified

File Attachments:
  Size
Text File Licensed for inclusion in ASF works jackrabbit-extractor-r420472.patch 2006-07-10 06:50 PM Jukka Zitting 46 kB
Text File Licensed for inclusion in ASF works jackrabbit-query-r420472.patch 2006-07-10 06:55 PM Jukka Zitting 36 kB
Text File Licensed for inclusion in ASF works jackrabbit-query-r421461.patch 2006-07-14 05:02 AM Jukka Zitting 40 kB
Image Attachments:

1. org.apache.jackrabbit.core.query-extractor.jpg
(23 kB)

2. org.apache.jackrabbit.core.query.lucene-extractor.jpg
(64 kB)

3. org.apache.jackrabbit.extractor.jpg
(46 kB)

Resolution Date: 20/Dec/06 03:28 PM


 Description  « Hide
Indexing of binary content should be enhanced in order to allow either configuration what fields are indexed or provide better support for custom NodeIndexer implementations.

The current design has a couple of flaws that should be addressed at the same time:
- Reader instances are requested from the text filters even though the reader might never be used
- only jcr:data properties of nt:resource nodes are fulltext indexed
- It is up to the text filter implementation to decide the lucene field name for the text representation, responsibility should be moved to the NodeIndexer. A text filter should only provide a Reader instance.

With those changes a custom NodeIndexer can then decide if a binary property has one or more representations in the index.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Jukka Zitting added a comment - 01/Jul/06 02:37 PM

Jukka Zitting added a comment - 10/Jul/06 06:50 PM
Attached a proposal patch containing a mostly complete implementation of the TextExtractor idea I discussed briefly on the mailing list. This covers just a part of this issue, but should simplify further work considerably.

The attached patch (jackrabbit-extractor-r420472.patch) contains just the TextExtractor interface and related classes placed in org.apache.jackrabbit.extractor. I chose to place them outside of o.a.j.core as they have no Jackrabbit dependencies, and would probably make a good contribution to Apache Lucene once battle-tested.

I'll continue with a separate patch that backwards-compatibly replaces the current TextFilter usage in o.a.j.core.query[.lucene], and with some class diagrams that give a quick overview before diving into the javadocs.

Jukka Zitting added a comment - 10/Jul/06 06:55 PM
Attached patch (jackrabbit-query-r420472.patch) to integrate the TextExtractor proposal with Jackrabbit core.

Marcel Reutegger added a comment - 11/Jul/06 03:18 PM
Looks very good.

Some minor suggestions:

- NodeIndexer.addBinaryValue() is protected to allow subclasses to override it but it uses the private method getValue(). Thus getValue() should be protected final in order to be usable for a subclass.

- Extracting text should be deferred to the time when the lucene Document acutally requests character from Reader that is assigned to a Field. See http://issues.apache.org/jira/browse/JCR-264.

Jukka Zitting added a comment - 11/Jul/06 04:23 PM
Marcel:
> NodeIndexer.addBinaryValue() is protected to allow subclasses to override it but it uses the private
> method getValue(). Thus getValue() should be protected final in order to be usable for a subclass.

OK.

> Extracting text should be deferred to the time when the lucene Document acutally requests character
> from Reader that is assigned to a Field. See http://issues.apache.org/jira/browse/JCR-264.

I think it would make more design sense to try to postpone the creation of the Document instances instead of delaying text extraction. But I'm not too familiar with the details, so I'm OK with adding lazy reading to the mix. In any case I think it's best to layer the lazy reading on top of the TextExtractor interface instead of below it. A utility class like the following could achieve this as long as the given InputStream remains valid until the document has been read.

    class TextExtractorReader extends Reader {

        private final TextExtractor extractor;
        private final InputStream stream;
        private final String type;
        private final String encoding;

        private Reader reader;

        public TextExtractorReader(
                TextExtractor extractor, InputStream stream,
                String type, String encoding) {
            this.extractor = extractor;
            this.stream = stream;
            this.type = type;
            this.encoding = encoding;
            this.reader = null;
        }

        public int read(char[] buffer, int offset, int length) throws IOException {
            if (reader == null) {
                reader = extractor.extractText(stream, type, encoding);
            }
            return reader.read(buffer, offset, length);
        }

        public void close() throws IOException {
            if (reader != null) {
                reader.close();
            } else {
                stream.close();
            }
        }

    }

I can update the query patch accordingly.

Marcel Reutegger added a comment - 11/Jul/06 07:24 PM
Jukka wrote:
> I think it would make more design sense to try to postpone the creation of the Document instances
> instead of delaying text extraction. But I'm not too familiar with the details, so I'm OK with adding lazy
> reading to the mix. In any case I think it's best to layer the lazy reading on top of the TextExtractor interface
> instead of below it. A utility class like the following could achieve this as long as the given InputStream
> remains valid until the document has been read.

Yes, you are right. I thought I could get away with the dirty solution ;)
While going through your patch I was actually also thinking about a design that should create the document
only when it is really added to the index.
For now we can maybe use the TextExtractorReader you proposed and then in a next step change the design
to create the Document in a later stage of the indexing process.

Jukka Zitting added a comment - 14/Jul/06 05:02 AM
> Yes, you are right. I thought I could get away with the dirty solution ;)

Well, as they say, do the simplest thing that could possibly work!

> For now we can maybe use the TextExtractorReader you proposed and then in a next step
> change the design to create the Document in a later stage of the indexing process.

Sounds good. Attached a revised patch (jackrabbit-query-r421461.patch) that uses the TextExtractorReader solution.

Marcel Reutegger added a comment - 18/Dec/06 03:43 PM
I would like to get this change into the next major release (1.3) and propose the following changes:

- Create a new module jackrabbit-text-extractors which will initially contain the jackrabbit-extractor patch provided by Jukka
- Migrate the jackrabbit-text-filters into the new extractors module
- Add jackrabbit-text-filters as dependency to jackrabbit-core
- Remove the jackrabbit-text-filters module and do not create releases anymore for this module. Jackrabbit would still support existing releases of jackrabbit-text-filters but the interface TextFilter will be deprecated (see Jukkas' patch) and developers are encouraged to use the new TextExtractor interface.

Does this make sense?

Marcel Reutegger added a comment - 20/Dec/06 03:28 PM
In addition to the previously mentioned steps I also implemented the deferred text extraction as discussed with Jukka in this jira issue. The helper class TextExtractorReader has been removed again.

Fixed in revision: 489112

If noone objects within a week I will remove the now obsolete module jackrabbit-index-filters.