Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Trivial Trivial
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      In our application, we have some kind of generic API that is handling how we are using Lucene. The different other applications are using this API with different semantics, and are using the Lucene fields quite differently. We wrote some usefull functions to do this mapping. Today, as the Document class cannot be overriden, we are obliged to make a document wrapper by application, ie some MyAppDocument and MyOtherAppDocument which have a property holding a real Lucene Document. Then, when MyApp or MyOtherApp want to use our generic lucene API, we have to "get out" the Lucene document, ie do some genericLuceneAPI.writeDoc(myAppDoc.getLuceneDocument()). This work fine, but it becomes quite tricky to use the other function of our generic API which is genericLuceneAPI.writeDocs(Collection<Document> docs).

      I don't know the rational behind making final Document, but removing it will allow more object-oriented code.

        Activity

        Hide
        Erick Erickson added a comment -

        SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.

        Show
        Erick Erickson added a comment - SPRING_CLEANING_2013 We can reopen if necessary. Think this code has been extensively re-worked anyway.
        Hide
        Yonik Seeley added a comment -

        > getStoredFields() methods that return a Map<String, Field>

        Or a Map<String,Field[]> or Map<String,Object> where Object could be Field or Field[]
        Unless it was a LinkedHashMap, that would also lose the ordering of different fields.

        Document is lightweight now...pretty much just a List<Fieldable> that the user can get directly at.
        If anything, a factory for Fieldable (which for Field, isn't as lightweight) would be more useful than a Document factory.

        Show
        Yonik Seeley added a comment - > getStoredFields() methods that return a Map<String, Field> Or a Map<String,Field[]> or Map<String,Object> where Object could be Field or Field[] Unless it was a LinkedHashMap, that would also lose the ordering of different fields. Document is lightweight now...pretty much just a List<Fieldable> that the user can get directly at. If anything, a factory for Fieldable (which for Field, isn't as lightweight) would be more useful than a Document factory.
        Hide
        Michael Busch added a comment -

        Maybe we should deprecate the document() methods and add
        getStoredFields() methods that return a Map<String, Field>?
        Then we could also definalize Document. Thoughts?

        Show
        Michael Busch added a comment - Maybe we should deprecate the document() methods and add getStoredFields() methods that return a Map<String, Field>? Then we could also definalize Document. Thoughts?
        Hide
        Nicolas Lalevée added a comment -

        Rethinking about the function "public void document(int n, Document doc)", in fact it will completely break the work I have done for LUCENE-662.

        And finally, I agree with you Hoss. Two different interfaces, and let the user implement the document he wants. As a first step, the user will decorate his document. And in a second time, Lucene could provide the user the possibility to have his own DocumentFactory.

        Show
        Nicolas Lalevée added a comment - Rethinking about the function "public void document(int n, Document doc)", in fact it will completely break the work I have done for LUCENE-662 . And finally, I agree with you Hoss. Two different interfaces, and let the user implement the document he wants. As a first step, the user will decorate his document. And in a second time, Lucene could provide the user the possibility to have his own DocumentFactory.
        Hide
        Hoss Man added a comment -

        Grant: I have no serious objection to making Document non-final ... i was just pointing out that as long as we are mucking with the Document related APIs, this may be a good time to solve the perception issue people have with what a "Document" is ... that single container class gets used in two very different ways, and clarifying what those ways are with seperate Interfaces would be useful. from a more parcticle standpoint, if i want to subclass Document and add custom behavior, what methods impact the way a document gets indexed? what methods impact the behavior of returned data? ... the answers to thouse questions aren't obvious to most novice users as is – but we tend to punt on them since no one can "replace" a Document instance at the moment ... we'd need to be more clear about that if subclassing is allowed and DocumentFactories or passing empty Beans to IndexReader.document start being allowed.

        regarding your specific question about...
        public void document(int n, Document doc)

        ...i'm leary of that approach just because of all the places people might want it ... with FieldSelector or without; from a Searcher, or an IndexReader, or a Hits instance, etc...

        I'd be more in favor of adding a "setDocumentFactory" method on IndexReader or something like that if we went that route ... but as i said, if Document (or whatever interface/abstract class we make IndexReader.document return) isn't final, then it's not clera to me why a Decorator model wouldn't be good enough.

        Show
        Hoss Man added a comment - Grant: I have no serious objection to making Document non-final ... i was just pointing out that as long as we are mucking with the Document related APIs, this may be a good time to solve the perception issue people have with what a "Document" is ... that single container class gets used in two very different ways, and clarifying what those ways are with seperate Interfaces would be useful. from a more parcticle standpoint, if i want to subclass Document and add custom behavior, what methods impact the way a document gets indexed? what methods impact the behavior of returned data? ... the answers to thouse questions aren't obvious to most novice users as is – but we tend to punt on them since no one can "replace" a Document instance at the moment ... we'd need to be more clear about that if subclassing is allowed and DocumentFactories or passing empty Beans to IndexReader.document start being allowed. regarding your specific question about... public void document(int n, Document doc) ...i'm leary of that approach just because of all the places people might want it ... with FieldSelector or without; from a Searcher, or an IndexReader, or a Hits instance, etc... I'd be more in favor of adding a "setDocumentFactory" method on IndexReader or something like that if we went that route ... but as i said, if Document (or whatever interface/abstract class we make IndexReader.document return) isn't final, then it's not clera to me why a Decorator model wouldn't be good enough.
        Hide
        Nicolas Lalevée added a comment -

        Hoss> OK I got it.
        In fact my concern was about "semantic". I agree with you that the API make the Lucene user think that every index data are retrieved while doing reader.doc(int i) (as I thought the first days using Lucene). But here you are proposing to completely separate the indexing from the retrieving, saying that they are not compatible with each other. I think this is wrong because basically you can retrieve a document and repushed the same in the index, even if it has no sense. But this was pure semantic concerns.
        Now looking at the implementation you are proposing of a "YourDocumentWraper", we can make it work correctly without any performance issue. So I won't make a war is a such design is implemented

        Grant>In fact the discussion derived from the original issue. BTW, this would be nice !

        Show
        Nicolas Lalevée added a comment - Hoss> OK I got it. In fact my concern was about "semantic". I agree with you that the API make the Lucene user think that every index data are retrieved while doing reader.doc(int i) (as I thought the first days using Lucene). But here you are proposing to completely separate the indexing from the retrieving, saying that they are not compatible with each other. I think this is wrong because basically you can retrieve a document and repushed the same in the index, even if it has no sense. But this was pure semantic concerns. Now looking at the implementation you are proposing of a "YourDocumentWraper", we can make it work correctly without any performance issue. So I won't make a war is a such design is implemented Grant>In fact the discussion derived from the original issue. BTW, this would be nice !
        Hide
        Grant Ingersoll added a comment -

        I feel like I'm missing something in this discussion, but couldn't we just make Document non-final and add:

        public void document(int n, Document doc)

        or even mark it as expert and call it populateDocument(int n, Document doc) or something like that

        The semantics of which are to add the fields for document n to the Document object doc.

        From the looks of the code, most of the Readers first thing in the document call is: Document result = new Document() so it is not like we are doing some complicated construction that is optimized for the different types of readers.

        Show
        Grant Ingersoll added a comment - I feel like I'm missing something in this discussion, but couldn't we just make Document non-final and add: public void document(int n, Document doc) or even mark it as expert and call it populateDocument(int n, Document doc) or something like that The semantics of which are to add the fields for document n to the Document object doc. From the looks of the code, most of the Readers first thing in the document call is: Document result = new Document() so it is not like we are doing some complicated construction that is optimized for the different types of readers.
        Hide
        Hoss Man added a comment -

        > 1) it does matter from the user point of view : we cannot do anymore
        > writer.addDocument(reader.doc(10)).

        ...and i argue that is precisely what we want ... it's not generally safe for people do do that now, but becuase it is so easy, they think they can. If we change the API so that doc(int) doesn't return the same class that addDocument takes we make go out of their way a little bit and need to write something like...

        writer.addDocument(new ReIndexedDocument(reader.doc(10)));

        ...where the RedIndexedDocument class can have some good javadocs explaining what it is, when to use it, and most importantly: when it doesn't work.

        > 2) Effectively I can implement a DecoratedDocument. But I cannot make Lucene
        > instanciating my own document, the

        ...i'm still not clear why you feel you need lucene to instantiate your specific class ... why wouldn't a decorator class work for the use case you describe?

        public class YourDocumentWraper implements IndexableDocument {
        public YourDocumentWraper(ReturnableDocument r, Object specialStuff)

        { ... }

        public Fieldable getFieldable(String f)

        { // ...check if you want special stuff, if not... return r.getFieldable(f); }

        }

        Show
        Hoss Man added a comment - > 1) it does matter from the user point of view : we cannot do anymore > writer.addDocument(reader.doc(10)). ...and i argue that is precisely what we want ... it's not generally safe for people do do that now, but becuase it is so easy, they think they can. If we change the API so that doc(int) doesn't return the same class that addDocument takes we make go out of their way a little bit and need to write something like... writer.addDocument(new ReIndexedDocument(reader.doc(10))); ...where the RedIndexedDocument class can have some good javadocs explaining what it is, when to use it, and most importantly: when it doesn't work. > 2) Effectively I can implement a DecoratedDocument. But I cannot make Lucene > instanciating my own document, the ...i'm still not clear why you feel you need lucene to instantiate your specific class ... why wouldn't a decorator class work for the use case you describe? public class YourDocumentWraper implements IndexableDocument { public YourDocumentWraper(ReturnableDocument r, Object specialStuff) { ... } public Fieldable getFieldable(String f) { // ...check if you want special stuff, if not... return r.getFieldable(f); } }
        Hide
        Nicolas Lalevée added a comment -

        Note that I was talking about the future API, with some deprecated functions removed. So the API will look like :
        class IndexReader

        { ReturnableDocument doc(int n); }

        class IndexWriter

        { void addDocument(IndexableDocument doc); }

        1) it does matter from the user point of view : we cannot do anymore writer.addDocument(reader.doc(10)).

        2) Effectively I can implement a DecoratedDocument. But I cannot make Lucene instanciating my own document, the reader will still return some ReturnableDocument. Unless you want to allow the user to customize in Lucene the instanciation of Document by providing a factory of Document ?

        Show
        Nicolas Lalevée added a comment - Note that I was talking about the future API, with some deprecated functions removed. So the API will look like : class IndexReader { ReturnableDocument doc(int n); } class IndexWriter { void addDocument(IndexableDocument doc); } 1) it does matter from the user point of view : we cannot do anymore writer.addDocument(reader.doc(10)). 2) Effectively I can implement a DecoratedDocument. But I cannot make Lucene instanciating my own document, the reader will still return some ReturnableDocument. Unless you want to allow the user to customize in Lucene the instanciation of Document by providing a factory of Document ?
        Hide
        Hoss Man added a comment -

        1) from a design standpoint, as long as IndexReader/IndexSearcher deal with documents using a different interface name i nthe signature then IndexWriter, it doens't really matter what name is used – what i suggested tackles a second point: keeping the existing interface backwards compatible for the forseeable future while migrating to this modified API.

        2) if you are already decorating the docs you get back from your "master index" before adding them to your specialized indexes, then nothing i described would prevent that from continuing to work ... the bulk of your code could be example the same, you would just need a simple DecoratedDocument which impliments IndexableDocument and wraps a ReturnableDocument.

        something like that might even be a common enough use case to inlcude in the core ... anyone with a legitimate use cases for doing round trips could use it as well.

        Show
        Hoss Man added a comment - 1) from a design standpoint, as long as IndexReader/IndexSearcher deal with documents using a different interface name i nthe signature then IndexWriter, it doens't really matter what name is used – what i suggested tackles a second point: keeping the existing interface backwards compatible for the forseeable future while migrating to this modified API. 2) if you are already decorating the docs you get back from your "master index" before adding them to your specialized indexes, then nothing i described would prevent that from continuing to work ... the bulk of your code could be example the same, you would just need a simple DecoratedDocument which impliments IndexableDocument and wraps a ReturnableDocument. something like that might even be a common enough use case to inlcude in the core ... anyone with a legitimate use cases for doing round trips could use it as well.
        Hide
        Nicolas Lalevée added a comment -

        Marker interface is a nice idea, but I think this will make Document handling more painfull. In my use case this will not be optimal.
        In our application, we have a kind of workflow of Document. We have a big/master index which is holding every data on the system, and then we have specialized index which a part of the big one. The big one is for making big global queries on the whole data. The specialized are specialized for the end application. So the workflow is :

        • from the raw data, make them as Document and index it in the master index
        • for each specialized index :
          • do the specific query on the master index
          • from the retreived document, redecorate it with specialized indexed field
          • index the decorated documents in the specialized index

        Here I just have to decorate the Document retrieved form the master index. With incompatible interfaces, this won't be possible anymore. I will have to reinstanciate a Document each time and repopulate it.
        So why not keep IndexWriter#addDocument(Document), and just change IndexReader#doc(int) to make it return a kind of DocumentWithOnlyStoredData, with DocumentWithOnlyStoredData extends Document. (the proposed named is horrible, I know !)

        Show
        Nicolas Lalevée added a comment - Marker interface is a nice idea, but I think this will make Document handling more painfull. In my use case this will not be optimal. In our application, we have a kind of workflow of Document. We have a big/master index which is holding every data on the system, and then we have specialized index which a part of the big one. The big one is for making big global queries on the whole data. The specialized are specialized for the end application. So the workflow is : from the raw data, make them as Document and index it in the master index for each specialized index : do the specific query on the master index from the retreived document, redecorate it with specialized indexed field index the decorated documents in the specialized index Here I just have to decorate the Document retrieved form the master index. With incompatible interfaces, this won't be possible anymore. I will have to reinstanciate a Document each time and repopulate it. So why not keep IndexWriter#addDocument(Document), and just change IndexReader#doc(int) to make it return a kind of DocumentWithOnlyStoredData, with DocumentWithOnlyStoredData extends Document. (the proposed named is horrible, I know !)
        Hide
        Hoss Man added a comment -

        From email...

        http://www.nabble.com/-jira--Created%3A-%28LUCENE-778%29-Allow-overriding-a-Document-tf3026011.html

        : A simple solution might be a 'classname' setup for the Document
        : creation - like the default Directory implementation uses. As long as
        : the subclass has a no-arg ctor it is trivial.

        a differnet tack on the topic: there is really no good reason why the
        "Document" class used for indexing data should be the same as the
        "Document" classs ued for returning results ... using the same class in
        this way results in all sort of confusio abotu which methods can be called
        in which context, and frequently leads people to assume they can do safe
        "round trips" of their Documents ... doing a search, modifying a field
        value, and then re-inexing it – not considering what happens to
        non-STOREd fields or field/document boosts.

        any work done to change the Document API to make it easier to subclass
        should probably start with a seperation of these too completley different
        concepts.

        One approach off the top of my head: make an IndexableDocument interface
        for clients to pass to IndexWriter and a "ReturnableDocument" class for
        IndexReader/IndexSearcher to return ... the existing Document class can
        subclass ReturnableDocument and impliment IndexableDocument, the existing
        methods with Document in their sig would be deprecated and replaced with
        methods using one of these new class names

        ...some followup comments can be found in the thread archive.

        Show
        Hoss Man added a comment - From email... http://www.nabble.com/-jira--Created%3A-%28LUCENE-778%29-Allow-overriding-a-Document-tf3026011.html : A simple solution might be a 'classname' setup for the Document : creation - like the default Directory implementation uses. As long as : the subclass has a no-arg ctor it is trivial. a differnet tack on the topic: there is really no good reason why the "Document" class used for indexing data should be the same as the "Document" classs ued for returning results ... using the same class in this way results in all sort of confusio abotu which methods can be called in which context, and frequently leads people to assume they can do safe "round trips" of their Documents ... doing a search, modifying a field value, and then re-inexing it – not considering what happens to non-STOREd fields or field/document boosts. any work done to change the Document API to make it easier to subclass should probably start with a seperation of these too completley different concepts. One approach off the top of my head: make an IndexableDocument interface for clients to pass to IndexWriter and a "ReturnableDocument" class for IndexReader/IndexSearcher to return ... the existing Document class can subclass ReturnableDocument and impliment IndexableDocument, the existing methods with Document in their sig would be deprecated and replaced with methods using one of these new class names ...some followup comments can be found in the thread archive.
        Hide
        Karl Wettin added a comment -

        Doug Cutting [17/Jan/07 09:24 AM]
        > If we decide to make Document non-final, then we must document
        > very clearly the ways in which it may be subclassed, in particular
        > that instances returned from IndexReader.document() will not be
        >of the subclass.

        +1 for a more polymorphic Document.

        In fact, I would like to introduce interfaces for all "low level" classes. reader, writer, modifier, term, document, et c. And I did that that I did in Lucene-550. I've had to definalized a lot of these classes and their methods to make Lucene compliant with a number of formula 1-A patterns.

        Show
        Karl Wettin added a comment - Doug Cutting [17/Jan/07 09:24 AM] > If we decide to make Document non-final, then we must document > very clearly the ways in which it may be subclassed, in particular > that instances returned from IndexReader.document() will not be >of the subclass. +1 for a more polymorphic Document. In fact, I would like to introduce interfaces for all "low level" classes. reader, writer, modifier, term, document, et c. And I did that that I did in Lucene-550. I've had to definalized a lot of these classes and their methods to make Lucene compliant with a number of formula 1-A patterns.
        Hide
        Doug Cutting added a comment -

        If we decide to make Document non-final, then we must document very clearly the ways in which it may be subclassed, in particular that instances returned from IndexReader.document() will not be of the subclass.

        > Should we move this discussion to lucene-dev ?

        Since Jira comments are sent to lucene-dev, I see no reason. The mailing list may be more appropriate for discussing general, process-related issues, but for specific technical issues, Jira is usually better.

        Show
        Doug Cutting added a comment - If we decide to make Document non-final, then we must document very clearly the ways in which it may be subclassed, in particular that instances returned from IndexReader.document() will not be of the subclass. > Should we move this discussion to lucene-dev ? Since Jira comments are sent to lucene-dev, I see no reason. The mailing list may be more appropriate for discussing general, process-related issues, but for specific technical issues, Jira is usually better.
        Hide
        Nicolas Lalevée added a comment -

        Just after committing the jira issue, I just figured out that I haven't searched for topics about it. Sorry.
        BTW, thanks for the pointers.

        But what my request here is not making Lucene providing customized documents, like in LUCENE-662, it is about allowing passing to writer.addDocument() a document that is extending Document, nothing more.

        For instance, I would like to do something like that :

        public RDFDocument extends Document {
        public RDFDocument(String uri)

        { add(new Field("uri", uri); }

        public void addStatement(String prop, String value)

        { add(new Field(prop, value)); }

        }

        Should we move this discussion to lucene-dev ?

        Show
        Nicolas Lalevée added a comment - Just after committing the jira issue, I just figured out that I haven't searched for topics about it. Sorry. BTW, thanks for the pointers. But what my request here is not making Lucene providing customized documents, like in LUCENE-662 , it is about allowing passing to writer.addDocument() a document that is extending Document, nothing more. For instance, I would like to do something like that : public RDFDocument extends Document { public RDFDocument(String uri) { add(new Field("uri", uri); } public void addStatement(String prop, String value) { add(new Field(prop, value)); } } Should we move this discussion to lucene-dev ?
        Hide
        Michael McCandless added a comment -

        One reason is because "Lucene controls the construction of Documents from Hits":

        http://www.gossamer-threads.com/lists/lucene/java-dev/31447

        This issue periodically comes up, eg:

        http://www.gossamer-threads.com/lists/lucene/java-dev/32733

        And there was at least this [closed, unresolved] issue from the past:

        http://issues.apache.org/jira/browse/LUCENE-358

        Show
        Michael McCandless added a comment - One reason is because "Lucene controls the construction of Documents from Hits": http://www.gossamer-threads.com/lists/lucene/java-dev/31447 This issue periodically comes up, eg: http://www.gossamer-threads.com/lists/lucene/java-dev/32733 And there was at least this [closed, unresolved] issue from the past: http://issues.apache.org/jira/browse/LUCENE-358

          People

          • Assignee:
            Unassigned
            Reporter:
            Nicolas Lalevée
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development