Solr
  1. Solr
  2. SOLR-193

General SolrDocument interface to manage field values.

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: None
    • Labels:
      None

      Description

      In an effort to make SOLR-139 (the "modify" command) more manageable, i extracted out a large chunk. This patch adds a general SolrDocument interface and includes a concrete implementation (SimpleSolrDoc)

      SOLR-139 needs some way to transport document values independent of the lucene Document. This is required for the INCREMENT command and useful for modifying documents. SolrDocument is also generally useful for SOLR-20

      • - - - - -

      The one (potentially) controversial part is that I added a function to FieldType:

      public Object toExternalValue(Fieldable f);

      This asks each field type to convert its Fieldable into its real type, for example IntField.java has:

      public Integer toExternalValue(Fieldable f)

      { return Integer.valueOf( toExternal(f) ); }

      By default, it returns a string value. If this addition is too much, there are other (less clean) ways to handle the INCREMENT command. My real motivation for this addition is that it makes it possible to implement an embeddable SOLR-20 client that does not need an HTTP connection.

      • - - -

      The SimpleSolrDoc implementation was written for SOLR-20. It needs to play nice with EL, so it implements a few extra map function that may not seem necessary:

      $

      {doc.values['name']]}

      gets a collection
      $

      {doc.valueMap['name']]}

      gets a single value for the field

      • - - -

      The tests cover all "toExternalValue" changes in schema.*
      SimpleSolrDoc and DocumentBuilder have 100% test coverage.

      1. SOLR-193-SimpleSolrDocument.patch
        12 kB
        Ryan McKinley
      2. SOLR-193-SimpleSolrDocument.patch
        14 kB
        Ryan McKinley
      3. SOLR-193-SolrDocument.patch
        40 kB
        Ryan McKinley
      4. SOLR-193-SolrDocument.patch
        40 kB
        Ryan McKinley
      5. SOLR-193-SolrDocument.patch
        40 kB
        Ryan McKinley
      6. SOLR-193-SolrDocument.patch
        40 kB
        Ryan McKinley
      7. SOLR-193-SolrDocument.patch
        40 kB
        Ryan McKinley
      8. SOLR-193-SolrDocument.patch
        38 kB
        Ryan McKinley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          One other note:
          the "id" field in the test schema.xml was a multiValued field. I changed it to single valued.

          Show
          Ryan McKinley added a comment - One other note: the "id" field in the test schema.xml was a multiValued field. I changed it to single valued.
          Hide
          Yonik Seeley added a comment -

          > toExternalValue(Fieldable f)

          "external" value sort of already means the human-readable serialized textual representation.

          What about something like toObject() which would return the appropriate Java object (such as Integer, Long, Date, etc)?

          Show
          Yonik Seeley added a comment - > toExternalValue(Fieldable f) "external" value sort of already means the human-readable serialized textual representation. What about something like toObject() which would return the appropriate Java object (such as Integer, Long, Date, etc)?
          Hide
          Ryan McKinley added a comment -

          yes:
          Object toObject( Fieldable );
          is better.

          Show
          Ryan McKinley added a comment - yes: Object toObject( Fieldable ); is better.
          Hide
          Erik Hatcher added a comment -

          The latest patch has a typo:

          public Document bulid....

          should be "build".

          Show
          Erik Hatcher added a comment - The latest patch has a typo: public Document bulid.... should be "build".
          Hide
          Ryan McKinley added a comment -

          Damn dyslexia! I can hardly see the difference even when its pointed out.

          thanks eric.

          Show
          Ryan McKinley added a comment - Damn dyslexia! I can hardly see the difference even when its pointed out. thanks eric.
          Hide
          Ryan McKinley added a comment -

          applies cleanly with trunk

          Show
          Ryan McKinley added a comment - applies cleanly with trunk
          Hide
          Ryan McKinley added a comment -

          applies with trunk, fixed some spelling...

          Show
          Ryan McKinley added a comment - applies with trunk, fixed some spelling...
          Hide
          Hoss Man added a comment -

          i'm not sure that i understand a lot of what's going on here ... the basic API for SolrDocument makes sense to me, but i'm not sure that i understand some of the methods in SimpleSolrDoc ... what is setDistinctByDefault, or setDistinctOrderMatters ?

          Also, what is the purpose/use of DocumentBuilder.build and DocumentBuilder.loadStoredFields ... neither seems to be used anywhere ... if they are not intended for use by existing clients of DocumentBuilder, but new client code not year written that won't care about any of the existing stateful methods in DocumentBuilder, perhaps they (the two new methods) should live in a separate class?

          The spirit of DocumentBuilder.build makes sense to me in the context of the issue title – but loadStoredFields on the other hand really doesn't make sense to me at all...
          1) DocumentBuilder is only involved in in building Lucene Document objects to index ... so why have a method in it for converting from a Lucene Document object to something else?
          2) I thought the SolrDocument API was for incoming documents ... why a method for adding values to it from an existing Lucene Document, or special logic for looking at stored fields?
          3) if the goal is for SolrDocument to be general enough to handle pre-indexing or post-searching Document representation, then we should not attempt to model boosts in it ... those should only live in a subclass used for indexing purposes (Lucene made this mistake early on, and it's caused countless amounts of confusion to this date) ... the loadStoredFields seems to suffer from this confusion by trying to access the field boosts of a Lucene Document that (appears to be) the result of a search — they don't exist in those instances of Lucene Documents.

          If these methods are not intended for use by existing clients of DocumentBuilder, but new client code not year written that doesn't care about any of the existing stateful methods in DocumentBuilder, perhaps they (the two new methods) should live in a separate class?)

          Hmmm... rereading the issue summary and the comments about playing nice with EL i see the goal is for a generic representation both in a java client sending docs to and reading docs back from Solr, as well as internally within Solr (or embedded Solr contexts) ... I think it's a mistake to try and have one single Interface for all three. At the very least there should be a seperate API for the indexing side and the query side (because of the boost issue) which can be subclass/superclass relationships.

          I ranted about this in a related Lucene Jira issue (note also the email discussion linked to from one of my comments in that issue) ...

          https://issues.apache.org/jira/browse/LUCENE-778

          Show
          Hoss Man added a comment - i'm not sure that i understand a lot of what's going on here ... the basic API for SolrDocument makes sense to me, but i'm not sure that i understand some of the methods in SimpleSolrDoc ... what is setDistinctByDefault, or setDistinctOrderMatters ? Also, what is the purpose/use of DocumentBuilder.build and DocumentBuilder.loadStoredFields ... neither seems to be used anywhere ... if they are not intended for use by existing clients of DocumentBuilder, but new client code not year written that won't care about any of the existing stateful methods in DocumentBuilder, perhaps they (the two new methods) should live in a separate class? The spirit of DocumentBuilder.build makes sense to me in the context of the issue title – but loadStoredFields on the other hand really doesn't make sense to me at all... 1) DocumentBuilder is only involved in in building Lucene Document objects to index ... so why have a method in it for converting from a Lucene Document object to something else? 2) I thought the SolrDocument API was for incoming documents ... why a method for adding values to it from an existing Lucene Document, or special logic for looking at stored fields? 3) if the goal is for SolrDocument to be general enough to handle pre-indexing or post-searching Document representation, then we should not attempt to model boosts in it ... those should only live in a subclass used for indexing purposes (Lucene made this mistake early on, and it's caused countless amounts of confusion to this date) ... the loadStoredFields seems to suffer from this confusion by trying to access the field boosts of a Lucene Document that (appears to be) the result of a search — they don't exist in those instances of Lucene Documents. If these methods are not intended for use by existing clients of DocumentBuilder, but new client code not year written that doesn't care about any of the existing stateful methods in DocumentBuilder, perhaps they (the two new methods) should live in a separate class?) Hmmm... rereading the issue summary and the comments about playing nice with EL i see the goal is for a generic representation both in a java client sending docs to and reading docs back from Solr, as well as internally within Solr (or embedded Solr contexts) ... I think it's a mistake to try and have one single Interface for all three. At the very least there should be a seperate API for the indexing side and the query side (because of the boost issue) which can be subclass/superclass relationships. I ranted about this in a related Lucene Jira issue (note also the email discussion linked to from one of my comments in that issue) ... https://issues.apache.org/jira/browse/LUCENE-778
          Hide
          Ryan McKinley added a comment -

          For background. This class has functionality used for other issues including SOLR-104, SOLR-139. For a while i tried keeping the functionality in different patches, but it became too much of a nightmare to maintain. Perhaps it would be better to leave out the edge cases and just focus on the SolrDocument interface now...

          > what is setDistinctByDefault, or setDistinctOrderMatters ?
          >

          These options let you say if the field values should be backed by a Map<String> or a List<String>, the DistinctOrderMatters says if it should be Map<String> or LinkedHashMap<String>

          These were useful for SOLR-104 when you SQL join a table and may get duplicate rows, but only want the distinct values to keep fields.

          Now that you point it out, (and there is a good chance it will be in trunk soon) It would make more sense to implement these features as different subclasses of SimpleSolrDocument.

          > Also, what is the purpose/use of DocumentBuilder.build and DocumentBuilder.loadStoredFields

          This is for SOLR-139. to 'modify' a document, you load the existing Document - change it - then store it back.

          These two functions can happily live in a new class, and could be attached to SOLR-139.

          > 2) I thought the SolrDocument API was for incoming documents ...

          I hope it is also useful for modifying existing Documents and transforming incoming/outgoing documents (but I'll raise that issue later

          > I think it's a mistake to try and have one single Interface for all three. ... At the very least there should be a seperate API for the indexing side and the query side (because of the boost issue) which can be subclass/superclass relationships.
          >

          This sounds fine. We should defiantly solve any know problems with the Lucene document interface. Just using an interface (rather then a concrete class) will be a huge help.

          Is the only difference between the input Document and output Document that it has boosts?

          Should we have:
          SolrDocument
          + BoostedSolrDocument

          or

          SolrDocument
          + IndexSolrDocument

          Any thoughts on the common use case where I want to pull a document out of the index (no boosts) change it, then put it back? Do i need to make a new class and copy all the fields? Should SOLR-20 be able to index a SolrDocument (no boosts) as well as a BoostedSolrDocument? I think so...

          Thanks for looking at this!

          Show
          Ryan McKinley added a comment - For background. This class has functionality used for other issues including SOLR-104 , SOLR-139 . For a while i tried keeping the functionality in different patches, but it became too much of a nightmare to maintain. Perhaps it would be better to leave out the edge cases and just focus on the SolrDocument interface now... > what is setDistinctByDefault, or setDistinctOrderMatters ? > These options let you say if the field values should be backed by a Map<String> or a List<String>, the DistinctOrderMatters says if it should be Map<String> or LinkedHashMap<String> These were useful for SOLR-104 when you SQL join a table and may get duplicate rows, but only want the distinct values to keep fields. Now that you point it out, (and there is a good chance it will be in trunk soon) It would make more sense to implement these features as different subclasses of SimpleSolrDocument. > Also, what is the purpose/use of DocumentBuilder.build and DocumentBuilder.loadStoredFields This is for SOLR-139 . to 'modify' a document, you load the existing Document - change it - then store it back. These two functions can happily live in a new class, and could be attached to SOLR-139 . > 2) I thought the SolrDocument API was for incoming documents ... I hope it is also useful for modifying existing Documents and transforming incoming/outgoing documents (but I'll raise that issue later > I think it's a mistake to try and have one single Interface for all three. ... At the very least there should be a seperate API for the indexing side and the query side (because of the boost issue) which can be subclass/superclass relationships. > This sounds fine. We should defiantly solve any know problems with the Lucene document interface. Just using an interface (rather then a concrete class) will be a huge help. Is the only difference between the input Document and output Document that it has boosts? Should we have: SolrDocument + BoostedSolrDocument or SolrDocument + IndexSolrDocument Any thoughts on the common use case where I want to pull a document out of the index (no boosts) change it, then put it back? Do i need to make a new class and copy all the fields? Should SOLR-20 be able to index a SolrDocument (no boosts) as well as a BoostedSolrDocument? I think so... Thanks for looking at this!
          Hide
          Yonik Seeley added a comment -

          > This sounds fine. We should defiantly solve any know problems with the Lucene document interface.
          > Just using an interface (rather then a concrete class) will be a huge help.

          I know this runs contrary to common java OO wisdom, but interfaces can really suck.
          They don't hurt the consumer of a class, but cause major headaches for the provider, trying to evolve an interface and still provide backward compatibility (it's pretty much impossible).

          In Lucene, where we have had a class (like Analyzer), it was trivial adding new functionality like getPositionIncrement(). If it had been an interface, it would have been impossible without breaking all the custom analyzers out there. Where we have had interfaces, and added a new method, we simply broke some peoples code.

          So if it's something that a customer might possibly subclass, a class used as an interface is a much better option.
          If it's internal, or package projected, or something where you really need multiple inheritance, then an interface is fine.

          Show
          Yonik Seeley added a comment - > This sounds fine. We should defiantly solve any know problems with the Lucene document interface. > Just using an interface (rather then a concrete class) will be a huge help. I know this runs contrary to common java OO wisdom, but interfaces can really suck. They don't hurt the consumer of a class, but cause major headaches for the provider , trying to evolve an interface and still provide backward compatibility (it's pretty much impossible). In Lucene, where we have had a class (like Analyzer), it was trivial adding new functionality like getPositionIncrement(). If it had been an interface, it would have been impossible without breaking all the custom analyzers out there. Where we have had interfaces, and added a new method, we simply broke some peoples code. So if it's something that a customer might possibly subclass, a class used as an interface is a much better option. If it's internal, or package projected, or something where you really need multiple inheritance, then an interface is fine.
          Hide
          Ryan McKinley added a comment -

          Here is a much much smaller patch that only adds the SolrDocument class and BoostableSolrDocument subclass.

          We can work through the other bits later, but this would be sufficient for SOLR-20

          It is a quick eclipes refactoring, so the comments may not make sense. I'll check that over in better detail after you all get a chance to look at it...

          Show
          Ryan McKinley added a comment - Here is a much much smaller patch that only adds the SolrDocument class and BoostableSolrDocument subclass. We can work through the other bits later, but this would be sufficient for SOLR-20 It is a quick eclipes refactoring, so the comments may not make sense. I'll check that over in better detail after you all get a chance to look at it...
          Hide
          Hoss Man added a comment -

          these comments are very happhazard, and in the best order i can think of (not hte order i wrote them)

          > Perhaps it would be better to leave out the edge cases and just focus on the SolrDocument

          ...i don't mind big patches that have a lot of things ... it's just weird when there is a big patch with a lot of stuff and it's not clear what it's for ... i was mainly looking for someplace where an UpdateHandler was making a SolrDocument and then calling build on it.

          > Is the only difference between the input Document and output Document that it has boosts?

          there is some more complexity in Lucene docs because of things like the Fieldable options but i don't think those really impact a SolrDocument API since that info is abstracted into the schema and can't be set on a per document basis.

          > Should we have:
          > SolrDocument
          > + BoostedSolrDocument

          BoostedSolrDocument seems to specific to the methods added, and not to the purpose of the class ... i would call it a "SolrInputDocument" (IndexSolrDocument is too vague since the term "index" is used so much in the code base)

          The basic structure in the new patch looks fine by the way, no real concerns from me once the comments are cleaned up (one question: should SolrDocument implement Map<String,Collection<Object>> ??)

          > This is for SOLR-139. to 'modify' a document, you load the existing Document - change it -
          > then store it back.
          >
          > These two functions can happily live in a new class, and could be attached to SOLR-139.

          ...oh, right, i forgot about the "update in place" patch .... yeah i don't think those methods should live in DocumentBuilder (am i alone in thinking DocumentBuilder should probably be deprecated completely once this stuff is commited? ... or ... hmmm ... it could probably be subclassed by one that supports adding a whole SolrInputDocument at once, or one that can start with an older Document and update it with a new SolrInputDocument ... but we can worry about that later)

          "updating" is a direct example of the type of thing i refered to in LUCENE-778 about why a single Lucene Document class is bad. to support updating you should have an explicitly means of composing the output class into the input class ... but in that case you're dealing directly with Lucene Documents – i can understand why we would need to modify a Lucene document into a SolrInputDocument ... but i don't think we really need to worry about the SolrDocument => SolrInputDocument case right?

          Show
          Hoss Man added a comment - these comments are very happhazard, and in the best order i can think of (not hte order i wrote them) > Perhaps it would be better to leave out the edge cases and just focus on the SolrDocument ...i don't mind big patches that have a lot of things ... it's just weird when there is a big patch with a lot of stuff and it's not clear what it's for ... i was mainly looking for someplace where an UpdateHandler was making a SolrDocument and then calling build on it. > Is the only difference between the input Document and output Document that it has boosts? there is some more complexity in Lucene docs because of things like the Fieldable options but i don't think those really impact a SolrDocument API since that info is abstracted into the schema and can't be set on a per document basis. > Should we have: > SolrDocument > + BoostedSolrDocument BoostedSolrDocument seems to specific to the methods added, and not to the purpose of the class ... i would call it a "SolrInputDocument" (IndexSolrDocument is too vague since the term "index" is used so much in the code base) The basic structure in the new patch looks fine by the way, no real concerns from me once the comments are cleaned up (one question: should SolrDocument implement Map<String,Collection<Object>> ??) > This is for SOLR-139 . to 'modify' a document, you load the existing Document - change it - > then store it back. > > These two functions can happily live in a new class, and could be attached to SOLR-139 . ...oh, right, i forgot about the "update in place" patch .... yeah i don't think those methods should live in DocumentBuilder (am i alone in thinking DocumentBuilder should probably be deprecated completely once this stuff is commited? ... or ... hmmm ... it could probably be subclassed by one that supports adding a whole SolrInputDocument at once, or one that can start with an older Document and update it with a new SolrInputDocument ... but we can worry about that later) "updating" is a direct example of the type of thing i refered to in LUCENE-778 about why a single Lucene Document class is bad. to support updating you should have an explicitly means of composing the output class into the input class ... but in that case you're dealing directly with Lucene Documents – i can understand why we would need to modify a Lucene document into a SolrInputDocument ... but i don't think we really need to worry about the SolrDocument => SolrInputDocument case right?
          Hide
          Ryan McKinley added a comment -

          Thanks for your feedback Hoss.

          I think this one is good to go.

          Show
          Ryan McKinley added a comment - Thanks for your feedback Hoss. I think this one is good to go.
          Hide
          Ryan McKinley added a comment -

          > The basic structure in the new patch looks fine by the way, no real concerns from me once the comments are cleaned up (one question: should SolrDocument implement Map<String,Collection<Object>> ??)

          If the class implements Map, JSTL treats it differently – for example, $

          {doc.fieldNames}

          does not call:
          Collection<String> getFieldNames();
          it calls: Map.get( "fieldNames" );

          this is really annoying! Essentially any real function you add is hidden (unless i'm incompetent in JSTL land...)

          This is why I had:

          + ///////////////////////////////////////////////////////////////////
          + // Utility functions to help JSTL (map interface to single field
          + ///////////////////////////////////////////////////////////////////
          +
          + public Map<String,Collection<Object>> getValuesMap()
          +

          { + return _fields; + }

          +
          + public Map<String,Object> getValueMap() {
          + return new Map<String,Object>() {
          + // The only reason we are implementing map!
          + public Object get(Object key)

          { + return getFieldValue( (String)key); + }

          +
          + // Easily Supported methods
          + public boolean containsKey(Object key)

          { return _fields.containsKey( key ); }

          + public Set<String> keySet()

          { return _fields.keySet(); }

          + public int size()

          { return _fields.size(); }

          + public boolean isEmpty()

          { return _fields.isEmpty(); }

          +
          + // Unsupported operations. These are not necessary for JSTL
          + public void clear()

          { throw new UnsupportedOperationException(); }

          + public boolean containsValue(Object value)

          {throw new UnsupportedOperationException();}
          + public Set<java.util.Map.Entry<String, Object>> entrySet() {throw new UnsupportedOperationException();}

          + public Object put(String key, Object value)

          {throw new UnsupportedOperationException();}
          + public void putAll(Map<? extends String, ? extends Object> t) {throw new UnsupportedOperationException();}

          + public Object remove(Object key)

          {throw new UnsupportedOperationException();}
          + public Collection<Object> values() {throw new UnsupportedOperationException();}

          + };
          + }

          I would still like to include these in the API.

          Show
          Ryan McKinley added a comment - > The basic structure in the new patch looks fine by the way, no real concerns from me once the comments are cleaned up (one question: should SolrDocument implement Map<String,Collection<Object>> ??) If the class implements Map, JSTL treats it differently – for example, $ {doc.fieldNames} does not call: Collection<String> getFieldNames(); it calls: Map.get( "fieldNames" ); this is really annoying! Essentially any real function you add is hidden (unless i'm incompetent in JSTL land...) This is why I had: + /////////////////////////////////////////////////////////////////// + // Utility functions to help JSTL (map interface to single field + /////////////////////////////////////////////////////////////////// + + public Map<String,Collection<Object>> getValuesMap() + { + return _fields; + } + + public Map<String,Object> getValueMap() { + return new Map<String,Object>() { + // The only reason we are implementing map! + public Object get(Object key) { + return getFieldValue( (String)key); + } + + // Easily Supported methods + public boolean containsKey(Object key) { return _fields.containsKey( key ); } + public Set<String> keySet() { return _fields.keySet(); } + public int size() { return _fields.size(); } + public boolean isEmpty() { return _fields.isEmpty(); } + + // Unsupported operations. These are not necessary for JSTL + public void clear() { throw new UnsupportedOperationException(); } + public boolean containsValue(Object value) {throw new UnsupportedOperationException();} + public Set<java.util.Map.Entry<String, Object>> entrySet() {throw new UnsupportedOperationException();} + public Object put(String key, Object value) {throw new UnsupportedOperationException();} + public void putAll(Map<? extends String, ? extends Object> t) {throw new UnsupportedOperationException();} + public Object remove(Object key) {throw new UnsupportedOperationException();} + public Collection<Object> values() {throw new UnsupportedOperationException();} + }; + } I would still like to include these in the API.
          Hide
          Ryan McKinley added a comment -

          moved DocumentBuilder issues to SOLR-262

          Show
          Ryan McKinley added a comment - moved DocumentBuilder issues to SOLR-262

            People

            • Assignee:
              Ryan McKinley
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development