Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      Need a way to rapidly do a bulk update of a single field for use as a component in a function query (no need to be able to search on it).
      Idea: create an ExternalValueSource fieldType that reads it's values from a file. The file could be simple id,val records, and stored in the index directory so it would get replicated.

      Values could optionally be updated more often than the searcher (hashCode/equals should take this into account to prevent caching issues).

        Issue Links

          Activity

          Hide
          Yonik Seeley added a comment -

          Obstacle #1: how to find the index directory (we really need the solr core passed at some point - perhaps during FieldType.init()?
          It would be nice of a SolrQueryRequest object was passed during all calls to things like getValueSource(), but it may be too late for that (and much more difficult).

          Show
          Yonik Seeley added a comment - Obstacle #1: how to find the index directory (we really need the solr core passed at some point - perhaps during FieldType.init()? It would be nice of a SolrQueryRequest object was passed during all calls to things like getValueSource(), but it may be too late for that (and much more difficult).
          Hide
          Hoss Man added a comment -

          FieldType is abstract, so we can always add a getValueSource(SolrQueryRequest req, SchemaFIeld field) that delegates to getValueSource(SchemaFIeld field) by default.

          (ditto for the init method if needed ... i almost think we should do both since i can imagine situations where FieldType might want to go ahead and pre-compute some info, but doing something like this would raise a lot of questions about what to do when newSearchers are opened ... might be better to stick with the request type access for now untill we have stronger uses cases for anything else)

          Show
          Hoss Man added a comment - FieldType is abstract, so we can always add a getValueSource(SolrQueryRequest req, SchemaFIeld field) that delegates to getValueSource(SchemaFIeld field) by default. (ditto for the init method if needed ... i almost think we should do both since i can imagine situations where FieldType might want to go ahead and pre-compute some info, but doing something like this would raise a lot of questions about what to do when newSearchers are opened ... might be better to stick with the request type access for now untill we have stronger uses cases for anything else)
          Hide
          Yonik Seeley added a comment -

          > a lot of questions about what to do when newSearchers are opened
          Yeah, it's not actually getValueSource() where one would normally want the request... ValueSource is like a Query - relatively independent of context. getValues(IndexReader reader) is where you normally want it.. but there are layers of Lucene in between that know nothing of Solr.

          Show
          Yonik Seeley added a comment - > a lot of questions about what to do when newSearchers are opened Yeah, it's not actually getValueSource() where one would normally want the request... ValueSource is like a Query - relatively independent of context. getValues(IndexReader reader) is where you normally want it.. but there are layers of Lucene in between that know nothing of Solr.
          Hide
          Yonik Seeley added a comment -

          Hmmm, due to pluggable query parsers patch, I pass around the QParser everywhere now, so perhaps
          it is easiest + consistent to extend that to getValueSource.
          public ValueSource getValueSource(SchemaField field, QParser parser)

          Easy change... it's only called in one place in the source.

          Show
          Yonik Seeley added a comment - Hmmm, due to pluggable query parsers patch, I pass around the QParser everywhere now, so perhaps it is easiest + consistent to extend that to getValueSource. public ValueSource getValueSource(SchemaField field, QParser parser) Easy change... it's only called in one place in the source.
          Hide
          Yonik Seeley added a comment - - edited

          OK, the latest SOLR-334 patch includes external value source.

          Here is what the fieldType in the schema looks like:
          <fieldType name="file" class="solr.ExternalFileField" keyField="id" defVal="1" stored="false" indexed="false" valType="float"/>

          • keyField will normally be the unique key field, but it doesn't have to be.
            • it's OK to have a keyField value that can't be found in the index
            • it's OK to have some documents without a keyField in the file (defVal is used as the default)
            • it's OK for a keyField value to point to multiple documents (no uniqueness requirement)
          • valType is a reference to another fieldType to define the value type of this field (must currently be FloatField (float))

          The format of the external file is simply
          keyFieldValue=floatValue
          keyFieldValue=floatValue

          Solr looks for the external file in the index directory under the name of
          external_<fieldname> or external_<fieldname>.*

          If any files of the latter pattern appear, the last (after being sorted by name) will be used and previous versions will be deleted. This is to help support systems where one may not be able to overwrite a file (like Windows, if the file is in use).

          If the external file has already been loaded, and it is changed, those changes will not be visible until a commit has been done.

          The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.

          Show
          Yonik Seeley added a comment - - edited OK, the latest SOLR-334 patch includes external value source. Here is what the fieldType in the schema looks like: <fieldType name="file" class="solr.ExternalFileField" keyField="id" defVal="1" stored="false" indexed="false" valType="float"/> keyField will normally be the unique key field, but it doesn't have to be. it's OK to have a keyField value that can't be found in the index it's OK to have some documents without a keyField in the file (defVal is used as the default) it's OK for a keyField value to point to multiple documents (no uniqueness requirement) valType is a reference to another fieldType to define the value type of this field (must currently be FloatField (float)) The format of the external file is simply keyFieldValue=floatValue keyFieldValue=floatValue Solr looks for the external file in the index directory under the name of external_<fieldname> or external_<fieldname>.* If any files of the latter pattern appear, the last (after being sorted by name) will be used and previous versions will be deleted. This is to help support systems where one may not be able to overwrite a file (like Windows, if the file is in use). If the external file has already been loaded, and it is changed, those changes will not be visible until a commit has been done. The external file may be sorted or unsorted by the key field, but it will be substantially slower (untested) if it isn't sorted.
          Hide
          Yonik Seeley added a comment -

          Attaching patch (separated out from pluggable query parsers, but still depends on it).
          I'll commit shortly barring objections.

          Show
          Yonik Seeley added a comment - Attaching patch (separated out from pluggable query parsers, but still depends on it). I'll commit shortly barring objections.
          Hide
          J.J. Larrea added a comment -

          My apologies for these last-minute peanut-gallery comments, and especially if they're completely off-target (I've not yet used Function Queries), but reviewing the patch raised these questions and ideas:

          1. Why force a 1:1 mapping between the fieldname and the filename? Could there ever be a be a situation where multiple fields would want to share the same file, e.g. if if the file is a sampling of a generic weighting function, or even if field-specific if it needs to be shared across multiple Solr instances/cores?

          Within the current structure, an extra file="<baseFile>" argument to ExternalFieldField and FileFloatSource couldn't hurt; it could still default to external_<fieldname>, relative paths could still resolve to ffs.indexDir, and the getLatestFile extension logic could still be applied, but specified names with relative or absolute paths would be allowed.

          2. Is ExternalFieldField useful apart from being used as input to function queries, e.g. could one sort or facet against it?

          2a. If not (or even if so), couldn't one get enhanced flexibility and simplicity by creating a function interface to FileFloatSource that uses a sub-ValueSource to obtain key values? That way the domain of the mapping function isn't limited to a literal set of Terms. For example, a function of the form

          filemap( <keyFieldName>[, "baseFilePath" )

          could be applied as, for example,

          boost( filemap( keyField ) )
          boost( filemap( div( ord( someField ), const( 1426 ) ), "/var/data/termBooster" ) )

          I'm thinking something like this (added to FunctionQParser):

          vsParsers.put("filemap", new VSParser() {
          ValueSource parse(FunctionQParser fp) throws ParseException

          { ValueSource source = fp.parseValSource(); fp.sp.expect(","); String base = fp.sp.getQuotedString(); (would also want to get the default in there) return new FileFloatFunction(source, base); }

          });

          One would think a FileFloatFunction could extend FieldCacheSource, but I assume there was a good reason the FC code was duplicated rather than references, e.g. limited access.

          2b. If the external file could be useful for sorting/faceting, and if it could be implemented as a Function as above, then perhaps ExternalFieldField could be recast as a more general FunctionField which takes a QueryParser.StrParser string in an attribute?

          <fieldType name="file" class="solr.FunctionField" expression="filemap(id)" stored="false" indexed="false" valType="float"/>

          Is there any sense to these (even if the scope is way too large to be implemented in the foreseeable future)?

          I also have some thoughts on SOLR-334 which I'll write up in a few days.

          Show
          J.J. Larrea added a comment - My apologies for these last-minute peanut-gallery comments, and especially if they're completely off-target (I've not yet used Function Queries), but reviewing the patch raised these questions and ideas: 1. Why force a 1:1 mapping between the fieldname and the filename? Could there ever be a be a situation where multiple fields would want to share the same file, e.g. if if the file is a sampling of a generic weighting function, or even if field-specific if it needs to be shared across multiple Solr instances/cores? Within the current structure, an extra file="<baseFile>" argument to ExternalFieldField and FileFloatSource couldn't hurt; it could still default to external_<fieldname>, relative paths could still resolve to ffs.indexDir, and the getLatestFile extension logic could still be applied, but specified names with relative or absolute paths would be allowed. 2. Is ExternalFieldField useful apart from being used as input to function queries, e.g. could one sort or facet against it? 2a. If not (or even if so), couldn't one get enhanced flexibility and simplicity by creating a function interface to FileFloatSource that uses a sub-ValueSource to obtain key values? That way the domain of the mapping function isn't limited to a literal set of Terms. For example, a function of the form filemap( <keyFieldName>[, "baseFilePath" ) could be applied as, for example, boost( filemap( keyField ) ) boost( filemap( div( ord( someField ), const( 1426 ) ), "/var/data/termBooster" ) ) I'm thinking something like this (added to FunctionQParser): vsParsers.put("filemap", new VSParser() { ValueSource parse(FunctionQParser fp) throws ParseException { ValueSource source = fp.parseValSource(); fp.sp.expect(","); String base = fp.sp.getQuotedString(); (would also want to get the default in there) return new FileFloatFunction(source, base); } }); One would think a FileFloatFunction could extend FieldCacheSource, but I assume there was a good reason the FC code was duplicated rather than references, e.g. limited access. 2b. If the external file could be useful for sorting/faceting, and if it could be implemented as a Function as above, then perhaps ExternalFieldField could be recast as a more general FunctionField which takes a QueryParser.StrParser string in an attribute? <fieldType name="file" class="solr.FunctionField" expression="filemap(id)" stored="false" indexed="false" valType="float"/> Is there any sense to these (even if the scope is way too large to be implemented in the foreseeable future)? I also have some thoughts on SOLR-334 which I'll write up in a few days.
          Hide
          Yonik Seeley added a comment -

          Thanks for the review JJ, I had missed it earlier somehow (I just committed this code).

          re: specifying filename... yes, I thought it might possibly be useful in the future, esp being able to specify somewhere different than the index directory. I simply left it out now because nothing is lost in deferring it.

          2. Is ExternalFieldField useful apart from being used as input to function queries, e.g. could one sort or facet against it?

          Not currently. Perhaps in the future it would be possible to make it searchable... not sure. And it seems like a good idea to allow sorting by a ValueSource in the future. Faceting: yes, I think so (again, in the future).

          2a. If not (or even if so), couldn't one get enhanced flexibility and simplicity by creating a function interface to FileFloatSource that uses a sub-ValueSource to obtain key values? That way the domain of the mapping function isn't limited to a literal set of Terms. For example, a function of the form

          Hmmm, I hadn't thought of hooking it directly via a new type of function, but that would work.
          add(1, filevalues("myexternalfilename","float") )

          I'm not sure I understand the form you picked though (a ValueSource param to filemap).

          One would think a FileFloatFunction could extend FieldCacheSource, but I assume there was a good reason the FC code was duplicated rather than references, e.g. limited access.

          Right, Lucene doesn't allow write access to the FieldCache.

          2b. If the external file could be useful for sorting/faceting, and if it could be implemented as a Function as above, then perhaps ExternalFieldField could be recast as a more general FunctionField which takes a QueryParser.StrParser string in an attribute?

          <fieldType name="file" class="solr.FunctionField" expression="filemap(id)" stored="false" indexed="false" valType="float"/>

          So a FunctionField would be a shortcut or alias to any function query expression... that's a pretty interesting idea.
          Since the signature for getValueSource now includes the QParser, this should be doable.

          Show
          Yonik Seeley added a comment - Thanks for the review JJ, I had missed it earlier somehow (I just committed this code). re: specifying filename... yes, I thought it might possibly be useful in the future, esp being able to specify somewhere different than the index directory. I simply left it out now because nothing is lost in deferring it. 2. Is ExternalFieldField useful apart from being used as input to function queries, e.g. could one sort or facet against it? Not currently. Perhaps in the future it would be possible to make it searchable... not sure. And it seems like a good idea to allow sorting by a ValueSource in the future. Faceting: yes, I think so (again, in the future). 2a. If not (or even if so), couldn't one get enhanced flexibility and simplicity by creating a function interface to FileFloatSource that uses a sub-ValueSource to obtain key values? That way the domain of the mapping function isn't limited to a literal set of Terms. For example, a function of the form Hmmm, I hadn't thought of hooking it directly via a new type of function, but that would work. add(1, filevalues("myexternalfilename","float") ) I'm not sure I understand the form you picked though (a ValueSource param to filemap). One would think a FileFloatFunction could extend FieldCacheSource, but I assume there was a good reason the FC code was duplicated rather than references, e.g. limited access. Right, Lucene doesn't allow write access to the FieldCache. 2b. If the external file could be useful for sorting/faceting, and if it could be implemented as a Function as above, then perhaps ExternalFieldField could be recast as a more general FunctionField which takes a QueryParser.StrParser string in an attribute? <fieldType name="file" class="solr.FunctionField" expression="filemap(id)" stored="false" indexed="false" valType="float"/> So a FunctionField would be a shortcut or alias to any function query expression... that's a pretty interesting idea. Since the signature for getValueSource now includes the QParser, this should be doable.
          Hide
          Shalin Shekhar Mangar added a comment -

          This issue has been fixed by Yonik in revision 587098 and released with 1.3

          Show
          Shalin Shekhar Mangar added a comment - This issue has been fixed by Yonik in revision 587098 and released with 1.3

            People

            • Assignee:
              Yonik Seeley
              Reporter:
              Yonik Seeley
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development