Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.5
    • Fix Version/s: 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      PreAnalyzedFieldType provides a functionality to index (and optionally store) content that was already processed and split into tokens using some external processing chain. This implementation defines a serialization format for sending tokens with any currently supported Attributes (eg. type, posIncr, payload, ...). This data is de-serialized into a regular TokenStream that is returned in Field.tokenStreamValue() and thus added to the index as index terms, and optionally a stored part that is returned in Field.stringValue() and is then added as a stored value of the field.

      This field type is useful for integrating Solr with existing text-processing pipelines, such as third-party NLP systems.

      1. SOLR-1535.patch
        45 kB
        Andrzej Bialecki
      2. SOLR-1535.patch
        28 kB
        Andrzej Bialecki
      3. SOLR-1535.patch
        27 kB
        Andrzej Bialecki
      4. preanalyzed.patch
        25 kB
        Andrzej Bialecki
      5. preanalyzed.patch
        25 kB
        Andrzej Bialecki

        Issue Links

          Activity

          Hide
          Andrzej Bialecki added a comment -

          Let's move this discussion to SOLR-4619 .

          Show
          Andrzej Bialecki added a comment - Let's move this discussion to SOLR-4619 .
          Hide
          John Berryman added a comment - - edited

          Ah, I see. This is a bit lower level than I was thinking. Still useful, but different. I was thinking about having PreAnalyzedField extend directly from TextField rather than from FieldType, and then be able to build up whatever analysis chain that you want in the usual TextField sense. Query analysis would proceed as with a normal TextField, but index analysis would smart detect whether this input was already parsed or not. If the input was not parsed, then it would go through the normal analysis. On the other hand, if the input was already parsed, then the token stream would go straight into the index (the assumption being that someone upstream understands what they're doing).

          This way, in the SolrJ client you could build up some extra functionality so that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In my current application, we have one Solr and N-indexers on different machines. The setup described here would take a big load off of Solr. The other benefit of this setup is that query analysis proceeds as it always does. I don't understand how someone would search over a PreAnalyzed field as it currently stands, without a bit of extra work/custom code on the client.

          One pitfall to my idea is that you'd have to create a similar PreAnalyzedIntField, PreAnalyzedLocationField, PreAnalyzedDateField etc. I wish Java had mixins or multiple inheritance.

          Thoughts?

          Show
          John Berryman added a comment - - edited Ah, I see. This is a bit lower level than I was thinking. Still useful, but different. I was thinking about having PreAnalyzedField extend directly from TextField rather than from FieldType, and then be able to build up whatever analysis chain that you want in the usual TextField sense. Query analysis would proceed as with a normal TextField, but index analysis would smart detect whether this input was already parsed or not. If the input was not parsed, then it would go through the normal analysis. On the other hand, if the input was already parsed, then the token stream would go straight into the index (the assumption being that someone upstream understands what they're doing). This way, in the SolrJ client you could build up some extra functionality so that the PreAnalyzedTextFields would be parsed client side and sent to Solr. In my current application, we have one Solr and N-indexers on different machines. The setup described here would take a big load off of Solr. The other benefit of this setup is that query analysis proceeds as it always does. I don't understand how someone would search over a PreAnalyzed field as it currently stands, without a bit of extra work/custom code on the client. One pitfall to my idea is that you'd have to create a similar PreAnalyzedIntField, PreAnalyzedLocationField, PreAnalyzedDateField etc. I wish Java had mixins or multiple inheritance. Thoughts?
          Hide
          Andrzej Bialecki added a comment -

          John, this is primarily a Solr feature. I added a short example to the wiki page.

          Show
          Andrzej Bialecki added a comment - John, this is primarily a Solr feature. I added a short example to the wiki page.
          Hide
          John Berryman added a comment - - edited

          Hey Aldrzej, would it be possible to get a minimal example posted on the documentation page? I'd like to use this feature, but I don't really know where to start.

          UPDATE: Looking over your tests in your code, I realize that this is currently a Lucene-only thing. I wonder what it would take to get this into Solr or maybe SolrJ. Food for thought.

          Show
          John Berryman added a comment - - edited Hey Aldrzej, would it be possible to get a minimal example posted on the documentation page? I'd like to use this feature, but I don't really know where to start. UPDATE: Looking over your tests in your code, I realize that this is currently a Lucene-only thing. I wonder what it would take to get this into Solr or maybe SolrJ. Food for thought.
          Hide
          Alexandre Rafalovitch added a comment -

          Not sure where to ask this, as the feature is so new. But how does this work at query time? This needs to be pared somehow with a query-time tokenizer/filters, right? I could not find a trivial (or complex) example showing this in action.

          Show
          Alexandre Rafalovitch added a comment - Not sure where to ask this, as the feature is so new. But how does this work at query time? This needs to be pared somehow with a query-time tokenizer/filters, right? I could not find a trivial (or complex) example showing this in action.
          Hide
          Andrzej Bialecki added a comment - - edited

          Hoss was wrong there is no way to do this, as there is no way to do this in TokenStream - you should view the PreAnalyzed field type as a serialized TokenStream (with the added functionality to specify the stored part independently).

          Edit: I started adding some documentation to http://wiki.apache.org/solr/PreAnalyzedField .

          Show
          Andrzej Bialecki added a comment - - edited Hoss was wrong there is no way to do this, as there is no way to do this in TokenStream - you should view the PreAnalyzed field type as a serialized TokenStream (with the added functionality to specify the stored part independently). Edit: I started adding some documentation to http://wiki.apache.org/solr/PreAnalyzedField .
          Hide
          Neil Hooey added a comment -

          When I asked Hoss at Lucene Revolution yesterday, he said you could manually set term frequency in a pre-analyzed field, but I couldn't find any reference to it in the JSON parser.

          Is there a way to specify term frequency for each term in the field?

          Show
          Neil Hooey added a comment - When I asked Hoss at Lucene Revolution yesterday, he said you could manually set term frequency in a pre-analyzed field, but I couldn't find any reference to it in the JSON parser. Is there a way to specify term frequency for each term in the field?
          Hide
          Andrzej Bialecki added a comment -

          Committed with minor tweaks in rev. 1327982.

          Show
          Andrzej Bialecki added a comment - Committed with minor tweaks in rev. 1327982.
          Hide
          Mark Miller added a comment -

          +1 - I have not fully reviewed the patch, but did a quick eye scan. It looks isolated enough (eg doesnt affect existing classes) and this issue has been open long enough (and look at the votes watchers) - lets get it into trunk now and we can iterate there if/as needed.

          Show
          Mark Miller added a comment - +1 - I have not fully reviewed the patch, but did a quick eye scan. It looks isolated enough (eg doesnt affect existing classes) and this issue has been open long enough (and look at the votes watchers) - lets get it into trunk now and we can iterate there if/as needed.
          Hide
          Jan Høydahl added a comment -

          +1 (not tested but positve to splitting up the elephant)

          Show
          Jan Høydahl added a comment - +1 (not tested but positve to splitting up the elephant)
          Hide
          Andrzej Bialecki added a comment -

          The latest patch implements requested improvements. If there are no objections I'd like to commit it shortly, and track further improvements as separate issues.

          Show
          Andrzej Bialecki added a comment - The latest patch implements requested improvements. If there are no objections I'd like to commit it shortly, and track further improvements as separate issues.
          Hide
          Andrzej Bialecki added a comment -

          This patch contains the following improvements:

          • abstracted parser implementations (PreAnalyzedParser)
          • configurable implementation via field init args
          • JsonPreAnalyzedParser that supports a JSON-based format, used as default.
          Show
          Andrzej Bialecki added a comment - This patch contains the following improvements: abstracted parser implementations (PreAnalyzedParser) configurable implementation via field init args JsonPreAnalyzedParser that supports a JSON-based format, used as default.
          Hide
          Andrzej Bialecki added a comment -

          Nice idea about a pluggable format... Hmm. This should be specified then in the field type definition, I think, and not in a preamble of the data itself (UTF BOM mess comes to mind). I can implement the JSON version, and the current "simple" format, each with a version attrib.

          New patch coming soon.

          Show
          Andrzej Bialecki added a comment - Nice idea about a pluggable format... Hmm. This should be specified then in the field type definition, I think, and not in a preamble of the data itself (UTF BOM mess comes to mind). I can implement the JSON version, and the current "simple" format, each with a version attrib. New patch coming soon.
          Hide
          Jan Høydahl added a comment -

          I wish I had time to do the Avro stuff now, but just go ahead with whatever you choose.

          Since this format potentially will be adopted by many 3rd party frameworks we should take multi language support and back-compat seriously, so we do not end up in a similar situation as with JavaBin v1/v2... Perhaps a JSON structure with Base64 for binaries and a mandatory version attribute is a good generic start?

          Show
          Jan Høydahl added a comment - I wish I had time to do the Avro stuff now, but just go ahead with whatever you choose. Since this format potentially will be adopted by many 3rd party frameworks we should take multi language support and back-compat seriously, so we do not end up in a similar situation as with JavaBin v1/v2... Perhaps a JSON structure with Base64 for binaries and a mandatory version attribute is a good generic start?
          Hide
          Chris Male added a comment -

          Can't we just provide an abstraction so people can choose whatever format they want? You might use JSON out-of-box, but Jan could implement an Avro alternative if he wanted too. That also gives us a way to grow the format as our needs change.

          Show
          Chris Male added a comment - Can't we just provide an abstraction so people can choose whatever format they want? You might use JSON out-of-box, but Jan could implement an Avro alternative if he wanted too. That also gives us a way to grow the format as our needs change.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to the latest trunk. This still uses the custom serialization format. Please weigh in with suggestions about how to proceed - I see the following options:

          • keep the custom format as is (it's compact and easy to produce)
          • use JSON instead (easy to produce, but more chatty, binary values would have to be base64 encoded)
          • use Avro (compact and back-forward compatible, self-describing, but added dependencies, not that easy to construct by hand)
          Show
          Andrzej Bialecki added a comment - Patch updated to the latest trunk. This still uses the custom serialization format. Please weigh in with suggestions about how to proceed - I see the following options: keep the custom format as is (it's compact and easy to produce) use JSON instead (easy to produce, but more chatty, binary values would have to be base64 encoded) use Avro (compact and back-forward compatible, self-describing, but added dependencies, not that easy to construct by hand)
          Hide
          Hoss Man added a comment -

          Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently.

          email notification suppressed to prevent mass-spam
          psuedo-unique token identifying these issues: hoss20120321nofix36

          Show
          Hoss Man added a comment - Bulk of fixVersion=3.6 -> fixVersion=4.0 for issues that have no assignee and have not been updated recently. email notification suppressed to prevent mass-spam psuedo-unique token identifying these issues: hoss20120321nofix36
          Hide
          Andrzej Bialecki added a comment -

          Avro adds yet another dependency, which would make sense if Solr used Avro instead of JavaBin - but that's a separate discussion that merits a separate JIRA issue... as it isn't used now I'd rather avoid putting additional burden on clients just for the sake of this patch.

          JSON could be a nice alternative, if it only supported binary data natively (it doesn't, one has to use base64 - however, it's not that awful as you could think). I wanted to avoid complex formats like XML - too much boilerplate for such small bits of data. So the current custom serialization tried to strike a balance between simplicity, flexibility and low overhead.

          Serialization of terms was also discussed in SOLR-1632 - e.g. this patch doesn't serialize binary terms properly.

          Show
          Andrzej Bialecki added a comment - Avro adds yet another dependency, which would make sense if Solr used Avro instead of JavaBin - but that's a separate discussion that merits a separate JIRA issue... as it isn't used now I'd rather avoid putting additional burden on clients just for the sake of this patch. JSON could be a nice alternative, if it only supported binary data natively (it doesn't, one has to use base64 - however, it's not that awful as you could think ). I wanted to avoid complex formats like XML - too much boilerplate for such small bits of data. So the current custom serialization tried to strike a balance between simplicity, flexibility and low overhead. Serialization of terms was also discussed in SOLR-1632 - e.g. this patch doesn't serialize binary terms properly.
          Hide
          David Smiley added a comment -

          Yes, definitely Avro instead of home-grown serialization/binary format. Same for JavaBin.

          Show
          David Smiley added a comment - Yes, definitely Avro instead of home-grown serialization/binary format. Same for JavaBin.
          Hide
          Jan Høydahl added a comment -

          Became aware of this during EuroCon. This is great stuff.
          Have you thought about going <buzzwordAlert>Avro</buzzwordAlert> for the serialization format? It would better support changing serialization format in new versions, and be more compact, especially when serializing binary data (instead of using base64). The Avro version of the document could also be the new binary serialization format to replace JavaBin so that other clients than SolrJ can benefit from binary streaming.

          Show
          Jan Høydahl added a comment - Became aware of this during EuroCon. This is great stuff. Have you thought about going <buzzwordAlert>Avro</buzzwordAlert> for the serialization format? It would better support changing serialization format in new versions, and be more compact, especially when serializing binary data (instead of using base64). The Avro version of the document could also be the new binary serialization format to replace JavaBin so that other clients than SolrJ can benefit from binary streaming.
          Hide
          Robert Muir added a comment -

          3.4 -> 3.5

          Show
          Robert Muir added a comment - 3.4 -> 3.5
          Hide
          Robert Muir added a comment -

          Bulk move 3.2 -> 3.3

          Show
          Robert Muir added a comment - Bulk move 3.2 -> 3.3
          Hide
          Andrzej Bialecki added a comment -

          Updated patch. This patch implements also getAnalyzer()/getQueryAnalyzer() so that it's possible to test fields in analysis.jsp.

          Show
          Andrzej Bialecki added a comment - Updated patch. This patch implements also getAnalyzer()/getQueryAnalyzer() so that it's possible to test fields in analysis.jsp.
          Hide
          Hoss Man added a comment -

          Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

          A unique token for finding these 240 issues in the future: hossversioncleanup20100527

          Show
          Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
          Hide
          Andrzej Bialecki added a comment -

          Sigh ... attach correct patch.

          Show
          Andrzej Bialecki added a comment - Sigh ... attach correct patch.
          Hide
          Andrzej Bialecki added a comment -

          Oops .. previous patch produced NPEs. This one doesn't.

          Show
          Andrzej Bialecki added a comment - Oops .. previous patch produced NPEs. This one doesn't.
          Hide
          Andrzej Bialecki added a comment -

          Patch updated to the current trunk.

          Show
          Andrzej Bialecki added a comment - Patch updated to the current trunk.

            People

            • Assignee:
              Andrzej Bialecki
              Reporter:
              Andrzej Bialecki
            • Votes:
              9 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development