Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: modules/highlighter
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Highlighter still uses char[] terms (consumes tokens from the analyzer as char[] not as BytesRef), which is causing problems for merging SOLR-2497 to trunk.

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          This issue blocks merging SOLR-2497, because as NumericFields have now a stored field content which is no longer binary in Solr, SolrDefaultHighlighter wants to "highlight" the field. For that it passes Document.getValues() [which are now "casted" string representations of the real NumericField for backwards reasons - see explanation about this in LUCENE-3065] to highlighter using the TokenStream that it gets from the SchemaField - which is a NumericTokenStream. Highlighter then fails with IllegalArggumentEx because numeric fields are binary only and dont support CharTermAttribute.

          I fixed this in the linked issue by adding some "hack" , so please remember to fix that correctly in Solr once this is fixed (maybe remove; not sure), DefaultSolrHighlighter line 402ff.:

          private void doHighlightingByHighlighter( Query query, SolrQueryRequest req, NamedList docSummaries,
                int docId, Document doc, String fieldName ) throws IOException {
              final SolrIndexSearcher searcher = req.getSearcher();
              final IndexSchema schema = searcher.getSchema();
              
              // TODO: Currently in trunk highlighting numeric fields is broken (Lucene) -
              // so we disable them until fixed (see LUCENE-3080)!
              // BEGIN: Hack
              final SchemaField schemaField = schema.getFieldOrNull(fieldName);
              if (schemaField != null && (
                (schemaField.getType() instanceof org.apache.solr.schema.TrieField) ||
                (schemaField.getType() instanceof org.apache.solr.schema.TrieDateField)
              )) return;
              // END: Hack
              ...
          
          Show
          Uwe Schindler added a comment - This issue blocks merging SOLR-2497 , because as NumericFields have now a stored field content which is no longer binary in Solr, SolrDefaultHighlighter wants to "highlight" the field. For that it passes Document.getValues() [which are now "casted" string representations of the real NumericField for backwards reasons - see explanation about this in LUCENE-3065] to highlighter using the TokenStream that it gets from the SchemaField - which is a NumericTokenStream. Highlighter then fails with IllegalArggumentEx because numeric fields are binary only and dont support CharTermAttribute. I fixed this in the linked issue by adding some "hack" , so please remember to fix that correctly in Solr once this is fixed (maybe remove; not sure), DefaultSolrHighlighter line 402ff.: private void doHighlightingByHighlighter( Query query, SolrQueryRequest req, NamedList docSummaries, int docId, Document doc, String fieldName ) throws IOException { final SolrIndexSearcher searcher = req.getSearcher(); final IndexSchema schema = searcher.getSchema(); // TODO: Currently in trunk highlighting numeric fields is broken (Lucene) - // so we disable them until fixed (see LUCENE-3080)! // BEGIN: Hack final SchemaField schemaField = schema.getFieldOrNull(fieldName); if (schemaField != null && ( (schemaField.getType() instanceof org.apache.solr.schema.TrieField) || (schemaField.getType() instanceof org.apache.solr.schema.TrieDateField) )) return ; // END: Hack ...
          Hide
          Robert Muir added a comment -

          I'm sorry, I don't think we should change the highlighter to use BytesRef.

          Not unless offsets are changed to byte offsets to fit, otherwise this will make highlighting even more complicated for no good reason.

          Show
          Robert Muir added a comment - I'm sorry, I don't think we should change the highlighter to use BytesRef. Not unless offsets are changed to byte offsets to fit, otherwise this will make highlighting even more complicated for no good reason.
          Hide
          Uwe Schindler added a comment -

          In the past this was not an issue, because stored numeric fields in Solr were only binary, but now they are consumed by the highlighter component.

          If we dont fix highlighter I have no problem. We should add some checks in highlighter to prevent NumericFields from beeing highlighted (e.g. if the TokenStream does not support CTA, it should stop highlighting). The above hack was just faster to implement in Solr to not slow down merging (because index formats from 3.x and trunk are incompatible now, so trunk needs to be committed ASAP).

          This is why I opened this issue. Just to record that its a little bit inconsistent. But of course your response is true, it may make more problems.

          Show
          Uwe Schindler added a comment - In the past this was not an issue, because stored numeric fields in Solr were only binary, but now they are consumed by the highlighter component. If we dont fix highlighter I have no problem. We should add some checks in highlighter to prevent NumericFields from beeing highlighted (e.g. if the TokenStream does not support CTA, it should stop highlighting). The above hack was just faster to implement in Solr to not slow down merging (because index formats from 3.x and trunk are incompatible now, so trunk needs to be committed ASAP). This is why I opened this issue. Just to record that its a little bit inconsistent. But of course your response is true, it may make more problems.
          Hide
          Robert Muir added a comment -

          Right, because highlighters are designed for doing a lot of high-level things on text to bring out (hopefully good) snippets, e.g. various segmentation and fragmenting and whatever they do, I think it would be best if they continued to work on char[]. This makes it easier for people to extend the APIs.

          I agree with the idea of somehow adding a check or assert to prevent Numerics from being highlighted... I don't think it makes sense to highlight NumericFields.

          Show
          Robert Muir added a comment - Right, because highlighters are designed for doing a lot of high-level things on text to bring out (hopefully good) snippets, e.g. various segmentation and fragmenting and whatever they do, I think it would be best if they continued to work on char[]. This makes it easier for people to extend the APIs. I agree with the idea of somehow adding a check or assert to prevent Numerics from being highlighted... I don't think it makes sense to highlight NumericFields.
          Hide
          Mike Sokolov added a comment -

          There could be a good reason though for using byte-offsets in highlighting. I have in mind an optimization that would pull in text from an external file or other source, enabling highlighting without stored fields. For best performance the snippet should be pulled from the external source using random access to storage, but this requires byte offsets. I think this might be a big win for large field values.

          This could only be done if the highlighter doesn't need to perform any text manipulation itself, so it's not really appropriate for Highlighter, as Robert said, but in the case of FVH it might be possible to implement. I'm looking at this, but wondering before I get too deep in if anyone can comment on the feasibility of using byte offsets - I'm unclear on what they get used for other than highlighting: would it cause problems to have a CharFilter that returns "corrected" offsets such that char positions in the analyzed text are translated into byte positions in the source text?

          Show
          Mike Sokolov added a comment - There could be a good reason though for using byte-offsets in highlighting. I have in mind an optimization that would pull in text from an external file or other source, enabling highlighting without stored fields. For best performance the snippet should be pulled from the external source using random access to storage, but this requires byte offsets. I think this might be a big win for large field values. This could only be done if the highlighter doesn't need to perform any text manipulation itself, so it's not really appropriate for Highlighter, as Robert said, but in the case of FVH it might be possible to implement. I'm looking at this, but wondering before I get too deep in if anyone can comment on the feasibility of using byte offsets - I'm unclear on what they get used for other than highlighting: would it cause problems to have a CharFilter that returns "corrected" offsets such that char positions in the analyzed text are translated into byte positions in the source text?
          Hide
          Robert Muir added a comment -

          Mike, its an interesting idea, as I think the offsets are intended to be opaque to the app (so you should be able to use byte offsets if you want).

          There are some problems though, especially tokenfilters that muck with offsets:
          NGramTokenFilter, WordDelimiterFilter, ...

          In general there are assumptions here that offsets are utf16.

          Show
          Robert Muir added a comment - Mike, its an interesting idea, as I think the offsets are intended to be opaque to the app (so you should be able to use byte offsets if you want). There are some problems though, especially tokenfilters that muck with offsets: NGramTokenFilter, WordDelimiterFilter, ... In general there are assumptions here that offsets are utf16.
          Hide
          Robert Muir added a comment -

          not trying to scare you off from that byte-offset-converting-charfilter idea with my comment above either, I think this would be interesting to fix! maybe we just need to fix the situation so that tokenfilters can pass down the call to 'correctOffset' and then use it, I think it might only be defined on Tokenizer...

          Show
          Robert Muir added a comment - not trying to scare you off from that byte-offset-converting-charfilter idea with my comment above either, I think this would be interesting to fix! maybe we just need to fix the situation so that tokenfilters can pass down the call to 'correctOffset' and then use it, I think it might only be defined on Tokenizer...
          Hide
          Mike Sokolov added a comment -

          It might be a bit more complicated? Looks like WordDelimiterFilter, in generatePart and concatenate, eg, performs computation with the offsets. So it would either need to know the units of the offsets it was passed, or be given more than just a correctOffset() method: rather it seems to require something like addCharsToOffset (offset, charOffsetIncr), where charOffsetIncr is a number of chars, but offset is in some unspecified unit.

          Show
          Mike Sokolov added a comment - It might be a bit more complicated? Looks like WordDelimiterFilter, in generatePart and concatenate, eg, performs computation with the offsets. So it would either need to know the units of the offsets it was passed, or be given more than just a correctOffset() method: rather it seems to require something like addCharsToOffset (offset, charOffsetIncr), where charOffsetIncr is a number of chars, but offset is in some unspecified unit.
          Hide
          Robert Muir added a comment -

          yes: in general I think it would be problematic, especially since most tests use only all-ascii data.

          Another problem on this issue is that if you want to use bytes, but with the Tokenizer-analysis-chain, it only takes Reader, so you cannot assume anything about the original bytes or encoding (e.g. that its UTF-8 for example).

          Show
          Robert Muir added a comment - yes: in general I think it would be problematic, especially since most tests use only all-ascii data. Another problem on this issue is that if you want to use bytes, but with the Tokenizer-analysis-chain, it only takes Reader, so you cannot assume anything about the original bytes or encoding (e.g. that its UTF-8 for example).
          Hide
          Mike Sokolov added a comment -

          Yeah I knew that at some point, but stuffed it away as something to think about later There really is no way to pass byte streams into the analysis chain. Maybe providing a character encoding to the filter could enable it to compute the needed byte offsets.

          Show
          Mike Sokolov added a comment - Yeah I knew that at some point, but stuffed it away as something to think about later There really is no way to pass byte streams into the analysis chain. Maybe providing a character encoding to the filter could enable it to compute the needed byte offsets.
          Hide
          Robert Muir added a comment -

          Well, personally i am hesitant to introduce any encodings or bytes into our current analysis chain, because its unnecessary complexity that will introduce bugs (at the moment, its the users responsibility to create the appropriate Reader etc).

          Furthermore, not all character sets can be 'corrected' with a linear conversion like this: for example some actually order the text in a different direction, and things like that... there are many quirks to non-unicode character sets.

          Maybe as a start, it would be useful to prototype some simple experiments with a "binary analysis chain" and hackup a highlighter to work with them? This way we would have an understanding of what the potential performance gain is.

          Here's some example code for a dead simple binary analysis chain that only uses bytes the whole way through, you could take these ideas and prototype one with just all ascii-terms and split on the space byte and such:
          http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestBinaryTerms.java
          http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/BinaryTokenStream.java

          Show
          Robert Muir added a comment - Well, personally i am hesitant to introduce any encodings or bytes into our current analysis chain, because its unnecessary complexity that will introduce bugs (at the moment, its the users responsibility to create the appropriate Reader etc). Furthermore, not all character sets can be 'corrected' with a linear conversion like this: for example some actually order the text in a different direction, and things like that... there are many quirks to non-unicode character sets. Maybe as a start, it would be useful to prototype some simple experiments with a "binary analysis chain" and hackup a highlighter to work with them? This way we would have an understanding of what the potential performance gain is. Here's some example code for a dead simple binary analysis chain that only uses bytes the whole way through, you could take these ideas and prototype one with just all ascii-terms and split on the space byte and such: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/TestBinaryTerms.java http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/src/test/org/apache/lucene/index/BinaryTokenStream.java
          Hide
          Mike Sokolov added a comment -

          I agree it's necessary to prove there is some point to all this - I'm working on getting some numbers. At the moment I'm just assuming ASCII encoding, but I'll take a look at the binary stuff too - thanks.

          Show
          Mike Sokolov added a comment - I agree it's necessary to prove there is some point to all this - I'm working on getting some numbers. At the moment I'm just assuming ASCII encoding, but I'll take a look at the binary stuff too - thanks.
          Hide
          Jack Krupansky added a comment -

          Hmmm... there hasn't been any activity on this issue for over two years, but the code in branch_4x for org.apache.solr.highlight.DefaultSolrHighlighter#doHighlightingByHighlighter has this TODO comment and "hack" code:

              // TODO: Currently in trunk highlighting numeric fields is broken (Lucene) -
              // so we disable them until fixed (see LUCENE-3080)!
              // BEGIN: Hack
              final SchemaField schemaField = schema.getFieldOrNull(fieldName);
              if (schemaField != null && (
                (schemaField.getType() instanceof org.apache.solr.schema.TrieField) ||
                (schemaField.getType() instanceof org.apache.solr.schema.TrieDateField)
              )) return;
              // END: Hack
          

          And, there is no note in the Solr highlighting wiki/refguide concerning this released "hack".

          Show
          Jack Krupansky added a comment - Hmmm... there hasn't been any activity on this issue for over two years, but the code in branch_4x for org.apache.solr.highlight.DefaultSolrHighlighter#doHighlightingByHighlighter has this TODO comment and "hack" code: // TODO: Currently in trunk highlighting numeric fields is broken (Lucene) - // so we disable them until fixed (see LUCENE-3080)! // BEGIN: Hack final SchemaField schemaField = schema.getFieldOrNull(fieldName); if (schemaField != null && ( (schemaField.getType() instanceof org.apache.solr.schema.TrieField) || (schemaField.getType() instanceof org.apache.solr.schema.TrieDateField) )) return ; // END: Hack And, there is no note in the Solr highlighting wiki/refguide concerning this released "hack".

            People

            • Assignee:
              Unassigned
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development