Solr
  1. Solr
  2. SOLR-2400

FieldAnalysisRequestHandler; add information about token-relation

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.3, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      The XML-Output (simplified example attached) is missing one small information .. which could be very useful to build an nice Analysis-Output, and that's "Token-Relation" (if there is special/correct word for this, please correct me).

      Meaning, that is actually not possible to "follow" the Analysis-Process (completly) while the Tokenizers/Filters will drop out Tokens (f.e. StopWord) or split it into multiple Tokens (f.e. WordDelimiter).

      Would it be possible to include this Information? If so, it would be possible to create an improved Analysis-Page for the new Solr Admin (SOLR-2399) - short scribble attached

      1. 110303_FieldAnalysisRequestHandler_output.xml
        3 kB
        Stefan Matheis (steffkes)
      2. 110303_FieldAnalysisRequestHandler_view.png
        4 kB
        Stefan Matheis (steffkes)
      3. SOLR-2400.patch
        8 kB
        Uwe Schindler
      4. field.xml
        13 kB
        Uwe Schindler
      5. SOLR-2400.patch
        8 kB
        Uwe Schindler
      6. SOLR-2400.patch
        8 kB
        Uwe Schindler
      7. SOLR-2400.patch
        48 kB
        Uwe Schindler
      8. SOLR-2400-revision1.patch
        44 kB
        Uwe Schindler
      9. SOLR-2400-revision1.patch
        45 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Uwe Schindler added a comment -

          The "position" is used e.g. in analysis.jsp to do exactly what you want to have. It is the token position. If no "broken" TokenFilters are used that do not correctly modify the posIncr attribute, you can simply use it for alignment.

          Show
          Uwe Schindler added a comment - The "position" is used e.g. in analysis.jsp to do exactly what you want to have. It is the token position. If no "broken" TokenFilters are used that do not correctly modify the posIncr attribute, you can simply use it for alignment.
          Hide
          Stefan Matheis (steffkes) added a comment -

          Uwe, that was the first thing i thought myself, yes - but .. let's take "flat" (starting on position 4) and follow it. Passing StopFilter, still position 4; Arriving at WordDelimiter, it's position 6 - the dash was dropped out while beeing an StopWord and VA902B gets splitted up in three Tokens.

          So, what i guess, that it's missing .. is some type of information, that for example the original Token on position 2 (VA902B) is splitted an know (partial) placed on position 3 through 6 .. also for example, that flat is no longer position 4, because it's moved to 6.

          Or did i just miss something really simple?

          Show
          Stefan Matheis (steffkes) added a comment - Uwe, that was the first thing i thought myself, yes - but .. let's take "flat" (starting on position 4) and follow it. Passing StopFilter, still position 4; Arriving at WordDelimiter, it's position 6 - the dash was dropped out while beeing an StopWord and VA902B gets splitted up in three Tokens. So, what i guess, that it's missing .. is some type of information, that for example the original Token on position 2 (VA902B) is splitted an know (partial) placed on position 3 through 6 .. also for example, that flat is no longer position 4, because it's moved to 6. Or did i just miss something really simple?
          Hide
          Uwe Schindler added a comment - - edited

          Stefan, this is a general issue of TokenStreams adding Tokens. TokenStreams that remove Tokens should automatically preserve position, but not even all of those do that correctly (we were fixing some of them lately). The way of how the Lucene analysis works makes it impossible to guarantee any corresponence of the position numbers. Because for the indexer it's only important what comes out at the end, the steps inbetween are not interesting. AnalysisReqHandler on the other hand does some bad "hacks" to look "inside" the analysis (by using temporary TokenStreams that buffer tokens), which are not the general use-case of TokenStreams.

          I wonder a little bit about your xml file, it only contains text and position, but it should also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes not only two of them. Is this a hand-made file or what is the problem? Which Solr version?

          One possibility to handle the thing might be the char offset in the original text, because that the req handler may use the character offset of begin and end of the token in the original stream instead of the token position, but this is likely to break for lots of TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...). The problem is incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted terms are longer than their originals.

          Alltogether: Its unlikely that you can implement that and it will work for all combinations of TokenStream components.

          Show
          Uwe Schindler added a comment - - edited Stefan, this is a general issue of TokenStreams adding Tokens. TokenStreams that remove Tokens should automatically preserve position, but not even all of those do that correctly (we were fixing some of them lately). The way of how the Lucene analysis works makes it impossible to guarantee any corresponence of the position numbers. Because for the indexer it's only important what comes out at the end, the steps inbetween are not interesting. AnalysisReqHandler on the other hand does some bad "hacks" to look "inside" the analysis (by using temporary TokenStreams that buffer tokens), which are not the general use-case of TokenStreams. I wonder a little bit about your xml file, it only contains text and position, but it should also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes not only two of them. Is this a hand-made file or what is the problem? Which Solr version? One possibility to handle the thing might be the char offset in the original text, because that the req handler may use the character offset of begin and end of the token in the original stream instead of the token position, but this is likely to break for lots of TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...). The problem is incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted terms are longer than their originals. Alltogether: Its unlikely that you can implement that and it will work for all combinations of TokenStream components.
          Hide
          Stefan Matheis (steffkes) added a comment -

          Uwe, thanks for your reply.

          I wonder a little bit about your xml file, it only contains text and position, but it should also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes not only two of them. Is this a hand-made file or what is the problem? Which Solr version?

          My fault, indeed it's not the original output - i thought it would be enough to demonstrate the point which i was talking about, sorry for that.

          My Solr 4.x nighlty-build from last week only has the following output; there is no rawTerm - which would be extremly helpful, because with this information it should be possible to establish to relation i talked about earlier.

          <!-- .. -->
          <arr name="org.apache.lucene.analysis.standard.StandardTokenizer">
            <lst>
              <str name="text">this</str>
              <str name="type"><ALPHANUM></str>
              <int name="start">0</int>
              <int name="end">4</int>
              <int name="position">1</int>
            </lst>
            <!-- .. -->
          </arr>
          <!-- .. -->

          May i miss an important configuration-setting for having rawTerm in Analysis-Output?

          Show
          Stefan Matheis (steffkes) added a comment - Uwe, thanks for your reply. I wonder a little bit about your xml file, it only contains text and position, but it should also contain rawTerm, startOffset, endOffset. When I call analysis i get all of those attributes not only two of them. Is this a hand-made file or what is the problem? Which Solr version? My fault, indeed it's not the original output - i thought it would be enough to demonstrate the point which i was talking about, sorry for that. My Solr 4.x nighlty-build from last week only has the following output; there is no rawTerm - which would be extremly helpful, because with this information it should be possible to establish to relation i talked about earlier. <!-- .. --> <arr name= "org.apache.lucene.analysis.standard.StandardTokenizer" > <lst> <str name= "text" > this </str> <str name= "type" ><ALPHANUM></str> < int name= "start" >0</ int > < int name= "end" >4</ int > < int name= "position" >1</ int > </lst> <!-- .. --> </arr> <!-- .. --> May i miss an important configuration-setting for having rawTerm in Analysis-Output?
          Hide
          Stefan Matheis (steffkes) added a comment -

          I've checked out the current trunk-Revision .. but could not see any change on that, especially the raw-Term thing. Did i miss something else? Special Setting required for getting this property?

          Show
          Stefan Matheis (steffkes) added a comment - I've checked out the current trunk-Revision .. but could not see any change on that, especially the raw-Term thing. Did i miss something else? Special Setting required for getting this property?
          Hide
          Uwe Schindler added a comment -

          Hi Stefan,

          sorry for missing your last response.

          About the raw term: The raw term is only shown by solr currently, if the term is only binary (like numerics) or similar (when the FieldType does some transformation like with the deprecated Sortable*) fields. I just mentioned it as example that I was missing some attributes in your example output. To solve your problem it is of no use.

          I already mentioned:

          One possibility to handle the thing might be the char offset in the original text, because that the req handler may use the character offset of begin and end of the token in the original stream instead of the token position, but this is likely to break for lots of TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...). The problem is incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted terms are longer than their originals.

          This might be your only chance (using the OffsetAttribute), but it is likely to break. What you want to have is not possible with the analysis API of Lucene, as some information is missing (as not needed during analysis - the absolute positions are not important for the indexer, so TokenStreams don't preserve them.

          A possibility to preserve the original positions would be a trick in the analysis RequestHandler: It could insert a Fake TokenFilter directly after the Tokenizer, that adds an additional Attribute with the absolute position (incremented on each call to input.incrementToken()). This could be a hack to achieve what you want.

          Maybe I can help you, but that needs some refactoring in AnalysisRequestHandlers, but might be a good idea.

          Show
          Uwe Schindler added a comment - Hi Stefan, sorry for missing your last response. About the raw term: The raw term is only shown by solr currently, if the term is only binary (like numerics) or similar (when the FieldType does some transformation like with the deprecated Sortable*) fields. I just mentioned it as example that I was missing some attributes in your example output. To solve your problem it is of no use. I already mentioned: One possibility to handle the thing might be the char offset in the original text, because that the req handler may use the character offset of begin and end of the token in the original stream instead of the token position, but this is likely to break for lots of TokenFilters (WordDelimiterFilter would work as long as you don't do stemming before...). The problem is incorrect handling of offset calculation (also leading to bugs in highlighting) when the inserted terms are longer than their originals. This might be your only chance (using the OffsetAttribute), but it is likely to break. What you want to have is not possible with the analysis API of Lucene, as some information is missing (as not needed during analysis - the absolute positions are not important for the indexer, so TokenStreams don't preserve them. A possibility to preserve the original positions would be a trick in the analysis RequestHandler: It could insert a Fake TokenFilter directly after the Tokenizer, that adds an additional Attribute with the absolute position (incremented on each call to input.incrementToken()). This could be a hack to achieve what you want. Maybe I can help you, but that needs some refactoring in AnalysisRequestHandlers, but might be a good idea.
          Hide
          Uwe Schindler added a comment -

          Hi Stefan

          you seem to work again on this admin interface. How about my last proposal: Adding an internal TokenFilter in the FieldAnalysisRequestHandler that is inserted directly after the Tokenizer before the first TokenFilter? This one could simply count the tokens emitted by the Tokenizer and add it as a special attribute. By this every Token emitted by Tokenizer would get a unique ID (a integer). If some TokenFilter later splits a token, both would get the same ID. Please note: This only works for the first Tokenizer and all TokenFilters together. If another TokenFilter later again splits Tokens produced by a TokenFilter before, all those would get the original ID of the Tokenizer.

          Any comments? This should be quite simple to implement.

          Show
          Uwe Schindler added a comment - Hi Stefan you seem to work again on this admin interface. How about my last proposal: Adding an internal TokenFilter in the FieldAnalysisRequestHandler that is inserted directly after the Tokenizer before the first TokenFilter? This one could simply count the tokens emitted by the Tokenizer and add it as a special attribute. By this every Token emitted by Tokenizer would get a unique ID (a integer). If some TokenFilter later splits a token, both would get the same ID. Please note: This only works for the first Tokenizer and all TokenFilters together. If another TokenFilter later again splits Tokens produced by a TokenFilter before, all those would get the original ID of the Tokenizer. Any comments? This should be quite simple to implement.
          Hide
          Stefan Matheis (steffkes) added a comment -

          Hi Uwe,

          sorry, missed your last comment :/

          Any comments? This should be quite simple to implement.

          Sounds great! Sample/Static/Sample Output of the Analysis-Handler should be enough for the first setp to check, if we could (easily, more or less *g) integrate that.

          Thanks
          Stefan

          Show
          Stefan Matheis (steffkes) added a comment - Hi Uwe, sorry, missed your last comment :/ Any comments? This should be quite simple to implement. Sounds great! Sample/Static/Sample Output of the Analysis-Handler should be enough for the first setp to check, if we could (easily, more or less *g) integrate that. Thanks Stefan
          Hide
          Uwe Schindler added a comment -

          After thinking a little bit more, I think it would even be possible to add this Filter after each Step to track tokens. The resulting Attribute would then contain the whole tracking of positions:

          • After Tokenizer this attribute would contains "0", "1", "2",...
          • After the first TokenFilter: "0.0", "1.1", "1.2", "1.3", "2.2" (while the second token (1) emitteded by the Tokenizer was split into 3 Tokens). I think this would help? Additionally the Filter could use PositionIncrement to track same position tokens - or this could be left to the consumer (so if 1.2 and 1.3 have posIncr 0, the consumer knows that they all are at same position). If the TokenFilter would use the PosIncr to increment the unique IDs, then this would be solved (so 1.x tokens would always get "1.1" as ID if at same position).

          I will think about it an supply a patch that enriches the FieldAnalysisContentHandler by this extra attribute.

          We can then iterate. But today is Easter Holiday, so little bit later...

          Show
          Uwe Schindler added a comment - After thinking a little bit more, I think it would even be possible to add this Filter after each Step to track tokens. The resulting Attribute would then contain the whole tracking of positions: After Tokenizer this attribute would contains "0", "1", "2",... After the first TokenFilter: "0.0", "1.1", "1.2", "1.3", "2.2" (while the second token (1) emitteded by the Tokenizer was split into 3 Tokens). I think this would help? Additionally the Filter could use PositionIncrement to track same position tokens - or this could be left to the consumer (so if 1.2 and 1.3 have posIncr 0, the consumer knows that they all are at same position). If the TokenFilter would use the PosIncr to increment the unique IDs, then this would be solved (so 1.x tokens would always get "1.1" as ID if at same position). I will think about it an supply a patch that enriches the FieldAnalysisContentHandler by this extra attribute. We can then iterate. But today is Easter Holiday, so little bit later...
          Hide
          Stefan Matheis (steffkes) added a comment -

          I will think about it an supply a patch that enriches the FieldAnalysisContentHandler by this extra attribute.

          Goes better an better - the more information we have, the more we could display :> everything that's possible and help to understand the Analysis would be great!

          We can then iterate. But today is Easter Holiday, so little bit later...

          whenever you'll find the time, i'll continue work on another topic. ty anway Uwe.

          Show
          Stefan Matheis (steffkes) added a comment - I will think about it an supply a patch that enriches the FieldAnalysisContentHandler by this extra attribute. Goes better an better - the more information we have, the more we could display :> everything that's possible and help to understand the Analysis would be great! We can then iterate. But today is Easter Holiday, so little bit later... whenever you'll find the time, i'll continue work on another topic. ty anway Uwe.
          Hide
          Uwe Schindler added a comment -

          Here a first & quick patch for TRUNK (may not apply to 3.x).

          The FieldAnalysisRequestHandler behaves as before, only tht it adds an additional property "positionHistory" to the named lists with attributes. This property contains all positions this token had before, the last one ist the actual position repeated. "2.2.4.4" means that this token had position 2 after Tokenizer, after first filter still 2, but then changed to 4 after second filter. The actual position after 3rd filter is 4.

          By the way, this also fixes a bug in the RequestHandler: The list of tokens is sorted on printout (by position) and the original list is modified by that. Later Filters will then see the Tokens in the new order, which is a bug. The new code copies the List to an array first to dont touch the tokens. This bug only affects strange TokenStreams with negative position increments, so we can fix this together with this issue (once it is committed).

          An example output is:
          http://localhost:8983/solr/analysis/field?analysis.fieldtype=text&analysis.fieldvalue=moo-moo+dontstems+foo-bar+and+this+fucking+token

          (default schema, Solr trunk)

          Hope that helps.

          Show
          Uwe Schindler added a comment - Here a first & quick patch for TRUNK (may not apply to 3.x). The FieldAnalysisRequestHandler behaves as before, only tht it adds an additional property "positionHistory" to the named lists with attributes. This property contains all positions this token had before, the last one ist the actual position repeated. "2.2.4.4" means that this token had position 2 after Tokenizer, after first filter still 2, but then changed to 4 after second filter. The actual position after 3rd filter is 4. By the way, this also fixes a bug in the RequestHandler: The list of tokens is sorted on printout (by position) and the original list is modified by that. Later Filters will then see the Tokens in the new order, which is a bug. The new code copies the List to an array first to dont touch the tokens. This bug only affects strange TokenStreams with negative position increments, so we can fix this together with this issue (once it is committed). An example output is: http://localhost:8983/solr/analysis/field?analysis.fieldtype=text&analysis.fieldvalue=moo-moo+dontstems+foo-bar+and+this+fucking+token (default schema, Solr trunk) Hope that helps.
          Hide
          Uwe Schindler added a comment -

          Here the output of above analysis in XML.

          Show
          Uwe Schindler added a comment - Here the output of above analysis in XML.
          Hide
          Uwe Schindler added a comment -

          Updated patch (the deep clone in the attribute was not needed)

          Show
          Uwe Schindler added a comment - Updated patch (the deep clone in the attribute was not needed)
          Hide
          Stefan Matheis (steffkes) added a comment -

          Yes =) Ty Uwe, applied the Patch: works perfectly! I've tried splitting on Words, also removing of Stopwords - both are looking good.
          Will see how we could integrate this – actually for the normal languages an their analysis .. afterwords for the Japanase one

          Show
          Stefan Matheis (steffkes) added a comment - Yes =) Ty Uwe, applied the Patch: works perfectly! I've tried splitting on Words, also removing of Stopwords - both are looking good. Will see how we could integrate this – actually for the normal languages an their analysis .. afterwords for the Japanase one
          Hide
          Uwe Schindler added a comment -

          Hi Stefan,

          do you have any addition requirements to this patch? So it might be a good idea to also commit that one, so you can produce a full-featured analysis GUI in your great new admin interface, showing all token relations and their attributes.

          That would really be an improvement over analysis.jsp!

          By the way, to test out custom attributes, you can simply show tokens of a numeric field type like "tint", it will add some additional attributes (like shift...)!

          I would like to only change one part of my patch: The separator for the hierarchy levels is currently ".", I would prefer "/" (like a fs path), any other ideas from the other committers?

          Show
          Uwe Schindler added a comment - Hi Stefan, do you have any addition requirements to this patch? So it might be a good idea to also commit that one, so you can produce a full-featured analysis GUI in your great new admin interface, showing all token relations and their attributes. That would really be an improvement over analysis.jsp! By the way, to test out custom attributes, you can simply show tokens of a numeric field type like "tint", it will add some additional attributes (like shift...)! I would like to only change one part of my patch: The separator for the hierarchy levels is currently ".", I would prefer "/" (like a fs path), any other ideas from the other committers?
          Hide
          Stefan Matheis (steffkes) added a comment -

          Uwe,

          do you have any addition requirements to this patch?

          No, it's perfect – thank you

          So it might be a good idea to also commit that one, so you can produce a full-featured analysis GUI in your great new admin interface, showing all token relations and their attributes.

          Yes, that would be good. Actually the analysis page is not yet using it, but will integrate them while working on Otis Feedback from SOLR-2399

          By the way, to test out custom attributes, you can simply show tokens of a numeric field type like "tint", it will add some additional attributes (like shift...)!

          Ah cool, will do so

          I would like to only change one part of my patch: The separator for the hierarchy levels is currently ".", I would prefer "/" (like a fs path), any other ideas from the other committers?

          Go for it, fine with me

          Stefan

          Show
          Stefan Matheis (steffkes) added a comment - Uwe, do you have any addition requirements to this patch? No, it's perfect – thank you So it might be a good idea to also commit that one, so you can produce a full-featured analysis GUI in your great new admin interface, showing all token relations and their attributes. Yes, that would be good. Actually the analysis page is not yet using it, but will integrate them while working on Otis Feedback from SOLR-2399 By the way, to test out custom attributes, you can simply show tokens of a numeric field type like "tint", it will add some additional attributes (like shift...)! Ah cool, will do so I would like to only change one part of my patch: The separator for the hierarchy levels is currently ".", I would prefer "/" (like a fs path), any other ideas from the other committers? Go for it, fine with me Stefan
          Hide
          Ryan McKinley added a comment -

          I would prefer "/" (like a fs path), any other ideas from the other committers?

          Sounds good to me

          Show
          Ryan McKinley added a comment - I would prefer "/" (like a fs path), any other ideas from the other committers? Sounds good to me
          Hide
          Uwe Schindler added a comment -

          Updated patch with '/' as history separator and updated to trunk.

          There is still a good test missing (test coverage of FieldAnalysisReqHandler is already bad...)

          Show
          Uwe Schindler added a comment - Updated patch with '/' as history separator and updated to trunk. There is still a good test missing (test coverage of FieldAnalysisReqHandler is already bad...)
          Hide
          Uwe Schindler added a comment -

          Here a new patch with test case for the positionHistory. It also adds another test for WDF in combination with FieldAnalysisReqHandler, as its more complicated there to track token position history.

          I think that is ready to commit!

          Show
          Uwe Schindler added a comment - Here a new patch with test case for the positionHistory. It also adds another test for WDF in combination with FieldAnalysisReqHandler, as its more complicated there to track token position history. I think that is ready to commit!
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1134685
          Committed 3.x revision: 1134692

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1134685 Committed 3.x revision: 1134692
          Hide
          Uwe Schindler added a comment -

          I reopen this issue, as I don't really like the position history as String.

          I revised the patch to return a int[] (serialized as Integer[]) for positionHistory.

          Show
          Uwe Schindler added a comment - I reopen this issue, as I don't really like the position history as String. I revised the patch to return a int[] (serialized as Integer[]) for positionHistory.
          Hide
          Uwe Schindler added a comment -

          Will commit soon.

          Show
          Uwe Schindler added a comment - Will commit soon.
          Hide
          Uwe Schindler added a comment -

          Small improvement to cache array when sorting.

          Show
          Uwe Schindler added a comment - Small improvement to cache array when sorting.
          Hide
          Uwe Schindler added a comment -

          Committed trunk revision: 1135154
          Committed 3.x branch revision: 1135156

          Show
          Uwe Schindler added a comment - Committed trunk revision: 1135154 Committed 3.x branch revision: 1135156
          Hide
          Robert Muir added a comment -

          Bulk close for 3.3

          Show
          Robert Muir added a comment - Bulk close for 3.3

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Stefan Matheis (steffkes)
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development