Solr
  1. Solr
  2. SOLR-3287

3x tutorial tries to demo schema features that don't work with 3x schema

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None

      Description

      I just audited the tutorial on the 3x branch to ensure everything would work for the 3.6 release, and ran into a two sections where things were very confusing and seemed broken to me (even as a solr expert)

      https://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x/solr/core/src/java/doc-files/tutorial.html

      1) "Text Analysis" of the 5 queries in this section, only the "pixima" example works (power-shot matches documents but not the ones the tutorial suggests it should, and for different reasons). The lead in para does explain that you have to edit your schema.xml in order for these links to work – but it's confusing, and i honestly read it 3 times before i realized what it was saying (the first two times i thought it was saying that because the content is in english, english specific field types are used, and you can change those to text_general if you don't use english)

      Bottom line: the links are confusing since they don't work "out of the box" with the simple commands shown so far

      If you know your textual content is English, as is the case for the example documents in this tutorial, and you'd like to apply English-specific stemming and stop word removal, as well as split compound words, you can use the text_en_splitting fieldType instead. Go ahead and edit the schema.xml under the solr/example/solr/conf directory, and change the type for fields text and features from text_general to text_en_splitting. Restart the server and then re-post all of the documents, and then these queries will show the English-specific transformations:

      • A search for power-shot matches PowerShot, and adata matches A-DATA due to the use of WordDelimiterFilter and LowerCaseFilter.
      • A search for features:recharging matches Rechargeable due to stemming with the EnglishPorterFilter.
      • A search for "1 gigabyte" matches things with GB, and the misspelled pixima matches Pixma due to use of a SynonymFilter.

      2) "Analysis Debugging"

      Likewise, all of the analysis.jsp example URLs attempt to show off how various features work, but the fields used don't demonstrate the analysis being discussed unless the user has edited the schema as discussed in the previous section

      This shows how "Canon Power-Shot SD500" would be indexed as a value in the name field. Each row of the table shows the resulting tokens after having passed through the next TokenFilter in the analyzer for the name field. Notice how both powershot and power, shot are indexed. Tokens generated at the same position are shown in the same column, in this case shot and powershot.

      Selecting verbose output will show more details, such as the name of each analyzer component in the chain, token positions, and the start and end positions of the token in the original text.

      Selecting highlight matches when both index and query values are provided will take the resulting terms from the query value and highlight all matches in the index value analysis.

      Here is an example of stemming and stop-words at work.

        Issue Links

          Activity

          Hide
          Hoss Man added a comment -

          I don't have a "great" suggestion for dealing with this. Fundementally it comes down to a conflict between trying to make the field types used by the example fields general and generic enough to be useful for any languages so people can re-use them, vs having fields in the example that let us show off some features that aren't neccessarily things all users will want in all of their text fields if they copy the schema.

          we could use copyField to create "_en" versions of all these fields, but this type of solution has also lead to confusion/problems in the past, with people leaving those copyFields in the shchema.xml when they copy it, and winding up with indexes that are twice as big as they need to be.

          My best suggestions are:

          • For the search links in #1::
            • leave the verbage as is, but maybe put this line in bold: Go ahead and edit the schema.xml under the solr/example/solr/conf directory, and change the type for fields text and features from text_general to text_en_splitting ... i would also suggest changing it to: Go ahead and edit the schema.xml under the solr/example/solr/conf directory to use type="text_en_splitting" for the fields "text" and "features"
            • include a <pre> box showing an example of what the <field/> declarations will look like in XML if the user makes these changes
            • i think we should also change the example queries so they aren't actually links – just show the query syntax. my thinking being that this will act as a metnal cue that these are examples of valid queries, but they don't work "out of the box"
          • For the analysis.jsp link in #2: i think we should switch from using the "name=name" and "name=text" params to using "type=text_en" (with a tweak in verbage to make it clear what the URLs are showing) so these work even if the user doesn't edit the schema.

          Anyone have any better ideas?

          Show
          Hoss Man added a comment - I don't have a "great" suggestion for dealing with this. Fundementally it comes down to a conflict between trying to make the field types used by the example fields general and generic enough to be useful for any languages so people can re-use them, vs having fields in the example that let us show off some features that aren't neccessarily things all users will want in all of their text fields if they copy the schema. we could use copyField to create "_en" versions of all these fields, but this type of solution has also lead to confusion/problems in the past, with people leaving those copyFields in the shchema.xml when they copy it, and winding up with indexes that are twice as big as they need to be. My best suggestions are: For the search links in #1:: leave the verbage as is, but maybe put this line in bold: Go ahead and edit the schema.xml under the solr/example/solr/conf directory, and change the type for fields text and features from text_general to text_en_splitting ... i would also suggest changing it to: Go ahead and edit the schema.xml under the solr/example/solr/conf directory to use type="text_en_splitting" for the fields "text" and "features" include a <pre> box showing an example of what the <field/> declarations will look like in XML if the user makes these changes i think we should also change the example queries so they aren't actually links – just show the query syntax. my thinking being that this will act as a metnal cue that these are examples of valid queries, but they don't work "out of the box" For the analysis.jsp link in #2: i think we should switch from using the "name=name" and "name=text" params to using "type=text_en" (with a tweak in verbage to make it clear what the URLs are showing) so these work even if the user doesn't edit the schema. Anyone have any better ideas?
          Hide
          Hoss Man added a comment -

          Committed revision 1306166.
          Committed revision 1306167.

          a) I added the <pre> block with the <field/> changes as described, but I tweaked the wording of the intro para a bit from what i initially suggested and didn't bother bolding that one sentence to try and draw attention to it (the <pre> block is eye catchng enough i think)
          b) i left the links in for those queries that only work if you change the schema, but tweaked the wording to be speculative (ie: "can match" instead of "matches") so it'smore accurate even if they don't change the schema
          c) switch the problematic analysis.jsp links to use field type instead of field name
          d) added some more analysis.jsp examples using text_cjk, text_ja, and text_ar (thanks rmuir!)

          Show
          Hoss Man added a comment - Committed revision 1306166. Committed revision 1306167. a) I added the <pre> block with the <field/> changes as described, but I tweaked the wording of the intro para a bit from what i initially suggested and didn't bother bolding that one sentence to try and draw attention to it (the <pre> block is eye catchng enough i think) b) i left the links in for those queries that only work if you change the schema, but tweaked the wording to be speculative (ie: "can match" instead of "matches") so it'smore accurate even if they don't change the schema c) switch the problematic analysis.jsp links to use field type instead of field name d) added some more analysis.jsp examples using text_cjk, text_ja, and text_ar (thanks rmuir!)

            People

            • Assignee:
              Hoss Man
              Reporter:
              Hoss Man
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development