Solr
  1. Solr
  2. SOLR-3442

Example schema switch to DisMax instead of CopyField

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Schema and Analysis
    • Labels:

      Description

      Spinoff from SOLR-3439:

      The use of copyField in todays example schema is an anti pattern since we indirectly teach people to duplicate most of their content, while most would be better off using DisMax, or at least a combination.

        Activity

        Hide
        Chris Male added a comment -

        I think it's a pretty bold claim to call it an anti-pattern. I've seen it successfully used in many projects and it continues to fulfill user needs.

        Show
        Chris Male added a comment - I think it's a pretty bold claim to call it an anti-pattern. I've seen it successfully used in many projects and it continues to fulfill user needs.
        Hide
        Jan Høydahl added a comment -

        Sure, I've seen it successfully used too, and I use it myself now and then to reduce the number of fields required in "qf".

        For very small indexes without much need for tuning analysis or relevancy it does not matter very much. But I'm arguing that copyField is the legacy way of searching multiple fields in one go, while DisMax is the current recommendation. So why stick to the legacy in the default example?

        Show
        Jan Høydahl added a comment - Sure, I've seen it successfully used too, and I use it myself now and then to reduce the number of fields required in "qf". For very small indexes without much need for tuning analysis or relevancy it does not matter very much. But I'm arguing that copyField is the legacy way of searching multiple fields in one go, while DisMax is the current recommendation. So why stick to the legacy in the default example?
        Hide
        Jack Krupansky added a comment -

        Maybe Solr has outgrown the concept of a single example schema/config. "Full function" and "maximal performance" conflict to some degree and picking one arbitrary point on the design spectrum does a disservice for those who have varying requirements. The current example already has performance tips and a warning advisory not to use it for benchmarking. And SolrCell documents having "core", common metadata is somewhat distinct from full-custom schema design.

        The copyField to "text" pattern is more clearly targeted at non-dismax users, where "text" is the single default search field.

        This issue essentially raises the question: Is non-dismax query parsing dead? If not, the copyField/text pattern still seems relevant.

        Maybe it would be worth having a modest library of schema/config files that the user can select from when running "example". OTOH, maintaining a lot of somewhat similar files can be a pain. A way to configure the schema/config files (conditionals) would be helpful.

        Show
        Jack Krupansky added a comment - Maybe Solr has outgrown the concept of a single example schema/config. "Full function" and "maximal performance" conflict to some degree and picking one arbitrary point on the design spectrum does a disservice for those who have varying requirements. The current example already has performance tips and a warning advisory not to use it for benchmarking. And SolrCell documents having "core", common metadata is somewhat distinct from full-custom schema design. The copyField to "text" pattern is more clearly targeted at non-dismax users, where "text" is the single default search field. This issue essentially raises the question: Is non-dismax query parsing dead? If not, the copyField/text pattern still seems relevant. Maybe it would be worth having a modest library of schema/config files that the user can select from when running "example". OTOH, maintaining a lot of somewhat similar files can be a pain. A way to configure the schema/config files (conditionals) would be helpful.
        Hide
        Jan Høydahl added a comment -

        I'm not saying anything is "dead". Both the "lucene" queryparser and copyField has its mission and is supported, and you can mix and match these with DisMax to fit your needs. But for the example we should select the most useful and flexible way to show indexing and search, and that is no longer "text" catch-all and copyField. Aside from it doubling the size of your index, it is inflexible in that adding or removing a field from search means schema update and re-indexing. Catch-all fields with copyField can sometimes be used as a performance optimization, but you do not start in that end.

        Maintaining many examples has shown not to be a very good strategy, look at the multi-core and DIH examples, they lag behind several versions when it comes to schema version and new solrconfig syntaxes. Instead, a single schema which can do both the product search and document search use cases well is easy to achieve. The Velocity GUI can be extended with two tabs if need be, one "products" tab and one "documents" tab. If we choose the example documents to index wisely, to be i.e. user guides for the products, we get a nice connection. You can search for "ipod" and see both products and user guides matching your search.

        Show
        Jan Høydahl added a comment - I'm not saying anything is "dead". Both the "lucene" queryparser and copyField has its mission and is supported, and you can mix and match these with DisMax to fit your needs. But for the example we should select the most useful and flexible way to show indexing and search, and that is no longer "text" catch-all and copyField. Aside from it doubling the size of your index, it is inflexible in that adding or removing a field from search means schema update and re-indexing. Catch-all fields with copyField can sometimes be used as a performance optimization, but you do not start in that end. Maintaining many examples has shown not to be a very good strategy, look at the multi-core and DIH examples, they lag behind several versions when it comes to schema version and new solrconfig syntaxes. Instead, a single schema which can do both the product search and document search use cases well is easy to achieve. The Velocity GUI can be extended with two tabs if need be, one "products" tab and one "documents" tab. If we choose the example documents to index wisely, to be i.e. user guides for the products, we get a nice connection. You can search for "ipod" and see both products and user guides matching your search.
        Hide
        Jack Krupansky added a comment -

        I don't disagree with the gist of your argument, but I would cringe a little if we change the schema so that it doesn't work very well if the user does drop back to the lucene query parser with &defType=lucene which has only a single default field.

        OTOH, maybe that is simply the cost of making the example schema (and config) be more representative of "best practices". But, that sort of implies that the Lucene query parser is not a "best practice", at least when searchable text content is spread over multiple fields.

        Show
        Jack Krupansky added a comment - I don't disagree with the gist of your argument, but I would cringe a little if we change the schema so that it doesn't work very well if the user does drop back to the lucene query parser with &defType=lucene which has only a single default field. OTOH, maybe that is simply the cost of making the example schema (and config) be more representative of "best practices". But, that sort of implies that the Lucene query parser is not a "best practice", at least when searchable text content is spread over multiple fields.
        Hide
        Yonik Seeley added a comment -

        I would cringe a little if we change the schema so that it doesn't work very well if the user does drop back to the lucene query parser

        The lucene query parser generally shouldn't be used for user queries, only programmatically generated ones. Using expicit fieldnames (or specifying df) for that case should be fine.

        Show
        Yonik Seeley added a comment - I would cringe a little if we change the schema so that it doesn't work very well if the user does drop back to the lucene query parser The lucene query parser generally shouldn't be used for user queries, only programmatically generated ones. Using expicit fieldnames (or specifying df) for that case should be fine.
        Hide
        Jack Krupansky added a comment -

        The lucene query parser generally shouldn't be used for user queries...

        If that is the general sentiment, then having the default example user query parser be edismax makes perfect sense.

        Show
        Jack Krupansky added a comment - The lucene query parser generally shouldn't be used for user queries... If that is the general sentiment, then having the default example user query parser be edismax makes perfect sense.
        Hide
        Jack Krupansky added a comment -

        When I initially read this issue I mistakenly read it as edismax rather than dismax. So, I would request that the intent be crystal clear - is it reasonable to switch the default query parser handler to edismax, or is it being suggested that the more limited dismax query parser be the new default? If the latter, we won't even be able to query specific fields without config changes.

        Some of the discussion over on SOLR-2368 might be relevant, as to whether the default query for example should be severely "locked-down" as opposed to highly functional (fields, Lucene syntax, etc.)

        I was going to proceed with an edismax-based patch, but now I am not so sure.

        Show
        Jack Krupansky added a comment - When I initially read this issue I mistakenly read it as edismax rather than dismax. So, I would request that the intent be crystal clear - is it reasonable to switch the default query parser handler to edismax, or is it being suggested that the more limited dismax query parser be the new default? If the latter, we won't even be able to query specific fields without config changes. Some of the discussion over on SOLR-2368 might be relevant, as to whether the default query for example should be severely "locked-down" as opposed to highly functional (fields, Lucene syntax, etc.) I was going to proceed with an edismax-based patch, but now I am not so sure.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jan Høydahl
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development