Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.6, Trunk
    • Component/s: None
    • Labels:
      None

      Description

      We have encountered a use case in our system where we have a few fields (Severity. Risk etc) with a closed set of values, where the sort order for these values is pre-determined but not lexicographic (Critical is higher than High). Generically this is very close to how enums work.

      To implement, I have prototyped a new type of field: EnumField where the inputs are a closed predefined set of strings in a special configuration file (similar to currency.xml).

      The code is based on 4.2.1.

      1. enumsConfig.xml
        0.5 kB
        Elran Dvir
      2. schema_example.xml
        52 kB
        Elran Dvir
      3. Solr-5084.patch
        43 kB
        Elran Dvir
      4. Solr-5084.patch
        35 kB
        Elran Dvir
      5. Solr-5084.patch
        42 kB
        Elran Dvir
      6. Solr-5084.patch
        101 kB
        Elran Dvir
      7. Solr-5084.trunk.patch
        49 kB
        Erick Erickson
      8. Solr-5084.trunk.patch
        49 kB
        Erick Erickson
      9. Solr-5084.trunk.patch
        44 kB
        Elran Dvir
      10. Solr-5084.trunk.patch
        43 kB
        Erick Erickson
      11. Solr-5084.trunk.patch
        43 kB
        Elran Dvir
      12. Solr-5084.trunk.patch
        40 kB
        Erick Erickson
      13. Solr-5084.trunk.patch
        40 kB
        Elran Dvir
      14. Solr-5084.trunk.patch
        40 kB
        Elran Dvir
      15. Solr-5084.trunk.patch
        42 kB
        Elran Dvir

        Activity

        Hide
        Elran Dvir added a comment -

        Thank you, Erick!

        Show
        Elran Dvir added a comment - Thank you, Erick!
        Hide
        Erick Erickson added a comment -

        Thanks Elran!!

        Show
        Erick Erickson added a comment - Thanks Elran!!
        Hide
        Erick Erickson added a comment -

        I had a bit of work to reconcile the trunk and 4x versions, there's StorableField used in the patch that's not available in 4.x, but I think it's just a rename.

        We can open up any issues with enums in new JIRAs.

        Show
        Erick Erickson added a comment - I had a bit of work to reconcile the trunk and 4x versions, there's StorableField used in the patch that's not available in 4.x, but I think it's just a rename. We can open up any issues with enums in new JIRAs.
        Hide
        ASF subversion and git services added a comment -

        Commit 1539128 from Erick Erickson in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1539128 ]

        SOLR-5084: new enum field type

        Show
        ASF subversion and git services added a comment - Commit 1539128 from Erick Erickson in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1539128 ] SOLR-5084 : new enum field type
        Hide
        Erick Erickson added a comment -

        Elran:

        I broke out the addition of the enums into the schema.xml used for testing into it's own file so we didn't have issues with other tests FWIW.

        Show
        Erick Erickson added a comment - Elran: I broke out the addition of the enums into the schema.xml used for testing into it's own file so we didn't have issues with other tests FWIW.
        Hide
        ASF subversion and git services added a comment -

        Commit 1539111 from Erick Erickson in branch 'dev/trunk'
        [ https://svn.apache.org/r1539111 ]

        SOLR-5084: added enum field type to Solr

        Show
        ASF subversion and git services added a comment - Commit 1539111 from Erick Erickson in branch 'dev/trunk' [ https://svn.apache.org/r1539111 ] SOLR-5084 : added enum field type to Solr
        Hide
        Erick Erickson added a comment -

        Added entry to CHANGES.txt for Solr.

        Show
        Erick Erickson added a comment - Added entry to CHANGES.txt for Solr.
        Hide
        Erick Erickson added a comment -

        Some changes/additions. Mostly this is making tests work. The addition of the enum to the schema.xml file caused failures in other places that used that file but didn't have access to the underlying config files. I suspect they were somehow being found on Elran's machine but not on mine.

        Anyway, new patch. I'll be committing momentarily.

        Show
        Erick Erickson added a comment - Some changes/additions. Mostly this is making tests work. The addition of the enum to the schema.xml file caused failures in other places that used that file but didn't have access to the underlying config files. I suspect they were somehow being found on Elran's machine but not on mine. Anyway, new patch. I'll be committing momentarily.
        Hide
        Erick Erickson added a comment -

        OK, looks good. On my Mac, I get two failures, but I also get those same failures in trunk without the patch.

        So unless someone objects, I'll commit this over the weekend.

        Show
        Erick Erickson added a comment - OK, looks good. On my Mac, I get two failures, but I also get those same failures in trunk without the patch. So unless someone objects, I'll commit this over the weekend.
        Hide
        Elran Dvir added a comment -

        Attached new patch
        Fixes our test issues. Added "putConfig("conf1", zkClient, solrhome, "enumsConfig.xml");" in buildZooKeeper in LeaderElectionIntegrationTest.
        There are still test errors but I don't think it's related to our patch (thread leaked from SUITE scope at org.apache.solr.servlet.SolrRequestParserTest).

        Thanks.

        Show
        Elran Dvir added a comment - Attached new patch Fixes our test issues. Added "putConfig("conf1", zkClient, solrhome, "enumsConfig.xml");" in buildZooKeeper in LeaderElectionIntegrationTest. There are still test errors but I don't think it's related to our patch (thread leaked from SUITE scope at org.apache.solr.servlet.SolrRequestParserTest). Thanks.
        Hide
        Erick Erickson added a comment -

        Fixes EOL warnings on OS X, passes precommit (there were a couple of Javadoc warnings), addes ASF header to one file that didn't have it.

        However, tests don't seem to be passing, not sure whether this is a problem with this patch or not. The tests are distributed e.g. LeaderElectionIntegrationTest

        Show
        Erick Erickson added a comment - Fixes EOL warnings on OS X, passes precommit (there were a couple of Javadoc warnings), addes ASF header to one file that didn't have it. However, tests don't seem to be passing, not sure whether this is a problem with this patch or not. The tests are distributed e.g. LeaderElectionIntegrationTest
        Hide
        Elran Dvir added a comment -

        Hi Erick,

        I have added the necessary JavaDocs. Now 'ant precommit' finishes successfully
        I have attached updated patch.
        You can ignore the 4.x version. It's only here because it's the version I started with.

        Thank you very much!

        Show
        Elran Dvir added a comment - Hi Erick, I have added the necessary JavaDocs. Now 'ant precommit' finishes successfully I have attached updated patch. You can ignore the 4.x version. It's only here because it's the version I started with. Thank you very much!
        Hide
        Erick Erickson added a comment -

        Cleans up a couple of import statements, adds a license header etc.

        Elran:

        Please try running 'ant precommit' after applying the patch. That step complains about missing JavaDocs, there are a couple of files that don't have any.

        What's the purpose of a separate trunk and 4x patch? In general we try to just have a patch for trunk and then use the merge capability. I'm asking mostly to see whether there are incompatibilities between trunk and 4x. If there's no specific reason, I'll ignore the 4x patch and do the merge thing when we commit.

        All:
        If there are no objections, I'll probably commit this over the weekend after we work out the precommit issues and insure that tests run, so speak up....

        Erick

        Show
        Erick Erickson added a comment - Cleans up a couple of import statements, adds a license header etc. Elran: Please try running 'ant precommit' after applying the patch. That step complains about missing JavaDocs, there are a couple of files that don't have any. What's the purpose of a separate trunk and 4x patch? In general we try to just have a patch for trunk and then use the merge capability. I'm asking mostly to see whether there are incompatibilities between trunk and 4x. If there's no specific reason, I'll ignore the 4x patch and do the merge thing when we commit. All: If there are no objections, I'll probably commit this over the weekend after we work out the precommit issues and insure that tests run, so speak up.... Erick
        Hide
        Elran Dvir added a comment -

        Thanks!

        Show
        Elran Dvir added a comment - Thanks!
        Hide
        Erick Erickson added a comment -

        Elran:

        I haven't forgotten, I've just been swamped. I'll try to get to it Real Soon Now, but I'm traveling for the next 10 days. Anyone who wants to go ahead and grab and commit this please feel free!

        Erick

        Show
        Erick Erickson added a comment - Elran: I haven't forgotten, I've just been swamped. I'll try to get to it Real Soon Now, but I'm traveling for the next 10 days. Anyone who wants to go ahead and grab and commit this please feel free! Erick
        Hide
        Elran Dvir added a comment -

        Hi all,

        Did anyone have a chance to examine the latest patch? Is everthing fine?

        Thanks.

        Show
        Elran Dvir added a comment - Hi all, Did anyone have a chance to examine the latest patch? Is everthing fine? Thanks.
        Hide
        Elran Dvir added a comment -

        Hi Erick,

        Thanks for the feedback.

        1) I think your suggestion would be much better, especially if we can keep the syntax fairly compact. However I would like to separate that effort from this change (it might apply to CurrencyField and other use cases so might warrant a different issue)

        1.5) gone in today's patch.

        2)I'm not seeing 17 in the patch, I have ENUM_FIELD_VALUE as 18. I rechecked for the today's patch as well - not sure where the difference is coming from

        3)I am not checking null value of this.max. I am checking null value of the parameter max of the function. So this.max should be set to non-null.

        4) usage of isNullOrEmpty - again, I've removed the isNullOrEmpty code in the Sep 1st patch. Can't explain why you are still seeing it... I also have an override of storedToIndexed showing in my patch

        5) there were inclusive range tests in the function testEnumRangeSearch. I have added now exclusive range tests in today's patch .

        I have removed leading underscores from members in today's patch.

        Thanks!

        Show
        Elran Dvir added a comment - Hi Erick, Thanks for the feedback. 1) I think your suggestion would be much better, especially if we can keep the syntax fairly compact. However I would like to separate that effort from this change (it might apply to CurrencyField and other use cases so might warrant a different issue) 1.5) gone in today's patch. 2)I'm not seeing 17 in the patch, I have ENUM_FIELD_VALUE as 18. I rechecked for the today's patch as well - not sure where the difference is coming from 3)I am not checking null value of this.max. I am checking null value of the parameter max of the function. So this.max should be set to non-null. 4) usage of isNullOrEmpty - again, I've removed the isNullOrEmpty code in the Sep 1st patch. Can't explain why you are still seeing it... I also have an override of storedToIndexed showing in my patch 5) there were inclusive range tests in the function testEnumRangeSearch. I have added now exclusive range tests in today's patch . I have removed leading underscores from members in today's patch. Thanks!
        Hide
        Ryan Ernst added a comment -

        What do you (and others) think about putting the enum right in the schema.xml file,
        perhaps as as many children of <fields><enum></enum><enum></enum></fields> as necessary?
        That would at least keep them together and not introduce a separate file.

        Inlining within schema.xml would definitely be better. But do we need children at all? Why not just an attribute with a comma separated list (and allow for comma escaping)?

        Show
        Ryan Ernst added a comment - What do you (and others) think about putting the enum right in the schema.xml file, perhaps as as many children of <fields><enum></enum><enum></enum></fields> as necessary? That would at least keep them together and not introduce a separate file. Inlining within schema.xml would definitely be better. But do we need children at all? Why not just an attribute with a comma separated list (and allow for comma escaping)?
        Hide
        Erick Erickson added a comment -

        Rough draft comments. I won't be able to do anything more until this evening (Pacific time), in meetings all day.

        Elran

        1> What do you (and others) think about putting the enum right in the schema.xml file,
        perhaps as as many children of <fields><enum></enum><enum></enum></fields> as necessary?
        That would at least keep them together and not introduce a separate file.

        1.5> There's still the IntelliJ headers in the new files, should be removed.

        2> JavabinCodec.java
        SOLRINPUTDOC_CHILDS = 17,

        //Why is this value identical to SOLRINPUTDOC_CHILDS? It may be fine, verifying
        ENUM_FIELD_VALUE = 17,

        3> StatsValuesFactory.updateMinMax. How does this.max ever get set to non-null?

        4> EnumField.java.
        > You have a custom field isNullOrEmpty. There's nothing wrong with this,
        but StringUtils.isBlank already takes care of this, less code to maintain.

        > storedToIndexed isn't referenced and, at least in trunk, doesn't show as being
        overridden (haven't checked 4x). May be OK but....

        5> EnumFieldTest.java
        > How about a couple of tests that exercise inclusive and exclusive ranges with defined
        endpoints rather than [* TO *]?
        > The general coding practice (even though I like the leading underscore for member
        vars personally) is to not have them, please change.

        Show
        Erick Erickson added a comment - Rough draft comments. I won't be able to do anything more until this evening (Pacific time), in meetings all day. Elran 1> What do you (and others) think about putting the enum right in the schema.xml file, perhaps as as many children of <fields><enum></enum><enum></enum></fields> as necessary? That would at least keep them together and not introduce a separate file. 1.5> There's still the IntelliJ headers in the new files, should be removed. 2> JavabinCodec.java SOLRINPUTDOC_CHILDS = 17, //Why is this value identical to SOLRINPUTDOC_CHILDS? It may be fine, verifying ENUM_FIELD_VALUE = 17, 3> StatsValuesFactory.updateMinMax. How does this.max ever get set to non-null? 4> EnumField.java. > You have a custom field isNullOrEmpty. There's nothing wrong with this, but StringUtils.isBlank already takes care of this, less code to maintain. > storedToIndexed isn't referenced and, at least in trunk, doesn't show as being overridden (haven't checked 4x). May be OK but.... 5> EnumFieldTest.java > How about a couple of tests that exercise inclusive and exclusive ranges with defined endpoints rather than [* TO *] ? > The general coding practice (even though I like the leading underscore for member vars personally) is to not have them, please change.
        Hide
        Elran Dvir added a comment -

        Thanks!

        Show
        Elran Dvir added a comment - Thanks!
        Hide
        Erick Erickson added a comment -

        I have a flight coming up, I'll see if I can give it a look-see.

        Show
        Erick Erickson added a comment - I have a flight coming up, I'll see if I can give it a look-see.
        Hide
        Elran Dvir added a comment -

        Hi all,

        Did any have a chance to examine the latest patch?

        Thanks.

        Show
        Elran Dvir added a comment - Hi all, Did any have a chance to examine the latest patch? Thanks.
        Hide
        Elran Dvir added a comment -

        Hi all,

        I attached a new patch that contains:
        1.Integer values in enum configuration are now implicit 0-N
        2.Throw exception when indexed value is not in the configuration (either numeric or string)

        Should I change the example directory to demonstrate the use? If so, how?

        Thanks.

        Show
        Elran Dvir added a comment - Hi all, I attached a new patch that contains: 1.Integer values in enum configuration are now implicit 0-N 2.Throw exception when indexed value is not in the configuration (either numeric or string) Should I change the example directory to demonstrate the use? If so, how? Thanks.
        Hide
        Erick Erickson added a comment -

        Assigned it to myself just so it stays on my radar, Elran is still driving the bus....

        Show
        Erick Erickson added a comment - Assigned it to myself just so it stays on my radar, Elran is still driving the bus....
        Hide
        David Smiley added a comment -

        If you are worried about idiot users, add a comment around the field type to the example:

        +1 A comment suffices.

        Show
        David Smiley added a comment - If you are worried about idiot users, add a comment around the field type to the example: +1 A comment suffices.
        Hide
        Robert Muir added a comment -

        As long as the config forces them to be explicit about the values (and has error checking at startup that the values start a "0" and are monotomicly increasing ints) then anyone who wants to "insert" values into their config is going to have to pause and think about the fact that there is a concrete int associated with the existing values – and is more likely to realize that changing those ints has consequences.

        If the values are implicitly 0, 1, 2, ... n, then why force the user to write that out?

        If you are worried about idiot users, add a comment around the field type to the example:

        <!-- note: you cannot change the order/existing values without reindexing.
             but you can always add new values to the end. -->
        

        Otherwise it just makes the configuration overly verbose to have them write 0..n themselves.

        Show
        Robert Muir added a comment - As long as the config forces them to be explicit about the values (and has error checking at startup that the values start a "0" and are monotomicly increasing ints) then anyone who wants to "insert" values into their config is going to have to pause and think about the fact that there is a concrete int associated with the existing values – and is more likely to realize that changing those ints has consequences. If the values are implicitly 0, 1, 2, ... n, then why force the user to write that out? If you are worried about idiot users, add a comment around the field type to the example: <!-- note: you cannot change the order/existing values without reindexing. but you can always add new values to the end. --> Otherwise it just makes the configuration overly verbose to have them write 0..n themselves.
        Hide
        Elran Dvir added a comment -

        Hi all,

        I attached a new patch.
        The patch is based on trunk.
        It contains changes regarding the issues Robert mentioned (Thanks Robert):
        1. fixed the bug where string inputs weren't mapped into their numeric values in ValueSourceScorer.getRangeScorer and getRangeQuery
        2. removed analysis chain.

        In the next following days, I will attach fixes for the remaining issues:
        1.Verify value strictness on startup (numeric values start at 0, increment by 1).
        2.Throwing exception when indexed value is not in the configuration (either number or string).

        Thank you all.

        Show
        Elran Dvir added a comment - Hi all, I attached a new patch. The patch is based on trunk. It contains changes regarding the issues Robert mentioned (Thanks Robert): 1. fixed the bug where string inputs weren't mapped into their numeric values in ValueSourceScorer.getRangeScorer and getRangeQuery 2. removed analysis chain. In the next following days, I will attach fixes for the remaining issues: 1.Verify value strictness on startup (numeric values start at 0, increment by 1). 2.Throwing exception when indexed value is not in the configuration (either number or string). Thank you all.
        Hide
        Hoss Man added a comment -

        And I still would really like it if we didn't need a separate XML file for each enumerated type: its like a parallel schema.xml: I think it would be much better if we could nest this underneath the fieldtype.

        it would be nice, but as far as i know there is no way for a FieldType to do this – making this FieldType use an attribute to refer to another file (just like ExternalFile field does, or StopWordsFilterFactory, or SynonymFilterFactory, etc...) seems like a suitable approach for now, and if/when someone enhances FieldType configuration in general, then it can be revisted. (ie: it doesn't seem fair to Elran to object to this patch/feature given that he's working iwth the APIs available)

        Finally, I still think the ordinals should be implicit in the list (as i mentioned before). This way the thing can actually be efficient.

        I agree that it makes sense to require that the ordinals be "dense" (ie: start at 0, no gaps allowed).

        But in my opinion, from a usability standpoint, I think it's actually better to force the Solr admin writing the config to explicit about the numeric mappings in the config so that they have to be aware of the fact that a specific numeric value is used under the covers (ie: in hte indexed/docValues fields) for each value that the end users get. It seems like it will help minimize the risk of someone assuming that only the "labels" matter in the configs and the can insert new ones to get the sorting they want.

        Example:

        If the config looked like this...

        <enum name="priority">
          <value>LOW</value>
          <value>HIGH</value>
        </enum>
        

        ...then a user might not realize there is anything wrong with making the following additions w/o re-indexing...

        <enum name="priority">
          <value>NONE</value>
          <value>LOW</value>
          <value>MEDIUM</value>
          <value>HIGH</value>
        </enum>
        

        ...and if they did that they would silently get bogus results – no obvious error at runtime.

        As long as the config forces them to be explicit about the values (and has error checking at startup that the values start a "0" and are monotomicly increasing ints) then anyone who wants to "insert" values into their config is going to have to pause and think about the fact that there is a concrete int associated with the existing values – and is more likely to realize that changing those ints has consequences.

        Show
        Hoss Man added a comment - And I still would really like it if we didn't need a separate XML file for each enumerated type: its like a parallel schema.xml: I think it would be much better if we could nest this underneath the fieldtype. it would be nice, but as far as i know there is no way for a FieldType to do this – making this FieldType use an attribute to refer to another file (just like ExternalFile field does, or StopWordsFilterFactory, or SynonymFilterFactory, etc...) seems like a suitable approach for now, and if/when someone enhances FieldType configuration in general, then it can be revisted. (ie: it doesn't seem fair to Elran to object to this patch/feature given that he's working iwth the APIs available) Finally, I still think the ordinals should be implicit in the list (as i mentioned before). This way the thing can actually be efficient. I agree that it makes sense to require that the ordinals be "dense" (ie: start at 0, no gaps allowed). But in my opinion, from a usability standpoint, I think it's actually better to force the Solr admin writing the config to explicit about the numeric mappings in the config so that they have to be aware of the fact that a specific numeric value is used under the covers (ie: in hte indexed/docValues fields) for each value that the end users get. It seems like it will help minimize the risk of someone assuming that only the "labels" matter in the configs and the can insert new ones to get the sorting they want. Example: If the config looked like this... <enum name="priority"> <value>LOW</value> <value>HIGH</value> </enum> ...then a user might not realize there is anything wrong with making the following additions w/o re-indexing... <enum name="priority"> <value>NONE</value> <value>LOW</value> <value>MEDIUM</value> <value>HIGH</value> </enum> ...and if they did that they would silently get bogus results – no obvious error at runtime. As long as the config forces them to be explicit about the values (and has error checking at startup that the values start a "0" and are monotomicly increasing ints) then anyone who wants to "insert" values into their config is going to have to pause and think about the fact that there is a concrete int associated with the existing values – and is more likely to realize that changing those ints has consequences.
        Hide
        Erick Erickson added a comment -

        Yeah, if you would Actually, 4x (eventually 4.5) would be
        better, and against trunk would be even best. Uf/when we
        apply it we'll merge it into the 4x branch.

        But also take a look at Robert's latest comments, he's one
        of the deep-level Lucene knowledge.

        Best
        Erick

        Show
        Erick Erickson added a comment - Yeah, if you would Actually, 4x (eventually 4.5) would be better, and against trunk would be even best. Uf/when we apply it we'll merge it into the 4x branch. But also take a look at Robert's latest comments, he's one of the deep-level Lucene knowledge. Best Erick
        Hide
        Elran Dvir added a comment -

        Hi Erick,

        I developed the feature with Solr 4.2.1 source code.
        I can create a similar patch based on Solr 4.4. Do you want me to create and attach it?

        Thanks.

        Show
        Elran Dvir added a comment - Hi Erick, I developed the feature with Solr 4.2.1 source code. I can create a similar patch based on Solr 4.4. Do you want me to create and attach it? Thanks.
        Hide
        Robert Muir added a comment -

        I still really don't think we should add an enumerated type that isn't strict about inputs actually matching enumerated values: it loses all of its value to me.

        There are a few inconsistencies where string inputs aren't mapped into their numeric value: like ValueSourceScorer.getRangeScorer and getRangeQuery in the fieldtype (in some cases).

        I don't understand the need for an analysis chain: The TrieTokenizerFactory is unnecessary and I have a patch to remove it on some issue somewhere but just havent gotten around to committing it.

        And I still would really like it if we didn't need a separate XML file for each enumerated type: its like a parallel schema.xml: I think it would be much better if we could nest this underneath the fieldtype.

        Finally, I still think the ordinals should be implicit in the list (as i mentioned before). This way the thing can actually be efficient.

        Show
        Robert Muir added a comment - I still really don't think we should add an enumerated type that isn't strict about inputs actually matching enumerated values: it loses all of its value to me. There are a few inconsistencies where string inputs aren't mapped into their numeric value: like ValueSourceScorer.getRangeScorer and getRangeQuery in the fieldtype (in some cases). I don't understand the need for an analysis chain: The TrieTokenizerFactory is unnecessary and I have a patch to remove it on some issue somewhere but just havent gotten around to committing it. And I still would really like it if we didn't need a separate XML file for each enumerated type: its like a parallel schema.xml: I think it would be much better if we could nest this underneath the fieldtype. Finally, I still think the ordinals should be implicit in the list (as i mentioned before). This way the thing can actually be efficient.
        Hide
        Erick Erickson added a comment -

        Hmmm, it doesn't apply cleanly to 4x or trunk. Usually patches are made against trunk and back-ported/merged to 4x BTW....

        Robert Muir
        Chris Hostetter
        Aside from applying cleanly to the current 4x and trunk, what do you think about the current state of the patch?

        Show
        Erick Erickson added a comment - Hmmm, it doesn't apply cleanly to 4x or trunk. Usually patches are made against trunk and back-ported/merged to 4x BTW.... Robert Muir Chris Hostetter Aside from applying cleanly to the current 4x and trunk, what do you think about the current state of the patch?
        Hide
        Elran Dvir added a comment -

        Hi,

        I have attached a patch containing unit tests.

        Thanks.

        Show
        Elran Dvir added a comment - Hi, I have attached a patch containing unit tests. Thanks.
        Hide
        Elran Dvir added a comment -

        In the latest patch I removed all redundant format changes. There were no changes in logic.
        I hope to attach the patch with unit tests in the next following days.

        Thanks again for the attention.

        Show
        Elran Dvir added a comment - In the latest patch I removed all redundant format changes. There were no changes in logic. I hope to attach the patch with unit tests in the next following days. Thanks again for the attention.
        Hide
        Erick Erickson added a comment -

        I have a few cycles to devote to this.

        Elran Dvir What's the state of the most recent patch? You were going to attach a new patch that had some unit tests, is that forthcoming or did I miss it being attached?

        Show
        Erick Erickson added a comment - I have a few cycles to devote to this. Elran Dvir What's the state of the most recent patch? You were going to attach a new patch that had some unit tests, is that forthcoming or did I miss it being attached?
        Hide
        Robert Muir added a comment -

        I think sorting is a major use case. With some of these previous examples like risk or issue tracker status, you want to sort by the field and for 'high' risk to sort after 'low', maybe 'closed' after 'created' and so on.

        Show
        Robert Muir added a comment - I think sorting is a major use case. With some of these previous examples like risk or issue tracker status, you want to sort by the field and for 'high' risk to sort after 'low', maybe 'closed' after 'created' and so on.
        Hide
        Erick Erickson added a comment -

        Ahhh, OK. Then Hoss says sorting, so no wonder I'm confused!

        There's no reason one couldn't sort by a field of this type, right? Frankly though, it seems kind of low-utility since there are probably only going to be a few values in the common use-case, but I'd guess it's still a possibility...

        Show
        Erick Erickson added a comment - Ahhh, OK. Then Hoss says sorting, so no wonder I'm confused! There's no reason one couldn't sort by a field of this type, right? Frankly though, it seems kind of low-utility since there are probably only going to be a few values in the common use-case, but I'd guess it's still a possibility...
        Hide
        Robert Muir added a comment -

        Wait: i said sort order (not sorting).

        So to me the multivalued case of an enum field makes total sense (it is kinda like java's EnumSet). And the sort order defines what is used in faceting, range queries, and so on.

        Show
        Robert Muir added a comment - Wait: i said sort order (not sorting). So to me the multivalued case of an enum field makes total sense (it is kinda like java's EnumSet). And the sort order defines what is used in faceting, range queries, and so on.
        Hide
        Erick Erickson added a comment -

        @Elran

        bq: Why do you say the assumption is the type is restricted to single value?...

        Parts of the discussion mentioned sorting, which is undefined on multivalued fields. If sorting is required for an enum-type field then it shouldn't be mutliValued. There's no reason it needs to be restricted to single values, it's fine for the enum type to be just like any other field; it's up to the user to only put one value in the field if it's to be used to sorting.

        Mostly getting it straight in my head what the characteristics are, not saying it should be single-valued-only...

        Erick

        Show
        Erick Erickson added a comment - @Elran bq: Why do you say the assumption is the type is restricted to single value?... Parts of the discussion mentioned sorting, which is undefined on multivalued fields. If sorting is required for an enum-type field then it shouldn't be mutliValued. There's no reason it needs to be restricted to single values, it's fine for the enum type to be just like any other field; it's up to the user to only put one value in the field if it's to be used to sorting. Mostly getting it straight in my head what the characteristics are, not saying it should be single-valued-only... Erick
        Hide
        Elran Dvir added a comment -

        The patch is finally attached.
        I'll attach a patch with unit tests ASAP.

        Show
        Elran Dvir added a comment - The patch is finally attached. I'll attach a patch with unit tests ASAP.
        Hide
        Elran Dvir added a comment -

        I failed attaching the new patch. I will attach it ASAP.
        The only changes are formatting.

        Show
        Elran Dvir added a comment - I failed attaching the new patch. I will attach it ASAP. The only changes are formatting.
        Hide
        Elran Dvir added a comment -

        Thank you all very much for your very quick feedback.

        @Hoss,

        1) I eliminated all formatting changes and attached a new patch. I hope it'll be more readable now.
        2) I will try to create unit test as soon as possible and attach it.
        3) I returned the value as EnumFieldValue in JavaBin format because I would like to allow for a use case of sorting the values when returned to my application by SolrJ.
        4) Maybe it could, but I tried to keep the implementation simple and it didn’t appear to give much more value. If someone feels strongly about it I can revisit the implementation

        @Robert,

        In the configuration, I chose to specify the numeric values because I want to also enable indexing and querying numeric values.
        For example, the queries risk:[1 TO 3] and risk:[Low TO High] are both valid.
        Currently:

        • If a bogus string value is sent, the value is indexed as -1.
        • If a bogus integer value is sent, if the value is not a negative number, the value is indexed as is. If it’s negative – the value is indexed as -1.
        • The display value is of course string. If we don’t find a matching label to the numeric value in the configuration, the actual numeric value is displayed.
          Adding a new value at the end, would work.
          Changing a display string for a value, will also work for retrieving data – new data will have to be inserted using the new name (or by int value)
          Removing a legal value from the list would retrieve the numeric value for existing data

        I have no use case for removing a previously legal value, so I can either document the behavior, or implement a different behavior – depending on how this discussion goes

        @Erick,

        I didn't intend to make this type single valued on purpose, it’s just that my use case is single valued. I changed the field's configuration to multi value and it seems to work fine
        (facet, pivot, stats, return stored field). Why do you say the assumption is the type is restricted to single value?
        Thanks again.

        Show
        Elran Dvir added a comment - Thank you all very much for your very quick feedback. @Hoss, 1) I eliminated all formatting changes and attached a new patch. I hope it'll be more readable now. 2) I will try to create unit test as soon as possible and attach it. 3) I returned the value as EnumFieldValue in JavaBin format because I would like to allow for a use case of sorting the values when returned to my application by SolrJ. 4) Maybe it could, but I tried to keep the implementation simple and it didn’t appear to give much more value. If someone feels strongly about it I can revisit the implementation @Robert, In the configuration, I chose to specify the numeric values because I want to also enable indexing and querying numeric values. For example, the queries risk: [1 TO 3] and risk: [Low TO High] are both valid. Currently: If a bogus string value is sent, the value is indexed as -1. If a bogus integer value is sent, if the value is not a negative number, the value is indexed as is. If it’s negative – the value is indexed as -1. The display value is of course string. If we don’t find a matching label to the numeric value in the configuration, the actual numeric value is displayed. Adding a new value at the end, would work. Changing a display string for a value, will also work for retrieving data – new data will have to be inserted using the new name (or by int value) Removing a legal value from the list would retrieve the numeric value for existing data I have no use case for removing a previously legal value, so I can either document the behavior, or implement a different behavior – depending on how this discussion goes @Erick, I didn't intend to make this type single valued on purpose, it’s just that my use case is single valued. I changed the field's configuration to multi value and it seems to work fine (facet, pivot, stats, return stored field). Why do you say the assumption is the type is restricted to single value? Thanks again.
        Hide
        Erick Erickson added a comment -

        Hmmmm...

        So the underlying assumption here is that the enum type is restricted to being single valued. Not saying it should be multiValued, just making sure it's explicit.

        Hoss's point about inserting/removing values in the middle of the list leading to "interesting" behavior is well taken, but I'd rather deal with that by big fat warnings in the documentation, telling users they need to re-index. Let's wait for the compelling use-case before complexifying it...

        FWIW

        Show
        Erick Erickson added a comment - Hmmmm... So the underlying assumption here is that the enum type is restricted to being single valued. Not saying it should be multiValued, just making sure it's explicit. Hoss's point about inserting/removing values in the middle of the list leading to "interesting" behavior is well taken, but I'd rather deal with that by big fat warnings in the documentation, telling users they need to re-index. Let's wait for the compelling use-case before complexifying it... FWIW
        Hide
        Robert Muir added a comment -

        For sorting and value sources etc... nothing special happens – they still have the same numeric value under the covers; it's just that when writing out the "stored" values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.)

        Then this is just renaming a label to some special value.

        I really think the best thing is to keep it simple, like java.lang.Enum. Just give a list of values. This way it will be efficient everywhere since the values will be dense. Its also conceptually simple.

        Otherwise, things get complicated. and the implementation may suffer due to sparse "ordinals". Really, i dont care, as docvalues will do the right thing as long as you have < 256 values (regardless of sparsity). Fieldcache wont, but doesn't bother me a bit.

        But still, there is no sense in making things complicated and inefficient for no good reason. Someone could make a HairyComplicatedAndInefficientEnumType for that.

        Show
        Robert Muir added a comment - For sorting and value sources etc... nothing special happens – they still have the same numeric value under the covers; it's just that when writing out the "stored" values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.) Then this is just renaming a label to some special value. I really think the best thing is to keep it simple, like java.lang.Enum. Just give a list of values. This way it will be efficient everywhere since the values will be dense. Its also conceptually simple. Otherwise, things get complicated. and the implementation may suffer due to sparse "ordinals". Really, i dont care, as docvalues will do the right thing as long as you have < 256 values (regardless of sparsity). Fieldcache wont, but doesn't bother me a bit. But still, there is no sense in making things complicated and inefficient for no good reason. Someone could make a HairyComplicatedAndInefficientEnumType for that.
        Hide
        Hoss Man added a comment -

        but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value?

        For sorting and value sources etc... nothing special happens – they still have the same numeric value under the covers; it's just that when writing out the "stored" values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.)

        If the user wants some other behavior the burden is on them to re-index or delete the affected docs – but the simple stuff stays just as simple as if they were dealing with the int<->label mappings in their own code, the validation of legal labels just moves from the client to solr.

        Show
        Hoss Man added a comment - but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value? For sorting and value sources etc... nothing special happens – they still have the same numeric value under the covers; it's just that when writing out the "stored" values (ie: label) you act as if they have no value in the field at all (shouldn't affect efficiency at all.) If the user wants some other behavior the burden is on them to re-index or delete the affected docs – but the simple stuff stays just as simple as if they were dealing with the int<->label mappings in their own code, the validation of legal labels just moves from the client to solr.
        Hide
        Robert Muir added a comment -

        I think adding new values to the end of the list is no issue at all. neither is renaming labels.

        but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value?

        in general i'm just trying to make sure we keep things sane here, so that the underlying shit is efficient.

        Show
        Robert Muir added a comment - I think adding new values to the end of the list is no issue at all. neither is renaming labels. but removing a legal value from the list, i think you need to reindex. Because what to do with documents that have that integer value? in general i'm just trying to make sure we keep things sane here, so that the underlying shit is efficient.
        Hide
        Hoss Man added a comment -

        If you want to rename a label, thats fine, but you cant really change the sort order without reindexing.

        No, no .. of course not ... i wasn't suggestiong you could change the order, just:

        • remove a legal value from the list (w/o causing hte validation to complain)
        • add new values to the end of the list
        • (as you mentioned) modify the label on an existing value

        See the example i posted before about removing "Medium" but keeping "High" & "Critical" exactly as they are – no change in indexed data, just a way to tell the validation logic you were talking about adding "skip this value, i removed it on purpose" (or i suppose: "skip this value, i'm reserving it as a placeholder for future use")

        Show
        Hoss Man added a comment - If you want to rename a label, thats fine, but you cant really change the sort order without reindexing. No, no .. of course not ... i wasn't suggestiong you could change the order, just: remove a legal value from the list (w/o causing hte validation to complain) add new values to the end of the list (as you mentioned) modify the label on an existing value See the example i posted before about removing "Medium" but keeping "High" & "Critical" exactly as they are – no change in indexed data, just a way to tell the validation logic you were talking about adding "skip this value, i removed it on purpose" (or i suppose: "skip this value, i'm reserving it as a placeholder for future use")
        Hide
        Robert Muir added a comment -

        i dunno ... that seems like it would really kill the utility of a field for a lot of use cases – if it had that kind of limitation, i would just use an "int" field an managing the mappings myself so id always know i could add/remove (EDIT) fields values w/o needing to reindex.

        This isnt really going to work here, because the idea is you want to assign sort order (not just values mapped to ints). If you want to rename a label, thats fine, but you cant really change the sort order without reindexing.

        Show
        Robert Muir added a comment - i dunno ... that seems like it would really kill the utility of a field for a lot of use cases – if it had that kind of limitation, i would just use an "int" field an managing the mappings myself so id always know i could add/remove (EDIT) fields values w/o needing to reindex. This isnt really going to work here, because the idea is you want to assign sort order (not just values mapped to ints). If you want to rename a label, thats fine, but you cant really change the sort order without reindexing.
        Hide
        Hoss Man added a comment - - edited

        Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing.

        i dunno ... that seems like it would really kill the utility of a field for a lot of use cases – if it had that kind of limitation, i would just use an "int" field an managing the mappings myself so id always know i could add/remove (EDIT) fields values w/o needing to reindex.

        to follow your example: if i completley change hte analyzer, then yes i have ot reindex – but if want to stop using a ynonym, i don't have to re-index every doc, just the ones that had that used that synonyms.

        as far as name->int, you want a hash anyway.

        right ... never mind, i was thinking about it backwards.

        Show
        Hoss Man added a comment - - edited Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. i dunno ... that seems like it would really kill the utility of a field for a lot of use cases – if it had that kind of limitation, i would just use an "int" field an managing the mappings myself so id always know i could add/remove (EDIT) fields values w/o needing to reindex. to follow your example: if i completley change hte analyzer, then yes i have ot reindex – but if want to stop using a ynonym, i don't have to re-index every doc, just the ones that had that used that synonyms. as far as name->int, you want a hash anyway. right ... never mind, i was thinking about it backwards.
        Hide
        Robert Muir added a comment -

        ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files.

        i don't think the schema parsing code can handle that – it's attribute based, not nested element based

        Right but code can change. Other parts of solr allow this kinda stuff.

        yeah ... it would be tempting to not even let the config specify numeric values – just an ordered list, except:

        1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list
        2) you'd need/want a way to "disable" values form the middle of the list from working again.

        Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing.

        I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps...

        Why? I'm talking about int->canonical name (e.g. in the valuesource impl) not anything else. as far as name->int, you want a hash anyway.

        Show
        Robert Muir added a comment - ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files. i don't think the schema parsing code can handle that – it's attribute based, not nested element based Right but code can change. Other parts of solr allow this kinda stuff. yeah ... it would be tempting to not even let the config specify numeric values – just an ordered list, except: 1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list 2) you'd need/want a way to "disable" values form the middle of the list from working again. Well i guess i look at it differently. That this is in a sense like an analyzer. you cant change the config without reindexing. I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps... Why? I'm talking about int->canonical name (e.g. in the valuesource impl) not anything else. as far as name->int, you want a hash anyway.
        Hide
        Hoss Man added a comment -

        ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files.

        i don't think the schema parsing code can handle that – it's attribute based, not nested element based

        should we enforce from the enum config that the integer values are 0-N or something simple? ...

        yeah ... it would be tempting to not even let the config specify numeric values – just an ordered list, except:

        1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list
        2) you'd need/want a way to "disable" values form the middle of the list from working again.

        #2 is a problem you'd need to worry about even if we keep the mappings explicit but enforce 0-N ... there needs to be something like...

          <enum name="severity">
            <pair name="Not Available" value="0"/>
            <pair name="Low" value="1"/>
        
            <!-- value w/o name passes validation but prevents it from being used --><
            <pair value="2"/> <!-- "Medium" use to exist, but was phased out -->
        
            <pair name="High" value="3"/>
            <pair name="Critical" value="4"/>       
        
            <!-- this however would fail, because we skipped 5-10 -->
            <pair name="Super Nova" value="11"/>
          </enum>
        

        ... This way, things like valuesources dont have to do hashing but simple array lookups.

        I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps...

          <enum name="severity">
            <value int="0"><label>Not Available</label></value>
            <value int="1"><label>Low</label></value>
        
            <!-- value w/o name passes validation but prevents it from being used --><
            <value int="2" /> <!-- "Medium" use to exist, but was phased out -->
        
            <value int="3"><label>High</label></value>
        
            <value int="4">
              <label canonical="true">Critical</label>
              <label>Highest</label>
            </value>
          </enum>
        
        Show
        Hoss Man added a comment - ...nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files. i don't think the schema parsing code can handle that – it's attribute based, not nested element based should we enforce from the enum config that the integer values are 0-N or something simple? ... yeah ... it would be tempting to not even let the config specify numeric values – just an ordered list, except: 1) all hell would break loose if someone accidently inserted a new element anywhere other then the end of the list 2) you'd need/want a way to "disable" values form the middle of the list from working again. #2 is a problem you'd need to worry about even if we keep the mappings explicit but enforce 0-N ... there needs to be something like... < enum name= "severity" > <pair name= "Not Available" value= "0" /> <pair name= "Low" value= "1" /> <!-- value w/o name passes validation but prevents it from being used -->< <pair value= "2" /> <!-- "Medium" use to exist, but was phased out --> <pair name= "High" value= "3" /> <pair name= "Critical" value= "4" /> <!-- this however would fail, because we skipped 5-10 --> <pair name= "Super Nova" value= "11" /> </ enum > ... This way, things like valuesources dont have to do hashing but simple array lookups. I was actually thinking it would be nice to support multiple legal names (with one canonical for respones) per value, but that would preent the simple array lookps... < enum name= "severity" > <value int = "0" ><label>Not Available</label></value> <value int = "1" ><label>Low</label></value> <!-- value w/o name passes validation but prevents it from being used -->< <value int = "2" /> <!-- "Medium" use to exist, but was phased out --> <value int = "3" ><label>High</label></value> <value int = "4" > <label canonical= " true " >Critical</label> <label>Highest</label> </value> </ enum >
        Hide
        Robert Muir added a comment -

        I agree with Hossman.. stick with it though, I really like the idea of an efficient enumerated type.

        A few other ideas/questions (just from a glance, i could be wrong):

        • should we enforce from the enum config that the integer values are 0-N or something simple? This way, things like valuesources dont have to do hashing but simple array lookups.
        • it isnt clear to me what happens if you send a bogus value. I think an enumerated type would be best if its "strongly-typed" and just throws exception if the value is bogus.
        • should the config, instead of being a separate config file, just be a nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files.
        Show
        Robert Muir added a comment - I agree with Hossman.. stick with it though, I really like the idea of an efficient enumerated type. A few other ideas/questions (just from a glance, i could be wrong): should we enforce from the enum config that the integer values are 0-N or something simple? This way, things like valuesources dont have to do hashing but simple array lookups. it isnt clear to me what happens if you send a bogus value. I think an enumerated type would be best if its "strongly-typed" and just throws exception if the value is bogus. should the config, instead of being a separate config file, just be a nested element underneath the field type? I dont know if this is even possible or a good idea, but its an that would remove some xml files.
        Hide
        Hoss Man added a comment -

        Elran:

        1) there's still several sections in your patch that have a lot of reformatting making it hard to see what exactly you've added. (I realize that the formatting may not be 100% uniform in all of these files, but the key to making patches easy to read is not to change anything that does't have to be changed ... formatting changes should be done seperately and independently from functionality changes)

        2) could you please add a few unit tests to show how the type can be used when indexing/querying/faceting/returning stored fields so it's more clear what this patch does?

        3) I'm not sure that it makes sense to customize the response writers and the JavaBinCodec to know about hte enum values – it seems like it would make a lot more sense (and by much simpler) to have clients just treat the enum values as strings

        4) a lot of your code seems to be cut/paste from TrieField ... why can't the EnumField class subclass TrieField to re-use this behavior (or worst case: wrap a TrieIntField similar to how TrieDateField works)

        Show
        Hoss Man added a comment - Elran: 1) there's still several sections in your patch that have a lot of reformatting making it hard to see what exactly you've added. (I realize that the formatting may not be 100% uniform in all of these files, but the key to making patches easy to read is not to change anything that does't have to be changed ... formatting changes should be done seperately and independently from functionality changes) 2) could you please add a few unit tests to show how the type can be used when indexing/querying/faceting/returning stored fields so it's more clear what this patch does? 3) I'm not sure that it makes sense to customize the response writers and the JavaBinCodec to know about hte enum values – it seems like it would make a lot more sense (and by much simpler) to have clients just treat the enum values as strings 4) a lot of your code seems to be cut/paste from TrieField ... why can't the EnumField class subclass TrieField to re-use this behavior (or worst case: wrap a TrieIntField similar to how TrieDateField works)
        Hide
        Elran Dvir added a comment -

        I reformatted the code.
        I hope it's OK now.

        Thanks.

        Show
        Elran Dvir added a comment - I reformatted the code. I hope it's OK now. Thanks.
        Hide
        Robert Muir added a comment -

        This issue is interesting but its difficult to see the changes due to all the reformatting in the patch (e.g. changing existing 2-space indent to 4-space indent).

        Is it possible to fix this in the patch? See http://wiki.apache.org/solr/HowToContribute#Creating_the_patch_file

        Thanks

        Show
        Robert Muir added a comment - This issue is interesting but its difficult to see the changes due to all the reformatting in the patch (e.g. changing existing 2-space indent to 4-space indent). Is it possible to fix this in the patch? See http://wiki.apache.org/solr/HowToContribute#Creating_the_patch_file Thanks
        Hide
        Jack Krupansky added a comment -

        Hmmm... I'm wondering how this might tie in with Lucene faceted fields - optimized for a limited number of values.

        Show
        Jack Krupansky added a comment - Hmmm... I'm wondering how this might tie in with Lucene faceted fields - optimized for a limited number of values.

          People

          • Assignee:
            Erick Erickson
            Reporter:
            Elran Dvir
          • Votes:
            1 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development