Solr
  1. Solr
  2. SOLR-308

Add a field that generates an unique id when you have none in your data to index

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.3
    • Component/s: search
    • Labels:
      None

      Description

      This patch adds a field that generates an unique id when you have no unique id in your data you want to index.

      1. UUIDField.patch
        5 kB
        Thomas Peuss
      2. UUIDField.patch
        5 kB
        Thomas Peuss
      3. UUIDField.patch
        3 kB
        Thomas Peuss
      4. UUIDField.patch
        7 kB
        Thomas Peuss
      5. UUIDField.patch
        5 kB
        Thomas Peuss

        Activity

        Hide
        Erik Hatcher added a comment -

        Can the client get the generated id back when adding a document?

        Show
        Erik Hatcher added a comment - Can the client get the generated id back when adding a document?
        Hide
        Thomas Peuss added a comment -

        Well, indirectly yes. It is viewable in the response when you store the field. We use this field because we mainly rely on 3rd party data where we have not much control of the data.

        Show
        Thomas Peuss added a comment - Well, indirectly yes. It is viewable in the response when you store the field. We use this field because we mainly rely on 3rd party data where we have not much control of the data.
        Hide
        Otis Gospodnetic added a comment -

        What type does the id end up being after this? String?

        Show
        Otis Gospodnetic added a comment - What type does the id end up being after this? String?
        Hide
        Hoss Man added a comment -

        i'm confused by this issue .. what's the need?

        solr doesn't require that you have a uniqueKey field, so if there isn't a unique id for your data, why add one artificially?

        Show
        Hoss Man added a comment - i'm confused by this issue .. what's the need? solr doesn't require that you have a uniqueKey field, so if there isn't a unique id for your data, why add one artificially?
        Hide
        Ryan McKinley added a comment -

        If I'm following correct, this is a FieldType that generates a UUID regardless of the input value:

        public Field createField(SchemaField field, String externalVal, float boost)

        { // We ignore the external value and have our own return super.createField(field, UUID.randomUUID().toString(), boost); }

        What is a use case for that?

        If you are looking for something like the sql auto increment, it might be a good candidate for the new fangled 'UpdateRequestProcessor' – this could check if the input document has a uniqueKey - if not, add one and add the new value to the response.

        Show
        Ryan McKinley added a comment - If I'm following correct, this is a FieldType that generates a UUID regardless of the input value: public Field createField(SchemaField field, String externalVal, float boost) { // We ignore the external value and have our own return super.createField(field, UUID.randomUUID().toString(), boost); } What is a use case for that? If you are looking for something like the sql auto increment, it might be a good candidate for the new fangled 'UpdateRequestProcessor' – this could check if the input document has a uniqueKey - if not, add one and add the new value to the response.
        Hide
        Thomas Peuss added a comment - - edited

        The use case is the following:

        • We get catalog data from vendors (300+). We have no control about the data.
        • The only unique thing is the catalogid, which is of course the same for all rows in one catalog.
        • In our webapp we request first only a few fields that are needed for the search result display.
        • When the customer clicks on a product in the search result he gets a detailed page. To get the info from Solr we need a unique id to read the rest of the fields (50+). This id is generated by this code.

        Of course we could add the unique id in a preprocessing step but we wanted to achieve this with Solr alone.

        The update procedure goes like this:

        • Delete all documents with a specific catalogId
        • Insert the updated catalog data

        So you see we need this id to find the exact same document we have in the search result. We do nothing more with it.

        Maybe I overlooked something and this can be achieved with existing code. Any hint is welcome.

        Show
        Thomas Peuss added a comment - - edited The use case is the following: We get catalog data from vendors (300+). We have no control about the data. The only unique thing is the catalogid, which is of course the same for all rows in one catalog. In our webapp we request first only a few fields that are needed for the search result display. When the customer clicks on a product in the search result he gets a detailed page. To get the info from Solr we need a unique id to read the rest of the fields (50+). This id is generated by this code. Of course we could add the unique id in a preprocessing step but we wanted to achieve this with Solr alone. The update procedure goes like this: Delete all documents with a specific catalogId Insert the updated catalog data So you see we need this id to find the exact same document we have in the search result. We do nothing more with it. Maybe I overlooked something and this can be achieved with existing code. Any hint is welcome.
        Hide
        Pieter Berkel added a comment -

        From the usage case you have provided, it sounds like the unique id will change every time you delete and re-insert the document. If this is the case, then perhaps it might be more efficient to use the lucene document id as your unique id value rather than a seperate field? However, as far as I'm aware, there currently isn't any way to access the lucene doc id from solr (except perhaps the luke request handler)?

        Show
        Pieter Berkel added a comment - From the usage case you have provided, it sounds like the unique id will change every time you delete and re-insert the document. If this is the case, then perhaps it might be more efficient to use the lucene document id as your unique id value rather than a seperate field? However, as far as I'm aware, there currently isn't any way to access the lucene doc id from solr (except perhaps the luke request handler)?
        Hide
        Thomas Peuss added a comment -

        That would be a good replacement for my problem. From the Lucene docs I see that the document id is 32 bits (int). I don't know if the docid "wraps around" when this address space is exhausted (I assume not). Or is the docid field recomputed on "optimize"?

        I try to add the functionality to see the document id in the response. So for now we can close this issue for now.

        Show
        Thomas Peuss added a comment - That would be a good replacement for my problem. From the Lucene docs I see that the document id is 32 bits (int). I don't know if the docid "wraps around" when this address space is exhausted (I assume not). Or is the docid field recomputed on "optimize"? I try to add the functionality to see the document id in the response. So for now we can close this issue for now.
        Hide
        Yonik Seeley added a comment -

        Lucene docids are transient (they change when the index changes) - they should not be used across different instances of an IndexReader

        Show
        Yonik Seeley added a comment - Lucene docids are transient (they change when the index changes) - they should not be used across different instances of an IndexReader
        Hide
        Ryan McKinley added a comment -

        The easiest option is to add a UUID when you index the data.

        Other options would be to make this FieldType a plugin and put it in the 'lib' directory.

        Show
        Ryan McKinley added a comment - The easiest option is to add a UUID when you index the data. Other options would be to make this FieldType a plugin and put it in the 'lib' directory.
        Hide
        Hoss Man added a comment -

        I understood your data entry/delete reindexing strategy, but i hadn't considered the use case of doing a query, and then issuing a followup query to get more details about specific items.

        As yonik points out, exposing the internal lucene docid would be a bad idea since it may change every time an IndexReader is opened ... even if hte doc you are interested in is still in the index (ie: hasn't been deleted) other deletions may have changed it's internal id.

        i have no objection to adding a FieldType that can generate UUID on demand for use cases like this, but having it ignore the input seems a little sketchy to me. it seems like a better approach would be to have UUIDFieldType with a toInternal() method that tests it's input for some marker token (like "NEW" or "*") and if it sees that token, generates a new UUID, otherwise it uses the literal value. then you can configure the id field with a defaultValue of "NEW" in the schema and any doc without an id will get a unique one, but if someone tries to update an existing doc whose id they already know, it will still work as well.

        Show
        Hoss Man added a comment - I understood your data entry/delete reindexing strategy, but i hadn't considered the use case of doing a query, and then issuing a followup query to get more details about specific items. As yonik points out, exposing the internal lucene docid would be a bad idea since it may change every time an IndexReader is opened ... even if hte doc you are interested in is still in the index (ie: hasn't been deleted) other deletions may have changed it's internal id. i have no objection to adding a FieldType that can generate UUID on demand for use cases like this, but having it ignore the input seems a little sketchy to me. it seems like a better approach would be to have UUIDFieldType with a toInternal() method that tests it's input for some marker token (like "NEW" or "*") and if it sees that token, generates a new UUID, otherwise it uses the literal value. then you can configure the id field with a defaultValue of "NEW" in the schema and any doc without an id will get a unique one, but if someone tries to update an existing doc whose id they already know, it will still work as well.
        Hide
        Thomas Peuss added a comment -

        Hoss Man: I change the code in the way you described. Thanks for your notes on that.

        Show
        Thomas Peuss added a comment - Hoss Man: I change the code in the way you described. Thanks for your notes on that.
        Hide
        Thomas Peuss added a comment -

        Patch for an UUIDField and associated test.

        Show
        Thomas Peuss added a comment - Patch for an UUIDField and associated test.
        Hide
        Thomas Peuss added a comment -

        An updated version of the patch. In the XML response the UUIDField is now rendered as <uuid>...</uuid>.

        Show
        Thomas Peuss added a comment - An updated version of the patch. In the XML response the UUIDField is now rendered as <uuid>...</uuid>.
        Hide
        Hoss Man added a comment -

        a few misc comments...

        1) ...val.startsWith("NEW")... seems like a bad idea, why not just val.equals("NEW") ?

        2) classes like IntField and DateField don't currently do strong parsing validation in the toInternal method, but this UUIDFIeld class does ... should it?

        3) should toObject be strongly typed to return UUID ?

        4) there shouldn't be new methods in the output writers for this field type ... output writers should only need to know about the most primitive types of data that should be viable regardless of the client language (ie: string, int, float, date, list, etc...) the UUIDField should just write itself out as a string (using <str> in the xml response writer)

        Show
        Hoss Man added a comment - a few misc comments... 1) ...val.startsWith("NEW")... seems like a bad idea, why not just val.equals("NEW") ? 2) classes like IntField and DateField don't currently do strong parsing validation in the toInternal method, but this UUIDFIeld class does ... should it? 3) should toObject be strongly typed to return UUID ? 4) there shouldn't be new methods in the output writers for this field type ... output writers should only need to know about the most primitive types of data that should be viable regardless of the client language (ie: string, int, float, date, list, etc...) the UUIDField should just write itself out as a string (using <str> in the xml response writer)
        Hide
        Thomas Peuss added a comment -

        1.) I change it.
        2.) I remove the check. I understand that this has a performance impact.
        3.) I changed it to what DateField and IntField do.
        4.) I remove that as well.

        If we don't do strong parsing we should call this IDField instead of UUIDField. If we don't enforce that this is an UUID we shouldn't name it like that. What do you think?

        Show
        Thomas Peuss added a comment - 1.) I change it. 2.) I remove the check. I understand that this has a performance impact. 3.) I changed it to what DateField and IntField do. 4.) I remove that as well. If we don't do strong parsing we should call this IDField instead of UUIDField. If we don't enforce that this is an UUID we shouldn't name it like that. What do you think?
        Hide
        Thomas Peuss added a comment -

        Changes based on comments...

        Show
        Thomas Peuss added a comment - Changes based on comments...
        Hide
        Thomas Peuss added a comment -

        BTW: The DateField does strong parsing of the input... It tries to convert the input value to the internal representation and throws a SolrException when that is not possible...

        Show
        Thomas Peuss added a comment - BTW: The DateField does strong parsing of the input... It tries to convert the input value to the internal representation and throws a SolrException when that is not possible...
        Hide
        Thomas Peuss added a comment -

        Added missing test class and readded strong checking that the given value is indeed a valid UUID. So this behaves now like DateField.

        Show
        Thomas Peuss added a comment - Added missing test class and readded strong checking that the given value is indeed a valid UUID. So this behaves now like DateField.
        Hide
        Hoss Man added a comment -

        > BTW: The DateField does strong parsing of the input... It tries to convert the input value to
        > the internal representation and throws a SolrException when that is not possible...

        ...no, note quite. DateField.toInternal(String) only does a quick sanity check to see if the string ends in a Z, if it does it assumes it's in the correct date format, and does no parsing – if it does not end in a Z, then it does DateMathParsing (which may include parsing the date and throwing an exception if that can't be done) ... that parsing is only done if necessary for the date math.

        that was my point - if the UUIDFIeld class is going to index the UUID value using the orriginal human readable format, then there isn't really any reason to attempt to parse it – except as a form of validation, i was just raising the question as to whether or not we think it should do that validation.

        Show
        Hoss Man added a comment - > BTW: The DateField does strong parsing of the input... It tries to convert the input value to > the internal representation and throws a SolrException when that is not possible... ...no, note quite. DateField.toInternal(String) only does a quick sanity check to see if the string ends in a Z, if it does it assumes it's in the correct date format, and does no parsing – if it does not end in a Z, then it does DateMathParsing (which may include parsing the date and throwing an exception if that can't be done) ... that parsing is only done if necessary for the date math. that was my point - if the UUIDFIeld class is going to index the UUID value using the orriginal human readable format, then there isn't really any reason to attempt to parse it – except as a form of validation, i was just raising the question as to whether or not we think it should do that validation.
        Hide
        Thomas Peuss added a comment -

        Changed the input validation to only do basic input validation. We now only check if the thing looks like an UUID.

        Show
        Thomas Peuss added a comment - Changed the input validation to only do basic input validation. We now only check if the thing looks like an UUID.
        Hide
        Thomas Peuss added a comment -

        I personally would prefer strong input checking. This avoids problems at search time. Better we find the problem at index time than the customer at search time... Maybe I am a bit paranoid here. But we get content from many suppliers and the quality is often not that good (commas instead of dots as decimal seperator in floats - even changing from row to row of the catalogue).

        Show
        Thomas Peuss added a comment - I personally would prefer strong input checking. This avoids problems at search time. Better we find the problem at index time than the customer at search time... Maybe I am a bit paranoid here. But we get content from many suppliers and the quality is often not that good (commas instead of dots as decimal seperator in floats - even changing from row to row of the catalogue).
        Hide
        Hoss Man added a comment -

        Thomas: I understand you concerns, but in the balance of performance vs safety Solr tends to err on the side of performance when dealing with indexing data – since that comes from a finite number of controlled sources (you may get it from dozens of places, but you must trust them at least a little and have the chance to sanitize their data before deciding to use it) while query inputs are treaty much more delicately since they typically come from much more diverse group of users many of whom you may outright distrust.

        that said, i went ahead and left in the remaining validation you had, although i had to replace the isEmpty() call (Solr still uses Java 1.5)

        I also changed the toInternal methods to always lowercase whatever value they get (the hex values need to be case insensitve in case someone tries to query/update using a different case then was orriginally indexed)

        Show
        Hoss Man added a comment - Thomas: I understand you concerns, but in the balance of performance vs safety Solr tends to err on the side of performance when dealing with indexing data – since that comes from a finite number of controlled sources (you may get it from dozens of places, but you must trust them at least a little and have the chance to sanitize their data before deciding to use it) while query inputs are treaty much more delicately since they typically come from much more diverse group of users many of whom you may outright distrust. that said, i went ahead and left in the remaining validation you had, although i had to replace the isEmpty() call (Solr still uses Java 1.5) I also changed the toInternal methods to always lowercase whatever value they get (the hex values need to be case insensitve in case someone tries to query/update using a different case then was orriginally indexed)
        Hide
        Hoss Man added a comment -

        Committed revision 569279.

        Show
        Hoss Man added a comment - Committed revision 569279.
        Hide
        rassen added a comment -

        i'm having small question.
        how to use these files?

        Show
        rassen added a comment - i'm having small question. how to use these files?
        Hide
        Ryan McKinley added a comment -

        if you are using trunk (the nightly builds, not 1.2) it is included.

        Show
        Ryan McKinley added a comment - if you are using trunk (the nightly builds, not 1.2) it is included.
        Hide
        Lance Norskog added a comment -

        This field type and its use is not documented in the Wiki: search for 'UUID' finds only custom code in ExtractingRequestHandler.

        Show
        Lance Norskog added a comment - This field type and its use is not documented in the Wiki: search for 'UUID' finds only custom code in ExtractingRequestHandler.
        Hide
        Otis Gospodnetic added a comment -

        Lance - anyone can add/modify a Wiki page. Do you mind adding info about this field type?

        Show
        Otis Gospodnetic added a comment - Lance - anyone can add/modify a Wiki page. Do you mind adding info about this field type?
        Hide
        Thomas Peuss added a comment -
        Show
        Thomas Peuss added a comment - Some documentation can be found here: http://lucene.apache.org/solr/api/org/apache/solr/schema/UUIDField.html
        Hide
        Thomas Peuss added a comment -

        Fields are defined by:

        <fieldType name="uuid" class="solr.UUIDField" indexed="true" />

        and used by

        <field name="id" type="uuid" indexed="true" stored="true" default="NEW"/>

        Show
        Thomas Peuss added a comment - Fields are defined by: <fieldType name="uuid" class="solr.UUIDField" indexed="true" /> and used by <field name="id" type="uuid" indexed="true" stored="true" default="NEW"/>

          People

          • Assignee:
            Hoss Man
            Reporter:
            Thomas Peuss
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development