Solr
  1. Solr
  2. SOLR-6020

Auto-generate a unique key in schema-less mode if data does not have an "id" field

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.10, 6.0
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      Currently it is not possible to use the schema-less example if my data does not have an "id" field.

      I was indexing data where the unique field name was "url" in schema-less mode. This requires one to first change unique key name in the schema and then start solr and then index docs. If one had already started solr, one'd first need to remove managed-schema, rename schema.xml.bak to schema.xml and then make the necessary changes in schema.xml. I don't think we should fail on such simple things.

      Here's what I propose:

      1. We remove "id" and uniqueKey from the managed schema example
      2. If there's a field named "id" in the document, we use that as the uniqueKey
      3. Else we fallback on generating a UUID or a signature field via an update processor and store it as the unique key field. We can name it as "id" or "_id"
      4. But if a uniqueKey is already present in original schema.xml then we should expect the incoming data to have that field and we should preserve the current behavior of failing loudly.
      1. SOLR-6020.patch
        12 kB
        Shalin Shekhar Mangar
      2. SOLR-6020.patch
        12 kB
        Shalin Shekhar Mangar
      3. SOLR-6020.patch
        11 kB
        Shalin Shekhar Mangar
      4. SOLR-6020.patch
        38 kB
        Vitaliy Zhovtyuk

        Activity

        Hide
        Hoss Man added a comment -

        Wouldn't the simplest solution in this case be...

        • leave uniqueKey (id,string) in example-schemaless/solr/collection1/conf/managed-schema
        • add UUIDUpdateProcessorFactory (id) to example-schemaless/solr/collection1/conf/solconfig.xml ?

        UUIDUpdateProcessorFactory will already do the right thing and not generate a new ID if the document being added already has one.

        Show
        Hoss Man added a comment - Wouldn't the simplest solution in this case be... leave uniqueKey (id,string) in example-schemaless/solr/collection1/conf/managed-schema add UUIDUpdateProcessorFactory (id) to example-schemaless/solr/collection1/conf/solconfig.xml ? UUIDUpdateProcessorFactory will already do the right thing and not generate a new ID if the document being added already has one.
        Hide
        Shalin Shekhar Mangar added a comment -

        +1

        That's even better. I wasn't aware that UUIDUpdateProcessor can do that.

        Show
        Shalin Shekhar Mangar added a comment - +1 That's even better. I wasn't aware that UUIDUpdateProcessor can do that.
        Hide
        Hoss Man added a comment -

        A related improvement that might be easy: change UUIDUpdateProcessorFactory so that if no fieldName is configured, it defaults to the uniqueKey field in the schema (if the schema has one - else error just like it does right now if you forget to configure the fieldName on the processor)

        Show
        Hoss Man added a comment - A related improvement that might be easy: change UUIDUpdateProcessorFactory so that if no fieldName is configured, it defaults to the uniqueKey field in the schema (if the schema has one - else error just like it does right now if you forget to configure the fieldName on the processor)
        Hide
        Vitaliy Zhovtyuk added a comment -

        Added patch with changes to UUIDUpdateProcessorFactory and test.
        UUIDUpdateProcessorFactory will use uniqueKeyField if its UUID and field is not defined in processor configuration.
        Maybe make sense to throw exception if configured or uniqueKeyField is not UUID type. Currently it's ignored.

        Show
        Vitaliy Zhovtyuk added a comment - Added patch with changes to UUIDUpdateProcessorFactory and test. UUIDUpdateProcessorFactory will use uniqueKeyField if its UUID and field is not defined in processor configuration. Maybe make sense to throw exception if configured or uniqueKeyField is not UUID type. Currently it's ignored.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Vitaliy.

        In this case, a fieldName must be specified (even if empty) in the solrconfig.xml. This isn't ideal. We should be able to completely omit the fieldName declaration and still have it work. We should override init method in UUIDUpdateProcessorFactory and set fieldName ourselves.

        Maybe make sense to throw exception if configured or uniqueKeyField is not UUID type. Currently it's ignored.

        +1, we should do that.

        Show
        Shalin Shekhar Mangar added a comment - Thanks Vitaliy. In this case, a fieldName must be specified (even if empty) in the solrconfig.xml. This isn't ideal. We should be able to completely omit the fieldName declaration and still have it work. We should override init method in UUIDUpdateProcessorFactory and set fieldName ourselves. Maybe make sense to throw exception if configured or uniqueKeyField is not UUID type. Currently it's ignored. +1, we should do that.
        Hide
        Shalin Shekhar Mangar added a comment -

        We should override init method in UUIDUpdateProcessorFactory and set fieldName ourselves.

        Oh I see, we can't do that because we need the request object to get the uniqueKey field name. I think UUIDUpdateProcessorFactory should not extend the AbstractDefaultValueUpdateProcessorFactory and handle the fieldName itself.

        Show
        Shalin Shekhar Mangar added a comment - We should override init method in UUIDUpdateProcessorFactory and set fieldName ourselves. Oh I see, we can't do that because we need the request object to get the uniqueKey field name. I think UUIDUpdateProcessorFactory should not extend the AbstractDefaultValueUpdateProcessorFactory and handle the fieldName itself.
        Hide
        Shalin Shekhar Mangar added a comment -

        Here's a patch which makes it possible to specify a UUIDUpdateProcessorFactory without specifying a field name. The uniqueKey is automatically picked up in this case.

        I had to UUIDUpdateProcessorFactory inherit from UpdateRequestProcessorFactory directly instead of going through AbstractDefaultValueUpdateProcessorFactory because AbstractDefaultValueUpdateProcessorFactory stipulates that the fieldName must be specified. Any workaround would have been ugly.

        Show
        Shalin Shekhar Mangar added a comment - Here's a patch which makes it possible to specify a UUIDUpdateProcessorFactory without specifying a field name. The uniqueKey is automatically picked up in this case. I had to UUIDUpdateProcessorFactory inherit from UpdateRequestProcessorFactory directly instead of going through AbstractDefaultValueUpdateProcessorFactory because AbstractDefaultValueUpdateProcessorFactory stipulates that the fieldName must be specified. Any workaround would have been ugly.
        Hide
        Shalin Shekhar Mangar added a comment -

        This patch adds the UUID processor to the default update chain of the example-schemaless.

        With this change, we can add any doc to schema-less example and not worry about unique key. If "id" is present then it is used otherwise the unique key is set to a generated UUID.

        Show
        Shalin Shekhar Mangar added a comment - This patch adds the UUID processor to the default update chain of the example-schemaless. With this change, we can add any doc to schema-less example and not worry about unique key. If "id" is present then it is used otherwise the unique key is set to a generated UUID.
        Hide
        Shalin Shekhar Mangar added a comment -
        1. Updated javadoc to link to SchemaField
        2. Removed formatting changes to javadocs
        3. Fixed javadoc which said that uniqueKey must be UUID – that's not true anymore, it can be anything which accepts a string.
        4. Fixed a bug in UUIDUpdateProcessor which was checking for fieldName != null needlessly.

        I think this is ready to go.

        Show
        Shalin Shekhar Mangar added a comment - Updated javadoc to link to SchemaField Removed formatting changes to javadocs Fixed javadoc which said that uniqueKey must be UUID – that's not true anymore, it can be anything which accepts a string. Fixed a bug in UUIDUpdateProcessor which was checking for fieldName != null needlessly. I think this is ready to go.
        Hide
        Steve Rowe added a comment -

        +1, LGTM

        Show
        Steve Rowe added a comment - +1, LGTM
        Hide
        Erik Hatcher added a comment -

        Ditto, +1, just reviewed the patch and approach. Nice improvement.

        Show
        Erik Hatcher added a comment - Ditto, +1, just reviewed the patch and approach. Nice improvement.
        Hide
        ASF subversion and git services added a comment -

        Commit 1614416 from shalin@apache.org in branch 'dev/trunk'
        [ https://svn.apache.org/r1614416 ]

        SOLR-6020: Auto-generate a unique key in schema-less example if data does not have an id field

        Show
        ASF subversion and git services added a comment - Commit 1614416 from shalin@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1614416 ] SOLR-6020 : Auto-generate a unique key in schema-less example if data does not have an id field
        Hide
        ASF subversion and git services added a comment -

        Commit 1614417 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1614417 ]

        SOLR-6020: Auto-generate a unique key in schema-less example if data does not have an id field

        Show
        ASF subversion and git services added a comment - Commit 1614417 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1614417 ] SOLR-6020 : Auto-generate a unique key in schema-less example if data does not have an id field
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Hoss, Vitaliy, Steve and Erik!

        Show
        Shalin Shekhar Mangar added a comment - Thanks Hoss, Vitaliy, Steve and Erik!
        Hide
        ASF subversion and git services added a comment -

        Commit 1614498 from Timothy Potter in branch 'dev/trunk'
        [ https://svn.apache.org/r1614498 ]

        SOLR-6020: Fix broken JavaDoc found by precommit.

        Show
        ASF subversion and git services added a comment - Commit 1614498 from Timothy Potter in branch 'dev/trunk' [ https://svn.apache.org/r1614498 ] SOLR-6020 : Fix broken JavaDoc found by precommit.
        Hide
        ASF subversion and git services added a comment -

        Commit 1614554 from shalin@apache.org in branch 'dev/branches/branch_4x'
        [ https://svn.apache.org/r1614554 ]

        SOLR-6020: Fix broken JavaDoc found by precommit.

        Show
        ASF subversion and git services added a comment - Commit 1614554 from shalin@apache.org in branch 'dev/branches/branch_4x' [ https://svn.apache.org/r1614554 ] SOLR-6020 : Fix broken JavaDoc found by precommit.
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks Tim! I merged your commit to branch_4x as well.

        Show
        Shalin Shekhar Mangar added a comment - Thanks Tim! I merged your commit to branch_4x as well.

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            Shalin Shekhar Mangar
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development