Solr
  1. Solr
  2. SOLR-1229

deletedPkQuery feature does not work when pk and uniqueKey field do not have the same value

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      Problem doing a delta-import such that records marked as "deleted" in the database are removed from Solr using deletedPkQuery.

      Here's a config I'm using against a mocked test database:

      <dataConfig>
       <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/db"/>
       <document name="tests">
         <entity name="test"
                 pk="board_id"
                 transformer="TemplateTransformer"
                 deletedPkQuery="select board_id from boards where deleted = 'Y'"
                 query="select * from boards where deleted = 'N'"
                 deltaImportQuery="select * from boards where deleted = 'N'"
                 deltaQuery="select * from boards where deleted = 'N'"
                 preImportDeleteQuery="datasource:board">
           <field column="id" template="board-${test.board_id}"/>
           <field column="datasource" template="board"/>
           <field column="title" />
         </entity>
       </document>
      </dataConfig>
      

      Note that the uniqueKey in Solr is the "id" field. And its value is a template board-<PK>.

      I noticed the javadoc comments in DocBuilder#collectDelta it says "Note: In our definition, unique key of Solr document is the primary key of the top level entity". This of course isn't really an appropriate assumption.

      1. SOLR-1229.patch
        4 kB
        Noble Paul
      2. SOLR-1229.patch
        41 kB
        Erik Hatcher
      3. SOLR-1229.patch
        3 kB
        Noble Paul
      4. SOLR-1229.patch
        2 kB
        Erik Hatcher
      5. SOLR-1229.patch
        2 kB
        Erik Hatcher
      6. SOLR-1229.patch
        0.8 kB
        Noble Paul
      7. tests.patch
        38 kB
        Lance Norskog

        Activity

        Hide
        Erik Hatcher added a comment -

        Maybe the pk field returned from deletedPkQuery should be run through the transformation process on the uniqueKey (id in this case) field to get the actual value??

        Show
        Erik Hatcher added a comment - Maybe the pk field returned from deletedPkQuery should be run through the transformation process on the uniqueKey (id in this case) field to get the actual value??
        Hide
        Noble Paul added a comment -

        there is a huge problem w/ the current implementation.
        The whole delta-import process is built like an after thought. I wish to revamp the whole design. so that all the rows returned, deletedPkQuery or deltaQuery etc should also go through the transformations

        Show
        Noble Paul added a comment - there is a huge problem w/ the current implementation. The whole delta-import process is built like an after thought. I wish to revamp the whole design. so that all the rows returned, deletedPkQuery or deltaQuery etc should also go through the transformations
        Hide
        Erik Hatcher added a comment -

        FYI - some prior discussion on solr-user too: http://www.lucidimagination.com/search/document/b8ef8e2e9e1de005/pk_vs_uniquekey_with_dih_delta_import, about trying "select concat('board-',board_id)
        from boards where deleted = 'Y'", to no avail.

        Show
        Erik Hatcher added a comment - FYI - some prior discussion on solr-user too: http://www.lucidimagination.com/search/document/b8ef8e2e9e1de005/pk_vs_uniquekey_with_dih_delta_import , about trying "select concat('board-',board_id) from boards where deleted = 'Y'", to no avail.
        Hide
        Noble Paul added a comment -

        Erik plz let me know if this helps.

        Show
        Noble Paul added a comment - Erik plz let me know if this helps.
        Hide
        Lance Norskog added a comment -

        Someone with magic powers, please mark this as "to be fixed in Solr 1.4".

        Show
        Lance Norskog added a comment - Someone with magic powers, please mark this as "to be fixed in Solr 1.4".
        Hide
        Erik Hatcher added a comment -

        Noble - that looks like the right thing to add for delta queries, but that didn't help with the deleted one. I've attached a patch that fixes things in my limited test.

        We really need to add some unit tests for this - tricky business though. Lance - do you have some unit tests to add that shows it broken and then fixed with this patch?

        Show
        Erik Hatcher added a comment - Noble - that looks like the right thing to add for delta queries, but that didn't help with the deleted one. I've attached a patch that fixes things in my limited test. We really need to add some unit tests for this - tricky business though. Lance - do you have some unit tests to add that shows it broken and then fixed with this patch?
        Hide
        Noble Paul added a comment -

        Erik, the fix is not right

        see this

        key = map.get(dataImporter.getSchema().getUniqueKeyField().getName());
        

        The name of the uniqueKey field in the scema and the one you have in the map does not have to be same. DIH really gives you an option for it to be different. The attribute 'pk' is only used for this purpose.

        Show
        Noble Paul added a comment - Erik, the fix is not right see this key = map.get(dataImporter.getSchema().getUniqueKeyField().getName()); The name of the uniqueKey field in the scema and the one you have in the map does not have to be same. DIH really gives you an option for it to be different. The attribute 'pk' is only used for this purpose.
        Hide
        Erik Hatcher added a comment -

        Noble - they do have to be the same, and that is why applying the transformations as you patched is part of the right answer. But in order to call writer.deleteDoc(key) in DocBuilder#deleteAll it must use the value of the uniqueKey field, not of the pk one. That is the crux of the issue here. In my case, in the example at the top of this issue, the pk field is board_id with a value of "1", and the id field after applying transformations is "board-1". The delete-by-id to Solr must be done using value "board-1" and that is only obtained by looking up the uniqueKey field (id in this example) from the schema and pulling it from the map.

        The latest patch I supplied worked in my test case and I analyzed it through a debugger.

        Show
        Erik Hatcher added a comment - Noble - they do have to be the same, and that is why applying the transformations as you patched is part of the right answer. But in order to call writer.deleteDoc(key) in DocBuilder#deleteAll it must use the value of the uniqueKey field, not of the pk one. That is the crux of the issue here. In my case, in the example at the top of this issue, the pk field is board_id with a value of "1", and the id field after applying transformations is "board-1". The delete-by-id to Solr must be done using value "board-1" and that is only obtained by looking up the uniqueKey field (id in this example) from the schema and pulling it from the map. The latest patch I supplied worked in my test case and I analyzed it through a debugger.
        Hide
        Noble Paul added a comment - - edited

        But in order to call writer.deleteDoc(key) in DocBuilder#deleteAll it must use the value of the uniqueKey field, not of the pk one

        Why can't you keep the uniqueKey same as pk as shown below(it is not used anywhere else (yes there is one other place which is gonna be deprecated) ) . That should solve the problem. DIH is agnostic of the solr primary key . The point is that , the key name does not matter only the value matters . As long as the value is correct, delete should work automatically .

        the following should be just fine

        <entity name="test"
        pk="id"
        ...
        
        Show
        Noble Paul added a comment - - edited But in order to call writer.deleteDoc(key) in DocBuilder#deleteAll it must use the value of the uniqueKey field, not of the pk one Why can't you keep the uniqueKey same as pk as shown below(it is not used anywhere else (yes there is one other place which is gonna be deprecated) ) . That should solve the problem. DIH is agnostic of the solr primary key . The point is that , the key name does not matter only the value matters . As long as the value is correct, delete should work automatically . the following should be just fine <entity name= "test" pk= "id" ...
        Hide
        Erik Hatcher added a comment -

        If the pk attribute is not used, then let's just remove it altogether. DIH should NOT be agnostic to Solr's uniqueKey field though - and it would be silly to mandate users duplicate the uniqueKey setting as DIH pk values when it can just be gotten from the schema. So -1 to misusing a pk-named attribute this way, when the most intuitive interpretation would be that pk is the Primary Key of the table being indexed.

        Show
        Erik Hatcher added a comment - If the pk attribute is not used, then let's just remove it altogether. DIH should NOT be agnostic to Solr's uniqueKey field though - and it would be silly to mandate users duplicate the uniqueKey setting as DIH pk values when it can just be gotten from the schema. So -1 to misusing a pk-named attribute this way, when the most intuitive interpretation would be that pk is the Primary Key of the table being indexed.
        Hide
        Erik Hatcher added a comment -

        Here's another patch that removes references to pk from DocBuilder#deleteAll.

        I've just looked for uses of pk myself, and indeed it seems it should be removed. DIH must know and use the uniqueKey field from the schema for updates/deletes - there is no reason for DIH configs to have to specify it again.

        Show
        Erik Hatcher added a comment - Here's another patch that removes references to pk from DocBuilder#deleteAll. I've just looked for uses of pk myself, and indeed it seems it should be removed. DIH must know and use the uniqueKey field from the schema for updates/deletes - there is no reason for DIH configs to have to specify it again.
        Hide
        Erik Hatcher added a comment -

        In my latest patch, I added a comment with a question about the use of the loop in deletelAll. I don't see why it is needed. Noble?

        Show
        Erik Hatcher added a comment - In my latest patch, I added a comment with a question about the use of the loop in deletelAll. I don't see why it is needed. Noble?
        Hide
        Lance Norskog added a comment -

        There are two bugs in this issue:

        1) The deletedPkQuery must say "select field from table". It may not say "select expression" or "select field AS FIELD from table". The former does not work and the latter may work with some databases and not others. In other words, the first example works and the others fail:

        deletedPkQuery="select id from item where EXPRESSION"
        deletedPkQuery="select concat('',id) from item where EXPRESSION"
        deletedPkQuery="select id from item AS ID where EXPRESSION"
        deletedPkQuery="select id from item AS ITEM.ID where EXPRESSION"

        2) If a template is used to format the contents of the Solr primary key, that template is declared at the <field> level and is not visible in the parent <entity> level. Since the deletedPkQuery is declared in the <entity> level, it cannot see the field's template. Thus the template cannot be applied to the results of the deletedPkQuery. Any formatting in the <field> template must be duplicated (in SQL syntax) in the deletedPkQuery.

        This is workaround that achieves the goal of #2: if the Solr primary key is declared as

        <field name="id" template="prefix-$

        {item.id}

        "

        then the deletedPkQuery must be declared as

        deletedPkQuery="select concat('prefix',id) from item where EXPRESSION"

        Because of bug #1, it is not possible to use this workaround.

        The attached patch fixes bug #1, making it possible to use the workaround. It is Erik's second patch but only with the fix for bug #1. It does not create an automatic way for results of deletedPkQuery to be formatted with the primary key's template.

        Show
        Lance Norskog added a comment - There are two bugs in this issue: 1) The deletedPkQuery must say "select field from table". It may not say "select expression" or "select field AS FIELD from table". The former does not work and the latter may work with some databases and not others. In other words, the first example works and the others fail: deletedPkQuery="select id from item where EXPRESSION" deletedPkQuery="select concat('',id) from item where EXPRESSION" deletedPkQuery="select id from item AS ID where EXPRESSION" deletedPkQuery="select id from item AS ITEM.ID where EXPRESSION" 2) If a template is used to format the contents of the Solr primary key, that template is declared at the <field> level and is not visible in the parent <entity> level. Since the deletedPkQuery is declared in the <entity> level, it cannot see the field's template. Thus the template cannot be applied to the results of the deletedPkQuery. Any formatting in the <field> template must be duplicated (in SQL syntax) in the deletedPkQuery. This is workaround that achieves the goal of #2: if the Solr primary key is declared as <field name="id" template="prefix-$ {item.id} " then the deletedPkQuery must be declared as deletedPkQuery="select concat('prefix',id) from item where EXPRESSION" Because of bug #1, it is not possible to use this workaround. The attached patch fixes bug #1, making it possible to use the workaround. It is Erik's second patch but only with the fix for bug #1. It does not create an automatic way for results of deletedPkQuery to be formatted with the primary key's template.
        Hide
        Erik Hatcher added a comment -

        Lance - your patch misses the all important EntityProcessorWrapper.java changes that both Noble and I contributed in our patches. That's needed to make this work!

        Show
        Erik Hatcher added a comment - Lance - your patch misses the all important EntityProcessorWrapper.java changes that both Noble and I contributed in our patches. That's needed to make this work!
        Hide
        Erik Hatcher added a comment -

        I've retested my last patch and all works fine in my single record simple test case using the exact configuration (and MySQL DB) as posted here. I do a full-import, one record goes in. I do change deleted to 'Y' and do a delta-import and the document is removed from Solr.

        Lance - please fall back to my last patch - the applyTransformer call in EntityProcessorWrapper#nextDeletedRow that I added is crucial. It gives an id=board-1 value which is then used for deletion in DocBuilder#deleteAll.

        Show
        Erik Hatcher added a comment - I've retested my last patch and all works fine in my single record simple test case using the exact configuration (and MySQL DB) as posted here. I do a full-import, one record goes in. I do change deleted to 'Y' and do a delta-import and the document is removed from Solr. Lance - please fall back to my last patch - the applyTransformer call in EntityProcessorWrapper#nextDeletedRow that I added is crucial. It gives an id=board-1 value which is then used for deletion in DocBuilder#deleteAll.
        Hide
        Noble Paul added a comment -

        If the pk attribute is not used, then let's just remove it altogether.....

        hi Erik. you got me wrong. Let me try to explain. There are two potential names for a field one is the 'column' and other is the 'name'

        'column' is the name of the field in the source or in DIH. The 'name' is the name of that field in Solr. DIH uses the 'name' attribute only when it writes the field to Solr.

        The relation ship between 'pk' attribute and the '<uniqueKey>' in Solr is same . The distinction is important . Otherwise the user will be forced to use the same name in both the db and solr (assuming no transformations are done).

        Show
        Noble Paul added a comment - If the pk attribute is not used, then let's just remove it altogether..... hi Erik. you got me wrong. Let me try to explain. There are two potential names for a field one is the 'column' and other is the 'name' 'column' is the name of the field in the source or in DIH. The 'name' is the name of that field in Solr. DIH uses the 'name' attribute only when it writes the field to Solr. The relation ship between 'pk' attribute and the '<uniqueKey>' in Solr is same . The distinction is important . Otherwise the user will be forced to use the same name in both the db and solr (assuming no transformations are done).
        Hide
        Lance Norskog added a comment -

        Yup, I've deleted my patch. After a lot more examination of the code, and combinatoric testing, I see why it works and is the right fix for this incarnation of the DIH.

        And, yes, unit tests for more of these options are on the way!

        Show
        Lance Norskog added a comment - Yup, I've deleted my patch. After a lot more examination of the code, and combinatoric testing, I see why it works and is the right fix for this incarnation of the DIH. And, yes, unit tests for more of these options are on the way!
        Hide
        Erik Hatcher added a comment -

        Noble - I'm not following your explanation here. The pk attribute seems like it should simply be removed (and the usages of it refactored to use Solr's uniqueKey setting when doing updated/deleted activities - that's the key to the Solr docs after all).

        Regarding 'column'/'name' - there are cases where it is quite confusing - where column is used to specify the Solr field name - and that feels wrong to me. 'name' should always be the Solr name, and if it is a templated field, it should be 'name'/'template' not 'column'/'template', for example.

        Show
        Erik Hatcher added a comment - Noble - I'm not following your explanation here. The pk attribute seems like it should simply be removed (and the usages of it refactored to use Solr's uniqueKey setting when doing updated/deleted activities - that's the key to the Solr docs after all). Regarding 'column'/'name' - there are cases where it is quite confusing - where column is used to specify the Solr field name - and that feels wrong to me. 'name' should always be the Solr name, and if it is a templated field, it should be 'name'/'template' not 'column'/'template', for example.
        Hide
        Erik Hatcher added a comment -

        I plan on committing this patch later this week. It fixes the problems we've had. Please review and comment further if there are still holes... but we really should be discussing this with unit tests. I hope to see some of those this week from Lance.

        Show
        Erik Hatcher added a comment - I plan on committing this patch later this week. It fixes the problems we've had. Please review and comment further if there are still holes... but we really should be discussing this with unit tests. I hope to see some of those this week from Lance.
        Hide
        Noble Paul added a comment - - edited

        he pk attribute seems like it should simply be removed

        The pk attribute IS REQUIRED for this use case. I guess the solution to your problem is that you set the pk correct as the name you find in the "source row" map. if we remove it every user will be forced to keep the uniquekey same as the name it has in the source DB.

        Take the following usecase

           <entity name="test"
                   pk="db_id"
                   deletedPkQuery="select db_id from boards where deleted = 'Y'">
             <field column="db_id" name="solr_id"/>
           </entity>
        

        elsewhere in solrconfig.xml

        <uniqueKey>solr_id</uniqueKey>
        

        if you omit that pk attributethis will fail

        Show
        Noble Paul added a comment - - edited he pk attribute seems like it should simply be removed The pk attribute IS REQUIRED for this use case. I guess the solution to your problem is that you set the pk correct as the name you find in the "source row" map. if we remove it every user will be forced to keep the uniquekey same as the name it has in the source DB. Take the following usecase <entity name= "test" pk= "db_id" deletedPkQuery= "select db_id from boards where deleted = 'Y'" > <field column= "db_id" name= "solr_id" /> </entity> elsewhere in solrconfig.xml <uniqueKey> solr_id </uniqueKey> if you omit that pk attributethis will fail
        Hide
        Erik Hatcher added a comment -

        The pk attribute currently is used to track only modified/deleted records during a delta-import. But only the uniqueKey setting makes sense, and that will get mapped from the transformations applied. I fail to see how removing pk (and of course refactoring its current uses to the uniqueKey value) will cause problems. The only thing that matters here is that transformations are applied such that the uniqueKey field value is accurate. There is nothing in what I'm suggesting here that makes the DB primary key field match the Solr uniqueKey field name.

        Again, tests are really required so we can be sure we have the bases covered, but it's been tested pretty thoroughly that my patch fixes a serious issue.

        Show
        Erik Hatcher added a comment - The pk attribute currently is used to track only modified/deleted records during a delta-import. But only the uniqueKey setting makes sense, and that will get mapped from the transformations applied. I fail to see how removing pk (and of course refactoring its current uses to the uniqueKey value) will cause problems. The only thing that matters here is that transformations are applied such that the uniqueKey field value is accurate. There is nothing in what I'm suggesting here that makes the DB primary key field match the Solr uniqueKey field name. Again, tests are really required so we can be sure we have the bases covered, but it's been tested pretty thoroughly that my patch fixes a serious issue.
        Hide
        Noble Paul added a comment -

        hi Erik it works in your case , But the usecase I have given above (which is a more normal usecase)

        let me explain with the above example .

        // this code is  taken from your patch
         SchemaField uniqueKeyField = dataImporter.getSchema().getUniqueKeyField();
        // now the value of uniqueKeyField is 'solr_id'
         if (uniqueKeyField == null) return;
         Object key = map.get(uniqueKeyField.getName());
         //the map contains {"db_id"-> "12345"}
         //so key == null; and it will do nothing and return
         // on the contrary if you set the pk="db_id" , then it works
        
        
        
        
        
        
        Show
        Noble Paul added a comment - hi Erik it works in your case , But the usecase I have given above (which is a more normal usecase) let me explain with the above example . // this code is taken from your patch SchemaField uniqueKeyField = dataImporter.getSchema().getUniqueKeyField(); // now the value of uniqueKeyField is 'solr_id' if (uniqueKeyField == null ) return ; Object key = map.get(uniqueKeyField.getName()); //the map contains { "db_id" -> "12345" } //so key == null ; and it will do nothing and return // on the contrary if you set the pk= "db_id" , then it works
        Hide
        Noble Paul added a comment -

        committed r788587

        Show
        Noble Paul added a comment - committed r788587
        Hide
        Erik Hatcher added a comment -

        Noble - I don't really appreciate you committing a partial patch of this. What you committed requires us to set pk="id", which is just silly and nonsensical. I'm opening this back up to continue to fix this awkward area of DIH.

        Show
        Erik Hatcher added a comment - Noble - I don't really appreciate you committing a partial patch of this. What you committed requires us to set pk="id", which is just silly and nonsensical. I'm opening this back up to continue to fix this awkward area of DIH.
        Hide
        Shalin Shekhar Mangar added a comment -

        Erik, the most common use-case as far as I have seen is that the primary key in tables is different from the uniqueKey in Solr (think about multiple tables with each having a root-entity). Yes, the pk can be transformed (or one can alias it in sql) but this being the most common use-case, I feel pk should be kept as-is.

        Let me give a few possible cases

        1. The name of table's primary key is different from solr's unique key name and the deletedPkQuery returns only one column (most common use-case)
        2. The name of table's primary key is different from solr's unique key name and the deletedPkQuery returns multiple columns
        3. The name of table's primary key is same as solr's unique key name and the deletedPkQuery returns only one column
        4. The name of table's primary key is same as solr's unique key name and the deletedPkQuery returns multiple columns

        For #1 'pk' does not matter because we can use the single columns coming back from deletedPkQuery
        For #2, 'pk' is required otherwise the user is forced to use a transformer (or alias). For non-database use-cases (there is none right now), there is no aliasing so the user must write a transformer
        For #3, neither 'pk' nor 'uniqueKey' matters
        For #4, we can use solr's uniqueKey name (I guess this is your use-case?). I think that this is a rare use-case.

        If at all, we decide to go with uniqueKey only, the right way to do that would be to use the corresponding column-mapping for looking up the unique key. For the example below, we should use "db-id" to lookup in the map returned by deletedPkQuery if solr-id is the uniqueKey in solr:

        <field column="db-id" name="solr-id" />
        

        However, even though the above approach is the 'right' one, it is very tricky and hard to explain to users. Also, there could be multiple columns mapped to same solr key (think about template for unique key for 'types' of documents based on a flag column). This may be very error-prone.

        What do you think?

        Show
        Shalin Shekhar Mangar added a comment - Erik, the most common use-case as far as I have seen is that the primary key in tables is different from the uniqueKey in Solr (think about multiple tables with each having a root-entity). Yes, the pk can be transformed (or one can alias it in sql) but this being the most common use-case, I feel pk should be kept as-is. Let me give a few possible cases The name of table's primary key is different from solr's unique key name and the deletedPkQuery returns only one column (most common use-case) The name of table's primary key is different from solr's unique key name and the deletedPkQuery returns multiple columns The name of table's primary key is same as solr's unique key name and the deletedPkQuery returns only one column The name of table's primary key is same as solr's unique key name and the deletedPkQuery returns multiple columns For #1 'pk' does not matter because we can use the single columns coming back from deletedPkQuery For #2, 'pk' is required otherwise the user is forced to use a transformer (or alias). For non-database use-cases (there is none right now), there is no aliasing so the user must write a transformer For #3, neither 'pk' nor 'uniqueKey' matters For #4, we can use solr's uniqueKey name (I guess this is your use-case?). I think that this is a rare use-case. If at all, we decide to go with uniqueKey only, the right way to do that would be to use the corresponding column-mapping for looking up the unique key. For the example below, we should use "db-id" to lookup in the map returned by deletedPkQuery if solr-id is the uniqueKey in solr: <field column= "db-id" name= "solr-id" /> However, even though the above approach is the 'right' one, it is very tricky and hard to explain to users. Also, there could be multiple columns mapped to same solr key (think about template for unique key for 'types' of documents based on a flag column). This may be very error-prone. What do you think?
        Hide
        Erik Hatcher added a comment -

        In all of those cases, as long as what is returned from the query and run through transformations ends up with a key in the map that is the same as the uniqueKey setting in schema.xml, then all is fine. I still don't see a need to set pk="solr-id". Won't there always be a uniqueKey-named key in that map after transformations are applied? uniqueKey definitely matters... that's the field that must be used for deletions, and what I'm consistently seeing mentioned here is that you want to duplicate that by saying pk="<uniqueKeyFieldName>", which is unnecessary duplication. When would you set pk to anything else?

        It'll be later next week at the earliest, but I hope to get some unit tests contributed so we can discuss this topic through tests rather than prose.

        My use case is exactly the config at the top of this issue, where the uniqueKey value is a templated transformation (because of multiple DIH configurations bringing in various data sources, so the unique key value must be fabricated to be guaranteed to be unique across different datasources that may have the same primary keys in the databases) - this corresponds to your #1.

        Show
        Erik Hatcher added a comment - In all of those cases, as long as what is returned from the query and run through transformations ends up with a key in the map that is the same as the uniqueKey setting in schema.xml, then all is fine. I still don't see a need to set pk="solr-id". Won't there always be a uniqueKey-named key in that map after transformations are applied? uniqueKey definitely matters... that's the field that must be used for deletions, and what I'm consistently seeing mentioned here is that you want to duplicate that by saying pk="<uniqueKeyFieldName>", which is unnecessary duplication. When would you set pk to anything else? It'll be later next week at the earliest, but I hope to get some unit tests contributed so we can discuss this topic through tests rather than prose. My use case is exactly the config at the top of this issue, where the uniqueKey value is a templated transformation (because of multiple DIH configurations bringing in various data sources, so the unique key value must be fabricated to be guaranteed to be unique across different datasources that may have the same primary keys in the databases) - this corresponds to your #1.
        Hide
        Noble Paul added a comment -

        don't really appreciate you committing a partial patch of this

        I am sorry about this. But, my fix was good enough for the bug's description. I was planing to open another issue for this.

        Either we should change the descrption to , deprecate 'pk' or we can open another issue for the same.
        DIH works as intented to with this fix and the deprecation can be take up .

        Won't there always be a uniqueKey-named key in that map after transformations..

        The assumption is that transformation is always done. My experience with ''DIH support says that users don't use it always.Transformer is just a addon .

        Anyway , the right fix would be to find out what is the right mapping in DIH for the given solr <uniqueKey> and use it as a pk. That can definitely be another jira issue

        your thoughts?

        Show
        Noble Paul added a comment - don't really appreciate you committing a partial patch of this I am sorry about this. But, my fix was good enough for the bug's description. I was planing to open another issue for this. Either we should change the descrption to , deprecate 'pk' or we can open another issue for the same. DIH works as intented to with this fix and the deprecation can be take up . Won't there always be a uniqueKey-named key in that map after transformations.. The assumption is that transformation is always done. My experience with ''DIH support says that users don't use it always.Transformer is just a addon . Anyway , the right fix would be to find out what is the right mapping in DIH for the given solr <uniqueKey> and use it as a pk. That can definitely be another jira issue your thoughts?
        Hide
        Lance Norskog added a comment -

        This patch contains two unit test files addressing the topics of this bug:
        1) database and Solr have different primary key names
        2) solr primary key value is processed with a template

        TestSqlEntityProcessorDelta.java does general tests of delta-import

        TestSqlEntityProcessorDelta2.java does the same tests with a different solr id name and a template for the solr id value.

        Also, there was a mistake in the test fixture which caused delta imports to always empty the index first.

        I believe that these tests should work. Please make all changes necessary so that we have working unit tests for these features, and please change the wiki to match all syntax changes.

        Show
        Lance Norskog added a comment - This patch contains two unit test files addressing the topics of this bug: 1) database and Solr have different primary key names 2) solr primary key value is processed with a template TestSqlEntityProcessorDelta.java does general tests of delta-import TestSqlEntityProcessorDelta2.java does the same tests with a different solr id name and a template for the solr id value. Also, there was a mistake in the test fixture which caused delta imports to always empty the index first. I believe that these tests should work. Please make all changes necessary so that we have working unit tests for these features, and please change the wiki to match all syntax changes.
        Hide
        Lance Norskog added a comment -

        Also, I recommend that a feature be removed: creating a deltaImportQuery if the user does not declare one. From the wiki:

        "deltaImportQuery : (Only used in delta-import) . If this is not present , DIH tries to construct the import query by(after identifying the delta) modifying the 'query' (this is error prone). There is a namespace $

        {dataimporter.delta.<column-name>}

        which can be used in this query. e.g: select * from tbl where id=$

        {dataimporter.delta.id}

        Solr1.4."

        The user should just be required to supply the deltaImportQuery. Now that the deltaImportQuery declaration exists, this "helper feature" is just a trap for the unwary. The phrase "this is error prone" is always a bad sign.

        Show
        Lance Norskog added a comment - Also, I recommend that a feature be removed: creating a deltaImportQuery if the user does not declare one. From the wiki: "deltaImportQuery : (Only used in delta-import) . If this is not present , DIH tries to construct the import query by(after identifying the delta) modifying the 'query' (this is error prone). There is a namespace $ {dataimporter.delta.<column-name>} which can be used in this query. e.g: select * from tbl where id=$ {dataimporter.delta.id} Solr1.4." The user should just be required to supply the deltaImportQuery. Now that the deltaImportQuery declaration exists, this "helper feature" is just a trap for the unwary. The phrase "this is error prone" is always a bad sign.
        Hide
        Lance Norskog added a comment -

        Shalin, how can a deletedPkQuery return multiple columns? It is supposed to return a list of Solr uniqueKeys. These are then sent in with the Solr delete-by-id operation.

        Show
        Lance Norskog added a comment - Shalin, how can a deletedPkQuery return multiple columns? It is supposed to return a list of Solr uniqueKeys. These are then sent in with the Solr delete-by-id operation.
        Hide
        Erik Hatcher added a comment - - edited

        Lance - the SQL query could return more than one column. A bit non-sensical, but it could be specified that way. SELECT * FROM BOARDS WHERE deleted='Y' for example.

        And no, it isn't supposed to return Solr uniqueKey id's!!! That's the main crux of the dilemma here... how to jive what is returned from a SQL resultset with Solr unique keys.

        Show
        Erik Hatcher added a comment - - edited Lance - the SQL query could return more than one column. A bit non-sensical, but it could be specified that way. SELECT * FROM BOARDS WHERE deleted='Y' for example. And no, it isn't supposed to return Solr uniqueKey id's!!! That's the main crux of the dilemma here... how to jive what is returned from a SQL resultset with Solr unique keys.
        Hide
        Noble Paul added a comment -

        Also, I recommend that a feature be removed: creating a deltaImportQuery if the user does not declare one. From the wiki:

        It will be removed. But, we cannot remove it from the wiki until we release 1.4.

        Show
        Noble Paul added a comment - Also, I recommend that a feature be removed: creating a deltaImportQuery if the user does not declare one. From the wiki: It will be removed. But, we cannot remove it from the wiki until we release 1.4.
        Hide
        Noble Paul added a comment -

        I believe that these tests should work.

        hi Lance, do the testcases run.? or we require some fix to make it work? I applied the patch and the testcases are failing

        Show
        Noble Paul added a comment - I believe that these tests should work. hi Lance, do the testcases run.? or we require some fix to make it work? I applied the patch and the testcases are failing
        Hide
        Erik Hatcher added a comment -

        Lance - I too get failures with your tests. Here's one example:

        junit.framework.AssertionFailedError: query failed XPath: //*[@numFound='0'] xml response was: <?xml version="1.0" encoding="UTF-8"?>
        <response>
        <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="rows">20</str><str name="start">0</str><str name="q">desc:hello OR XtestCompositePk_DeltaImport_replace_nodelete</str><str name="qt">standard</str><str name="version">2.2</str></lst></lst><result name="response" numFound="1" start="0"><doc><arr name="desc"><str>hello</str></arr><str name="id">1</str><date name="timestamp">2009-06-30T13:55:58.2Z</date></doc></result>
        </response>

        at org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:182)
        at org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:172)
        at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta.testCompositePk_DeltaImport_replace_nodelete(TestSqlEntityProcessorDelta.java:194)

        Show
        Erik Hatcher added a comment - Lance - I too get failures with your tests. Here's one example: junit.framework.AssertionFailedError: query failed XPath: //* [@numFound='0'] xml response was: <?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"><int name="status">0</int><int name="QTime">1</int><lst name="params"><str name="rows">20</str><str name="start">0</str><str name="q">desc:hello OR XtestCompositePk_DeltaImport_replace_nodelete</str><str name="qt">standard</str><str name="version">2.2</str></lst></lst><result name="response" numFound="1" start="0"><doc><arr name="desc"><str>hello</str></arr><str name="id">1</str><date name="timestamp">2009-06-30T13:55:58.2Z</date></doc></result> </response> at org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:182) at org.apache.solr.util.AbstractSolrTestCase.assertQ(AbstractSolrTestCase.java:172) at org.apache.solr.handler.dataimport.TestSqlEntityProcessorDelta.testCompositePk_DeltaImport_replace_nodelete(TestSqlEntityProcessorDelta.java:194)
        Hide
        Lance Norskog added a comment -

        I apologize, I did not put it clearly.

        Yes, these tests do not work against the current code base. I believe they are correct tests for the features we have talked about in this issue, SOLR-1229. The code in these tests should work, when this bug is resolved. They will not work with the partial patch currently committed.

        These tests are variations of the original test set. They handle two cases: 1) the solr uniqueKey has a different name than the DB primary key, and 2) the solr uniqueKey's value is a transformation of the DB primary key's value.

        In other words, if your code actually works, and passes your by-hand tests, it should then pass all of these tests. If there is something wrong in these tests, please fix it.

        Should you wish to write further tests, the pre/post remove query features at present have no coverage.

        Show
        Lance Norskog added a comment - I apologize, I did not put it clearly. Yes, these tests do not work against the current code base. I believe they are correct tests for the features we have talked about in this issue, SOLR-1229 . The code in these tests should work, when this bug is resolved. They will not work with the partial patch currently committed. These tests are variations of the original test set. They handle two cases: 1) the solr uniqueKey has a different name than the DB primary key, and 2) the solr uniqueKey's value is a transformation of the DB primary key's value. In other words, if your code actually works, and passes your by-hand tests, it should then pass all of these tests. If there is something wrong in these tests, please fix it. Should you wish to write further tests, the pre/post remove query features at present have no coverage.
        Hide
        Noble Paul added a comment -

        The testcases are wrong.

        take the case of testCompositePk_DeltaImport_empty()

        the uniqueKey is 'solr_id' and there is no such field which maps to 'solr_id '

        see the mapping

        <field column="id" name="id"/>
        
        Show
        Noble Paul added a comment - The testcases are wrong. take the case of testCompositePk_DeltaImport_empty() the uniqueKey is 'solr_id' and there is no such field which maps to 'solr_id ' see the mapping <field column= "id" name= "id" />
        Hide
        Noble Paul added a comment -

        Erik, the new fix

        This eliminates the need for 'pk' most of the cases. It tries to identify the corresponding column mapping for the solr uniqueKey. But the user can override this by specifying the pk attribute if the guess is not going to be correct

        Show
        Noble Paul added a comment - Erik, the new fix This eliminates the need for 'pk' most of the cases. It tries to identify the corresponding column mapping for the solr uniqueKey. But the user can override this by specifying the pk attribute if the guess is not going to be correct
        Hide
        Noble Paul added a comment -

        Erik, did you have a chance to look at the fix?

        Show
        Noble Paul added a comment - Erik, did you have a chance to look at the fix?
        Hide
        Erik Hatcher added a comment -

        Latest patch (adapted by Lance). All tests pass and this will be committed shortly.

        Show
        Erik Hatcher added a comment - Latest patch (adapted by Lance). All tests pass and this will be committed shortly.
        Hide
        Erik Hatcher added a comment -

        Committed to r792963. Thanks Lance and Noble for iterating on this. We've at least got it working well enough for our current needs.

        Show
        Erik Hatcher added a comment - Committed to r792963. Thanks Lance and Noble for iterating on this. We've at least got it working well enough for our current needs.
        Hide
        Noble Paul added a comment -

        this fix does not allow the user to override the pk key name from the data-config.xml . The deduced key takes precedence. In most cases the pk attribute is not required, but when it is mentioned it should take precedence

        Show
        Noble Paul added a comment - this fix does not allow the user to override the pk key name from the data-config.xml . The deduced key takes precedence. In most cases the pk attribute is not required, but when it is mentioned it should take precedence
        Hide
        Lance Norskog added a comment -

        Noble-

        With this change to DocBuilder.deleteAll(), the Delta2 test runs. With the code in the other order, the two delete operations do not work because the template is not applied.

        Please alter this config file (from unit test Delta2) to embody the use case you mention. The database pk is 'id' and the schema pk is 'solr_id'. And please describe the primary key design of the database and solr schemas.

        <dataConfig>
        <document>
        <entity name="x" pk="id" transformer="TemplateTransformer"
        query="select * from x"
        deletedPkQuery="select id from x where last_modified > NOW AND deleted='true'"
        deltaImportQuery="select * from x where id='$

        {dataimporter.delta.id}

        '"
        deltaQuery="select id from x where last_modified > NOW">
        <field column="solr_id" template="prefix-$

        {x.id}"/>
        <entity name="y" query="select * from y where y.A='${x.id}

        '">
        <field column="desc" />
        </entity>
        </entity>
        </document>
        </dataConfig>

        Show
        Lance Norskog added a comment - Noble- With this change to DocBuilder.deleteAll(), the Delta2 test runs. With the code in the other order, the two delete operations do not work because the template is not applied. Please alter this config file (from unit test Delta2) to embody the use case you mention. The database pk is 'id' and the schema pk is 'solr_id'. And please describe the primary key design of the database and solr schemas. <dataConfig> <document> <entity name="x" pk="id" transformer="TemplateTransformer" query="select * from x" deletedPkQuery="select id from x where last_modified > NOW AND deleted='true'" deltaImportQuery="select * from x where id='$ {dataimporter.delta.id} '" deltaQuery="select id from x where last_modified > NOW"> <field column="solr_id" template="prefix-$ {x.id}"/> <entity name="y" query="select * from y where y.A='${x.id} '"> <field column="desc" /> </entity> </entity> </document> </dataConfig>
        Hide
        Noble Paul added a comment -

        ideally , for your usecases , the pk attribute is not required. So i have removed it. Now it uses the user provided pk if it is not present it falls back to the solr schema uniqueKey

        Show
        Noble Paul added a comment - ideally , for your usecases , the pk attribute is not required. So i have removed it. Now it uses the user provided pk if it is not present it falls back to the solr schema uniqueKey
        Hide
        Lance Norskog added a comment - - edited

        Ok - these run. Thanks.

        Just to make sure I understand. The 'pk' attribute declares 2 things:
        1) that this column must exist for a document to be generated, and
        2) that this entity is the level where documents are created. Is this true?

        tmpid appears as an unused name merely so that $

        {x.id}

        is sent into solr_id. Maybe name="" would be more clear for this purpose?

        Lance

        Show
        Lance Norskog added a comment - - edited Ok - these run. Thanks. Just to make sure I understand. The 'pk' attribute declares 2 things: 1) that this column must exist for a document to be generated, and 2) that this entity is the level where documents are created. Is this true? tmpid appears as an unused name merely so that $ {x.id} is sent into solr_id. Maybe name="" would be more clear for this purpose? Lance
        Hide
        Lance Norskog added a comment - - edited

        Something is documented on the wiki but not used: multiple PKs in one entity.

        On the wiki page, see the config file after "Writing a huge deltaQuery" - there is a attribute:

        {pk="ITEM_ID, CATEGORY_ID"}

        There is code to parse this in DataImporter.InitEntity() and store the list in Entity.primaryKeys. But this list of PKs is never used.

        I think the use case for this is that the user requires more fields besides the uniqueKey for a document. Is this right? This is definitely on my list of must-have features. The second field may or may not be declared "required" in the schema, so looking at the schema is not good enough. The field has to be declared "required" in the dataconfig.

        Lance

        (I split this comment out from the previous since they are not related. However, this is another feature inside the same bit of code upon which we are ceaselessly chewing.)

        Show
        Lance Norskog added a comment - - edited Something is documented on the wiki but not used: multiple PKs in one entity. On the wiki page, see the config file after "Writing a huge deltaQuery" - there is a attribute: {pk="ITEM_ID, CATEGORY_ID"} There is code to parse this in DataImporter.InitEntity() and store the list in Entity.primaryKeys. But this list of PKs is never used. I think the use case for this is that the user requires more fields besides the uniqueKey for a document. Is this right? This is definitely on my list of must-have features. The second field may or may not be declared "required" in the schema, so looking at the schema is not good enough. The field has to be declared "required" in the dataconfig. Lance (I split this comment out from the previous since they are not related. However, this is another feature inside the same bit of code upon which we are ceaselessly chewing.)
        Hide
        Noble Paul added a comment -

        1) that this column must exist for a document to be generated, and

        no,
        2) that this entity is the level where documents are created. Is this true?
        no: rootEntity decides that

        pk is currently used for delete. XPathEntityProcessor also uses that to decode if this row is a candidate for a doc

        multiple pk values were supported for deltas (to construct the deltaImportQuery when it is not present ). That is going to be deprecated from 1.4.

        Show
        Noble Paul added a comment - 1) that this column must exist for a document to be generated, and no, 2) that this entity is the level where documents are created. Is this true? no: rootEntity decides that pk is currently used for delete. XPathEntityProcessor also uses that to decode if this row is a candidate for a doc multiple pk values were supported for deltas (to construct the deltaImportQuery when it is not present ). That is going to be deprecated from 1.4.
        Hide
        Noble Paul added a comment -

        I guess the latest patch can be committed and this issue can be closed

        Show
        Noble Paul added a comment - I guess the latest patch can be committed and this issue can be closed
        Hide
        Noble Paul added a comment -

        I guess the fix was not done

        Show
        Noble Paul added a comment - I guess the fix was not done
        Hide
        Erik Hatcher added a comment -

        Do you have a failing test case?

        Show
        Erik Hatcher added a comment - Do you have a failing test case?
        Hide
        Noble Paul added a comment -

        Nope. The last current trunk has changed an existing feature. The pk attribute is ignored. that is why I have reopened it in the first place

        Show
        Noble Paul added a comment - Nope. The last current trunk has changed an existing feature. The pk attribute is ignored. that is why I have reopened it in the first place
        Hide
        Lance Norskog added a comment -

        There are a couple of other features to remove in this code:

        1) multiple primary keys
        2) deltaImportQuery is created automatically if it is not given in the dataconfig.xml file

        Do we want to attack all of those in this issue?

        Show
        Lance Norskog added a comment - There are a couple of other features to remove in this code: 1) multiple primary keys 2) deltaImportQuery is created automatically if it is not given in the dataconfig.xml file Do we want to attack all of those in this issue?
        Hide
        Noble Paul added a comment -

        Do we want to attack all of those in this issue?

        we must remove them . Let us have a separate issue for them

        Show
        Noble Paul added a comment - Do we want to attack all of those in this issue? we must remove them . Let us have a separate issue for them
        Hide
        Lance Norskog added a comment -

        The Delta2 test handles both of the problems in this issue. Should this issue be closed?

        Show
        Lance Norskog added a comment - The Delta2 test handles both of the problems in this issue. Should this issue be closed?
        Hide
        Noble Paul added a comment - - edited

        The issue is reopened because it breaks an existing functionality. The latest patch fixes that

        Show
        Noble Paul added a comment - - edited The issue is reopened because it breaks an existing functionality. The latest patch fixes that
        Hide
        Lance Norskog added a comment -

        Noble, what is this broken functionality?

        I just did a full update and all unit tests work. If you describe the problem I will do the unit test and fix it.

        Show
        Lance Norskog added a comment - Noble, what is this broken functionality? I just did a full update and all unit tests work. If you describe the problem I will do the unit test and fix it.
        Hide
        Noble Paul added a comment -

        what is this broken functionality?

        Till this fix the user provided 'pk' was always honoured. Now, the derived pk can never be overridden. My latest patch has the fix and corresponding changes to the tests.

        Show
        Noble Paul added a comment - what is this broken functionality? Till this fix the user provided 'pk' was always honoured. Now, the derived pk can never be overridden. My latest patch has the fix and corresponding changes to the tests.
        Hide
        Lance Norskog added a comment -

        The latest patch works against the latest SVN checkout. Please commit it and close this issue.

        Thanks!

        Show
        Lance Norskog added a comment - The latest patch works against the latest SVN checkout. Please commit it and close this issue. Thanks!
        Hide
        Noble Paul added a comment -

        committed :r807537

        Show
        Noble Paul added a comment - committed :r807537
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Erik Hatcher
            Reporter:
            Erik Hatcher
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development