Derby
  1. Derby
  2. DERBY-2212

Add "Unique where not null" to create index

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 10.2.1.6
    • Fix Version/s: None
    • Component/s: SQL
    • Urgency:
      Normal

      Description

      Derby prohibits creation of unique constraints on nullable colums (as well if only some columns in the constraint list are nullable) and treat nulls in unique indexes as normal values (i.e. only one row with null values in indexed columns may be inserted into the table). This bahavior is very restrictive, does not completely comply with SQL standards (both letter and intent) as well as with business needs and intending meaning of NULL values (2 null values are not considered as equal, this comparision shall return NULL, and for selection criteria boolean null is treated as FALSE).

      This behavior, as far as I can see, is modelled after DB2 (and differs from behavior of most other major databases, like SyBase, Oracle, etc.).

      But even DB2 provide some means to alleviate these restrictions, namely "UNIQUE WHERE NOT NULL" clause for CREATE INDEX statement.

      It will be very good if such "UNIQUE WHERE NOT NULL" clause will be introduced in Derby.

      Regards,
      Oleksandr Alesinskyy

      1. derby-2212preview.diff
        16 kB
        Anurag Shekhar
      2. derby-2212preview2.diff
        24 kB
        Anurag Shekhar
      3. FunctionalSpec.html
        2 kB
        Anurag Shekhar
      4. FunctionlaSpecv2.html
        4 kB
        Anurag Shekhar
      5. FunctionalSpecV3.html
        6 kB
        Anurag Shekhar
      6. FunctionalSpecV3_comment.html
        6 kB
        Mike Matrigali

        Issue Links

          Activity

          Hide
          Uwe Kubosch added a comment -

          Thank you for the example! That helps a lot!

          I still vote for changing CREATE UNIQUE INDEX to have the same semantics as on PostgreSQL and others, and have a compatibility flag for the old behavior.

          Show
          Uwe Kubosch added a comment - Thank you for the example! That helps a lot! I still vote for changing CREATE UNIQUE INDEX to have the same semantics as on PostgreSQL and others, and have a compatibility flag for the old behavior.
          Hide
          Knut Anders Hatlen added a comment -

          Uwe,

          You can create such a constraint with CREATE TABLE or ALTER TABLE:

          CREATE TABLE T (X INT, Y INT, Z INT, UNIQUE(X,Y))

          or

          CREATE TABLE T (X INT, Y INT, Z INT, CONSTRAINT C_XY UNIQUE (X,Y))

          or

          CREATE TABLE T (X INT, Y INT, Z INT)
          ALTER TABLE T ADD CONSTRAINT C_XY UNIQUE (X,Y)

          or, if the unique constraint is on only one column,

          CREATE TABLE T (X INT UNIQUE, Y INT, Z INT)

          Show
          Knut Anders Hatlen added a comment - Uwe, You can create such a constraint with CREATE TABLE or ALTER TABLE: CREATE TABLE T (X INT, Y INT, Z INT, UNIQUE(X,Y)) or CREATE TABLE T (X INT, Y INT, Z INT, CONSTRAINT C_XY UNIQUE (X,Y)) or CREATE TABLE T (X INT, Y INT, Z INT) ALTER TABLE T ADD CONSTRAINT C_XY UNIQUE (X,Y) or, if the unique constraint is on only one column, CREATE TABLE T (X INT UNIQUE, Y INT, Z INT)
          Hide
          Uwe Kubosch added a comment -

          I have reread the previous comment a dozen times, but I cannot understand what the solution is.

          DERBY-3330 made it into the released 10.4.2.0 version, right?

          Could you please give an example of how to create a constraint equaling a unique index for one or more nullable columns while still allowing multiple rows with the same values as long as at least one value included in the constraint in each row is NULL? That is, the constraint should only enforce uniqueness for rows where all values in the columns included in the constraint are non-null.

          An example would make it clear how to solve the problem. If there is further documentation on this, please include a reference.

          Show
          Uwe Kubosch added a comment - I have reread the previous comment a dozen times, but I cannot understand what the solution is. DERBY-3330 made it into the released 10.4.2.0 version, right? Could you please give an example of how to create a constraint equaling a unique index for one or more nullable columns while still allowing multiple rows with the same values as long as at least one value included in the constraint in each row is NULL? That is, the constraint should only enforce uniqueness for rows where all values in the columns included in the constraint are non-null. An example would make it clear how to solve the problem. If there is further documentation on this, please include a reference.
          Hide
          Mike Matrigali added a comment -

          DERBY-3330 has been implemented which provides the SQL standard unique constraint which allow multiple nulls. I do not believe anyone is currently working on providing the "Unique where not null"
          syntax to create index. The same behavior as creating the index can be achieved by using the alter
          table add constraint syntax.

          If anyone needs this, the project would not be too hard as it would be mostly a parser project, and of
          course new testing and possible changes of existing tests. The underlying technology to support such an index has been implemented as part of DERBY-3330.

          The consensus at the time of DERBY-3330 implementation was to avoid breaking existing applications,
          and continue to support the existing behavior of DERBY unique indexes (ie. only allowing a single null),
          unless a user used a new syntax that specifically called for multiple nulls. DERBY-3330 enabled the
          contraint case as that was a SQL standard, and did not implement the create index syntax as it seemed
          redundant to provide 2 different ways to do the same thing.

          Show
          Mike Matrigali added a comment - DERBY-3330 has been implemented which provides the SQL standard unique constraint which allow multiple nulls. I do not believe anyone is currently working on providing the "Unique where not null" syntax to create index. The same behavior as creating the index can be achieved by using the alter table add constraint syntax. If anyone needs this, the project would not be too hard as it would be mostly a parser project, and of course new testing and possible changes of existing tests. The underlying technology to support such an index has been implemented as part of DERBY-3330 . The consensus at the time of DERBY-3330 implementation was to avoid breaking existing applications, and continue to support the existing behavior of DERBY unique indexes (ie. only allowing a single null), unless a user used a new syntax that specifically called for multiple nulls. DERBY-3330 enabled the contraint case as that was a SQL standard, and did not implement the create index syntax as it seemed redundant to provide 2 different ways to do the same thing.
          Hide
          Anurag Shekhar added a comment -

          I have created a seperate issue DERBY-3330 to handle unique constraint over nullable field and moving my works there. I will not be working on this issue for 10.4 release.

          Show
          Anurag Shekhar added a comment - I have created a seperate issue DERBY-3330 to handle unique constraint over nullable field and moving my works there. I will not be working on this issue for 10.4 release.
          Hide
          Øystein Grøvlen added a comment -

          Thanks, for the updated func spec, Anurag.

          I agree with the others who have commented that it would be good if
          you also include some implementation notes in your spec.

          I do not see that you have addressed my previous comments from Oct 31:

          • I would still like that you explicitly state whether you propose
            to keep the existing behavior for unique index.
          • I still have a hard time understanding the section on hard
            upgrade. What do you mean by "No update will be required ..."?
            Update of what? What is the difference between what is stated in
            the second sentence and in the first part of the third sentence?

          Some comments on the test case section:

          • I also think you should test updates that change column values
            to/from null, not just inserts.
          • I do not understand what you mean by testing "null
            ordering of unique constraints".
          • I would think adding of not null constraints also need to be
            tested.
          • You will probably also need some upgrade tests.
          Show
          Øystein Grøvlen added a comment - Thanks, for the updated func spec, Anurag. I agree with the others who have commented that it would be good if you also include some implementation notes in your spec. I do not see that you have addressed my previous comments from Oct 31: I would still like that you explicitly state whether you propose to keep the existing behavior for unique index. I still have a hard time understanding the section on hard upgrade. What do you mean by "No update will be required ..."? Update of what? What is the difference between what is stated in the second sentence and in the first part of the third sentence? Some comments on the test case section: I also think you should test updates that change column values to/from null, not just inserts. I do not understand what you mean by testing "null ordering of unique constraints". I would think adding of not null constraints also need to be tested. You will probably also need some upgrade tests.
          Hide
          Oleksandr Alesinskyy added a comment -

          Initial request (filed by me) was to allow "unique when not null" indexes
          (a'la DB2).

          Latest expressed intention was to allow unique constraints that permit nulls
          duplication
          (i.e. if at least one of constrained columns contains .null value, record
          satisfies constraint
          regardless which other records are already present in the table).

          Regards,
          Oleksandr

          Show
          Oleksandr Alesinskyy added a comment - Initial request (filed by me) was to allow "unique when not null" indexes (a'la DB2). Latest expressed intention was to allow unique constraints that permit nulls duplication (i.e. if at least one of constrained columns contains .null value, record satisfies constraint regardless which other records are already present in the table). Regards, Oleksandr
          Hide
          Mike Matrigali added a comment -

          FunctionalSpecV3_comment.html contains some typo/gramatical error corrections
          to the FunctionalSpecV3.html version.

          Show
          Mike Matrigali added a comment - FunctionalSpecV3_comment.html contains some typo/gramatical error corrections to the FunctionalSpecV3.html version.
          Hide
          Mike Matrigali added a comment -

          I do want to say that once it is clear what the functionality is being provided, it would be great to then write up the implementation details on how we are going to achieve that functionality. I requested we first concentrate on
          the functionality as it was not clear to me at all what was being implemented - just a patch that did something
          to index internals. It was impossible for me to comment on the proposed implementation without understanding what
          the actual change wanted to accomplish.

          I think I now understand the functionality being proposed, though the spec could use a little gramatical cleanup. I
          have also in the discussion given one approach that I believe will fit well with the existing system to implement
          the new unique constraint with nulls functionality.

          I did not mean to stifle the implementation discussion/writeup - I just wanted to first understand what the proposed
          functionality was. I may have been imposing my closed source background where functional specs and implementation specs were always 2 different documents and were written/discussed/reviewed separately.

          Show
          Mike Matrigali added a comment - I do want to say that once it is clear what the functionality is being provided, it would be great to then write up the implementation details on how we are going to achieve that functionality. I requested we first concentrate on the functionality as it was not clear to me at all what was being implemented - just a patch that did something to index internals. It was impossible for me to comment on the proposed implementation without understanding what the actual change wanted to accomplish. I think I now understand the functionality being proposed, though the spec could use a little gramatical cleanup. I have also in the discussion given one approach that I believe will fit well with the existing system to implement the new unique constraint with nulls functionality. I did not mean to stifle the implementation discussion/writeup - I just wanted to first understand what the proposed functionality was. I may have been imposing my closed source background where functional specs and implementation specs were always 2 different documents and were written/discussed/reviewed separately.
          Hide
          Oleksandr Alesinskyy added a comment -

          >functional spec shouldn't have any reference index specially because it talks about the feature related to unique constraint.
          >How it is implemented shouldn't be mentioned in this document. (Please check Mike's feed back on the first version)

          I have seen, Mike's feedback, but I do not agree with it. Yes, implementation details shall be clearly separated from specification itself, but,
          if they are important and affect interaction with other parts of the system (as in this case) they may and shall be mentioned while clearly marked.
          I.e. specification structure may be as follows:

          Functional specification

          1. Specification itself
          Bla-bla-bla....

          2. Implementation notes
          Bla-bla-bla....

          Show
          Oleksandr Alesinskyy added a comment - >functional spec shouldn't have any reference index specially because it talks about the feature related to unique constraint. >How it is implemented shouldn't be mentioned in this document. (Please check Mike's feed back on the first version) I have seen, Mike's feedback, but I do not agree with it. Yes, implementation details shall be clearly separated from specification itself, but, if they are important and affect interaction with other parts of the system (as in this case) they may and shall be mentioned while clearly marked. I.e. specification structure may be as follows: Functional specification 1. Specification itself Bla-bla-bla.... 2. Implementation notes Bla-bla-bla....
          Hide
          Anurag Shekhar added a comment -

          >In my opinion ithe functional specification must point that for time
          beeing implementation depends on appropriate extension of the behavior
          of the unique index and refer to task of exposing of this new behavior
          to the user.

          >One more point - this change for sure will affect optimizer behavior and
          will require modification in it, but specification completely miss this
          point.

          functional spec shouldn't have any reference index specially because it talks about the feature related to unique constraint. How it is implemented shouldn't be mentioned in this document. (Please check Mike's feed back on the first version)

          >There are still hordes of typos (both in the functional specification
          and your last comment). I'm really afraid that patch source would be
          done with the same level of accuracy

          Please don't be afraid, anything committable goes through a really strict process before being part of the source code. Chances of bad code finding a place in code base is really really low.

          thanks for pointing out the mistakes I will upload a new version.

          Show
          Anurag Shekhar added a comment - >In my opinion ithe functional specification must point that for time beeing implementation depends on appropriate extension of the behavior of the unique index and refer to task of exposing of this new behavior to the user. >One more point - this change for sure will affect optimizer behavior and will require modification in it, but specification completely miss this point. functional spec shouldn't have any reference index specially because it talks about the feature related to unique constraint. How it is implemented shouldn't be mentioned in this document. (Please check Mike's feed back on the first version) >There are still hordes of typos (both in the functional specification and your last comment). I'm really afraid that patch source would be done with the same level of accuracy Please don't be afraid, anything committable goes through a really strict process before being part of the source code. Chances of bad code finding a place in code base is really really low. thanks for pointing out the mistakes I will upload a new version.
          Hide
          Oleksandr Alesinskyy added a comment -

          Anurag,

          There are still hordes of typos (both in the functional specification
          and your last comment). I'm really afraid that patch source would be
          done with the same level of accuracy

          In my opinion ithe functional specification must point that for time
          beeing implementation depends on appropriate extension of the behavior
          of the unique index and refer to task of exposing of this new behavior
          to the user.

          One more point - this change for sure will affect optimizer behavior and
          will require modification in it, but specification completely miss this
          point.

          Regards,
          Oleksandr


          Oleksandr Alesinskyy.

          Show
          Oleksandr Alesinskyy added a comment - Anurag, There are still hordes of typos (both in the functional specification and your last comment). I'm really afraid that patch source would be done with the same level of accuracy In my opinion ithe functional specification must point that for time beeing implementation depends on appropriate extension of the behavior of the unique index and refer to task of exposing of this new behavior to the user. One more point - this change for sure will affect optimizer behavior and will require modification in it, but specification completely miss this point. Regards, Oleksandr "Anurag Shekhar (JIRA)" <jira@apache.org> [Wed, 7 Nov 2007 01:07:50 https://issues.apache.org/jira/browse/DERBY-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540705 version I https://issues.apache.org/jira/browse/DERBY-2212 derby-2212preview2.diff, (as treat bahavior (both meaning differs etc.). – Oleksandr Alesinskyy.
          Hide
          Anurag Shekhar added a comment -

          thanks Oleksandr and Øystein for the feed back. In this new version I have fixed the typos and elaborated sections related to introduction, test case and documentation.

          About the changes in unique index a new index will be required to support unique constraints on null able columns but weather to expose the new type of index to user is not clear.
          Probably I should split this issue one about unique constraint and another to expose the backing index directly to the users. I will take up the 2ns task too if its decided that its good idea to have it.

          Show
          Anurag Shekhar added a comment - thanks Oleksandr and Øystein for the feed back. In this new version I have fixed the typos and elaborated sections related to introduction, test case and documentation. About the changes in unique index a new index will be required to support unique constraints on null able columns but weather to expose the new type of index to user is not clear. Probably I should split this issue one about unique constraint and another to expose the backing index directly to the users. I will take up the 2ns task too if its decided that its good idea to have it.
          Hide
          Anurag Shekhar added a comment -

          sorry about the duplicate comment. Initially I had added the comment by mail but when it didn't showed up in jira I added it again from the web interface. Seems the mail I had sent before is now shoing up after 2-3 weeks.

          Show
          Anurag Shekhar added a comment - sorry about the duplicate comment. Initially I had added the comment by mail but when it didn't showed up in jira I added it again from the web interface. Seems the mail I had sent before is now shoing up after 2-3 weeks.
          Hide
          Anurag Shekhar added a comment -

          thanks Øystein, for the review and trying out the patch.

          1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations).

          In this case return value (1) is not important. But the sign is important because that determines the sequence of sorted keys (used while creating index on a table with existing records. While creating the B Tree first time the keys are supposed to be in a sequence and this is verified by invoking method in ControlRow. So if there are duplicate nulls are present in table they should still confirm the sequence (depending on the flag passed ascOrDesc []).
          Yes I saw the drop works fine if I change the sequence but in that case creating of index on table with existing records will fail.

          I have found the problem which is causing the drop table to fail. Its happening while scanning catalog's table to remove table index. Some of the attribute in index record are null and while comparing these two nulls are treated unequal and the deleting fails. I making changes to fix this and will be posting them after incorporating your other comments.

          2 .What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column.

          I am changing this part of code. Now the first null will ensure a mismatch and ascOrDesc flag for that column will be used. But I think it shouldn't matter for which null ascOrDesc is used as this is used only by Sorter and it doesn't makes any difference in which order these two record are sorted (at least till we have something like row id).

          I will get back back about other questions.

          Show
          Anurag Shekhar added a comment - thanks Øystein, for the review and trying out the patch. 1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations). In this case return value (1) is not important. But the sign is important because that determines the sequence of sorted keys (used while creating index on a table with existing records. While creating the B Tree first time the keys are supposed to be in a sequence and this is verified by invoking method in ControlRow. So if there are duplicate nulls are present in table they should still confirm the sequence (depending on the flag passed ascOrDesc []). Yes I saw the drop works fine if I change the sequence but in that case creating of index on table with existing records will fail. I have found the problem which is causing the drop table to fail. Its happening while scanning catalog's table to remove table index. Some of the attribute in index record are null and while comparing these two nulls are treated unequal and the deleting fails. I making changes to fix this and will be posting them after incorporating your other comments. 2 .What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column. I am changing this part of code. Now the first null will ensure a mismatch and ascOrDesc flag for that column will be used. But I think it shouldn't matter for which null ascOrDesc is used as this is used only by Sorter and it doesn't makes any difference in which order these two record are sorted (at least till we have something like row id). I will get back back about other questions.
          Hide
          Øystein Grøvlen added a comment -

          Thanks for the new version of the func spec, Anurag.

          In this version you do not mention anything about 'CREATE UNIQUE INDEX'. I guess that means that you propose to keep the existing behavior, and not introduce a new syntax to match the behavior of unique constraints for nullable columns. I think it would be good if that was stated explicitly in the func spec, and you should also explain why you think a new CREATE INDEX is not necessary. (And if this is no longer about CREATE INDEX, maybe this func spec is attached to the wrong issue?)

          Under 'Hard Upgrade' you write: 'No update will be required as this won't change the existing behavior'. I am not quite sure what that is supposed to mean. Do you mean that a user will not have to make any changes to her applications after a hard upgrade? And what do you mean by 'existing behavior'? That constraints defined as 'UNIQUE NOT NULL' will work as before? I think it need to be defined what is meant by "existing behavior".

          Another issue that is not clear to me, is the implications of changing the nullability of a column with ALTER TABLE . It should probably be discussed how that will be affected by this work.

          Show
          Øystein Grøvlen added a comment - Thanks for the new version of the func spec, Anurag. In this version you do not mention anything about 'CREATE UNIQUE INDEX'. I guess that means that you propose to keep the existing behavior, and not introduce a new syntax to match the behavior of unique constraints for nullable columns. I think it would be good if that was stated explicitly in the func spec, and you should also explain why you think a new CREATE INDEX is not necessary. (And if this is no longer about CREATE INDEX, maybe this func spec is attached to the wrong issue?) Under 'Hard Upgrade' you write: 'No update will be required as this won't change the existing behavior'. I am not quite sure what that is supposed to mean. Do you mean that a user will not have to make any changes to her applications after a hard upgrade? And what do you mean by 'existing behavior'? That constraints defined as 'UNIQUE NOT NULL' will work as before? I think it need to be defined what is meant by "existing behavior". Another issue that is not clear to me, is the implications of changing the nullability of a column with ALTER TABLE . It should probably be discussed how that will be affected by this work.
          Hide
          Oleksandr Alesinskyy added a comment -

          New version of specification looks much better, but still has some problems.

          1. There are too many typos and they sometimes affect the meaning, e.g.

          "null,null,1 null.null,1 no exception" - what is "null.null" (dot instead of comma)

          "null,1,null null,null,1 no exception" - likely should be ""null,1,null null,1,null no exception"

          2. After "no exception" shall be " and table contains both inserted rows" (otherwise it is legal to assume that second row may be silently swallowed)

          3. Issue is named 'Add "Unique where not null" to create index', but discussion of unique indexes is completely absent from functional specification.

          Regards,
          Oleksandr

          Show
          Oleksandr Alesinskyy added a comment - New version of specification looks much better, but still has some problems. 1. There are too many typos and they sometimes affect the meaning, e.g. "null,null,1 null.null,1 no exception" - what is "null.null" (dot instead of comma) "null,1,null null,null,1 no exception" - likely should be ""null,1,null null,1,null no exception" 2. After "no exception" shall be " and table contains both inserted rows" (otherwise it is legal to assume that second row may be silently swallowed) 3. Issue is named 'Add "Unique where not null" to create index', but discussion of unique indexes is completely absent from functional specification. Regards, Oleksandr
          Hide
          Anurag Shekhar added a comment -

          In this version of functional spec I have removed all implementation details and have added new sections relation upgrades and test cases.

          I have added few more examples and have tried to explain them but the list of scenario is still doesn't covers all possible combinations of 3 part keys. I felt there will be too many combinations to list and the description of the functionality will able to explain what will be the behavior in each case.

          Show
          Anurag Shekhar added a comment - In this version of functional spec I have removed all implementation details and have added new sections relation upgrades and test cases. I have added few more examples and have tried to explain them but the list of scenario is still doesn't covers all possible combinations of 3 part keys. I felt there will be too many combinations to list and the description of the functionality will able to explain what will be the behavior in each case.
          Hide
          Oleksandr Alesinskyy added a comment -

          This change not "would" but "may" introduce incompatibilities, as it it quite possible to avoid incompatibilities all together.

          Show
          Oleksandr Alesinskyy added a comment - This change not "would" but "may" introduce incompatibilities, as it it quite possible to avoid incompatibilities all together.
          Hide
          Kathey Marsden added a comment - - edited

          I have not been following this conversation in detail but did look at the functional spec. I did not feel particularly clear after reading it as to the incompatibilities that this change would introduce. In the revised spec, could you give an example of an SQL statement that would cause a difference in behavior after the change? Perhaps include the proposed release note so it is clear.

          Thanks

          Kathey

          Show
          Kathey Marsden added a comment - - edited I have not been following this conversation in detail but did look at the functional spec. I did not feel particularly clear after reading it as to the incompatibilities that this change would introduce. In the revised spec, could you give an example of an SQL statement that would cause a difference in behavior after the change? Perhaps include the proposed release note so it is clear. Thanks Kathey
          Hide
          Anurag Shekhar added a comment -

          thanks Mike for the comments. I will update the functional spec and upload it shortly.

          Show
          Anurag Shekhar added a comment - thanks Mike for the comments. I will update the functional spec and upload it shortly.
          Hide
          Mike Matrigali added a comment -

          comments on 1st draft of functional spec:

          1) In section "Proposed behavior of unique Constraint"
          There should be more than a set of examples. There should be some description
          of the behavior being implemented. Examples are excellent but are not a
          complete description of the behavior. For instance even in the 3 column
          key case the following cases are not included:
          value, null, null
          value, null, value
          value, value, null
          null, value, null
          null, value, value

          From the discussion it looks like a key difference between some Db's is
          whether (1, null, null) and another (1, null, null) is allowed. So this
          would be a good explicit example to use.

          I think what you are proposing is that if any column in a key contains a null
          then no duplicate checking is performed on insert.

          2) I don't agree with the following:
          Unique Constraint is internally backed by Index so for Unique Constraint to
          support a feature it is mandatory to have a type of index which supports same
          behavior. So to support T591 there is a need of a unique index which doesn't
          treats nulls as equals ie allows duplicates as long as at least one part of
          the key is null.

          This describes how derby currently implements unique constraint on non-nullable
          columns, but is an internal implementation detail. There is nothing that
          says unique constraint on nullable columns must use the same internal
          implementation as the one used for non-nullable columns. All that is necessary
          is that once a user
          declares a unique constraint on a key with a nullable value that it implements
          the SQL standard. The standard does not mandate that a "unique index" be
          used. There are a number of ways such a feature could be implemented without
          it being "mandatory" to have a unique index that allows duplicates.

          3) it would be nice to separate implementation from function. The key
          functional information I would like to see are:
          a) In case of soft upgrade, what is the behavior of create unique index
          on nullable columns.
          b) In case of soft upgrade, what is the behavior of inserts into
          pre-existing unique indexes on nullable columns.
          c) in case of hard upgrade, what is behavior of create unique index
          on nullable columns.
          d) in case of hard upgrade, what id behavior of inserts into pre-existing
          unique indexes on nullable columns.
          e) will there be new syntax to create index to allow for the creation
          of a unique index on nullable columns with different behavior with
          respect to nulls than the existing create index behavior.
          f) In case of soft upgrade, what is the behavior of creating constraint
          on nullable columns.
          g) In case of hard upgrade, what is the behavior of creating constraint
          on nullable columns.

          Discussion on some of these points is going on, but need to understand
          intended function before implementation.
          My preference for item 3 would be to keep existing behavior of unique
          indexes on nullable columns and only implement the new behavior as part
          of implementing SQL feature T591, unique constraints on nullable columns.
          This avoids a number of upgrade/backward compatibility problems and even
          possible performance regressions for existing indexes depending on
          implementation.

          So the answers would be:
          a) no change to current behavior
          b) no change to current behavior
          c) no change to current behavior
          d) no change to current behavior
          e) no new syntax necessary, only enabling existing syntax to work when
          requesting constraint on nullable columns.
          f) Existing error would be thrown.
          g) create constraint on nullable collumn would succeed and would implement
          SQL standard behavior. Actual implementation specifics to be determined
          later.

          Show
          Mike Matrigali added a comment - comments on 1st draft of functional spec: 1) In section "Proposed behavior of unique Constraint" There should be more than a set of examples. There should be some description of the behavior being implemented. Examples are excellent but are not a complete description of the behavior. For instance even in the 3 column key case the following cases are not included: value, null, null value, null, value value, value, null null, value, null null, value, value From the discussion it looks like a key difference between some Db's is whether (1, null, null) and another (1, null, null) is allowed. So this would be a good explicit example to use. I think what you are proposing is that if any column in a key contains a null then no duplicate checking is performed on insert. 2) I don't agree with the following: Unique Constraint is internally backed by Index so for Unique Constraint to support a feature it is mandatory to have a type of index which supports same behavior. So to support T591 there is a need of a unique index which doesn't treats nulls as equals ie allows duplicates as long as at least one part of the key is null. This describes how derby currently implements unique constraint on non-nullable columns, but is an internal implementation detail. There is nothing that says unique constraint on nullable columns must use the same internal implementation as the one used for non-nullable columns. All that is necessary is that once a user declares a unique constraint on a key with a nullable value that it implements the SQL standard. The standard does not mandate that a "unique index" be used. There are a number of ways such a feature could be implemented without it being "mandatory" to have a unique index that allows duplicates. 3) it would be nice to separate implementation from function. The key functional information I would like to see are: a) In case of soft upgrade, what is the behavior of create unique index on nullable columns. b) In case of soft upgrade, what is the behavior of inserts into pre-existing unique indexes on nullable columns. c) in case of hard upgrade, what is behavior of create unique index on nullable columns. d) in case of hard upgrade, what id behavior of inserts into pre-existing unique indexes on nullable columns. e) will there be new syntax to create index to allow for the creation of a unique index on nullable columns with different behavior with respect to nulls than the existing create index behavior. f) In case of soft upgrade, what is the behavior of creating constraint on nullable columns. g) In case of hard upgrade, what is the behavior of creating constraint on nullable columns. Discussion on some of these points is going on, but need to understand intended function before implementation. My preference for item 3 would be to keep existing behavior of unique indexes on nullable columns and only implement the new behavior as part of implementing SQL feature T591, unique constraints on nullable columns. This avoids a number of upgrade/backward compatibility problems and even possible performance regressions for existing indexes depending on implementation. So the answers would be: a) no change to current behavior b) no change to current behavior c) no change to current behavior d) no change to current behavior e) no new syntax necessary, only enabling existing syntax to work when requesting constraint on nullable columns. f) Existing error would be thrown. g) create constraint on nullable collumn would succeed and would implement SQL standard behavior. Actual implementation specifics to be determined later.
          Hide
          Anurag Shekhar added a comment -

          Functional Spec for Unique Constraint

          Show
          Anurag Shekhar added a comment - Functional Spec for Unique Constraint
          Hide
          Oleksandr Alesinskyy added a comment -

          To avoid user confusions current syntax should retain existing behavior and new and consistent syntax extension be provided for both index and constraints.

          Show
          Oleksandr Alesinskyy added a comment - To avoid user confusions current syntax should retain existing behavior and new and consistent syntax extension be provided for both index and constraints.
          Hide
          Øystein Grøvlen added a comment -

          I agree that we should try to avoid incompatibilities with older releases, but I am just as worried that a slightly different semantics for unique constraints and unique index will be confusing to users regardless of whether we add another syntax or not.

          Also, Derby will today not allow the explicit creation of an index if such an index already has been implicitly created through the definition of a constraint. How should this behave if we keep the old index behavior? Should we allow two different kinds of unique indexes on the same columns?

          With respect to upgrade, I assume that after soft upgrade only the old type of unique indexes should be created.
          After hard upgrade, you should get the new behavior. The question is whether we should replace any of the old unique indexes with the new type on hard upgrade. I am not sure I see any good reasons for doing that. Can not we leave it to the user to decide whether he/she wants to drop and recreated them?

          Show
          Øystein Grøvlen added a comment - I agree that we should try to avoid incompatibilities with older releases, but I am just as worried that a slightly different semantics for unique constraints and unique index will be confusing to users regardless of whether we add another syntax or not. Also, Derby will today not allow the explicit creation of an index if such an index already has been implicitly created through the definition of a constraint. How should this behave if we keep the old index behavior? Should we allow two different kinds of unique indexes on the same columns? With respect to upgrade, I assume that after soft upgrade only the old type of unique indexes should be created. After hard upgrade, you should get the new behavior. The question is whether we should replace any of the old unique indexes with the new type on hard upgrade. I am not sure I see any good reasons for doing that. Can not we leave it to the user to decide whether he/she wants to drop and recreated them?
          Hide
          Oleksandr Alesinskyy added a comment -

          Yes, unique constraint is not the same thing as unique index but it requires unique index to be enforced (yes, it is implementation detail, but it exists for now and for foreseeable future as fas as I can tell). So we result in the 2 different behavior of the unique index depending on mode of creation - direct or indirect.

          As I have already pointed in this discussion it would cause a headaches for users (and major one if it would not be clearly distiguishable in dictionary views).

          Regards,
          Oleksandr

          Show
          Oleksandr Alesinskyy added a comment - Yes, unique constraint is not the same thing as unique index but it requires unique index to be enforced (yes, it is implementation detail, but it exists for now and for foreseeable future as fas as I can tell). So we result in the 2 different behavior of the unique index depending on mode of creation - direct or indirect. As I have already pointed in this discussion it would cause a headaches for users (and major one if it would not be clearly distiguishable in dictionary views). Regards, Oleksandr
          Hide
          Mike Matrigali added a comment -

          No, I don't think you have overlooked a compatibility issue. But you do recogonize there
          is a difference in behavior. I have found that no matter how slight the behavior there will
          be some number of applications that depended on it. I would rather see derby not
          introduce this incompatibility.

          If the proposal is to introduce the incompatibility I would like to see how this will be handled in soft vs. hard upgrade and in both cases how it will be handled by existing indexes vs. indexes created after the soft or hard upgrade.

          So I would prefer constraint on null has the new behavior. unique index (which is not the
          same thing as unique constraint) has old behavior. I also don't think we should create
          a new create index with extra syntax for new behavior as the functionality is fully available
          via constraint syntax, but if someone feels it is necessary or more user friendly I would not
          care much.

          Show
          Mike Matrigali added a comment - No, I don't think you have overlooked a compatibility issue. But you do recogonize there is a difference in behavior. I have found that no matter how slight the behavior there will be some number of applications that depended on it. I would rather see derby not introduce this incompatibility. If the proposal is to introduce the incompatibility I would like to see how this will be handled in soft vs. hard upgrade and in both cases how it will be handled by existing indexes vs. indexes created after the soft or hard upgrade. So I would prefer constraint on null has the new behavior. unique index (which is not the same thing as unique constraint) has old behavior. I also don't think we should create a new create index with extra syntax for new behavior as the functionality is fully available via constraint syntax, but if someone feels it is necessary or more user friendly I would not care much.
          Hide
          Oleksandr Alesinskyy added a comment -

          I may agree as far as we are speaking about single-column indexes, but for multicolumn-indexes when not all columns need to be null to allow duplicates things may get worse.

          Show
          Oleksandr Alesinskyy added a comment - I may agree as far as we are speaking about single-column indexes, but for multicolumn-indexes when not all columns need to be null to allow duplicates things may get worse.
          Hide
          Øystein Grøvlen added a comment -

          I do not see how changing the existing create unique index command to create an index which allows multiple null values, will create severe backward compatibility issues. The new index would be less restrictive than the existing index implementation, so all operations that succeed on the current index should also succeed on with the new implementation.

          I see that applications can no longer rely on an unique index to ensure that only a single null value (but not two) are inserted, but I doubt that such needs are very common.

          Have I overlooked some compatibility issues here?

          Show
          Øystein Grøvlen added a comment - I do not see how changing the existing create unique index command to create an index which allows multiple null values, will create severe backward compatibility issues. The new index would be less restrictive than the existing index implementation, so all operations that succeed on the current index should also succeed on with the new implementation. I see that applications can no longer rely on an unique index to ensure that only a single null value (but not two) are inserted, but I doubt that such needs are very common. Have I overlooked some compatibility issues here?
          Hide
          Oleksandr Alesinskyy added a comment -

          Keeping backward compatibility in this case shall be of much higher priority then keeping with SQL standard (BTW, each major database deviates from SQL standard at this point). So, what would we achieve if CREATE INDEX would not provide additional syntax?

          There are 2 possibilities

          1. CREATE UNIQUE INDEX retain existing behavior. Then indexes created implicitly by UNIQUE constraint would differ in behavior from indexes created by CREATE UNIQUE INDEX. Customer headache is ensured.

          2. CREATE UNIQUE INDEX creates indexes with new behavior. Then many existing application will be broken. Instead of headache customers encounter pain in the ass. Not much better.

          Show
          Oleksandr Alesinskyy added a comment - Keeping backward compatibility in this case shall be of much higher priority then keeping with SQL standard (BTW, each major database deviates from SQL standard at this point). So, what would we achieve if CREATE INDEX would not provide additional syntax? There are 2 possibilities 1. CREATE UNIQUE INDEX retain existing behavior. Then indexes created implicitly by UNIQUE constraint would differ in behavior from indexes created by CREATE UNIQUE INDEX. Customer headache is ensured. 2. CREATE UNIQUE INDEX creates indexes with new behavior. Then many existing application will be broken. Instead of headache customers encounter pain in the ass. Not much better.
          Hide
          Daniel John Debrunner added a comment -

          I would recommend implementing the SQL standard behaviour (UNIWUQE constraint on nullable columns) as a first step.

          A second separate issue could be to add additional syntax to CREATE INDEX, but I personally don't see the point. Why add new alternate syntax that will confuse the user (just have one way to achieve one thing), especially as the new syntax will be not standard and a standard way exists.

          Show
          Daniel John Debrunner added a comment - I would recommend implementing the SQL standard behaviour (UNIWUQE constraint on nullable columns) as a first step. A second separate issue could be to add additional syntax to CREATE INDEX, but I personally don't see the point. Why add new alternate syntax that will confuse the user (just have one way to achieve one thing), especially as the new syntax will be not standard and a standard way exists.
          Hide
          Mike Matrigali added a comment -

          Derby currently does not support creating a unique constraint on a column with nullable keys, and sql defines what the behavior should be in this case which I believe is the intent of
          this work. So in the case of constraints I don't think additional syntax is necessary, internally
          we would use this new implementation to allow a unique constraint that supported duplicate
          constraints.

          If we decide to also to allow one to create indexes through the create index command that
          match this behavior then I agree additional syntax should be used.

          I also agree that this implementation would lead to a "third" kind of index and that should be somehow reflected in the system catalog description of the index. Using this the optimizer
          should be able to make correct assumptions about non-null uniqueness. Some work will
          have to be done to code that tries to use existing indexes if a user asks for the same kind of index to be created to understand the difference between this index and others.

          I didn't put much thought to the create index behavior. The storage system does provide a way to optimize building indexes, it allows the tree to one loaded at create time by providing
          a sorted stream of rows. The btree itself does not own how the stream is produced. It already does a sanity check on each adjacent row to insure the sort of the keys, so it would be simple to add a different check in the case of this new btree type. See
          BTreeController.java!do_load_insert().

          a Derby does sometimes optimize this by throwing all rows through the sorter and then builds the tree from the bottom up. As was suggested this method could still be used at the cost of adding a single compare of
          each row to it's previous if it has no null in it. The system already does a compare at this
          stage to insure

          Show
          Mike Matrigali added a comment - Derby currently does not support creating a unique constraint on a column with nullable keys, and sql defines what the behavior should be in this case which I believe is the intent of this work. So in the case of constraints I don't think additional syntax is necessary, internally we would use this new implementation to allow a unique constraint that supported duplicate constraints. If we decide to also to allow one to create indexes through the create index command that match this behavior then I agree additional syntax should be used. I also agree that this implementation would lead to a "third" kind of index and that should be somehow reflected in the system catalog description of the index. Using this the optimizer should be able to make correct assumptions about non-null uniqueness. Some work will have to be done to code that tries to use existing indexes if a user asks for the same kind of index to be created to understand the difference between this index and others. I didn't put much thought to the create index behavior. The storage system does provide a way to optimize building indexes, it allows the tree to one loaded at create time by providing a sorted stream of rows. The btree itself does not own how the stream is produced. It already does a sanity check on each adjacent row to insure the sort of the keys, so it would be simple to add a different check in the case of this new btree type. See BTreeController.java!do_load_insert(). a Derby does sometimes optimize this by throwing all rows through the sorter and then builds the tree from the bottom up. As was suggested this method could still be used at the cost of adding a single compare of each row to it's previous if it has no null in it. The system already does a compare at this stage to insure
          Hide
          Oleksandr Alesinskyy added a comment -

          I guess to change behaviour of existing unique indexes/constraints is a way to dangerous, many application will suffer.
          So syntax that allows to specify how to treat nulls is a must.

          Regards,
          Oleksandr

          Show
          Oleksandr Alesinskyy added a comment - I guess to change behaviour of existing unique indexes/constraints is a way to dangerous, many application will suffer. So syntax that allows to specify how to treat nulls is a must. Regards, Oleksandr
          Hide
          Anurag Shekhar added a comment -

          Thanks Mike for the detailed input.
          I will be attaching a functional spec in the issue some time next week.

          I have been thinking in the line of changing the unique index behavior to match that of postgresql (which I think is doing the right thing according to spec).

          I haven't thought of adding another syntax to retain the current behavior of the unique index but I am ok to
          do that if community thinks that is right thing to do.

          Show
          Anurag Shekhar added a comment - Thanks Mike for the detailed input. I will be attaching a functional spec in the issue some time next week. I have been thinking in the line of changing the unique index behavior to match that of postgresql (which I think is doing the right thing according to spec). I haven't thought of adding another syntax to retain the current behavior of the unique index but I am ok to do that if community thinks that is right thing to do.
          Hide
          Øystein Grøvlen added a comment -

          My previous comment assumed that creation is done in a more efficient
          way than inserting records one-by-one. Is that the case? If not,
          there should be no issue with index creation.

          Show
          Øystein Grøvlen added a comment - My previous comment assumed that creation is done in a more efficient way than inserting records one-by-one. Is that the case? If not, there should be no issue with index creation.
          Hide
          Oleksandr Alesinskyy added a comment -

          Hello Mike,

          concerning index and unique constraints - I did not completely catch your point, sorry.
          Do you propose to add additional syntax to CONSTRAINT clause only or just go Oracle
          way which treats all unique indexes the same way?

          And I completely agree that before implementation some specification concerning both
          SQL syntax and index behavior shall be agreed.

          In my opinion, for compatibility reasons DB2 way is better as any index created without
          additional syntax (does not matter in the constraint clause or in the create index command)
          behave exactly as before, so no existing application would be affected. And index, created
          with additional syntax provides new behavior. BTW, it may be considered as new, third type
          of index (that likely require system dictionary expansion).

          Regards,
          Oleksandr

          Show
          Oleksandr Alesinskyy added a comment - Hello Mike, concerning index and unique constraints - I did not completely catch your point, sorry. Do you propose to add additional syntax to CONSTRAINT clause only or just go Oracle way which treats all unique indexes the same way? And I completely agree that before implementation some specification concerning both SQL syntax and index behavior shall be agreed. In my opinion, for compatibility reasons DB2 way is better as any index created without additional syntax (does not matter in the constraint clause or in the create index command) behave exactly as before, so no existing application would be affected. And index, created with additional syntax provides new behavior. BTW, it may be considered as new, third type of index (that likely require system dictionary expansion). Regards, Oleksandr
          Hide
          Øystein Grøvlen added a comment -

          Thank you Mike for taking the time to write down your clear thoughts
          on the workings of the Derby B-tree. It certainly made it clearer for
          me. Your suggestion for how one with a "non-unique" index can still
          guarantee uniqueness for non-null keys seems like a very good idea.

          However, one will also need to do something to index creation, since a
          traditional non-unique index will not flag if there are duplicate
          non-null keys. A simple, but not very efficient, way of achieving that
          would be a sequential scan at leaf-level comparing neighboring
          entries. I do not know Derby's implementation of index creation
          well enough to say whether one can easily come up with a more
          efficient way.

          Show
          Øystein Grøvlen added a comment - Thank you Mike for taking the time to write down your clear thoughts on the workings of the Derby B-tree. It certainly made it clearer for me. Your suggestion for how one with a "non-unique" index can still guarantee uniqueness for non-null keys seems like a very good idea. However, one will also need to do something to index creation, since a traditional non-unique index will not flag if there are duplicate non-null keys. A simple, but not very efficient, way of achieving that would be a sequential scan at leaf-level comparing neighboring entries. I do not know Derby's implementation of index creation well enough to say whether one can easily come up with a more efficient way.
          Hide
          Mike Matrigali added a comment -

          I don't think I have explained very clearly why I don't think the current
          approach is best. I especially have been using unique/non-unique incorrectly.
          In the following I am only talking about the internal
          store btree implementation, for this discussion ignore the SQL index built
          on top of them.

          The btree design is ARIES based. The implementation actually only implements
          a unique btree, but by picking key columns and options one can implement
          both unique and non-unique indexes using the same code. Only supporting unique
          btree actually simplifies a lot of the binary search, leaf/branch splitting,
          and node merging code. So code that uses the btree to implement something like
          a unique constraint doesn't actually tell the code it is unique. There are
          a number of options you specify when you are creating the btree. There are
          2 that are of interest to this discussion:
          number of columns in the key and leading number of columns in the key that
          uniquely identifies the row in the btree.

          As the most simple example take the implementation of a single column SQL
          index. In that case the btree is actually created with 2 columns, which
          when combined are always unique (whether or not it is a unique or non-unique
          SQL index). In the current implementation of a unique single column index
          the btree is created with 2 columns and it knows that a row can be uniquely
          identified by the 1st column. In the case of a non-unique single column
          index the btree is created with with 2 columns and told that it takes both
          columns to uniquely identify a row in the index. The system insures that all
          the rows in the index in this case are unique by insuring that the row
          location column is unique, derby goes out of it's way to never reuse a row
          id and thus insure that every row inserted in a base table has a unique
          row id.
          The reason we use number of columns rather than a simple unique vs. non-unique
          setting is that someday we may also add columns to the index that actually
          don't participate at all in the searching of the index and in that case there
          might be a 7 column btree where only 3 leading columns are necessary (this
          is often requested for optimizing performance using covered queries for an
          index).
          The setting of how many
          columns it takes to uniquely identify a row is critical, and is used for a
          number of different operations within the btree some of which have nothing to
          do with returning result sets to SQL queries.

          At the btree level it is important that comparisons of each element remain
          consistent. So even if SQL thinks null's can't be compared, in order to be
          able to find the rows at the btree level there must be a consistent comparison
          between one null and another. The current implementation is that a null
          equals a null, and seems the easiest to implement at this level.

          All of the above have led to the current situation where in a unique index
          one null is allowed but not 2.

          As soon as you want to implement a 2 column btree which allows the insertion
          of: (null, row id 1), (null, row id 2), then it is clear that the minimun
          number columns necessary to uniquely identify the row is 2 and not 1. To me
          that makes it a non-unique index, but I don't actually care what the language
          system catalogs call it – as long as btree is correctly created indicating
          that it takes 2 columns to uniquely identify the row.

          But of course it is not that simple. The job of insert includes, but is
          not limited to:
          1) determine if the row is valid to insert
          2) determine exactly the "right" place in the index to insert the row
          3) lock the correct rows to implement whatever the isolation level of the xact

          Current "unique index" implementation (not the whole story leaving out details concerning finding and dealing with deleted rows):
          A we search the index for an exact match on the number of columns necessary
          to uniquely identify the row, so in the case of single column index we just
          search using 1 column. If we get a match then there is a duplicate key
          error.
          B The search in step A is set up such that if there is no exact match then the
          same code has found the exact row in the tree where the row is to be
          inserted just before. This works because we "know" that the above search
          that only used the 1st column must find the right place in the tree for column
          the value of the 2nd column does not matter.
          C locks the row being inserted.
          D Depending on isolation level it may also lock the previous row to insure
          no phantom inserts happen to a serializable reader. Note the previous row
          may not be on the current leaf page.

          Current "non-unique" index implementation is exactly the same except that the
          checks are altered based on the number of columns it takes to unique identify
          a row.
          A we search the index for an exact match on the number of columns necessary
          to uniquely identify the row, so in the case of single column index we
          search using 2 column. If we get a match then there is a duplicate key
          error, but only a bug will make this happen as the 2nd column should always
          be unique by itself - but if you went in somehow and forced the insert of
          "1, rowid 1", "1, rowid 1" into this btree a duplicate key error would be
          thrown (or maybe an ASSERT - not sure).
          B The search in step A is set up such that if there is no exact match then the
          same code has found the exact row in the tree where the row is to be
          inserted just before. In this case the search has used all columns so if it
          doesn't find exact match it again has ended up exactly where the row should
          be inserted.
          C locks the row being inserted.
          D Depending on isolation level it may also lock the previous row to insure
          no phantom inserts happen to a serializable reader. Note the previous row
          may not be on the current leaf page.

          The search code interfaces were carefully crafted to do exactly what we wanted
          in these 2 cases. I think the new behavior does not lend itself to a single
          search to perform both A and B. Assume the null duplicate behavior we are
          looking for can be described as (I am waiting on an answer to whether this
          assumption is valid):
          if there are any null's in any column in the row don't do duplicate
          checking, otherwise do duplicate checking on leading n-1 columns.

          Here is what I think may work best for the new behavior,
          with the specific example of a 1 column index:

          1) set the btree to 2 columns
          2) set the number of columns to uniquely identify the row to be 2 also
          3) set a new flag that indicates special duplicate checking processing needed
          4)
          if (magic duplicate checking flag)
          (
          null_present_in_key column = search the row for nulls;

          Perform the search exactly as we do above in the non-unique index
          case as described in step A, you will have the right place to insert
          the row.

          if (null_present_in_key_column)

          { you are done, insert the row no duplicate checking necessary }

          else

          { // the following may not be exactly right, but I think you can // do duplicate checking somehow just based on the row to your left // and row to right. The idea is that worst case depending on // order of your row location you have either landed left or right // of the single key that matches the insert row. logically compare insert row to row to left and row to right using only 1 column, no magic null handling necessary. If either compare equal then throw duplicate key. }

          }
          5) do same isolation locking as existing code for duplicate indexes.

          This might not be the most elegant, but I think it has the least chance of
          breaking a lot of the system at the cost of 1 null row check,
          and 2 row compares.
          The code is isolated to insert and no change is necessary to any compare
          interfaces. It should perform better than forcing the language layer to
          first do a probe and then do an insert for this kind of unique index.

          Show
          Mike Matrigali added a comment - I don't think I have explained very clearly why I don't think the current approach is best. I especially have been using unique/non-unique incorrectly. In the following I am only talking about the internal store btree implementation, for this discussion ignore the SQL index built on top of them. The btree design is ARIES based. The implementation actually only implements a unique btree, but by picking key columns and options one can implement both unique and non-unique indexes using the same code. Only supporting unique btree actually simplifies a lot of the binary search, leaf/branch splitting, and node merging code. So code that uses the btree to implement something like a unique constraint doesn't actually tell the code it is unique. There are a number of options you specify when you are creating the btree. There are 2 that are of interest to this discussion: number of columns in the key and leading number of columns in the key that uniquely identifies the row in the btree. As the most simple example take the implementation of a single column SQL index. In that case the btree is actually created with 2 columns, which when combined are always unique (whether or not it is a unique or non-unique SQL index). In the current implementation of a unique single column index the btree is created with 2 columns and it knows that a row can be uniquely identified by the 1st column. In the case of a non-unique single column index the btree is created with with 2 columns and told that it takes both columns to uniquely identify a row in the index. The system insures that all the rows in the index in this case are unique by insuring that the row location column is unique, derby goes out of it's way to never reuse a row id and thus insure that every row inserted in a base table has a unique row id. The reason we use number of columns rather than a simple unique vs. non-unique setting is that someday we may also add columns to the index that actually don't participate at all in the searching of the index and in that case there might be a 7 column btree where only 3 leading columns are necessary (this is often requested for optimizing performance using covered queries for an index). The setting of how many columns it takes to uniquely identify a row is critical, and is used for a number of different operations within the btree some of which have nothing to do with returning result sets to SQL queries. At the btree level it is important that comparisons of each element remain consistent. So even if SQL thinks null's can't be compared, in order to be able to find the rows at the btree level there must be a consistent comparison between one null and another. The current implementation is that a null equals a null, and seems the easiest to implement at this level. All of the above have led to the current situation where in a unique index one null is allowed but not 2. As soon as you want to implement a 2 column btree which allows the insertion of: (null, row id 1), (null, row id 2), then it is clear that the minimun number columns necessary to uniquely identify the row is 2 and not 1. To me that makes it a non-unique index, but I don't actually care what the language system catalogs call it – as long as btree is correctly created indicating that it takes 2 columns to uniquely identify the row. But of course it is not that simple. The job of insert includes, but is not limited to: 1) determine if the row is valid to insert 2) determine exactly the "right" place in the index to insert the row 3) lock the correct rows to implement whatever the isolation level of the xact Current "unique index" implementation (not the whole story leaving out details concerning finding and dealing with deleted rows): A we search the index for an exact match on the number of columns necessary to uniquely identify the row, so in the case of single column index we just search using 1 column. If we get a match then there is a duplicate key error. B The search in step A is set up such that if there is no exact match then the same code has found the exact row in the tree where the row is to be inserted just before. This works because we "know" that the above search that only used the 1st column must find the right place in the tree for column the value of the 2nd column does not matter. C locks the row being inserted. D Depending on isolation level it may also lock the previous row to insure no phantom inserts happen to a serializable reader. Note the previous row may not be on the current leaf page. Current "non-unique" index implementation is exactly the same except that the checks are altered based on the number of columns it takes to unique identify a row. A we search the index for an exact match on the number of columns necessary to uniquely identify the row, so in the case of single column index we search using 2 column. If we get a match then there is a duplicate key error, but only a bug will make this happen as the 2nd column should always be unique by itself - but if you went in somehow and forced the insert of "1, rowid 1", "1, rowid 1" into this btree a duplicate key error would be thrown (or maybe an ASSERT - not sure). B The search in step A is set up such that if there is no exact match then the same code has found the exact row in the tree where the row is to be inserted just before. In this case the search has used all columns so if it doesn't find exact match it again has ended up exactly where the row should be inserted. C locks the row being inserted. D Depending on isolation level it may also lock the previous row to insure no phantom inserts happen to a serializable reader. Note the previous row may not be on the current leaf page. The search code interfaces were carefully crafted to do exactly what we wanted in these 2 cases. I think the new behavior does not lend itself to a single search to perform both A and B. Assume the null duplicate behavior we are looking for can be described as (I am waiting on an answer to whether this assumption is valid): if there are any null's in any column in the row don't do duplicate checking, otherwise do duplicate checking on leading n-1 columns. Here is what I think may work best for the new behavior, with the specific example of a 1 column index: 1) set the btree to 2 columns 2) set the number of columns to uniquely identify the row to be 2 also 3) set a new flag that indicates special duplicate checking processing needed 4) if (magic duplicate checking flag) ( null_present_in_key column = search the row for nulls; Perform the search exactly as we do above in the non-unique index case as described in step A, you will have the right place to insert the row. if (null_present_in_key_column) { you are done, insert the row no duplicate checking necessary } else { // the following may not be exactly right, but I think you can // do duplicate checking somehow just based on the row to your left // and row to right. The idea is that worst case depending on // order of your row location you have either landed left or right // of the single key that matches the insert row. logically compare insert row to row to left and row to right using only 1 column, no magic null handling necessary. If either compare equal then throw duplicate key. } } 5) do same isolation locking as existing code for duplicate indexes. This might not be the most elegant, but I think it has the least chance of breaking a lot of the system at the cost of 1 null row check, and 2 row compares. The code is isolated to insert and no change is necessary to any compare interfaces. It should perform better than forcing the language layer to first do a probe and then do an insert for this kind of unique index.
          Hide
          Mike Matrigali added a comment -

          Along the same lines in the case of multi-column keys, what null behavior is being
          implemented? In this bug description I think I have at least seen oracle with one behavior and
          db2/postgres with another.

          Can the db2/postgres behavior be implemented as simply as if there is a null in any column
          of a key (either single column or multi-column) don't do any uniqueness checking on insert
          of that row?

          Show
          Mike Matrigali added a comment - Along the same lines in the case of multi-column keys, what null behavior is being implemented? In this bug description I think I have at least seen oracle with one behavior and db2/postgres with another. Can the db2/postgres behavior be implemented as simply as if there is a null in any column of a key (either single column or multi-column) don't do any uniqueness checking on insert of that row?
          Hide
          Mike Matrigali added a comment -

          There are 2 discussions here. Let's settle first what SQL level behavior is being implemented. I believe it is clear that the focus is to allow derby to create unique
          constraints on a set of columns where one or more of the columns is nullable. This
          is the SQL standard, and I don't see any backward compatiblity issues. Many users have
          asked for this and it adds to our standards implementation so seems like a good feature to add to Derby.

          My opinion is that we should not change the existing behavior for unique indexes. Changing
          the behavior presents possible application upgrade and backward compatibility issues. The
          change seems minor, but it has been my experience that no matter how small the behavior change a number of applications will depend on it. I don't think this is a SQL standard issue
          as create index is an implementation specific ddl and is not covered.

          I lean toward not providing extra syntax to create index to create the "new" multiple null unique indexes as it just seems like another way to do the same thing you can do with a constraint. But I don't really feel strongly about this one. If someone wanted to do it I would not -1 it. Historically I think alter table add constraint came after create index, so originally
          one needed the create index command to add more unique indexes after the table was
          created.

          I do think the decision on support of old unique index may affect best internal implementation details.

          Show
          Mike Matrigali added a comment - There are 2 discussions here. Let's settle first what SQL level behavior is being implemented. I believe it is clear that the focus is to allow derby to create unique constraints on a set of columns where one or more of the columns is nullable. This is the SQL standard, and I don't see any backward compatiblity issues. Many users have asked for this and it adds to our standards implementation so seems like a good feature to add to Derby. My opinion is that we should not change the existing behavior for unique indexes. Changing the behavior presents possible application upgrade and backward compatibility issues. The change seems minor, but it has been my experience that no matter how small the behavior change a number of applications will depend on it. I don't think this is a SQL standard issue as create index is an implementation specific ddl and is not covered. I lean toward not providing extra syntax to create index to create the "new" multiple null unique indexes as it just seems like another way to do the same thing you can do with a constraint. But I don't really feel strongly about this one. If someone wanted to do it I would not -1 it. Historically I think alter table add constraint came after create index, so originally one needed the create index command to add more unique indexes after the table was created. I do think the decision on support of old unique index may affect best internal implementation details.
          Hide
          Øystein Grøvlen added a comment -

          To my previous comment: I was rather confused when I asked whether we need to have indexes that treat null as equal. As far as I can tell, we want look-ups (and deletes) to treat nulls as equal (in order to find rows where a specific column is null), but we want nulls to be treated as not equal on insert. (I think the reason for my confusion was that I mixed this with whether we need some indexes that treat null as equal on insert and other indexes that treat null as not equal on insert. I think we only need the latter. Please, correct me if I am wrong.)

          So Anurag's approach is to add a parameter to compare methods so that on navigation on insert, compare is told to treat null as not equal while on look-ups compare is told to treat null as equal. The question is whether there is a clean way to separate the two navigational modes. With the current patch, it is not that straight-forward to detect whether there are situations where nulls are treated as equal on insert or where nulls are treated as not equal on look-ups. Is there a way to make that more clear and less error-prone?

          Show
          Øystein Grøvlen added a comment - To my previous comment: I was rather confused when I asked whether we need to have indexes that treat null as equal. As far as I can tell, we want look-ups (and deletes) to treat nulls as equal (in order to find rows where a specific column is null), but we want nulls to be treated as not equal on insert. (I think the reason for my confusion was that I mixed this with whether we need some indexes that treat null as equal on insert and other indexes that treat null as not equal on insert. I think we only need the latter. Please, correct me if I am wrong.) So Anurag's approach is to add a parameter to compare methods so that on navigation on insert, compare is told to treat null as not equal while on look-ups compare is told to treat null as equal. The question is whether there is a clean way to separate the two navigational modes. With the current patch, it is not that straight-forward to detect whether there are situations where nulls are treated as equal on insert or where nulls are treated as not equal on look-ups. Is there a way to make that more clear and less error-prone?
          Hide
          Oleksandr Alesinskyy added a comment -

          I'm original reporter
          DB2-like approach I have suggested for sole reason that Derby is already very DB2-like in many areas.
          If some other approach will be adopted that allows duplicated NULLs in unique indexes, it is Ok with me.

          In this very case it looks that DB2 approach is somewhat more consistent as Oracle one. Namely, Oracle treats
          NULLs in indexes in a special manner if and only if all columns in the index contain NULL. I. e. for
          create table EXAMPLE (
          x number,
          y number
          );:

          create unique index IDX_EXAMPLE on EXAMPLE(x,y);

          insert into EXAMPLE values(null,null);
          insert into EXAMPLE values(null,null);

          would succeed result in 2 records with NULL values, but

          insert into EXAMPLE values(1,null);
          insert into EXAMPLE values(1,null);

          would cause exception on second insert.
          Such behavior is somewhat cumbersome.

          On the other hand Oracle treats unique indexes and unique constraints uniformly.

          Show
          Oleksandr Alesinskyy added a comment - I'm original reporter DB2-like approach I have suggested for sole reason that Derby is already very DB2-like in many areas. If some other approach will be adopted that allows duplicated NULLs in unique indexes, it is Ok with me. In this very case it looks that DB2 approach is somewhat more consistent as Oracle one. Namely, Oracle treats NULLs in indexes in a special manner if and only if all columns in the index contain NULL. I. e. for create table EXAMPLE ( x number, y number );: create unique index IDX_EXAMPLE on EXAMPLE(x,y); insert into EXAMPLE values(null,null); insert into EXAMPLE values(null,null); would succeed result in 2 records with NULL values, but insert into EXAMPLE values(1,null); insert into EXAMPLE values(1,null); would cause exception on second insert. Such behavior is somewhat cumbersome. On the other hand Oracle treats unique indexes and unique constraints uniformly.
          Hide
          Øystein Grøvlen added a comment -

          I agree with Mike that it seems a bit complicated an error prone to
          sometimes treat nulls as equal and sometimes not.

          On the other hand, it seems to me that building the requested
          functionality on top of non-unique indexes will not be very efficient.
          There is a reason for why unique indexes has been implemented in
          Derby, and I think that is because it is an more efficient way of
          enforcing uniqueness than by doing lookups in a non-unique index.

          I also agree with Mike that it needs to be better defined what
          functionality one is aiming for here. The reporter of this Jira
          suggest to have a special index type where multiple null values are
          allowed (like DB2). Others have requested to be able to define a
          unique constraint on a column that may have null values. Unique
          constraints are currently enforced by an underlying unique index. The
          latter is the main reason why I would like to have an index that
          enforcing uniqueness for all values except null. I think I would
          prefer that all unique indexes behaved that way, and not to have a
          non-standard syntax for creating such indexes like DB2.

          My question is what do we gain from having indexes that treat null
          values as equal? Instead of having unique indexes that sometimes
          treat nulls as equal and sometimes not, could not all indexes always
          treat null values as unequal? (I know this will be probably be a lot
          more work). I must admit that I do not fully understand the
          consequences of such an approach, and it would be good if someone
          could explain if there is cases where it is necessary to treat nulls as equal.
          (Mike mention an example from recovery, but if I understand him correctly
          that could be solved by including the row location column in the
          look-up.)

          Show
          Øystein Grøvlen added a comment - I agree with Mike that it seems a bit complicated an error prone to sometimes treat nulls as equal and sometimes not. On the other hand, it seems to me that building the requested functionality on top of non-unique indexes will not be very efficient. There is a reason for why unique indexes has been implemented in Derby, and I think that is because it is an more efficient way of enforcing uniqueness than by doing lookups in a non-unique index. I also agree with Mike that it needs to be better defined what functionality one is aiming for here. The reporter of this Jira suggest to have a special index type where multiple null values are allowed (like DB2). Others have requested to be able to define a unique constraint on a column that may have null values. Unique constraints are currently enforced by an underlying unique index. The latter is the main reason why I would like to have an index that enforcing uniqueness for all values except null. I think I would prefer that all unique indexes behaved that way, and not to have a non-standard syntax for creating such indexes like DB2. My question is what do we gain from having indexes that treat null values as equal? Instead of having unique indexes that sometimes treat nulls as equal and sometimes not, could not all indexes always treat null values as unequal? (I know this will be probably be a lot more work). I must admit that I do not fully understand the consequences of such an approach, and it would be good if someone could explain if there is cases where it is necessary to treat nulls as equal. (Mike mention an example from recovery, but if I understand him correctly that could be solved by including the row location column in the look-up.)
          Hide
          Mike Matrigali added a comment -

          Also is there some sort of functional spec for this. The initial description seems to indicate that you will provide some extra syntax to allow the creation of unique indexes which allow nulls, is that still the case? I didn't see any parser changes, but again understand that this is a preview patch.

          Also what is the intent of the change on existing indexes?

          Show
          Mike Matrigali added a comment - Also is there some sort of functional spec for this. The initial description seems to indicate that you will provide some extra syntax to allow the creation of unique indexes which allow nulls, is that still the case? I didn't see any parser changes, but again understand that this is a preview patch. Also what is the intent of the change on existing indexes?
          Hide
          Mike Matrigali added a comment -

          I continue to think that it is a bad idea to introduce duplicate values to the current btree unique index (and sorting paradigm). I guess you are building a "mostly" uniquei index. I understand that SQL wants to treat the nulls differently but the basic assumptions of a lot of the structure of the unique btree (splitting, searching, logical recovery, branch node keys,
          serializable locking, ....) all are based on actual lower level physical nature of the keys.

          For instance while the SQL layer may never do a search where key=null, the logical recovery code may execute a search on the unique index using the key expecting null to
          match null. So if index
          is a single column unique index, with say 1,000,000 null values we may issue a search
          in the unique index for "null" assuming we will find the single row because it is a unique
          index. I believe the current search algorithm is optimized for unique indexes to not include
          the row location column in the search comparisons.

          Introducing duplicate nulls to unique indexes I believe will unnecessarily complicate the code and likely introduce bugs which will be hard to identify in the future.

          Safest would be to use existing non-unique indexes with extra work by language layer to
          verify uniqueness at insert time (indexes don't support updates so they default to a delete and
          insert – thus just one path to code/check). It may be worth fiddling with the optimizer to recognize that this index is a "mostly unique non-unique index" so that it can get estimates better than a regular non-unique index.

          Adding support to the btree itself to perform the check at insert time is more efficient at the cost of complicating the btree code, but I think could be isolated to a single compare call at point just before value is about to be inserted into tree.

          I know it is enticing to use the unique tree, but just sometimes treat nulls as not equal and sometimes as equal - but i don't think that is the right approach.

          Show
          Mike Matrigali added a comment - I continue to think that it is a bad idea to introduce duplicate values to the current btree unique index (and sorting paradigm). I guess you are building a "mostly" uniquei index. I understand that SQL wants to treat the nulls differently but the basic assumptions of a lot of the structure of the unique btree (splitting, searching, logical recovery, branch node keys, serializable locking, ....) all are based on actual lower level physical nature of the keys. For instance while the SQL layer may never do a search where key=null, the logical recovery code may execute a search on the unique index using the key expecting null to match null. So if index is a single column unique index, with say 1,000,000 null values we may issue a search in the unique index for "null" assuming we will find the single row because it is a unique index. I believe the current search algorithm is optimized for unique indexes to not include the row location column in the search comparisons. Introducing duplicate nulls to unique indexes I believe will unnecessarily complicate the code and likely introduce bugs which will be hard to identify in the future. Safest would be to use existing non-unique indexes with extra work by language layer to verify uniqueness at insert time (indexes don't support updates so they default to a delete and insert – thus just one path to code/check). It may be worth fiddling with the optimizer to recognize that this index is a "mostly unique non-unique index" so that it can get estimates better than a regular non-unique index. Adding support to the btree itself to perform the check at insert time is more efficient at the cost of complicating the btree code, but I think could be isolated to a single compare call at point just before value is about to be inserted into tree. I know it is enticing to use the unique tree, but just sometimes treat nulls as not equal and sometimes as equal - but i don't think that is the right approach.
          Hide
          Anurag Shekhar added a comment -

          1) Current code changes B2I so needs upgrade code or else no old database with any index will be able to be read.

          I do have code for soft and hard upgrade in my todo list. I will create an jira for the same.

          2) The approach seems complicated and error prone to me. Treating compare of nulls one way through an extra overloaded arg sometimes and sometimes seems overly complicated.

          It is just one of the possible approach I am trying it out only because I felt its a good solution. I am open to take other approach.

          3) I need to think about it more carefully to understand how treating compare of nulls differently in the same tree might affect things like logical undo during restart recovery after an uncommitted insert where the row may have moved drastically since the insert do to lots of other inserts.

          I will write some tests related to this scenario and get back.

          4) before commit needs a bunch of null heavy additional test cases.

          Yes I need to write several test cases for new behaviour and also some make sure its not breaking something else in some other part of the code base.

          1) optimizer probably will assume 1 row from unique index in this case, now it may have to use some other estimate when dealing with nulls. If the index was non-unique rather than unique current code would just work.

          Optimizer has a provision to treat IS NULL differently from '='. I am using that in my current patch and it appears to be working fine for very basic tests.

          2) unique index locking isolation level algorithm is different from duplicate index locking isolation level for serializable. It may have to now change. Previously a previous key lock is not necessary to protect a "phantom" insert range for a unique index - after your change maybe it is? Again if the index was marked non-unique at implementation level then current code would just work.

          I need to check about it. I will test and debug the code in multi user mode and get back.

          Show
          Anurag Shekhar added a comment - 1) Current code changes B2I so needs upgrade code or else no old database with any index will be able to be read. I do have code for soft and hard upgrade in my todo list. I will create an jira for the same. 2) The approach seems complicated and error prone to me. Treating compare of nulls one way through an extra overloaded arg sometimes and sometimes seems overly complicated. It is just one of the possible approach I am trying it out only because I felt its a good solution. I am open to take other approach. 3) I need to think about it more carefully to understand how treating compare of nulls differently in the same tree might affect things like logical undo during restart recovery after an uncommitted insert where the row may have moved drastically since the insert do to lots of other inserts. I will write some tests related to this scenario and get back. 4) before commit needs a bunch of null heavy additional test cases. Yes I need to write several test cases for new behaviour and also some make sure its not breaking something else in some other part of the code base. 1) optimizer probably will assume 1 row from unique index in this case, now it may have to use some other estimate when dealing with nulls. If the index was non-unique rather than unique current code would just work. Optimizer has a provision to treat IS NULL differently from '='. I am using that in my current patch and it appears to be working fine for very basic tests. 2) unique index locking isolation level algorithm is different from duplicate index locking isolation level for serializable. It may have to now change. Previously a previous key lock is not necessary to protect a "phantom" insert range for a unique index - after your change maybe it is? Again if the index was marked non-unique at implementation level then current code would just work. I need to check about it. I will test and debug the code in multi user mode and get back.
          Hide
          Anurag Shekhar added a comment -

          thanks Mike for revisiting the issue.

          I felt letting the index remain unique will be better because

          1. It avoids the need of rescanning the index to check for duplicates after every update/insert.
          2. Marking the index as non unique will cause optimizer to conclude that there might be
          duplicate records for equal predicates (not just IS NULL case) and hence it may some
          times select a genuine non unique index when it could have selected this (almost unique)
          index. Right now (in this patch) only if there is a IS NULL in where clause optimizer considers
          it to have multiple records.

          Many internal routine don't use optimizer to look for records but directly calls out to internal store
          methods (like the case of drop table). In such cases changes will be required irrespective of
          the approach we use.

          Show
          Anurag Shekhar added a comment - thanks Mike for revisiting the issue. I felt letting the index remain unique will be better because 1. It avoids the need of rescanning the index to check for duplicates after every update/insert. 2. Marking the index as non unique will cause optimizer to conclude that there might be duplicate records for equal predicates (not just IS NULL case) and hence it may some times select a genuine non unique index when it could have selected this (almost unique) index. Right now (in this patch) only if there is a IS NULL in where clause optimizer considers it to have multiple records. Many internal routine don't use optimizer to look for records but directly calls out to internal store methods (like the case of drop table). In such cases changes will be required irrespective of the approach we use.
          Hide
          Mike Matrigali added a comment -

          i am not sure any of the following are problems, but things to think about. These all stem from now you have a unique index that may return multiple rows for a given value if that value is null
          (or in the more interesting case if one of the columns of a multiple key column is null). The current derby system may assume this is not possible and may affect the following areas:

          1) optimizer probably will assume 1 row from unique index in this case, now it may have to use some other estimate when dealing with nulls. If the index was non-unique rather than unique
          current code would just work.
          2) unique index locking isolation level algorithm is different from duplicate index locking isolation level for serializable. It may have to now change. Previously a previous key lock is not necessary to protect a "phantom" insert range for a unique index - after your change maybe it is? Again if the index was marked non-unique at implementation level then current code would
          just work.
          3) some language and store positioning code may assume that an exact key on a unique index will only resolve to one row - now it may not. i don't know where all this code is just warning that it is something to look for. Again if code was marked non-unique vs unique
          then code continues to work.

          I still think at the store level putting multiple null's in makes the index non-unique and the code is cleaner if we just treat it that way - rather than special casing the compares like
          your patch.

          Show
          Mike Matrigali added a comment - i am not sure any of the following are problems, but things to think about. These all stem from now you have a unique index that may return multiple rows for a given value if that value is null (or in the more interesting case if one of the columns of a multiple key column is null). The current derby system may assume this is not possible and may affect the following areas: 1) optimizer probably will assume 1 row from unique index in this case, now it may have to use some other estimate when dealing with nulls. If the index was non-unique rather than unique current code would just work. 2) unique index locking isolation level algorithm is different from duplicate index locking isolation level for serializable. It may have to now change. Previously a previous key lock is not necessary to protect a "phantom" insert range for a unique index - after your change maybe it is? Again if the index was marked non-unique at implementation level then current code would just work. 3) some language and store positioning code may assume that an exact key on a unique index will only resolve to one row - now it may not. i don't know where all this code is just warning that it is something to look for. Again if code was marked non-unique vs unique then code continues to work. I still think at the store level putting multiple null's in makes the index non-unique and the code is cleaner if we just treat it that way - rather than special casing the compares like your patch.
          Hide
          Mike Matrigali added a comment -

          How do you compare your current patch with a solution that instead used a non-unique btree and then made a change to the insert code to do some special processing to recognize a duplicate - probably using either what you have already written or something very similar to do the right thing with null's for this index.

          Show
          Mike Matrigali added a comment - How do you compare your current patch with a solution that instead used a non-unique btree and then made a change to the insert code to do some special processing to recognize a duplicate - probably using either what you have already written or something very similar to do the right thing with null's for this index.
          Hide
          Mike Matrigali added a comment -

          I have only looked briefly at the most recent patch. I had not been following this thread as the last note I had read was that solution was not going to involve changing the btree code.

          My first overall comments:
          1) Current code changes B2I so needs upgrade code or else no old database with any index
          will be able to be read.

          2) The approach seems complicated and error prone to me. Treating compare of nulls one way through an extra overloaded arg sometimes and sometimes seems overly complicated.

          3) I need to think about it more carefully to understand how treating compare of nulls differently in the same tree might affect things like logical undo during restart recovery after an uncommitted insert where the row may have moved drastically since the insert do to lots of other inserts.

          4) before commit needs a bunch of null heavy additional test cases.

          Show
          Mike Matrigali added a comment - I have only looked briefly at the most recent patch. I had not been following this thread as the last note I had read was that solution was not going to involve changing the btree code. My first overall comments: 1) Current code changes B2I so needs upgrade code or else no old database with any index will be able to be read. 2) The approach seems complicated and error prone to me. Treating compare of nulls one way through an extra overloaded arg sometimes and sometimes seems overly complicated. 3) I need to think about it more carefully to understand how treating compare of nulls differently in the same tree might affect things like logical undo during restart recovery after an uncommitted insert where the row may have moved drastically since the insert do to lots of other inserts. 4) before commit needs a bunch of null heavy additional test cases.
          Hide
          Øystein Grøvlen added a comment -

          Anurag Shekhar (JIRA) wrote:
          > Partial key matching is used only for searching. While searching
          > (either for update or select) null should be treated equal. So this
          > code shouldn't get executed.

          It concerns me a bit that methods with generic names like 'compare'
          makes assumption about in what context they are used. If
          treatNullsEqual is false, I think nulls should not be treated as equal
          regardless of its current use. If the caller wants nulls to be
          treated equal it should pass in true for treatNullsEqual.

          > nullsOrderedLow is used for ordering nulls with respect to not null
          > values (in order by clause with null ordering option) so when two
          > nulls are being compared this flag is not relevant.
          >
          > -1 or 1 result of the null comparison will only effect where new
          > node will be inserted (left or right of the existing node). The spec
          > in my opinion doesn't mandates it. So I think its ok to return
          > either -1 or 1. Please let me know if I am wrong.

          I think you are right about the effect on current usage, but again, this is
          a generic method for comparison of data values, and even if its
          current usage is limited to insertion into B-tree, I do not think this
          code should make such an assumption.

          Show
          Øystein Grøvlen added a comment - Anurag Shekhar (JIRA) wrote: > Partial key matching is used only for searching. While searching > (either for update or select) null should be treated equal. So this > code shouldn't get executed. It concerns me a bit that methods with generic names like 'compare' makes assumption about in what context they are used. If treatNullsEqual is false, I think nulls should not be treated as equal regardless of its current use. If the caller wants nulls to be treated equal it should pass in true for treatNullsEqual. > nullsOrderedLow is used for ordering nulls with respect to not null > values (in order by clause with null ordering option) so when two > nulls are being compared this flag is not relevant. > > -1 or 1 result of the null comparison will only effect where new > node will be inserted (left or right of the existing node). The spec > in my opinion doesn't mandates it. So I think its ok to return > either -1 or 1. Please let me know if I am wrong. I think you are right about the effect on current usage, but again, this is a generic method for comparison of data values, and even if its current usage is limited to insertion into B-tree, I do not think this code should make such an assumption.
          Hide
          Anurag Shekhar added a comment -

          Sorry for the delay in posting patch. It took me lot longer than expected to fix
          drop table issue.

          Major change in this patch is additional attribute in B2I. This attribute is to
          announce whether for this particular index should nulls should be treated
          equal or not. This attributed is persisted while storing the index. This attribute
          is required to insure the data dictionary index retains the old behavior.

          Data dictionary classes use ControlRow class to locate index while deleting
          the table, in my previous patch drop table was failing because ControlRow was
          treating nulls unequal, unconditionally. In this patch I am passing additional
          attribute to ControlRow to tell if nulls should be equal or not. The information
          whether nulls should be treat equal or not is fetched from BTree class.

          Description of the patch
          java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java

          Modified to make IS NULL predicate as optimizeable as IS NULL on indexed
          field doesn't ensures single record in result.

          java/engine/org/apache/derby/impl/sql/execute/CreateIndexConstantAction.java

          modified to set nulls are not equal in sorter. This routine is executed only while
          user is creating the index, so it doesn't effects internal data dictionary indexes.

          java/engine/org/apache/derby/impl/store/access/sort/SortBuffer.java
          java/engine/org/apache/derby/impl/store/access/sort/MergeSort.java
          java/engine/org/apache/derby/impl/store/access/sort/MergeInserter.java
          java/engine/org/apache/derby/impl/store/access/heap/HeapScan.java
          java/engine/org/apache/derby/iapi/store/access/SortController.java

          Modified to add additional functionality to handle unequal nulls.

          java/engine/org/apache/derby/impl/store/access/btree/BTree.java
          Added two abstract methods to set and get how nulls should be treated (equal
          or unequal)

          java/engine/org/apache/derby/impl/store/access/btree/BTreeController.java

          Modified to call areNullsEqual() on Btree and pass it on to ControlRow.

          java/engine/org/apache/derby/impl/store/access/btree/BTreeScan.java

          added addional method to modify the the behaviour of null treatment and to
          retrieve the behaviour.

          java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java

          added new method to accept flag about how null should be treated and further
          pass it on to DataValueDescriptor.

          java/engine/org/apache/derby/impl/store/access/btree/index/B2I.java

          added additional attribute to indicate whether this index treats nulls equal or nor.
          added code to store and retrieve this flag in secondary storage.
          added code to retrieve this property while creation.

          java/engine/org/apache/derby/iapi/types/DataValueDescriptor.java
          java/engine/org/apache/derby/iapi/types/DataType.java

          added additional method to specify how null should be compared.

          I have also fixed spelling and while space issues Øystein.

          Show
          Anurag Shekhar added a comment - Sorry for the delay in posting patch. It took me lot longer than expected to fix drop table issue. Major change in this patch is additional attribute in B2I. This attribute is to announce whether for this particular index should nulls should be treated equal or not. This attributed is persisted while storing the index. This attribute is required to insure the data dictionary index retains the old behavior. Data dictionary classes use ControlRow class to locate index while deleting the table, in my previous patch drop table was failing because ControlRow was treating nulls unequal, unconditionally. In this patch I am passing additional attribute to ControlRow to tell if nulls should be equal or not. The information whether nulls should be treat equal or not is fetched from BTree class. Description of the patch java/engine/org/apache/derby/impl/sql/compile/FromBaseTable.java Modified to make IS NULL predicate as optimizeable as IS NULL on indexed field doesn't ensures single record in result. java/engine/org/apache/derby/impl/sql/execute/CreateIndexConstantAction.java modified to set nulls are not equal in sorter. This routine is executed only while user is creating the index, so it doesn't effects internal data dictionary indexes. java/engine/org/apache/derby/impl/store/access/sort/SortBuffer.java java/engine/org/apache/derby/impl/store/access/sort/MergeSort.java java/engine/org/apache/derby/impl/store/access/sort/MergeInserter.java java/engine/org/apache/derby/impl/store/access/heap/HeapScan.java java/engine/org/apache/derby/iapi/store/access/SortController.java Modified to add additional functionality to handle unequal nulls. java/engine/org/apache/derby/impl/store/access/btree/BTree.java Added two abstract methods to set and get how nulls should be treated (equal or unequal) java/engine/org/apache/derby/impl/store/access/btree/BTreeController.java Modified to call areNullsEqual() on Btree and pass it on to ControlRow. java/engine/org/apache/derby/impl/store/access/btree/BTreeScan.java added addional method to modify the the behaviour of null treatment and to retrieve the behaviour. java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java added new method to accept flag about how null should be treated and further pass it on to DataValueDescriptor. java/engine/org/apache/derby/impl/store/access/btree/index/B2I.java added additional attribute to indicate whether this index treats nulls equal or nor. added code to store and retrieve this flag in secondary storage. added code to retrieve this property while creation. java/engine/org/apache/derby/iapi/types/DataValueDescriptor.java java/engine/org/apache/derby/iapi/types/DataType.java added additional method to specify how null should be compared. I have also fixed spelling and while space issues Øystein.
          Hide
          Anurag Shekhar added a comment -

          >3. This code will not be executed for partial key matching. Hence, null values will always be treated equal in that case. Is that intentional? If yes, it would be good if that would be clear from the specification of the method, and not just be an implicit assumption based on how partial key matching is currently used.

          Partial key matching is used only for searching. While searching (either for update or select) null should be treated equal. So this code shouldn't get executed.

          DataType#compare: Are you sure that nullsOrderedLow does not matter when both values are null? I am a bit confused,so it would be good if you could add a convincing comment. And why return -1 and not 1 if nulls are not equal? Does this work equally well for both ascending and descending scans?

          nullsOrderedLow is used for ordering nulls with respect to not null values (in order by clause with null ordering option) so when two nulls are being compared this flag is not relevant.

          -1 or 1 result of the null comparison will only effect where new node will be inserted (left or right of the existing node). The spec in my opinion doesn't mandates it. So I think its ok to return either -1 or 1. Please let me know if I am wrong.

          Show
          Anurag Shekhar added a comment - >3. This code will not be executed for partial key matching. Hence, null values will always be treated equal in that case. Is that intentional? If yes, it would be good if that would be clear from the specification of the method, and not just be an implicit assumption based on how partial key matching is currently used. Partial key matching is used only for searching. While searching (either for update or select) null should be treated equal. So this code shouldn't get executed. DataType#compare: Are you sure that nullsOrderedLow does not matter when both values are null? I am a bit confused,so it would be good if you could add a convincing comment. And why return -1 and not 1 if nulls are not equal? Does this work equally well for both ascending and descending scans? nullsOrderedLow is used for ordering nulls with respect to not null values (in order by clause with null ordering option) so when two nulls are being compared this flag is not relevant. -1 or 1 result of the null comparison will only effect where new node will be inserted (left or right of the existing node). The spec in my opinion doesn't mandates it. So I think its ok to return either -1 or 1. Please let me know if I am wrong.
          Hide
          Anurag Shekhar added a comment -

          thanks Øystein, for the review and trying out the patch.

          1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations).

          In this case return value (1) is not important. But the sign is important because that determines the sequence of sorted keys (used while creating index on a table with existing records. While creating the B Tree first time the keys are supposed to be in a sequence and this is verified by invoking method in ControlRow. So if there are duplicate nulls are present in table they should still confirm the sequence (depending on the flag passed ascOrDesc []).
          Yes I saw the drop works fine if I change the sequence but in that case creating of index on table with existing records will fail.

          I have found the problem which is causing the drop table to fail. Its happening while scanning catalog's table to remove table index. Some of the attribute in index record are null and while comparing these two nulls are treated unequal and the deleting fails. I making changes to fix this and will be posting them after incorporating your other comments.

          2 .What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column.

          I am changing this part of code. Now the first null will ensure a mismatch and ascOrDesc flag for that column will be used. But I think it shouldn't matter for which null ascOrDesc is used as this is used only by Sorter and it doesn't makes any difference in which order these two record are sorted (at least till we have something like row id).

          I will get back back about other questions.

          Show
          Anurag Shekhar added a comment - thanks Øystein, for the review and trying out the patch. 1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations). In this case return value (1) is not important. But the sign is important because that determines the sequence of sorted keys (used while creating index on a table with existing records. While creating the B Tree first time the keys are supposed to be in a sequence and this is verified by invoking method in ControlRow. So if there are duplicate nulls are present in table they should still confirm the sequence (depending on the flag passed ascOrDesc []). Yes I saw the drop works fine if I change the sequence but in that case creating of index on table with existing records will fail. I have found the problem which is causing the drop table to fail. Its happening while scanning catalog's table to remove table index. Some of the attribute in index record are null and while comparing these two nulls are treated unequal and the deleting fails. I making changes to fix this and will be posting them after incorporating your other comments. 2 .What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column. I am changing this part of code. Now the first null will ensure a mismatch and ascOrDesc flag for that column will be used. But I think it shouldn't matter for which null ascOrDesc is used as this is used only by Sorter and it doesn't makes any difference in which order these two record are sorted (at least till we have something like row id). I will get back back about other questions.
          Hide
          Øystein Grøvlen added a comment -

          Thanks for the patch, Anurag. Your approach to solving this seems OK to me. I have a few questions and comments:

          ControlRow: I do not quite understand all implications of your changes here, and some comments explaining this would be good. Regarding the following code that, if I have understood thing correctly, is executed if both key and index row is equal for all columns and there are columns where both have null:

          if (nullPart != -1)

          { if (ascOrDesc[nullPart]) // true - Ascending order return 1; else return -1; }

          1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations).
          2. What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column.
          3. This code will not be executed for partial key matching. Hence, null values will always be treated equal in that case. Is that intentional? If yes, it would be good if that would be clear from the specification of the method, and not just be an implicit assumption based on how partial key matching is currently used.

          DataType#compare: Are you sure that nullsOrderedLow does not matter when both values are null? I am a bit confused,so it would be good if you could add a convincing comment. And why return -1 and not 1 if nulls are not equal? Does this work equally well for both ascending and descending scans?

          FromBaseTable: The comment above your code needs to be updated to reflect your change.

          SortController: Typos: 'defaul', 'other wise', 'returns if nulls are ...' should be 'returns whether ...', 'true if they ...' should be 'true if null values ...'

          Nit: There is a lot of whitespace issues that I hope you will correct before a final patch.

          Show
          Øystein Grøvlen added a comment - Thanks for the patch, Anurag. Your approach to solving this seems OK to me. I have a few questions and comments: ControlRow: I do not quite understand all implications of your changes here, and some comments explaining this would be good. Regarding the following code that, if I have understood thing correctly, is executed if both key and index row is equal for all columns and there are columns where both have null: if (nullPart != -1) { if (ascOrDesc[nullPart]) // true - Ascending order return 1; else return -1; } 1. I do not understand the reasoning behind the return values. I notice that if I swap them, dropping a table does not fail anymore (I have not checked if that breaks other operations). 2. What are the implications if there are several columns for which both have null? You only test ascOrDesc for the latest column. 3. This code will not be executed for partial key matching. Hence, null values will always be treated equal in that case. Is that intentional? If yes, it would be good if that would be clear from the specification of the method, and not just be an implicit assumption based on how partial key matching is currently used. DataType#compare: Are you sure that nullsOrderedLow does not matter when both values are null? I am a bit confused,so it would be good if you could add a convincing comment. And why return -1 and not 1 if nulls are not equal? Does this work equally well for both ascending and descending scans? FromBaseTable: The comment above your code needs to be updated to reflect your change. SortController: Typos: 'defaul', 'other wise', 'returns if nulls are ...' should be 'returns whether ...', 'true if they ...' should be 'true if null values ...' Nit: There is a lot of whitespace issues that I hope you will correct before a final patch.
          Hide
          Anurag Shekhar added a comment -

          This patch is not meant for commit.
          Its just for review of the approach I am talking for this issue.

          Description of Patch
          This patch contents changes for 3 issues involved to support duplicate nulls
          1. Treat nulls as unequal while inserting
          I have modified ControlRow class of BTree access. Now while searching the tree to look for duplicate nulls are not treated as equal. If all the fields of the key finds a match and one of the part is null it returns -1 or 1 depending on the if ascending or descending order is requested.

          2. Assume multiple rows in result if the where clause has is null for any of the key part.
          I have modified FromBaseTable to support his.

          3. While creating index on an existing table don't consider nulls as equal for they key parts.

          I have modified DataType class and DataValueDescriptor interface and added additional method to compare where a flag can be passed to specify if nulls are to be treated equal while comparing. I have also modified CreateIndexConstantAction so that it sets a flag in MergeInserter so that it treats nulls as unequal by passing in this flag to MergeSort and SortBuffer classes.

          This patch is incomplete. The main issues are
          1. Right now distinct clause is dropped whe the search is being performed on a unique index. I need to modify optimizer to not to drop it if the distinct is on unique index (to eliminate duplicate nulls)

          2. Dropping a table fails.

          Show
          Anurag Shekhar added a comment - This patch is not meant for commit. Its just for review of the approach I am talking for this issue. Description of Patch This patch contents changes for 3 issues involved to support duplicate nulls 1. Treat nulls as unequal while inserting I have modified ControlRow class of BTree access. Now while searching the tree to look for duplicate nulls are not treated as equal. If all the fields of the key finds a match and one of the part is null it returns -1 or 1 depending on the if ascending or descending order is requested. 2. Assume multiple rows in result if the where clause has is null for any of the key part. I have modified FromBaseTable to support his. 3. While creating index on an existing table don't consider nulls as equal for they key parts. I have modified DataType class and DataValueDescriptor interface and added additional method to compare where a flag can be passed to specify if nulls are to be treated equal while comparing. I have also modified CreateIndexConstantAction so that it sets a flag in MergeInserter so that it treats nulls as unequal by passing in this flag to MergeSort and SortBuffer classes. This patch is incomplete. The main issues are 1. Right now distinct clause is dropped whe the search is being performed on a unique index. I need to modify optimizer to not to drop it if the distinct is on unique index (to eliminate duplicate nulls) 2. Dropping a table fails.
          Hide
          Anurag Shekhar added a comment -

          I am trying to add duplicate null in unique index by making null
          not equal to null in case of unique index. The changes are in
          ControlRow class. But this is only part of the implemenation the
          other major part is to change the scan method used if there is
          a null in search criterion while scanning index.

          Changes are something like this

          — java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java (revision 572236)
          +++ java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java (working copy)
          @@ -1293,7 +1293,9 @@
          // int r = indexcol.compare(keycol);

          int r = indexrow[i].compare(key[i]);
          -
          + if (r == 0 && key [i].isNull())

          { + r = 1; + }

          (Doing this corrupt the btree index and i get an exception while deleting
          the table. I am checking out how to fix this)

          I am also modifying the optimizer to select bulk Index scan when there
          is a "is null" in select.

          This change will make indexes incompatible with the older behaviour.
          I plan to support soft upgrades by adding another attribute in index
          header and extending B2I class.

          I think this will be less expensive that letting the duplicates go into the
          tree and then checking if there is some duplicate key without null as any
          of their part. Please let me know if my assumption is wrong.

          Show
          Anurag Shekhar added a comment - I am trying to add duplicate null in unique index by making null not equal to null in case of unique index. The changes are in ControlRow class. But this is only part of the implemenation the other major part is to change the scan method used if there is a null in search criterion while scanning index. Changes are something like this — java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java (revision 572236) +++ java/engine/org/apache/derby/impl/store/access/btree/ControlRow.java (working copy) @@ -1293,7 +1293,9 @@ // int r = indexcol.compare(keycol); int r = indexrow [i] .compare(key [i] ); - + if (r == 0 && key [i] .isNull()) { + r = 1; + } (Doing this corrupt the btree index and i get an exception while deleting the table. I am checking out how to fix this) I am also modifying the optimizer to select bulk Index scan when there is a "is null" in select. This change will make indexes incompatible with the older behaviour. I plan to support soft upgrades by adding another attribute in index header and extending B2I class. I think this will be less expensive that letting the duplicates go into the tree and then checking if there is some duplicate key without null as any of their part. Please let me know if my assumption is wrong.
          Hide
          Oleksandr Alesinskyy added a comment -

          Oracle behavior:

          (null,null,null) vs (null,null,null) - allowed
          (1,null,null) vs (null,null,null) - allowed
          (1,null,null) vs (1,null,null) - NOT allowed
          (1,2,null) vs ((1,null,null)) - allowed
          (1,2,null) vs (1,2,null) - NOT allowed
          (1,2,3) vs (1,2,null) - allowed
          (1,2,3) vs (1,2,3l) - NOT allowed

          DB2 for iSeries

          without "when not null" clause:

          (null,null,null) vs (null,null,null) - NOT allowed
          (1,null,null) vs (null,null,null) - allowed
          (1,null,null) vs (1,null,null) - NOT allowed
          (1,2,null) vs ((1,null,null)) - allowed
          (1,2,null) vs (1,2,null) - NOT allowed
          (1,2,3) vs (1,2,null) - allowed
          (1,2,3) vs (1,2,3) - NOT allowed

          with "when not null" clause:

          (null,null,null) vs (null,null,null) - allowed
          (1,null,null) vs (null,null,null) - allowed
          (1,null,null) vs (1,null,null) -allowed
          (1,2,null) vs ((1,null,null)) - allowed
          (1,2,null) vs (1,2,null) - allowed
          (1,2,3) vs (1,2,null) - allowed
          (1,2,3) vs (1,2,3l) - NOT allowed

          Show
          Oleksandr Alesinskyy added a comment - Oracle behavior: (null,null,null) vs (null,null,null) - allowed (1,null,null) vs (null,null,null) - allowed (1,null,null) vs (1,null,null) - NOT allowed (1,2,null) vs ((1,null,null)) - allowed (1,2,null) vs (1,2,null) - NOT allowed (1,2,3) vs (1,2,null) - allowed (1,2,3) vs (1,2,3l) - NOT allowed DB2 for iSeries without "when not null" clause: (null,null,null) vs (null,null,null) - NOT allowed (1,null,null) vs (null,null,null) - allowed (1,null,null) vs (1,null,null) - NOT allowed (1,2,null) vs ((1,null,null)) - allowed (1,2,null) vs (1,2,null) - NOT allowed (1,2,3) vs (1,2,null) - allowed (1,2,3) vs (1,2,3) - NOT allowed with "when not null" clause: (null,null,null) vs (null,null,null) - allowed (1,null,null) vs (null,null,null) - allowed (1,null,null) vs (1,null,null) -allowed (1,2,null) vs ((1,null,null)) - allowed (1,2,null) vs (1,2,null) - allowed (1,2,3) vs (1,2,null) - allowed (1,2,3) vs (1,2,3l) - NOT allowed
          Hide
          Anurag Shekhar added a comment -

          I tried the scenario you mentioned in postgresql here is the result
          index (i,j,k)

          null, null, null vs null, null, null – allowed

          a, a, a vs a a a – not allowed

          a, null, b vs a, null, b – allowed
          a, null, null vs a, null, null – allowed
          null, a, b vs null a, b — allowed

          This appears to be in agreement of considering two nulls unequal ie if at least
          one part of the key is null, uniqueness is insured.

          Show
          Anurag Shekhar added a comment - I tried the scenario you mentioned in postgresql here is the result index (i,j,k) null, null, null vs null, null, null – allowed a, a, a vs a a a – not allowed a, null, b vs a, null, b – allowed a, null, null vs a, null, null – allowed null, a, b vs null a, b — allowed This appears to be in agreement of considering two nulls unequal ie if at least one part of the key is null, uniqueness is insured.
          Hide
          Anurag Shekhar added a comment -

          SQL 2003 doesn't specifically talks about unique index but if we use
          the description of unique predicate, (as Bernt has summarized
          http://tinyurl.com/yvqloj) two nulls are not equal and duplicate keys
          should be allowed as long as at least one part of the key is null.

          The section describing equals operation also hints that two nulls
          are not equal as "=" operator can't be used on nulls.

          Show
          Anurag Shekhar added a comment - SQL 2003 doesn't specifically talks about unique index but if we use the description of unique predicate, (as Bernt has summarized http://tinyurl.com/yvqloj ) two nulls are not equal and duplicate keys should be allowed as long as at least one part of the key is null. The section describing equals operation also hints that two nulls are not equal as "=" operator can't be used on nulls.
          Hide
          Anurag Shekhar added a comment -

          Thanks Mike and Oleksandr for showing interest and inputs.

          >From your last comment I can't tell what you plan on changing to implement this. Are you
          planning on changing the btree index implementation, which is a high risk change.

          No I am still in the process of finding the best way of implementing this and I think it won't require change in B tree implementation.

          Show
          Anurag Shekhar added a comment - Thanks Mike and Oleksandr for showing interest and inputs. >From your last comment I can't tell what you plan on changing to implement this. Are you planning on changing the btree index implementation, which is a high risk change. No I am still in the process of finding the best way of implementing this and I think it won't require change in B tree implementation.
          Hide
          Mike Matrigali added a comment -

          What is the standard with respect to multi-column keys and the null uniqueness semantics for this feature. The Oracle semantics definitely seem easier to implement/understand. If this is not the standard what should happen in the following 3 part multi-column unique
          key but allow null duplicate instances:
          clearly not allowed:
          null, null, null vs null, null, null

          clearly not allowed:
          a, a, a vs a a a

          Is following a duplicate or not?
          a, null, b vs a, null, b
          a, null, null vs a, null, null
          null, a, b vs null a, b

          The Derby store is architected to allow the SQL engine to create an index which would only
          have a subset of rows of the base table. This would be easy. The challenge would be to
          enhance the existing engine code to understand how it could use this index. So for instance would be wrong to use this new index for optimized isNull() lookup, also won't work for count *, and the consistency checker probably would need to change as it assumes matching number of rows between base table and indexes. Depending on
          the answer above optimizer may even more limited on what it could use the index for. There are probably other things.

          Long term I believe there are other uses for these kinds of non-standard indexes so I think work in this area may help other projects, like functional indexes.

          Show
          Mike Matrigali added a comment - What is the standard with respect to multi-column keys and the null uniqueness semantics for this feature. The Oracle semantics definitely seem easier to implement/understand. If this is not the standard what should happen in the following 3 part multi-column unique key but allow null duplicate instances: clearly not allowed: null, null, null vs null, null, null clearly not allowed: a, a, a vs a a a Is following a duplicate or not? a, null, b vs a, null, b a, null, null vs a, null, null null, a, b vs null a, b The Derby store is architected to allow the SQL engine to create an index which would only have a subset of rows of the base table. This would be easy. The challenge would be to enhance the existing engine code to understand how it could use this index. So for instance would be wrong to use this new index for optimized isNull() lookup, also won't work for count *, and the consistency checker probably would need to change as it assumes matching number of rows between base table and indexes. Depending on the answer above optimizer may even more limited on what it could use the index for. There are probably other things. Long term I believe there are other uses for these kinds of non-standard indexes so I think work in this area may help other projects, like functional indexes.
          Hide
          Oleksandr Alesinskyy added a comment -

          Just for your information - Oracle implements this feature as follows:

          Rows in which all indexed columns have NULL values are excluded from index, rows that has non-NULL value in at least one indexed column shall be unique.

          Show
          Oleksandr Alesinskyy added a comment - Just for your information - Oracle implements this feature as follows: Rows in which all indexed columns have NULL values are excluded from index, rows that has non-NULL value in at least one indexed column shall be unique.
          Hide
          Mike Matrigali added a comment -

          From your last comment I can't tell what you plan on changing to implement this. Are you
          planning on changing the btree index implementation, which is a high risk change. Or
          are you planning on implementing something that uses the exising btree index support
          to get what you want.

          At one point I looked at implementing this and started along the path of changing the
          existing btree index implementation but saw that it was going to add a lot of overhead
          to the existing paths through the code. The main issue is that the way "unique" violations are currently coded is that language does an insert and then catches the error from store when it recognizes a problem. This just happens to be the way derby currently enforces
          constraint, one could also enforce constraints by doing a probe using proper locking to
          transactionaly check for existence of duplicate or not and then do subsequent insert
          if no problem.

          Off the top of my head it might work to just associate an existing btree index that allowed
          duplicates with this new constraint. And then change the insert code do extra processing
          for this special constraint. It may be easiest to just first do the insert, and then after the
          insert do a probe of the index to see if there are more than one non-null value and if
          so do a statement backout and return constraint violation. I think the locking might be
          easier than enforcing separate serializable read on probing before the insert. The
          plus here is no chance to break/slow down exising index paths for all the indexes that
          don't care about allowing duplicate null's and non-duplicate other keys.

          Show
          Mike Matrigali added a comment - From your last comment I can't tell what you plan on changing to implement this. Are you planning on changing the btree index implementation, which is a high risk change. Or are you planning on implementing something that uses the exising btree index support to get what you want. At one point I looked at implementing this and started along the path of changing the existing btree index implementation but saw that it was going to add a lot of overhead to the existing paths through the code. The main issue is that the way "unique" violations are currently coded is that language does an insert and then catches the error from store when it recognizes a problem. This just happens to be the way derby currently enforces constraint, one could also enforce constraints by doing a probe using proper locking to transactionaly check for existence of duplicate or not and then do subsequent insert if no problem. Off the top of my head it might work to just associate an existing btree index that allowed duplicates with this new constraint. And then change the insert code do extra processing for this special constraint. It may be easiest to just first do the insert, and then after the insert do a probe of the index to see if there are more than one non-null value and if so do a statement backout and return constraint violation. I think the locking might be easier than enforcing separate serializable read on probing before the insert. The plus here is no chance to break/slow down exising index paths for all the indexes that don't care about allowing duplicate null's and non-duplicate other keys.
          Hide
          Anurag Shekhar added a comment -

          One of the approach suggested in the previous discussion was to
          eliminate nulls from the index and use full table scan while
          performing a search on null index (or part of it). This approach will
          be much cleaner (will not required null to be treated as special
          values in B+ tree searching for inserts), but it will result in two cases

          1. When where clause has is null criterion for a index field
          2. When search is performed on part of the multi field index.
          Suppose there is an index on field i,j,k and a query is perfomed on
          i = val and j = val, then its possible perform index scan for i and j
          ignoring third part and then performing a leaf scan after finding
          the first leaf node with (i,j).
          But once we eliminate null value from index it will be required to
          perform a full table scan as there may be one or more null k
          corresponding to the search criterion.

          This I think will cause a major performance drop and I think it will
          be better to go for an implementation where duplicate null values
          are allowed in the index.

          Show
          Anurag Shekhar added a comment - One of the approach suggested in the previous discussion was to eliminate nulls from the index and use full table scan while performing a search on null index (or part of it). This approach will be much cleaner (will not required null to be treated as special values in B+ tree searching for inserts), but it will result in two cases 1. When where clause has is null criterion for a index field 2. When search is performed on part of the multi field index. Suppose there is an index on field i,j,k and a query is perfomed on i = val and j = val, then its possible perform index scan for i and j ignoring third part and then performing a leaf scan after finding the first leaf node with (i,j). But once we eliminate null value from index it will be required to perform a full table scan as there may be one or more null k corresponding to the search criterion. This I think will cause a major performance drop and I think it will be better to go for an implementation where duplicate null values are allowed in the index.
          Hide
          Anurag Shekhar added a comment -

          There was a mail thread about this issue some time back in developer mailing list.
          Nabble link of that thread

          http://preview.tinyurl.com/yvqloj

          Show
          Anurag Shekhar added a comment - There was a mail thread about this issue some time back in developer mailing list. Nabble link of that thread http://preview.tinyurl.com/yvqloj
          Hide
          Anurag Shekhar added a comment -

          Description of this issue mentions two issues
          1. allowing duplicate nulls in unique index (nulls should be treated unequal while insert)
          2. Support of unique constraint over nullable columns.

          I plan to add support to allow duplicate nulls in unique index (issue 1) in this jira issue.

          Show
          Anurag Shekhar added a comment - Description of this issue mentions two issues 1. allowing duplicate nulls in unique index (nulls should be treated unequal while insert) 2. Support of unique constraint over nullable columns. I plan to add support to allow duplicate nulls in unique index (issue 1) in this jira issue.

            People

            • Assignee:
              Unassigned
              Reporter:
              Oleksandr Alesinskyy
            • Votes:
              6 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:

                Development