Derby
  1. Derby
  2. DERBY-4007

Optimization of IN with nested SELECT

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: 10.4.2.0
    • Fix Version/s: None
    • Component/s: SQL
    • Environment:
      Linux
    • Urgency:
      Normal
    • Issue & fix info:
      Repro attached
    • Bug behavior facts:
      Performance

      Description

      The problem is with the following query:

      UPDATE summa_records SET base='foobar' WHERE id IN ( SELECT parentId FROM summa_relations WHERE childId='horizon_2615441');

      It takes in the order of 30s to run when we expect something in the order of 1-2ms.

      We have a setup with two tables

      summa_records: 1,5M rows
      summa_relations: ~350000 rows

      summa_records have and 'id' column that is also indexed and is the primary key. The summa_relations table holds mappings between different ids.

      In our case the nested SELECT produces 2 hits, say, 'foo' and 'bar'. So the UPDATE on these two hits should be quite snappy. If we run the SELECT alone it runs in an instant, and also if we run with hardcoded ids for the IN clause:

      UPDATE summa_records SET base='foobar' WHERE id IN ('foo', 'bar');

      We have instant execution. I'll attach a query plan in a sec.

      1. CreateDatabase4007.java
        1 kB
        Knut Anders Hatlen
      2. dblook_p_index.log
        1.0 kB
        Mikkel Kamstrup Erlandsen
      3. dblook.log
        1.0 kB
        Mikkel Kamstrup Erlandsen
      4. derby_p_index.log
        5 kB
        Mikkel Kamstrup Erlandsen
      5. derby.log
        4 kB
        Mikkel Kamstrup Erlandsen

        Activity

        Hide
        Knut Anders Hatlen added a comment -

        Triaged for 10.5.2. Reclassified to improvement since the behaviour is correct, although slower than desired.

        Show
        Knut Anders Hatlen added a comment - Triaged for 10.5.2. Reclassified to improvement since the behaviour is correct, although slower than desired.
        Hide
        Knut Anders Hatlen added a comment -

        Attached is a java class (CreateDatabase4007.java) which creates a database with the schema Mikkel posted and populates it with about the same number of rows as he reported.

        When I try the update statement I see that it takes ~19 seconds.

        The time is reduced to ~4.5 seconds if I create an extra index:

        create unique index cp on summa_relations(childid, parentid);

        This index is more efficient than the existing PC index when there's a restriction on childid. However, the statement still performs a full scan of SUMMA_RECORDS, which sounds sub-optimal.

        Show
        Knut Anders Hatlen added a comment - Attached is a java class (CreateDatabase4007.java) which creates a database with the schema Mikkel posted and populates it with about the same number of rows as he reported. When I try the update statement I see that it takes ~19 seconds. The time is reduced to ~4.5 seconds if I create an extra index: create unique index cp on summa_relations(childid, parentid); This index is more efficient than the existing PC index when there's a restriction on childid. However, the statement still performs a full scan of SUMMA_RECORDS, which sounds sub-optimal.
        Hide
        Romi Ou added a comment -

        I ran into this issue too recently.
        Here is my table schema:
        create table TRADE_HISTORY (SEQID bigint not null unique, TRADEID varchar(16) not null, VERSION integer, STREAM varchar(20), TIMESTAMP varchar(31) not null, STATE char(1) default 'V', COMMENT varchar(512), CONTENT varchar(512), unique(TRADEID,DATASTREAMREF));

        I added 50k records into the table. The SEQID of the records are from 0 to 49999.
        Updated the STATE of 10k records:
        update TRADE_HISTORY set STATE='E' where seqid >=40000 and seqid < 50000;

        Then I query the table with below sql:
        ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>=
        40000 and seqid<40100 group by TRADEID) ;
        1
        -----------
        100

        1 row selected
        ELAPSED TIME = 2375 milliseconds
        ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>=
        40000 and seqid<40200 group by TRADEID) ;
        1
        -----------
        200

        1 row selected
        ELAPSED TIME = 9875 milliseconds
        ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>=
        40000 and seqid<40400 group by TRADEID) ;
        1
        -----------
        400

        1 row selected
        ELAPSED TIME = 38812 milliseconds

        After optimizing the query sql, the performance is much better.
        ij> select count from (select * from TRADE_HISTORY where ( STATE= 'E' ) as errorTabl
        e, (select max(SEQID) maxSeqID from TRADE_HISTORY where seqid>=40000 and seqid<40400 group by TRADEID) as tempT where er
        rorTable.SEQID =tempT.maxSeqID ;
        1
        -----------
        400

        1 row selected
        ELAPSED TIME = 47 milliseconds

        By referring to Tuning Derby Doc (http://db.apache.org/derby/docs/10.4/tuning/tuning-single.html#rtuntransform582), I noticed that there is a "Simple IN predicate transformations" in Derby.
        I'm not quite sure about the root cause of this issue.
        Is it estimated cost or the sql is transferred to another sql?

        Show
        Romi Ou added a comment - I ran into this issue too recently. Here is my table schema: create table TRADE_HISTORY (SEQID bigint not null unique, TRADEID varchar(16) not null, VERSION integer, STREAM varchar(20), TIMESTAMP varchar(31) not null, STATE char(1) default 'V', COMMENT varchar(512), CONTENT varchar(512), unique(TRADEID,DATASTREAMREF)); I added 50k records into the table. The SEQID of the records are from 0 to 49999. Updated the STATE of 10k records: update TRADE_HISTORY set STATE='E' where seqid >=40000 and seqid < 50000; Then I query the table with below sql: ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>= 40000 and seqid<40100 group by TRADEID) ; 1 ----------- 100 1 row selected ELAPSED TIME = 2375 milliseconds ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>= 40000 and seqid<40200 group by TRADEID) ; 1 ----------- 200 1 row selected ELAPSED TIME = 9875 milliseconds ij> select count from TRADE_HISTORY where STATE= 'E' and SEQID in (select max(SEQID) from TRADE_HISTORY where seqid>= 40000 and seqid<40400 group by TRADEID) ; 1 ----------- 400 1 row selected ELAPSED TIME = 38812 milliseconds After optimizing the query sql, the performance is much better. ij> select count from (select * from TRADE_HISTORY where ( STATE= 'E' ) as errorTabl e, (select max(SEQID) maxSeqID from TRADE_HISTORY where seqid>=40000 and seqid<40400 group by TRADEID) as tempT where er rorTable.SEQID =tempT.maxSeqID ; 1 ----------- 400 1 row selected ELAPSED TIME = 47 milliseconds By referring to Tuning Derby Doc ( http://db.apache.org/derby/docs/10.4/tuning/tuning-single.html#rtuntransform582 ), I noticed that there is a "Simple IN predicate transformations" in Derby. I'm not quite sure about the root cause of this issue. Is it estimated cost or the sql is transferred to another sql?
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        The dblook dump for the database with the PC ((parentId,childId) on summa_relations) index dropped and replaced by a non-unique index P om parentId only.

        In this setup the query still takes about 20-30s to complete.

        Show
        Mikkel Kamstrup Erlandsen added a comment - The dblook dump for the database with the PC ((parentId,childId) on summa_relations) index dropped and replaced by a non-unique index P om parentId only. In this setup the query still takes about 20-30s to complete.
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        This is the query plan where I replaced the 'PC' index ((parentId,childId) index on summa_relations) with a non-unique index P on the parentId column of summa_relations.

        Show
        Mikkel Kamstrup Erlandsen added a comment - This is the query plan where I replaced the 'PC' index ((parentId,childId) index on summa_relations) with a non-unique index P on the parentId column of summa_relations.
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        Output of dblook attached

        Show
        Mikkel Kamstrup Erlandsen added a comment - Output of dblook attached
        Show
        Knut Anders Hatlen added a comment - Link to discussion about the issue on derby-user: http://mail-archives.apache.org/mod_mbox/db-derby-user/200901.mbox/%3c1231761599.3683.18.camel@pc280%3e
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        I just tested with 10.4.2.0 and the problem persists.

        Show
        Mikkel Kamstrup Erlandsen added a comment - I just tested with 10.4.2.0 and the problem persists.
        Hide
        Mikkel Kamstrup Erlandsen added a comment -

        Log with query plan as promised

        Show
        Mikkel Kamstrup Erlandsen added a comment - Log with query plan as promised

          People

          • Assignee:
            Unassigned
            Reporter:
            Mikkel Kamstrup Erlandsen
          • Votes:
            1 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development