Derby
  1. Derby
  2. DERBY-3937

Select count(*) scans all the rows (and is therefore slow with big tables), is the amount of rows not available/known for example in index ?

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Store
    • Labels:
      None
    • Environment:
      Any
    • Bug behavior facts:
      Performance

      Description

      Create table with 5000000 rows. Create index on unique ID. Select count on such table is going to take quite some time.

      Shouldn't the index contain amount of indexed rows and the value taken from there ?

      Additionally, queries of the form select count from table where col1=value; take lots of time (depending on amount of rows satisfying WHERE clause) even if index on col1 exists. Isn't it possible to find first and last occurence in the index, and then calculate amount of rows more effectively then scanning through all of them ?

        Activity

        Hide
        Mike Matrigali added a comment -

        I believe that optimizer does consider doing a scan of an index vs. base table for execution of the count. You should be able to tell from the query plan.

        Show
        Mike Matrigali added a comment - I believe that optimizer does consider doing a scan of an index vs. base table for execution of the count . You should be able to tell from the query plan.
        Hide
        martin added a comment - - edited

        > The current implementations of indexes and base tables do not maintain an exact count of rows.

        Index could be useful even if it does not maintain "exact count". If scanning index can provide the result, it should be faster than scanning the table itself. Is it feasible?

        Show
        martin added a comment - - edited > The current implementations of indexes and base tables do not maintain an exact count of rows. Index could be useful even if it does not maintain "exact count". If scanning index can provide the result, it should be faster than scanning the table itself. Is it feasible?
        Hide
        Knut Anders Hatlen added a comment -

        I don't know how to access the table's estimated row count, but
        SYS.SYSSTATISTICS contains the cardinality statistics for non-unique
        indexes. These are not as accurate as the table's estimated row count,
        but if you call SYSCS_UTIL.SYSCS_UPDATE_STATISTICS every now and then,
        they should give a pretty good estimate. So provided that you have at
        least one non-unique index on the table, and the statistics have been
        created, a query such as this should give you the information you
        need:

        select st.statistics
        from
        sys.sysstatistics st, sys.sysschemas sc, sys.systables t
        where
        st.tableid = t.tableid and t.schemaid = sc.schemaid and
        sc.schemaname='APP' and t.tablename='T'

        The query should return something like "numunique= 3 numrows= 9",
        where numunique tells how many unique values the index contains, and
        numrows tells the total number of rows in the index, which should be
        the same as the number of rows in the table.

        Show
        Knut Anders Hatlen added a comment - I don't know how to access the table's estimated row count, but SYS.SYSSTATISTICS contains the cardinality statistics for non-unique indexes. These are not as accurate as the table's estimated row count, but if you call SYSCS_UTIL.SYSCS_UPDATE_STATISTICS every now and then, they should give a pretty good estimate. So provided that you have at least one non-unique index on the table, and the statistics have been created, a query such as this should give you the information you need: select st.statistics from sys.sysstatistics st, sys.sysschemas sc, sys.systables t where st.tableid = t.tableid and t.schemaid = sc.schemaid and sc.schemaname='APP' and t.tablename='T' The query should return something like "numunique= 3 numrows= 9", where numunique tells how many unique values the index contains, and numrows tells the total number of rows in the index, which should be the same as the number of rows in the table.
        Hide
        Trejkaz added a comment -

        The query optimiser works on an estimation of the number of rows, not the actual number of rows.

        Related to this: in our case if we had a way to do APPROXCOUNT instead of COUNT, it would be fast and accurate "enough", as we are only using this on big tables to provide the user with an idea of how much stuff is left to do. It doesn't have to be the exact number of items, just in the ballpark. But we couldn't find a way to use the estimate, so we used COUNT and are thus bitten by this bug.

        Show
        Trejkaz added a comment - The query optimiser works on an estimation of the number of rows, not the actual number of rows. Related to this: in our case if we had a way to do APPROXCOUNT instead of COUNT , it would be fast and accurate "enough", as we are only using this on big tables to provide the user with an idea of how much stuff is left to do. It doesn't have to be the exact number of items, just in the ballpark. But we couldn't find a way to use the estimate, so we used COUNT and are thus bitten by this bug.
        Hide
        martin added a comment -

        if engine does not know number of rows in table, does not disable it query optimizer?
        optimizer should work on big tables, not on small.

        Show
        martin added a comment - if engine does not know number of rows in table, does not disable it query optimizer? optimizer should work on big tables, not on small.
        Hide
        Myrna van Lunteren added a comment -

        To work around this limitation, apart from what's already replied, and you may have considered this already...but if the exact precise count is not essential (and select count is more habitual), and you don't have a lot of deletes, maybe an id/(generated)sequence # column - where one'd go with the highest sequence number to represent the count - would work in the application?

        Or, if the exact count is essential and needs to be accessible at a snap...not according to database normalization rules maybe, and definitely a 'crutch', but who knows, it might work for your application: a table that only has a counter, which gets updated for every insert & delete?

        Show
        Myrna van Lunteren added a comment - To work around this limitation, apart from what's already replied, and you may have considered this already...but if the exact precise count is not essential (and select count is more habitual), and you don't have a lot of deletes, maybe an id/(generated)sequence # column - where one'd go with the highest sequence number to represent the count - would work in the application? Or, if the exact count is essential and needs to be accessible at a snap...not according to database normalization rules maybe, and definitely a 'crutch', but who knows, it might work for your application: a table that only has a counter, which gets updated for every insert & delete?
        Hide
        Martin Thelian added a comment -

        I have the same problem in my project. Executing a simple "select count" on a table containing ~70 million entries took approximately 8 minutes.
        Using the trigger approach is not an option for me because I'm using derby via hibernate and using db-triggers would not allow me to keep my code db independent.

        Any other suggestion how to solve this problem? Thanks.

        Show
        Martin Thelian added a comment - I have the same problem in my project. Executing a simple "select count " on a table containing ~70 million entries took approximately 8 minutes. Using the trigger approach is not an option for me because I'm using derby via hibernate and using db-triggers would not allow me to keep my code db independent. Any other suggestion how to solve this problem? Thanks.
        Hide
        Knut Anders Hatlen added a comment -

        > So back to my original question - is there any alternative to
        > COUNT and implementing of paging over ordered set of (milion)
        > rows with constant performance towards 'last' pages ?

        One alternative is to create insert and delete triggers on the table
        and maintain the row count in a separate table. For example:

        ij> create table t(x int);
        0 rows inserted/updated/deleted
        ij> create table t_row_count(row_count int);
        0 rows inserted/updated/deleted
        ij> insert into t_row_count values 0;
        1 row inserted/updated/deleted
        ij> create trigger t_insert after insert on t for each row update t_row_count set row_count = row_count + 1;
        0 rows inserted/updated/deleted
        ij> create trigger t_delete after delete on t for each row update t_row_count set row_count = row_count - 1;
        0 rows inserted/updated/deleted
        ij> insert into t values 1,2,3;
        3 rows inserted/updated/deleted
        ij> select count from t;
        1
        -----------
        3

        1 row selected
        ij> select row_count from t_row_count;
        ROW_COUNT
        -----------
        3

        1 row selected
        ij> insert into t values 4,5;
        2 rows inserted/updated/deleted
        ij> select row_count from t_row_count;
        ROW_COUNT
        -----------
        5

        1 row selected
        ij> delete from t where x > 2;
        3 rows inserted/updated/deleted
        ij> select row_count from t_row_count;
        ROW_COUNT
        -----------
        2

        1 row selected
        ij> select count from t;
        1
        -----------
        2

        1 row selected

        Show
        Knut Anders Hatlen added a comment - > So back to my original question - is there any alternative to > COUNT and implementing of paging over ordered set of (milion) > rows with constant performance towards 'last' pages ? One alternative is to create insert and delete triggers on the table and maintain the row count in a separate table. For example: ij> create table t(x int); 0 rows inserted/updated/deleted ij> create table t_row_count(row_count int); 0 rows inserted/updated/deleted ij> insert into t_row_count values 0; 1 row inserted/updated/deleted ij> create trigger t_insert after insert on t for each row update t_row_count set row_count = row_count + 1; 0 rows inserted/updated/deleted ij> create trigger t_delete after delete on t for each row update t_row_count set row_count = row_count - 1; 0 rows inserted/updated/deleted ij> insert into t values 1,2,3; 3 rows inserted/updated/deleted ij> select count from t; 1 ----------- 3 1 row selected ij> select row_count from t_row_count; ROW_COUNT ----------- 3 1 row selected ij> insert into t values 4,5; 2 rows inserted/updated/deleted ij> select row_count from t_row_count; ROW_COUNT ----------- 5 1 row selected ij> delete from t where x > 2; 3 rows inserted/updated/deleted ij> select row_count from t_row_count; ROW_COUNT ----------- 2 1 row selected ij> select count from t; 1 ----------- 2 1 row selected
        Hide
        Martin Hajduch added a comment -

        Ok, in the meantime I read more about the topic and Derby.

        So back to my original question - is there any alternative to COUNT and implementing of paging over ordered set of (milion) rows with constant performance towards 'last' pages ?
        If not at the moment - is there any development planned in this area (where I could possibly contribute as well) ?
        If not, is there any development planned in the area of speeding up the index scan so that milion rows do not take 10, but let's say 1 second ? Even such constant improvement would help in my particular case.

        I tried several open source databases, found Derby to be the closest to my needs and I am able to devote some resources in this direction - but would like to get some pointers towards what to concentrate on to.

        Show
        Martin Hajduch added a comment - Ok, in the meantime I read more about the topic and Derby. So back to my original question - is there any alternative to COUNT and implementing of paging over ordered set of (milion) rows with constant performance towards 'last' pages ? If not at the moment - is there any development planned in this area (where I could possibly contribute as well) ? If not, is there any development planned in the area of speeding up the index scan so that milion rows do not take 10, but let's say 1 second ? Even such constant improvement would help in my particular case. I tried several open source databases, found Derby to be the closest to my needs and I am able to devote some resources in this direction - but would like to get some pointers towards what to concentrate on to.
        Hide
        Mike Matrigali added a comment -

        do not lump min/max with count, it represents a different problem.

        Derby can provide optimized performance for min/max depending on exact query, if you provide the correct backing index. For example min(a) will be fast if there is an ascending index on a.

        Show
        Mike Matrigali added a comment - do not lump min/max with count, it represents a different problem. Derby can provide optimized performance for min/max depending on exact query, if you provide the correct backing index. For example min(a) will be fast if there is an ascending index on a.
        Hide
        Martin Hajduch added a comment -

        Ok. I understand your explanation. At the same time, I can imagine that there are applications which need functions like COUNT/MIN/MAX to be performed faster then by scanning through all rows. Also, some applications would need to implement support for 'paging' - for which I haven't found any viable solution yet using Derby.

        Are there any plans of development in these directions ? Does Derby has something 'else' to make it possible to develop 'paging' over huge amount of rows ? Or is this not priority at the moment ?

        Are there some speedups possible ? For example count of index entries per page ? Or would that not bring any significant benefit over current scan ?

        Any idea how Oracle manages this ?

        Show
        Martin Hajduch added a comment - Ok. I understand your explanation. At the same time, I can imagine that there are applications which need functions like COUNT/MIN/MAX to be performed faster then by scanning through all rows. Also, some applications would need to implement support for 'paging' - for which I haven't found any viable solution yet using Derby. Are there any plans of development in these directions ? Does Derby has something 'else' to make it possible to develop 'paging' over huge amount of rows ? Or is this not priority at the moment ? Are there some speedups possible ? For example count of index entries per page ? Or would that not bring any significant benefit over current scan ? Any idea how Oracle manages this ?
        Hide
        Mike Matrigali added a comment -

        The current implementations of indexes and base tables do not maintain an exact count of rows. While on the surface for a single user using the table this seems simple, there are some complicated things to take into account when implementing such a thing for a multi-user, transactional, row locked system. Coordinating a single count across multiple users could
        create additional blocking locks which many applications would not want.

        For an index there is no way to tell how many rows given the first and last instance of a range in an index. The number of rows per page is variable. And often physically in both the index and base table there are entries that are actually marked deleted (which may be committed deleted or not), which should not be included in such a count.

        Show
        Mike Matrigali added a comment - The current implementations of indexes and base tables do not maintain an exact count of rows. While on the surface for a single user using the table this seems simple, there are some complicated things to take into account when implementing such a thing for a multi-user, transactional, row locked system. Coordinating a single count across multiple users could create additional blocking locks which many applications would not want. For an index there is no way to tell how many rows given the first and last instance of a range in an index. The number of rows per page is variable. And often physically in both the index and base table there are entries that are actually marked deleted (which may be committed deleted or not), which should not be included in such a count.

          People

          • Assignee:
            Unassigned
            Reporter:
            Martin Hajduch
          • Votes:
            4 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:

              Development