OpenJPA
  1. OpenJPA
  2. OPENJPA-703

Cache ResultObjectProvider data to improve query performance

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.0-M2
    • Fix Version/s: 2.0.0-M3
    • Component/s: kernel
    • Labels:
      None

      Description

      Profiling indicated that JDBCStoreQuery.populateSelect consumes a significant amount of CPU, and is executed every time a query is run. While, in fact, the actual PreparedStatement is created and run only in QueryImpl.toResult. It seems like the returned ResultObjectProvider from JDBCStoreQuery.executeQuery can be at least partially cached, or even cached in its entirety (provided care is taken with the context parameters).
      It seems like such an improvement would significantly improve query performance.

        Issue Links

          Activity

          Hide
          Ron Pressler added a comment -

          Actually, I now realize this has a direct connection to OPENJPA-407, though I'm not sure if that fix addresses the JPQL issue discussed here.

          Show
          Ron Pressler added a comment - Actually, I now realize this has a direct connection to OPENJPA-407 , though I'm not sure if that fix addresses the JPQL issue discussed here.
          Hide
          Pinaki Poddar added a comment -

          It always looks promising from performance perspectives to cache stuff that takes long time to compute. But caution must be exercised to cache query results (as opposed to parsed query such as PreparedStatement). Because time of execution is extremely significant for result of a query. Caching a result may lead to erroneous behavior because a cached version of data collected at time T1 may not reflect the actual result at T2 when T2 > T1.

          As far as caching parsed query itself goes, as the experience with OPENJPA-407 shows, is a non-trivial undertaking because of the context of query execution. To cache we need a key and we have made up a QueryKey. The content of the key dictates what we consider as the context of execution. But care must be taken to define the context. For example, as FetchPlan not being part of the query key can lead to erroneous behavior as documented in [1]. Such caching also limits or at least makes it more coupled to add more context information to a query.

          http://issues.apache.org/jira/browse/OPENJPA-671

          Show
          Pinaki Poddar added a comment - It always looks promising from performance perspectives to cache stuff that takes long time to compute. But caution must be exercised to cache query results (as opposed to parsed query such as PreparedStatement). Because time of execution is extremely significant for result of a query. Caching a result may lead to erroneous behavior because a cached version of data collected at time T1 may not reflect the actual result at T2 when T2 > T1. As far as caching parsed query itself goes, as the experience with OPENJPA-407 shows, is a non-trivial undertaking because of the context of query execution. To cache we need a key and we have made up a QueryKey. The content of the key dictates what we consider as the context of execution. But care must be taken to define the context. For example, as FetchPlan not being part of the query key can lead to erroneous behavior as documented in [1] . Such caching also limits or at least makes it more coupled to add more context information to a query. http://issues.apache.org/jira/browse/OPENJPA-671
          Hide
          Ron Pressler added a comment -

          I was referring to caching only the parsed query, not the results (interestingly enough, the ResultObjectProvider returned from JDBCStoreQuery.executeQuery does not, in fact, contain any results).
          As for the fetch plan - if it indeed affects the generated SQL (as I assume it does), then it must be part of the cache key. It does not, however, make sense that, in a very common use-case. when I execute the same query over and over in my application (using the same fetch plan obviously), the SQL statement needs to be generated each time by analyzing the metadata of an entire class hierarchy. This was shown in practice to consume a lot of CPU resources.

          Show
          Ron Pressler added a comment - I was referring to caching only the parsed query, not the results (interestingly enough, the ResultObjectProvider returned from JDBCStoreQuery.executeQuery does not, in fact, contain any results). As for the fetch plan - if it indeed affects the generated SQL (as I assume it does), then it must be part of the cache key. It does not, however, make sense that, in a very common use-case. when I execute the same query over and over in my application (using the same fetch plan obviously), the SQL statement needs to be generated each time by analyzing the metadata of an entire class hierarchy. This was shown in practice to consume a lot of CPU resources.
          Hide
          Ron Pressler added a comment -

          Come to think of it, for queries there's yet a simpler solution. Don't cache the generated SQL in a shared cache - simply return it along with the "compiled query" in the Query object returned to the user. If the fetch plan changes - discard the statement and regenerate it during the next execution. True, this does not implement a true cache, but it will provide significant performance improvements in the common case.
          I just don't like seeing the SQL generation process taking over 30% of my program's execution time when I simply execute the SAME QUERY OBJECT over and over again.
          I would have loved to fix this issue myself, I'm just not yet familiar enough with the inner workings of OpenJPA. Getting there.

          Show
          Ron Pressler added a comment - Come to think of it, for queries there's yet a simpler solution. Don't cache the generated SQL in a shared cache - simply return it along with the "compiled query" in the Query object returned to the user. If the fetch plan changes - discard the statement and regenerate it during the next execution. True, this does not implement a true cache, but it will provide significant performance improvements in the common case. I just don't like seeing the SQL generation process taking over 30% of my program's execution time when I simply execute the SAME QUERY OBJECT over and over again. I would have loved to fix this issue myself, I'm just not yet familiar enough with the inner workings of OpenJPA. Getting there.
          Hide
          Pinaki Poddar added a comment -

          > I just don't like seeing the SQL generation process taking over 30% of my program's execution time when I simply execute the SAME QUERY OBJECT over and over again.
          Agreed. The issue with the proposed solution, however, is that the query instances are bound to a persistence context (aka. EntityManager) – unless one is reusing the same EntityManager - an unlikely scenario in JEE container environment – associating generated SQL with persistence context is not going to be fruitful as the context will live within a single transaction and then disappear.

          I have now introduced a mechanism similar to your proposed idea to cache the SQLs generated by OpenJPA. The cached SQL is indexed by the original JPQL or NamedQuery moniker at EntityManagerFactory level.

          The advantages of the approach are manifold:
          1. this caching mechanism caches any query. Not only findBy() as is the case with an earlier attempt to SQL level query caching.
          2. it provides a longer life span of the cache and hence same JPQL query Q when executed by different EntityManagers can be now reused efficiently.
          3. the key for the query is natural and indirectly specified by the user to be unique as the key is the JPQL itself or the NamedQuery moniker. So there is neither any overhead of computing the query key nor there is any code coupling as context of what makes a query unique grows.
          4. the mechanism is orthogonal and unobtrusive to OpenJPA kernel as well as the user. The JPQL queries during first execution gets translated to SQL and any subsequent execution of the query within the same persistent unit (not context) reuse the generated SQL directly, bypassing the entire query formation (that 30% you mentioned) logic of the kernel.
          5. The critical (and slightly hairy) issue is the parameter binding. As JPQL allows positional as well as named parameter binding but SQL allows only positional binding, hence a bit of care must be exercised to ensure that parameter binding remains consistent as we translate a JPQL to SQL under the hood.

          Initial performance result with this simple mechanics is promising and presented below:
          1. All measurements are taken on my laptop
          2. Logging was turned off
          3. The database was empty – so the query gets executed but no actual record is selected. This is done purposefully to accentuate the incremental benefit of query construction vs query caching.
          4. All measurements are taken by 100 repeated executions of each query

          ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          Query Time taken in ms %Improvement
          without cache with cache
          t1 t2 PCT[(t1-t2)/t1]
          ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
          1. select p from PObject p 625 578 +7%
          SELECT t0.id, t0.version, t0.name FROM PObject t0

          2. select p from PObject p where p.name = 'PObject' 562 516 +8%
          SELECT t0.id, t0.version, t0.name FROM PObject t0 WHERE (t0.name = 'PObject')

          3. select p from PObject p where p.name = :param 531 500 +6%
          SELECT t0.id, t0.version, t0.name FROM PObject t0 WHERE (t0.name = ?)

          4. select e from Employee e where e.name = :emp and e.department.name = :dept 594 546 +8%
          and e.department.company.name = :company and e.address.zip = :zip
          SELECT t0.id, t3.id, t3.city, t3.state, t3.street, t3.zip, t1.id, t2.id, t2.name, t1.name, t0.name
          FROM Employee t0 INNER JOIN Department t1 ON t0.DEPARTMENT_ID = t1.id
          INNER JOIN Address t3 ON t0.ADDRESS_ID = t3.id
          INNER JOIN Company t2 ON t1.COMPANY_ID = t2.id
          WHERE (t0.name = ? AND t1.name = ? AND t2.name = ? AND t3.zip = ?)
          -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

          The current limitation is:
          a) queries that result into more than one SQL statements (e.g. union or parallel queries) are not amenable to caching.
          b) no cache invalidation on fetch plan changes – but the existing caching does not address that issue either.

          An earlier attempt to caching SQL query (as configured by jdbc.QuerySQLCache) is based on associating a SQL string to the implementation-internal Select objects and caching the Select objects themselves. This current work does overlap with this effort and hence some explanation is called for on why I have considered an alternative implementation for SQL query caching while one already exists. My primary concern with existing implementation for query caching are as follows:
          a) it is non-orthogonal to other design constructs. For example, when a relation is being loaded (a very critical and common code logic) the code becomes aware whether query caching is in effect and, more critically, takes a distinctly different path.
          b) the code logic of JDBCStoreManager – another critical component of OpenJPA – also becomes aware of query caching and starts bifurcating code paths. This approach can lead to poor maintainability and consistency in future.
          c) it only works for findBy() and trying to generalize it for other queries will spread the code to many other mapping strategies – accentuating the above mentioned design concerns.
          d) it does need to compute a Select key and also caches a Select – a very heavy object – causing a poor memory usage profile.

          I did not find any test case that measure performance gain of this existing approach. Can anyone (whoever has cared to read up to this point please point me to any performance result obtained with existing caching implementation so that I can do a fair comparative performance evaluation of the existing approach and this current one?

          Show
          Pinaki Poddar added a comment - > I just don't like seeing the SQL generation process taking over 30% of my program's execution time when I simply execute the SAME QUERY OBJECT over and over again. Agreed. The issue with the proposed solution, however, is that the query instances are bound to a persistence context (aka. EntityManager) – unless one is reusing the same EntityManager - an unlikely scenario in JEE container environment – associating generated SQL with persistence context is not going to be fruitful as the context will live within a single transaction and then disappear. I have now introduced a mechanism similar to your proposed idea to cache the SQLs generated by OpenJPA. The cached SQL is indexed by the original JPQL or NamedQuery moniker at EntityManagerFactory level. The advantages of the approach are manifold: 1. this caching mechanism caches any query. Not only findBy() as is the case with an earlier attempt to SQL level query caching. 2. it provides a longer life span of the cache and hence same JPQL query Q when executed by different EntityManagers can be now reused efficiently. 3. the key for the query is natural and indirectly specified by the user to be unique as the key is the JPQL itself or the NamedQuery moniker. So there is neither any overhead of computing the query key nor there is any code coupling as context of what makes a query unique grows. 4. the mechanism is orthogonal and unobtrusive to OpenJPA kernel as well as the user. The JPQL queries during first execution gets translated to SQL and any subsequent execution of the query within the same persistent unit (not context) reuse the generated SQL directly, bypassing the entire query formation (that 30% you mentioned) logic of the kernel. 5. The critical (and slightly hairy) issue is the parameter binding. As JPQL allows positional as well as named parameter binding but SQL allows only positional binding, hence a bit of care must be exercised to ensure that parameter binding remains consistent as we translate a JPQL to SQL under the hood. Initial performance result with this simple mechanics is promising and presented below: 1. All measurements are taken on my laptop 2. Logging was turned off 3. The database was empty – so the query gets executed but no actual record is selected. This is done purposefully to accentuate the incremental benefit of query construction vs query caching. 4. All measurements are taken by 100 repeated executions of each query --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Query Time taken in ms %Improvement without cache with cache t1 t2 PCT [(t1-t2)/t1] --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 1. select p from PObject p 625 578 +7% SELECT t0.id, t0.version, t0.name FROM PObject t0 2. select p from PObject p where p.name = 'PObject' 562 516 +8% SELECT t0.id, t0.version, t0.name FROM PObject t0 WHERE (t0.name = 'PObject') 3. select p from PObject p where p.name = :param 531 500 +6% SELECT t0.id, t0.version, t0.name FROM PObject t0 WHERE (t0.name = ?) 4. select e from Employee e where e.name = :emp and e.department.name = :dept 594 546 +8% and e.department.company.name = :company and e.address.zip = :zip SELECT t0.id, t3.id, t3.city, t3.state, t3.street, t3.zip, t1.id, t2.id, t2.name, t1.name, t0.name FROM Employee t0 INNER JOIN Department t1 ON t0.DEPARTMENT_ID = t1.id INNER JOIN Address t3 ON t0.ADDRESS_ID = t3.id INNER JOIN Company t2 ON t1.COMPANY_ID = t2.id WHERE (t0.name = ? AND t1.name = ? AND t2.name = ? AND t3.zip = ?) ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The current limitation is: a) queries that result into more than one SQL statements (e.g. union or parallel queries) are not amenable to caching. b) no cache invalidation on fetch plan changes – but the existing caching does not address that issue either. An earlier attempt to caching SQL query (as configured by jdbc.QuerySQLCache) is based on associating a SQL string to the implementation-internal Select objects and caching the Select objects themselves. This current work does overlap with this effort and hence some explanation is called for on why I have considered an alternative implementation for SQL query caching while one already exists. My primary concern with existing implementation for query caching are as follows: a) it is non-orthogonal to other design constructs. For example, when a relation is being loaded (a very critical and common code logic) the code becomes aware whether query caching is in effect and, more critically, takes a distinctly different path. b) the code logic of JDBCStoreManager – another critical component of OpenJPA – also becomes aware of query caching and starts bifurcating code paths. This approach can lead to poor maintainability and consistency in future. c) it only works for findBy() and trying to generalize it for other queries will spread the code to many other mapping strategies – accentuating the above mentioned design concerns. d) it does need to compute a Select key and also caches a Select – a very heavy object – causing a poor memory usage profile. I did not find any test case that measure performance gain of this existing approach. Can anyone (whoever has cared to read up to this point please point me to any performance result obtained with existing caching implementation so that I can do a fair comparative performance evaluation of the existing approach and this current one?
          Hide
          Pinaki Poddar added a comment -

          This commit introduces few changes that may impact other developers.

          1. The identity type for a Query Parameter expression has changed from String to Object. This changes are reflected in
          /openjpa/trunk/openjpa-kernel/src/main/java/org/apache/openjpa/kernel/exps/Parameter.java and all its subclasses.

          2. Query parsing preserves the runtime type of the parameter keys. The same parameter keys must be used as the key of the returned map from the following method of StoreQuery.Executor

          public LinkedMap getParameterTypes(StoreQuery q);

          This impact is shown in JPQLExpressionBuilder.getParameters().

          Both the above changes to preserve the actual type of the parameter key (numeric or String) is important to identify them later when a query is executed with a new set of user parameter values.

          3. As QueryExpression Tree is walked and each node contributes to fill in the SQLBuffer, the user defined parameters must distinguish themselves from any other SQL parameters being inserted in SQLBuffer. This distinction is critical for reparameterizing the cached SQL with a new set of user parameter values while keeping the internal parameter values intact. This aspect add a new overloaded SQLBuffer.appendValue(Object, Column, Parameter) method.

          Show
          Pinaki Poddar added a comment - This commit introduces few changes that may impact other developers. 1. The identity type for a Query Parameter expression has changed from String to Object. This changes are reflected in /openjpa/trunk/openjpa-kernel/src/main/java/org/apache/openjpa/kernel/exps/Parameter.java and all its subclasses. 2. Query parsing preserves the runtime type of the parameter keys. The same parameter keys must be used as the key of the returned map from the following method of StoreQuery.Executor public LinkedMap getParameterTypes(StoreQuery q); This impact is shown in JPQLExpressionBuilder.getParameters(). Both the above changes to preserve the actual type of the parameter key (numeric or String) is important to identify them later when a query is executed with a new set of user parameter values. 3. As QueryExpression Tree is walked and each node contributes to fill in the SQLBuffer, the user defined parameters must distinguish themselves from any other SQL parameters being inserted in SQLBuffer. This distinction is critical for reparameterizing the cached SQL with a new set of user parameter values while keeping the internal parameter values intact. This aspect add a new overloaded SQLBuffer.appendValue(Object, Column, Parameter) method.
          Hide
          Pinaki Poddar added a comment -

          This document describes the main issues and how they have been addressed in designing Prepared Query Cache – which bypasses normal pathway of JPQL query execution by reusing the target SQL and other internal constructs.

          One of the design goal is to minimize intrusion to current pathways. The outcome is that the only change a OpenJPA developer must keep in mind is
          while contributing to a SQLBuffer from the Query Expression Tree nodes is to distinguish between a user-supplied query parameter from an internally
          inserted parameter (such as Literal or Const).

          Show
          Pinaki Poddar added a comment - This document describes the main issues and how they have been addressed in designing Prepared Query Cache – which bypasses normal pathway of JPQL query execution by reusing the target SQL and other internal constructs. One of the design goal is to minimize intrusion to current pathways. The outcome is that the only change a OpenJPA developer must keep in mind is while contributing to a SQLBuffer from the Query Expression Tree nodes is to distinguish between a user-supplied query parameter from an internally inserted parameter (such as Literal or Const).
          Hide
          Catalina Wei added a comment -

          The JPQL queries are not cached until PreparedResultObjectProvider and SQLProjectResultObjectProvider can handle complex query results.
          Queries having following characteristics are not cached:

          1. resulting entity class is not candidate type,
          select e.dept from Employee e
          2: query returning any embeddable class
          select e.address from Employee e
          3. query returning more than one entity class
          select d, e from Dept d, in (d.emps) e
          4. query contains ORDER BY clause
          select e.name from Employee e order by e.salary
          5. query returning basic types but SQL result columns don't match with JPQL selections
          select e.name, e.name from Employee e

          6. JPQL query returning boolean value where SQL returning 0 or 1
          select e.name, e.isManager from Employee e

          The revision r773404 has code in place that detects the characteristics of queries and not caching them,
          It is to avoid cached query returning wrong answer.

          The query cache-ability detection logic may not complete (
          the query example 6 is currently cached and give wrong answer).

          This issue is reopened to re-evaluate the cache-ability of the queries having the characteristics as listed above.

          Show
          Catalina Wei added a comment - The JPQL queries are not cached until PreparedResultObjectProvider and SQLProjectResultObjectProvider can handle complex query results. Queries having following characteristics are not cached: 1. resulting entity class is not candidate type, select e.dept from Employee e 2: query returning any embeddable class select e.address from Employee e 3. query returning more than one entity class select d, e from Dept d, in (d.emps) e 4. query contains ORDER BY clause select e.name from Employee e order by e.salary 5. query returning basic types but SQL result columns don't match with JPQL selections select e.name, e.name from Employee e 6. JPQL query returning boolean value where SQL returning 0 or 1 select e.name, e.isManager from Employee e The revision r773404 has code in place that detects the characteristics of queries and not caching them, It is to avoid cached query returning wrong answer. The query cache-ability detection logic may not complete ( the query example 6 is currently cached and give wrong answer). This issue is reopened to re-evaluate the cache-ability of the queries having the characteristics as listed above.

            People

            • Assignee:
              Pinaki Poddar
              Reporter:
              Ron Pressler
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development