Solr
  1. Solr
  2. SOLR-812

JDBC optimizations: setReadOnly, setMaxRows

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Labels:
      None

      Description

      I'm looking at the DataImport code as of Solr v1.3 and using it with Postgres and very large data sets and there some improvement suggestions I have.

      1. call setReadOnly(true) on the connection. DIH doesn't change the data so this is obvious.

      2. call setAutoCommit(false) on the connection. (this is needed by Postgres to ensure that the fetchSize hint actually works)

      3. call setMaxRows(X) on the statement which is to be used when the dataimport.jsp debugger is only grabbing X rows. fetchSize is just a hint and alone it isn't sufficient.

      1. SOLR-812.patch
        8 kB
        Shalin Shekhar Mangar
      2. SOLR-812.patch
        8 kB
        Shalin Shekhar Mangar

        Activity

        Hide
        Noble Paul added a comment -

        1. call setReadOnly(true) on the connection. DIH doesn't change the data so this is obvious.

        2. call setAutoCommit(false) on the connection. (this is needed by Postgres to ensure that the fetchSize hint actually works)

        Actually JdbcDataSource does allow user to pass any query (even update or delete). So if we set it by default . Maybe we should add attributes readOnly and autoCommit to JdbcDataSource . It is possible to make it on by default . I am just wondering if users are already using it to write also

        Show
        Noble Paul added a comment - 1. call setReadOnly(true) on the connection. DIH doesn't change the data so this is obvious. 2. call setAutoCommit(false) on the connection. (this is needed by Postgres to ensure that the fetchSize hint actually works) Actually JdbcDataSource does allow user to pass any query (even update or delete). So if we set it by default . Maybe we should add attributes readOnly and autoCommit to JdbcDataSource . It is possible to make it on by default . I am just wondering if users are already using it to write also
        Hide
        Glen Newton added a comment -

        You might also think about setting
        setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED)
        (see: http://media.datadirect.com/download/docs/ddxquery/allddxq/reference/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Reference&file=grammarwrite7.html)

        Only DB2 appears to support TRANSACTION_NONE

        While from the previous comment JdbcDataSource it appears not to exclusively be a "source", you should still consider also setting:
        setHoldability(ResultSet.CLOSE_CURSORS_AT_COMMIT)

        It may reduce the driver resource usage.

        Certainly exposing these and the above in JdbcDataSource via properties would be more flexible to users. But sensible defaults should be set for read-only.

        Show
        Glen Newton added a comment - You might also think about setting setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED) (see: http://media.datadirect.com/download/docs/ddxquery/allddxq/reference/wwhelp/wwhimpl/common/html/wwhelp.htm?context=Reference&file=grammarwrite7.html ) Only DB2 appears to support TRANSACTION_NONE – While from the previous comment JdbcDataSource it appears not to exclusively be a "source", you should still consider also setting: setHoldability(ResultSet.CLOSE_CURSORS_AT_COMMIT) It may reduce the driver resource usage. Certainly exposing these and the above in JdbcDataSource via properties would be more flexible to users. But sensible defaults should be set for read-only.
        Hide
        Glen Newton added a comment -

        This is a related issue, but since I just got involved with Solr yesterday and got a jira account today, I am reluctant to make a career-limiting error!

        If it is indeed valid, perhaps someone else can make it a full-fledged separate issue!

        Perusing: JdbcDataSource @version $Id: JdbcDataSource.java 696539 2008-09-18 02:16:26Z ryan
        Issue: MySQL fetchSize driver bug

        Both my experience and according to: http://benjchristensen.wordpress.com/2008/05/27/mysql-jdbc-memory-usage-on-large-resultset/

        MySQL does not handle properly any fetchSize > Integer.MIN_VALUE, and the entire ResultSet is transfered and loaded into memory, which for large ResultSets can result in an out of memory.

        In JdbcDataSource.java:
        175: stmt.setFetchSize(batchSize);

        where
        57: private int batchSize = FETCH_SIZE;

        and
        326: private static final int FETCH_SIZE = 500;

        Is is, this code will invoke this bug for MySQL for large ResultSets.
        Even for smaller ResultSets that do not cause an out of memory error, having all the ResultSet in memory will unnecessarily use up memory.

        The work around for this MySQL issue is:
        stmt.setFetchSize(Integer.MIN_VALUE);

        From the blog entry, see also:

        Show
        Glen Newton added a comment - This is a related issue, but since I just got involved with Solr yesterday and got a jira account today, I am reluctant to make a career-limiting error! If it is indeed valid, perhaps someone else can make it a full-fledged separate issue! Perusing: JdbcDataSource @version $Id: JdbcDataSource.java 696539 2008-09-18 02:16:26Z ryan Issue: MySQL fetchSize driver bug Both my experience and according to: http://benjchristensen.wordpress.com/2008/05/27/mysql-jdbc-memory-usage-on-large-resultset/ MySQL does not handle properly any fetchSize > Integer.MIN_VALUE, and the entire ResultSet is transfered and loaded into memory, which for large ResultSets can result in an out of memory. In JdbcDataSource.java: 175: stmt.setFetchSize(batchSize); where 57: private int batchSize = FETCH_SIZE; and 326: private static final int FETCH_SIZE = 500; Is is, this code will invoke this bug for MySQL for large ResultSets. Even for smaller ResultSets that do not cause an out of memory error, having all the ResultSet in memory will unnecessarily use up memory. The work around for this MySQL issue is: stmt.setFetchSize(Integer.MIN_VALUE); From the blog entry, see also: http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html
        Hide
        Shalin Shekhar Mangar added a comment -

        Thanks for going through this Glen. This bug and a workaround is documented in the FAQ page.

        http://wiki.apache.org/solr/DataImportHandlerFaq

        Show
        Shalin Shekhar Mangar added a comment - Thanks for going through this Glen. This bug and a workaround is documented in the FAQ page. http://wiki.apache.org/solr/DataImportHandlerFaq
        Hide
        Shalin Shekhar Mangar added a comment -

        Certainly exposing these and the above in JdbcDataSource via properties would be more flexible to users. But sensible defaults should be set for read-only.

        Any extra attributes that you specify to the <dataSource> element are passed on to the DriverManager in a Properties object. So if your driver supports url parameters for these optimizations, you can use them right now. However, each driver has a different way of specifying this configuration, so we should support a way of making them configurable.

        I'm not very comfortable with making changes to the default configuration for the sake of backwards-compatibility. But we should make these possible and document their usage.

        Show
        Shalin Shekhar Mangar added a comment - Certainly exposing these and the above in JdbcDataSource via properties would be more flexible to users. But sensible defaults should be set for read-only. Any extra attributes that you specify to the <dataSource> element are passed on to the DriverManager in a Properties object. So if your driver supports url parameters for these optimizations, you can use them right now. However, each driver has a different way of specifying this configuration, so we should support a way of making them configurable. I'm not very comfortable with making changes to the default configuration for the sake of backwards-compatibility. But we should make these possible and document their usage.
        Hide
        Shalin Shekhar Mangar added a comment -

        Changes

        1. Supports the following through configuration (extra attributes in <dataSource> element or solrconfig.xml):
          1. readOnly
          2. autoCommit
          3. transactionIsolation
          4. holdability
          5. maxRows
        2. If readOnly is specified the following are added by default (but they will be overridden if specified explicitly):
          setAutoCommit(true);
          setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED);
          setHoldability(ResultSet.CLOSE_CURSORS_AT_COMMIT);
          
        3. If 'start' and 'rows' are specified as request parameters, then we call setMaxRows(start + rows) automatically overriding value specified in configuration
        4. No changes are made unless configuration is specified, so it is backwards compatible.

        I'd like to commit this in a day or two. We also need to add this documentation to the wiki page.

        Show
        Shalin Shekhar Mangar added a comment - Changes Supports the following through configuration (extra attributes in <dataSource> element or solrconfig.xml): readOnly autoCommit transactionIsolation holdability maxRows If readOnly is specified the following are added by default (but they will be overridden if specified explicitly): setAutoCommit( true ); setTransactionIsolation(Connection.TRANSACTION_READ_UNCOMMITTED); setHoldability(ResultSet.CLOSE_CURSORS_AT_COMMIT); If 'start' and 'rows' are specified as request parameters, then we call setMaxRows(start + rows) automatically overriding value specified in configuration No changes are made unless configuration is specified, so it is backwards compatible. I'd like to commit this in a day or two. We also need to add this documentation to the wiki page.
        Hide
        Shalin Shekhar Mangar added a comment -

        Fixing a bug with the way maxRows is calculated. It was not being set if 'rows' is specified but 'start' is not.

        Show
        Shalin Shekhar Mangar added a comment - Fixing a bug with the way maxRows is calculated. It was not being set if 'rows' is specified but 'start' is not.
        Hide
        Shalin Shekhar Mangar added a comment -

        Committed revision 726241.

        Just to re-iterate, there are no changes in default configuration except that maxRows is set automatically if user specifies 'start' and 'rows' parameter. Read-only mode has to be enabled explictly and then sane defaults for auto-commit, isolation and holdability are added. We can add readOnly flag to our example configurations on the wiki.

        Thanks David and Glen!

        Show
        Shalin Shekhar Mangar added a comment - Committed revision 726241. Just to re-iterate, there are no changes in default configuration except that maxRows is set automatically if user specifies 'start' and 'rows' parameter. Read-only mode has to be enabled explictly and then sane defaults for auto-commit, isolation and holdability are added. We can add readOnly flag to our example configurations on the wiki. Thanks David and Glen!
        Hide
        Martin Davidsson added a comment -

        I'm running postgres as well and I finally got rid of my memory issues with this data source configuration:

        <dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://host/db" user="user" password="password" readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT"/>

        I'm using v8.3 of postgres server and JDBC driver.

        Show
        Martin Davidsson added a comment - I'm running postgres as well and I finally got rid of my memory issues with this data source configuration: <dataSource type="JdbcDataSource" driver="org.postgresql.Driver" url="jdbc:postgresql://host/db" user="user" password="password" readOnly="true" autoCommit="false" transactionIsolation="TRANSACTION_READ_COMMITTED" holdability="CLOSE_CURSORS_AT_COMMIT"/> I'm using v8.3 of postgres server and JDBC driver.
        Hide
        Shalin Shekhar Mangar added a comment -

        I'm running postgres as well and I finally got rid of my memory issues with this data source configuration:

        Thanks for the information Martin. Can you please add this to the DataImportHandlerFaq wiki page?

        http://wiki.apache.org/solr/DataImportHandlerFaq

        Show
        Shalin Shekhar Mangar added a comment - I'm running postgres as well and I finally got rid of my memory issues with this data source configuration: Thanks for the information Martin. Can you please add this to the DataImportHandlerFaq wiki page? http://wiki.apache.org/solr/DataImportHandlerFaq
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Shalin Shekhar Mangar
            Reporter:
            David Smiley
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development