Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-8479

Add JDBCStream for integration with external data sources

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Minor
    • Resolution: Implemented
    • Affects Version/s: None
    • Fix Version/s: 6.0
    • Component/s: SolrJ
    • Labels:
      None

      Description

      Given that the Streaming API can merge and join multiple incoming SolrStreams to perform complex operations on the resulting combined datasets I think it would be beneficial to also support incoming streams from other data sources.

      The JDBCStream will provide a Streaming API interface to any data source which provides a JDBC driver.

      1. SOLR-8479.patch
        49 kB
        Dennis Gove
      2. SOLR-8479.patch
        43 kB
        Dennis Gove
      3. SOLR-8479.patch
        36 kB
        Dennis Gove
      4. SOLR-8479.patch
        25 kB
        Dennis Gove
      5. SOLR-8479.patch
        9 kB
        Dennis Gove

        Issue Links

          Activity

          Hide
          dpgove Dennis Gove added a comment -

          This is a first pass at the JDBCStream. There are still open questions and unimplemented pieces but I'm putting this out there to start the conversation. No tests are included.

          1. Currently it's handling the loading of JDBC Driver classes by requiring the driver class be provided and will then call

          Class.forName(driverClassName);
          

          during open(). I'm wondering if there's a better way to handle this, particularly if we can do the loading via config file handling.

          Show
          dpgove Dennis Gove added a comment - This is a first pass at the JDBCStream. There are still open questions and unimplemented pieces but I'm putting this out there to start the conversation. No tests are included. 1. Currently it's handling the loading of JDBC Driver classes by requiring the driver class be provided and will then call Class .forName(driverClassName); during open(). I'm wondering if there's a better way to handle this, particularly if we can do the loading via config file handling.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          +1

          Some great possibilities here. One I really like is combining it with the UpdateStream:

          update(jdbc("select ...")) 
          

          A simple way to populate a SolrCloud collection from a JDBC compliant data store. Since many NoSQL engines now support SQL we'll be able to pull data from places like Spark, Couchbase and any RDBMS.

          Show
          joel.bernstein Joel Bernstein added a comment - +1 Some great possibilities here. One I really like is combining it with the UpdateStream: update(jdbc( "select ..." )) A simple way to populate a SolrCloud collection from a JDBC compliant data store. Since many NoSQL engines now support SQL we'll be able to pull data from places like Spark, Couchbase and any RDBMS.
          Hide
          dpgove Dennis Gove added a comment - - edited

          Adds some simple tests for the raw stream and as embedded inside a SelectStream and MergeStream where it is being merged with a CloudSolrStream.

          The tests are using the in-memory database hsqldb with driver "org.hsqldb.jdbcDriver". I chose this as it's already being used in a contrib module. I'm open to other options as I'm not a huge fan of this particular in-memory database.

          Still doesn't implement Expressible interface (next on my list).

          Show
          dpgove Dennis Gove added a comment - - edited Adds some simple tests for the raw stream and as embedded inside a SelectStream and MergeStream where it is being merged with a CloudSolrStream. The tests are using the in-memory database hsqldb with driver "org.hsqldb.jdbcDriver". I chose this as it's already being used in a contrib module. I'm open to other options as I'm not a huge fan of this particular in-memory database. Still doesn't implement Expressible interface (next on my list).
          Hide
          joel.bernstein Joel Bernstein added a comment -

          You could also use Solr's JDBC driver.

          Show
          joel.bernstein Joel Bernstein added a comment - You could also use Solr's JDBC driver.
          Hide
          joel.bernstein Joel Bernstein added a comment -
          
              String zkHost = zkServer.getZkAddress();
              Properties props = new Properties();
              Connection con = DriverManager.getConnection("jdbc:solr://" + zkHost + "?collection=collection1", props);
              Statement stmt = con.createStatement();
              ResultSet rs = stmt.executeQuery("select id, a_i, a_s, a_f from collection1 order by a_i desc limit 2");
          
          
          Show
          joel.bernstein Joel Bernstein added a comment - String zkHost = zkServer.getZkAddress(); Properties props = new Properties(); Connection con = DriverManager.getConnection( "jdbc:solr: //" + zkHost + "?collection=collection1" , props); Statement stmt = con.createStatement(); ResultSet rs = stmt.executeQuery( "select id, a_i, a_s, a_f from collection1 order by a_i desc limit 2" );
          Hide
          dpgove Dennis Gove added a comment -

          I considered that but I wanted to be sure the test covered non-Solr code bases. I think there's value in showing that a non-Solr external source can be used and functions as expected.

          Show
          dpgove Dennis Gove added a comment - I considered that but I wanted to be sure the test covered non-Solr code bases. I think there's value in showing that a non-Solr external source can be used and functions as expected.
          Hide
          joel.bernstein Joel Bernstein added a comment -

          We can probably just make the driver class name optional. If the parameter is present then call Class.forName(). If it's not then just skip this step.

          Show
          joel.bernstein Joel Bernstein added a comment - We can probably just make the driver class name optional. If the parameter is present then call Class.forName(). If it's not then just skip this step.
          Hide
          dpgove Dennis Gove added a comment -

          New patch with a few changes.

          1. Added some new tests
          2. Made driverClassName an optional property. if provided then we will call Class.forName(driverClassName); during open(). Also added a call to DriverManager.getDriver(connectionUrl) during open() to validate that the driver can be found. If not then an exception is thrown. This will prevent us from continuing if the jdbc driver is not loaded.
          3. Changed the default handling types so that Double is handled as a direct class while Float is converted to a Doube. This keeps in line with the rest of the Streaming API.

          Show
          dpgove Dennis Gove added a comment - New patch with a few changes. 1. Added some new tests 2. Made driverClassName an optional property. if provided then we will call Class.forName(driverClassName); during open(). Also added a call to DriverManager.getDriver(connectionUrl) during open() to validate that the driver can be found. If not then an exception is thrown. This will prevent us from continuing if the jdbc driver is not loaded. 3. Changed the default handling types so that Double is handled as a direct class while Float is converted to a Doube. This keeps in line with the rest of the Streaming API.
          Hide
          dpgove Dennis Gove added a comment -

          Previous patch was a diff between the wrong hashes in the repo. This one is correct.

          Show
          dpgove Dennis Gove added a comment - Previous patch was a diff between the wrong hashes in the repo. This one is correct.
          Hide
          dpgove Dennis Gove added a comment -

          I intend to add a few more tests for failure scenarios and for setting connection properties. Barring any issues found with that, I think this will be ready to go .

          Show
          dpgove Dennis Gove added a comment - I intend to add a few more tests for failure scenarios and for setting connection properties. Barring any issues found with that, I think this will be ready to go .
          Hide
          dpgove Dennis Gove added a comment -

          Enhanced tests to include one which sets properties on the connection. Rebased against trunk.

          Show
          dpgove Dennis Gove added a comment - Enhanced tests to include one which sets properties on the connection. Rebased against trunk.
          Hide
          dpgove Dennis Gove added a comment -

          This is ready to go. I intend to commit to trunk either tonight or tomorrow.

          Show
          dpgove Dennis Gove added a comment - This is ready to go. I intend to commit to trunk either tonight or tomorrow.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1723749 from dpgove@apache.org in branch 'dev/trunk'
          [ https://svn.apache.org/r1723749 ]

          OLR-8479: Add JDBCStream to Streaming API and Streaming Expressions for integration with external data sources
          SOLR-8479: Add JDBCStream to Streaming API and Streaming Expressions for integration with external data sources

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1723749 from dpgove@apache.org in branch 'dev/trunk' [ https://svn.apache.org/r1723749 ] OLR-8479: Add JDBCStream to Streaming API and Streaming Expressions for integration with external data sources SOLR-8479 : Add JDBCStream to Streaming API and Streaming Expressions for integration with external data sources
          Hide
          gerlowskija Jason Gerlowski added a comment - - edited

          This looks awesome.

          Only comment would be that we might regret not having a test chaining JDBCStream and UpdateStream together.

          As Joel mentioned, one of the interesting possibilities here is quick data-import using those two streams. Just thought it might be nice to have a test to catch any future regressions there.

          Maybe it's not worth it though, or adding tests should be pushed to a different JIRA (since it looks like you're already working on committing this, and I'm commenting at the 11th hour here).

          Oops, looks like I'm too late here. Nevermind then : )

          Show
          gerlowskija Jason Gerlowski added a comment - - edited This looks awesome. Only comment would be that we might regret not having a test chaining JDBCStream and UpdateStream together. As Joel mentioned, one of the interesting possibilities here is quick data-import using those two streams. Just thought it might be nice to have a test to catch any future regressions there. Maybe it's not worth it though, or adding tests should be pushed to a different JIRA (since it looks like you're already working on committing this, and I'm commenting at the 11th hour here). Oops, looks like I'm too late here. Nevermind then : )
          Hide
          dpgove Dennis Gove added a comment -

          I think a test like that is a great idea. I'll add it at some point in the future (perhaps under the guise of cleaning up our tests which was mentioned in the UpdateStream ticket).

          Show
          dpgove Dennis Gove added a comment - I think a test like that is a great idea. I'll add it at some point in the future (perhaps under the guise of cleaning up our tests which was mentioned in the UpdateStream ticket).
          Hide
          joel.bernstein Joel Bernstein added a comment -

          This is a great ticket!

          One thing we can think about doing in the future is handling the defined sort differently. Possibly parsing it from the SQL statement.

          One of the cool things about this is it allows you to distribute a SQL database as well. For example you could send the same query to multiple SQL servers then stream it all back together.

          Show
          joel.bernstein Joel Bernstein added a comment - This is a great ticket! One thing we can think about doing in the future is handling the defined sort differently. Possibly parsing it from the SQL statement. One of the cool things about this is it allows you to distribute a SQL database as well. For example you could send the same query to multiple SQL servers then stream it all back together.

            People

            • Assignee:
              dpgove Dennis Gove
              Reporter:
              dpgove Dennis Gove
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development