Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.3
    • Component/s: update
    • Labels:
      None

      Description

      We need a RequestHandler Which can import data from a DB or other dataSources into the Solr index .Think of it as an advanced form of SqlUpload Plugin (SOLR-103).

      The way it works is as follows.

      • Provide a configuration file (xml) to the Handler which takes in the necessary SQL queries and mappings to a solr schema
      • It also takes in a properties file for the data source configuraution
      • Given the configuration it can also generate the solr schema.xml
      • It is registered as a RequestHandler which can take two commands do-full-import, do-delta-import
      • do-full-import - dumps all the data from the Database into the index (based on the SQL query in configuration)
      • do-delta-import - dumps all the data that has changed since last import. (We assume a modified-timestamp column in tables)
      • It provides a admin page
      • where we can schedule it to be run automatically at regular intervals
      • It shows the status of the Handler (idle, full-import, delta-import)
      1. xpath-stream.patch
        10 kB
        Noble Paul
      2. SOLR-469-contrib.patch
        337 kB
        Shalin Shekhar Mangar
      3. SOLR-469-contrib.patch
        349 kB
        Shalin Shekhar Mangar
      4. SOLR-469-contrib.patch
        741 kB
        Noble Paul
      5. SOLR-469-contrib.patch
        367 kB
        Shalin Shekhar Mangar
      6. SOLR-469-contrib.patch
        756 kB
        Noble Paul
      7. SOLR-469-contrib.patch
        368 kB
        Shalin Shekhar Mangar
      8. SOLR-469-contrib.patch
        379 kB
        Noble Paul
      9. SOLR-469-contrib.patch
        380 kB
        Noble Paul
      10. SOLR-469-contrib.patch
        382 kB
        Noble Paul
      11. SOLR-469-contrib.patch
        382 kB
        Noble Paul
      12. SOLR-469-contrib.patch
        385 kB
        Noble Paul
      13. SOLR-469-contrib.patch
        393 kB
        Shalin Shekhar Mangar
      14. SOLR-469.patch
        114 kB
        Shalin Shekhar Mangar
      15. SOLR-469.patch
        101 kB
        Shalin Shekhar Mangar
      16. SOLR-469.patch
        102 kB
        Shalin Shekhar Mangar
      17. SOLR-469.patch
        377 kB
        Noble Paul
      18. SOLR-469.patch
        182 kB
        Shalin Shekhar Mangar
      19. SOLR-469.patch
        182 kB
        Shalin Shekhar Mangar
      20. SOLR-469.patch
        230 kB
        Shalin Shekhar Mangar
      21. SOLR-469.patch
        324 kB
        Shalin Shekhar Mangar
      22. SOLR-469.patch
        336 kB
        Shalin Shekhar Mangar

        Issue Links

          Activity

          Hide
          David Smiley added a comment -

          However doing so is a protocol crime – HTTP GET verb should be read-only. Use HTTP POST instead.

          Show
          David Smiley added a comment - However doing so is a protocol crime – HTTP GET verb should be read-only. Use HTTP POST instead.
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks!

          Scheduling is not implemented inside Solr. You can use a cron job for scheduling automatic imports. For example, you can call "wget http://solr.host:port/solr/dataimport?command=full-import".

          Show
          Shalin Shekhar Mangar added a comment - Thanks! Scheduling is not implemented inside Solr. You can use a cron job for scheduling automatic imports. For example, you can call "wget http://solr.host:port/solr/dataimport?command=full-import ".
          Hide
          Mis Tigi added a comment -

          Thanks for everyone involved for this wonderful contribution. I find it extremely useful, saves lots of time and allows has an added benefit of rapid prototyping.

          One of the initial goals for this project was to be able to schedule automatic imports at regular intervals. I cannot find any reference to it beyond the initial goals. Has this been implemented ? If not any suggestions what is best way to accomplish that ?

          I also have a question if documents deletes is something that this handler can do or it should be done outside of it.

          Show
          Mis Tigi added a comment - Thanks for everyone involved for this wonderful contribution. I find it extremely useful, saves lots of time and allows has an added benefit of rapid prototyping. One of the initial goals for this project was to be able to schedule automatic imports at regular intervals. I cannot find any reference to it beyond the initial goals. Has this been implemented ? If not any suggestions what is best way to accomplish that ? I also have a question if documents deletes is something that this handler can do or it should be done outside of it.
          Hide
          ms added a comment -

          Shalin
          Sent a test case for Firebird just now to your email ID. If it is a bug with firebird, please let me know how I may report it. thanks!

          Show
          ms added a comment - Shalin Sent a test case for Firebird just now to your email ID. If it is a bug with firebird, please let me know how I may report it. thanks!
          Hide
          Noble Paul added a comment -

          This issue has served its purpose. Any new requirements/bugs can be raised on separate issue

          Show
          Noble Paul added a comment - This issue has served its purpose. Any new requirements/bugs can be raised on separate issue
          Hide
          Shalin Shekhar Mangar added a comment -

          Great patch. I am trying to use this with Firebird. The root Entity makes it's way into the index - but not the sub entities. On debugging with dataimport.jsp, the sub entities seem to be correctly processed. I can submit a test case with embedded firebird if necessary.

          You don't need to use the patch anymore. DataImportHandler has been released with Solr 1.3

          A test case to reproduce your problem will be nice. Just showing us the debug output will also help.

          Show
          Shalin Shekhar Mangar added a comment - Great patch. I am trying to use this with Firebird. The root Entity makes it's way into the index - but not the sub entities. On debugging with dataimport.jsp, the sub entities seem to be correctly processed. I can submit a test case with embedded firebird if necessary. You don't need to use the patch anymore. DataImportHandler has been released with Solr 1.3 A test case to reproduce your problem will be nice. Just showing us the debug output will also help.
          Hide
          ms added a comment -

          Great patch. I am trying to use this with Firebird. The root Entity makes it's way into the index - but not the sub entities. On debugging with dataimport.jsp, the sub entities seem to be correctly processed. I can submit a test case with embedded firebird if necessary.

          Show
          ms added a comment - Great patch. I am trying to use this with Firebird. The root Entity makes it's way into the index - but not the sub entities. On debugging with dataimport.jsp, the sub entities seem to be correctly processed. I can submit a test case with embedded firebird if necessary.
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 682383.

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 682383.
          Hide
          Noble Paul added a comment -

          xpath entity processor can stream rows one by one (for huge xml files) my making stream="true"

          Show
          Noble Paul added a comment - xpath entity processor can stream rows one by one (for huge xml files) my making stream="true"
          Hide
          Shalin Shekhar Mangar added a comment -

          Committed revision 681182.

          A big thanks to Noble for designing these cool features and to Grant for his reviews, feedback and support! Thanks to everybody who helped us by using the patch, giving suggestions and pointing out bugs!

          Solr rocks!

          Show
          Shalin Shekhar Mangar added a comment - Committed revision 681182. A big thanks to Noble for designing these cool features and to Grant for his reviews, feedback and support! Thanks to everybody who helped us by using the patch, giving suggestions and pointing out bugs! Solr rocks!
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Grant

          I shall go over the javadocs once more and then commit it.

          Show
          Shalin Shekhar Mangar added a comment - Thanks Grant I shall go over the javadocs once more and then commit it.
          Hide
          Grant Ingersoll added a comment -

          Shalin,

          I don't have the time at the moment on this so feel free to use your new powers. I think putting it into contrib and marking it as experimental is good (such that it is bound as strictly by back compat rules).

          I have some needs that I would like to see worked in, but I haven't had the time and I don't think they should hold back others, as it is obviously in significant use already. They are also nothing earth-shattering

          So, it's all yours. Enjoy.

          Show
          Grant Ingersoll added a comment - Shalin, I don't have the time at the moment on this so feel free to use your new powers. I think putting it into contrib and marking it as experimental is good (such that it is bound as strictly by back compat rules). I have some needs that I would like to see worked in, but I haven't had the time and I don't think they should hold back others, as it is obviously in significant use already. They are also nothing earth-shattering So, it's all yours. Enjoy.
          Hide
          Shalin Shekhar Mangar added a comment -

          Sorry for the spam due to my (multiple) mistakes. I think this one is the one

          A new patch containing the following changes:

          1. On further thinking about Interface vs. Abstract classes, we have decided to replace all interfaces with abstract classes. Transformer, Context, EntityProcessor, Evaluator, DataSource and VariableResolver are now abstract classes.
          2. The bug reported by Jonathan has been fixed and the TestCachedEntityProcessor has been updated to catch it. This exception used to be thrown only if the first request to CachedEntityProcessor needs a row which is not in cache. Subsequent requests were not affected.
          3. Javadoc improvements. In particular, all the API related classes are marked as experimental and subject to change.
          4. Propset Id in all classes.

          Users who have written their own custom transformers using the API will need to change their code. Sorry for the inconvenience.

          Grant - Is there anything else we need to do to get it committed?

          Show
          Shalin Shekhar Mangar added a comment - Sorry for the spam due to my (multiple) mistakes. I think this one is the one A new patch containing the following changes: On further thinking about Interface vs. Abstract classes, we have decided to replace all interfaces with abstract classes. Transformer, Context, EntityProcessor, Evaluator, DataSource and VariableResolver are now abstract classes. The bug reported by Jonathan has been fixed and the TestCachedEntityProcessor has been updated to catch it. This exception used to be thrown only if the first request to CachedEntityProcessor needs a row which is not in cache. Subsequent requests were not affected. Javadoc improvements. In particular, all the API related classes are marked as experimental and subject to change. Propset Id in all classes. Users who have written their own custom transformers using the API will need to change their code. Sorry for the inconvenience. Grant - Is there anything else we need to do to get it committed?
          Hide
          Shalin Shekhar Mangar added a comment -

          Nice catch! We shall incorporate the fix into the next patch.

          Yes indeed it can throw NullPointerException when key value does not exist in cached row set. However, I am wondering what can cause such a cache miss.

          Show
          Shalin Shekhar Mangar added a comment - Nice catch! We shall incorporate the fix into the next patch. Yes indeed it can throw NullPointerException when key value does not exist in cached row set. However, I am wondering what can cause such a cache miss.
          Hide
          Jonathan Lee added a comment - - edited

          When using CachedSqlEntityProcessor, an NPE is thrown (EntityProcessorBase.java:367) if a key value doesn't exist in the cached row set. This change to EntityProcessorBase.java should fix that, or let me know if I've missed something here!

          --- contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EntityProcessorBase.java	2008-07-28 12:49:21.000000000 -0400
          +++ contrib.new/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EntityProcessorBase.java	2008-07-28 12:40:17.000000000 -0400
          @@ -348,7 +348 @@
          -    if (rowIdVsRows != null) {
          -      rows = rowIdVsRows.get(key);
          -      if (rows == null)
          -        return null;
          -      dataSourceRowCache = new ArrayList<Map<String, Object>>(rows);
          -      return getFromRowCacheTransformed();
          -    } else {
          +    if (rowIdVsRows == null) {
          @@ -367,6 +360,0 @@
          -        dataSourceRowCache = new ArrayList<Map<String, Object>>(rowIdVsRows.get(key));
          -        if (dataSourceRowCache.isEmpty()) {
          -          dataSourceRowCache = null;
          -          return null;
          -        }
          -        return getFromRowCacheTransformed();
          @@ -374,0 +363,5 @@
          +    rows = rowIdVsRows.get(key);
          +    if (rows == null)
          +      return null;
          +    dataSourceRowCache = new ArrayList<Map<String, Object>>(rows);
          +    return getFromRowCacheTransformed();
          
          Show
          Jonathan Lee added a comment - - edited When using CachedSqlEntityProcessor, an NPE is thrown (EntityProcessorBase.java:367) if a key value doesn't exist in the cached row set. This change to EntityProcessorBase.java should fix that, or let me know if I've missed something here! --- contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EntityProcessorBase.java 2008-07-28 12:49:21.000000000 -0400 +++ contrib.new/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/EntityProcessorBase.java 2008-07-28 12:40:17.000000000 -0400 @@ -348,7 +348 @@ - if (rowIdVsRows != null) { - rows = rowIdVsRows.get(key); - if (rows == null) - return null; - dataSourceRowCache = new ArrayList<Map<String, Object>>(rows); - return getFromRowCacheTransformed(); - } else { + if (rowIdVsRows == null) { @@ -367,6 +360,0 @@ - dataSourceRowCache = new ArrayList<Map<String, Object>>(rowIdVsRows.get(key)); - if (dataSourceRowCache.isEmpty()) { - dataSourceRowCache = null; - return null; - } - return getFromRowCacheTransformed(); @@ -374,0 +363,5 @@ + rows = rowIdVsRows.get(key); + if (rows == null) + return null; + dataSourceRowCache = new ArrayList<Map<String, Object>>(rows); + return getFromRowCacheTransformed();
          Hide
          Noble Paul added a comment -

          The previous patch did not take care of multirow-transformers for CachedSqlEntityProcessor. Added a testcase and fixed that

          Show
          Noble Paul added a comment - The previous patch did not take care of multirow-transformers for CachedSqlEntityProcessor. Added a testcase and fixed that
          Hide
          Noble Paul added a comment -

          ignore the previous patch

          Show
          Noble Paul added a comment - ignore the previous patch
          Hide
          Noble Paul added a comment -

          bug fix in CachedSqlEntityProcessor

          Show
          Noble Paul added a comment - bug fix in CachedSqlEntityProcessor
          Hide
          Jonathan Lee added a comment - - edited

          This patch has been a wonderful addition to solr - thanks for all the work!

          I believe that there is a bug in CachedSqlEntityProcessor that causes transformers to be ignored. Here is a patch that worked for me, but I am not sure it is entirely correct:

          --- contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/CachedSqlEntityProcessor.java
          +++ contrib.new/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/CachedSqlEntityProcessor.java
          @@ -43,18 +43,23 @@
             }
           
             public Map<String, Object> nextRow() {
          -    if (rowcache != null)
          -      return getFromRowCache();
          -    if (!isFirst)
          +    Map<String, Object> r;
          +    if (rowcache != null) {
          +      r = getFromRowCache();
          +    } else if (!isFirst) {
                 return null;
          -    String query = resolver.replaceTokens(context.getEntityAttribute("query"));
          -    isFirst = false;
          -    if (simpleCache != null) {
          -      return getSimplCacheData(query);
               } else {
          -      return getIdCacheData(query);
          +      String query = resolver.replaceTokens(context.getEntityAttribute("query"));
          +      isFirst = false;
          +      if (simpleCache != null) {
          +        r = getSimplCacheData(query);
          +      } else {
          +        r = getIdCacheData(query);
          +      }
               }
          -
          +    if (r == null)
          +      return null;
          +    return applyTransformer(r);
             }
           
             protected List<Map<String, Object>> getAllNonCachedRows() {
           
          Show
          Jonathan Lee added a comment - - edited This patch has been a wonderful addition to solr - thanks for all the work! I believe that there is a bug in CachedSqlEntityProcessor that causes transformers to be ignored. Here is a patch that worked for me, but I am not sure it is entirely correct: --- contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/CachedSqlEntityProcessor.java +++ contrib.new/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/CachedSqlEntityProcessor.java @@ -43,18 +43,23 @@ } public Map<String, Object> nextRow() { - if (rowcache != null) - return getFromRowCache(); - if (!isFirst) + Map<String, Object> r; + if (rowcache != null) { + r = getFromRowCache(); + } else if (!isFirst) { return null; - String query = resolver.replaceTokens(context.getEntityAttribute("query")); - isFirst = false; - if (simpleCache != null) { - return getSimplCacheData(query); } else { - return getIdCacheData(query); + String query = resolver.replaceTokens(context.getEntityAttribute("query")); + isFirst = false; + if (simpleCache != null) { + r = getSimplCacheData(query); + } else { + r = getIdCacheData(query); + } } - + if (r == null) + return null; + return applyTransformer(r); } protected List<Map<String, Object>> getAllNonCachedRows() {
          Hide
          Noble Paul added a comment -

          it was a bad patch

          Show
          Noble Paul added a comment - it was a bad patch
          Hide
          Noble Paul added a comment - - edited
          • Added a method destroy() to EntityProcessor . This coupled with init() can be used for pre/post actions
          • JdbcDataSource uses Statement#execute() instead of Statement#executeQuery() . So users can execute DDL/DML using JdbcDataSource
          • Context has a new method getSolrCore() which returns the SolrCore instance
          Show
          Noble Paul added a comment - - edited Added a method destroy() to EntityProcessor . This coupled with init() can be used for pre/post actions JdbcDataSource uses Statement#execute() instead of Statement#executeQuery() . So users can execute DDL/DML using JdbcDataSource Context has a new method getSolrCore() which returns the SolrCore instance
          Hide
          Noble Paul added a comment -
          • All interfaces are marked as experimental
          • The bug optimize=true fixed
          • added a new variable to dataimporter namespace _$ {dataimporter.index_start_time}

            _

          Show
          Noble Paul added a comment - All interfaces are marked as experimental The bug optimize=true fixed added a new variable to dataimporter namespace _$ {dataimporter.index_start_time} _
          Hide
          Jeremy Hinegardner added a comment -

          I think there is a bug in the -contrib patch. setting optimize=false as a request parameter appears to turn off clean instead of turning off optimize.

          DataImporter.java line 496
          if (requestParams.containsKey("clean"))
              clean = Boolean.parseBoolean((String) requestParams.get("clean"));
          if (requestParams.containsKey("optimize"))
              clean = Boolean.parseBoolean((String) requestParams.get("optimize"));
          
          Show
          Jeremy Hinegardner added a comment - I think there is a bug in the -contrib patch. setting optimize=false as a request parameter appears to turn off clean instead of turning off optimize. DataImporter.java line 496 if (requestParams.containsKey( "clean" )) clean = Boolean .parseBoolean(( String ) requestParams.get( "clean" )); if (requestParams.containsKey( "optimize" )) clean = Boolean .parseBoolean(( String ) requestParams.get( "optimize" ));
          Hide
          Shalin Shekhar Mangar added a comment -

          Patch applies cleanly, tests pass, although I notice several @ignore in there.

          The @ignore are present in TestJdbcDataSource (for lack of mysql to test with) and in TestScriptTransformer (script tests can only be run with Java 6 which has a JS ScriptEngine present by default). We can rewrite the test with Derby if needed.

          Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead?

          Apart from the ones Noble pointed out, there's Evaluator which users can use to extend the power of VariableResolver. The EvaluatorBag provides some generally useful implementations. Probably the context can be passed to Evaluator as well. Apart from that, I'm not sure if/how they would change in the future. An AbstractDataSource can be added – maybe we can templatize the query as well in addition to the return type.

          What relation does the Context have to the HttpDataSource?

          The Context is independent of a data source. It's just extra information which is passed along if someone needs to use. Most of the implementation do not actually use it.

          What if I wanted to slurp from a table on the fly?

          If you mean passing an SQL query on the fly as a request parameter then no, it is not supported. We haven't seen a use-case for it yet – since schema and indexing are well defined in advance and there is no harm in putting the query in the configuration. However, if someone really wants to do something like that, he/she can pass a full data-config as a request parameter (debug mode) which can be executed. The interactive mode uses this approach. An alternate approach can be to extend SqlEntityProcessor and override the getQuery method to use the Context#getRequestParameters and if sql param is present, use that as the query instead of the sql in configuration.

          Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right?

          Yes, to play interactively while using a JdbcDataSource, one would need to have the driver jar present in the class-path before hand. The interactive mode is however independent – HttpDataSource does not have this limitation (slashdot example on the wiki)

          In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit?

          The connection is acquired once and used throught the import process. It is closed if not used for 10 seconds. The idea behind the time-out was to avoid the connection getting closed by the server due to the inactivity. Apart from that scenario, there's very less probability of a connection error happening – and even if it did, we may not have a way to deal with it.

          Show
          Shalin Shekhar Mangar added a comment - Patch applies cleanly, tests pass, although I notice several @ignore in there. The @ignore are present in TestJdbcDataSource (for lack of mysql to test with) and in TestScriptTransformer (script tests can only be run with Java 6 which has a JS ScriptEngine present by default). We can rewrite the test with Derby if needed. Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead? Apart from the ones Noble pointed out, there's Evaluator which users can use to extend the power of VariableResolver. The EvaluatorBag provides some generally useful implementations. Probably the context can be passed to Evaluator as well. Apart from that, I'm not sure if/how they would change in the future. An AbstractDataSource can be added – maybe we can templatize the query as well in addition to the return type. What relation does the Context have to the HttpDataSource? The Context is independent of a data source. It's just extra information which is passed along if someone needs to use. Most of the implementation do not actually use it. What if I wanted to slurp from a table on the fly? If you mean passing an SQL query on the fly as a request parameter then no, it is not supported. We haven't seen a use-case for it yet – since schema and indexing are well defined in advance and there is no harm in putting the query in the configuration. However, if someone really wants to do something like that, he/she can pass a full data-config as a request parameter (debug mode) which can be executed. The interactive mode uses this approach. An alternate approach can be to extend SqlEntityProcessor and override the getQuery method to use the Context#getRequestParameters and if sql param is present, use that as the query instead of the sql in configuration. Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right? Yes, to play interactively while using a JdbcDataSource, one would need to have the driver jar present in the class-path before hand. The interactive mode is however independent – HttpDataSource does not have this limitation (slashdot example on the wiki) In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit? The connection is acquired once and used throught the import process. It is closed if not used for 10 seconds. The idea behind the time-out was to avoid the connection getting closed by the server due to the inactivity. Apart from that scenario, there's very less probability of a connection error happening – and even if it did, we may not have a way to deal with it.
          Hide
          Noble Paul added a comment -

          I'd suggest,that instead of relying on MySQL in TestJdbcDataSource, we instead use and embedded Derby or some sort of JDBC mock. I suggest Derby mainly b/c it's already ASF and I don't want to bother looking up licenses for HSQL or any of the others that might work.

          We must remove the TestJdbcDataSource if we cannot integrate derby in the dev dependencies.

          Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead?

          Yes/No A lot of interfaces are never implemented by users like Context, VariableResolver They are kept as interfaces to make API's simple
          The interfaces people need to implement are

          • EntityProcessor: We expect users to extend EntityProcessorBase
          • Transformer : The most commonly implemented interface. I am ambivalent regarding this. I'm do not know if it will change
          • DataSource : This may be made abstract class

          What relation does the Context have to the HttpDataSource?

          DataSource is always created for an entity. The Context is the easiest way to get info about the entity. The current DataSources do not use that info . But because we have the info readily available just pass it over.

          What if I wanted to slurp from a table on the fly?

          CachedSqlEntityProcessor already does that. It slurps the table and caches the info

          Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right?

          Not sure If I got the question . Interactive dev mode does not need the drivers

          In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit?

          We create connections using Drivermanager.getConnection(). No pooling because, the same connection is used throughout the indexing. one conn is created per entity. So no pooling implemented.

          A PooledJdbcDataSource impl?

          Show
          Noble Paul added a comment - I'd suggest,that instead of relying on MySQL in TestJdbcDataSource, we instead use and embedded Derby or some sort of JDBC mock. I suggest Derby mainly b/c it's already ASF and I don't want to bother looking up licenses for HSQL or any of the others that might work. We must remove the TestJdbcDataSource if we cannot integrate derby in the dev dependencies. Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead? Yes/No A lot of interfaces are never implemented by users like Context, VariableResolver They are kept as interfaces to make API's simple The interfaces people need to implement are EntityProcessor: We expect users to extend EntityProcessorBase Transformer : The most commonly implemented interface. I am ambivalent regarding this. I'm do not know if it will change DataSource : This may be made abstract class What relation does the Context have to the HttpDataSource? DataSource is always created for an entity. The Context is the easiest way to get info about the entity. The current DataSources do not use that info . But because we have the info readily available just pass it over. What if I wanted to slurp from a table on the fly? CachedSqlEntityProcessor already does that. It slurps the table and caches the info Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right? Not sure If I got the question . Interactive dev mode does not need the drivers In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit? We create connections using Drivermanager.getConnection(). No pooling because, the same connection is used throughout the indexing. one conn is created per entity. So no pooling implemented. A PooledJdbcDataSource impl?
          Hide
          Grant Ingersoll added a comment -

          Patch applies cleanly, tests pass, although I notice several @ignore in there. Docs look good in my preliminary perusing. I've only started looking at things, and have a lot of reading to catch up on, so these first comments, please take with a grain of salt, as the English saying goes...

          I'd suggest,that instead of relying on MySQL in TestJdbcDataSource, we instead use and embedded Derby or some sort of JDBC mock. I suggest Derby mainly b/c it's already ASF and I don't want to bother looking up licenses for HSQL or any of the others that might work.

          Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead? I know, there is a whole big debate over it, and people will argue that if you get the interface exactly correct, you should use interfaces. Nice in theory, but Lucene/Solr experience suggests that rarely happens. Of course, I think the correct way is to actually do both, as one can easily decorate an abstract base class with more interfaces as needed. Just food for thought, b/c what's going to quickly happen after release is someone is going to need a new method on the DataSource or something and then we are going to be stuck doing all kinds of workarounds due to back compatibility reasons. The alternative is to clearly mark all Interfaces as being experimental at this point and clearly note that we expect them to change. We may even want to consider both! The other point, though, is contrib packages need not be held to the same standard as core when it comes to back compat.

          What relation does the Context have to the HttpDataSource? In other words, the DataSource init method takes a Context, meaning the HttpDataSource needs a Context, yet in my first glance at the Context, it seems to be DB related.

          What if I wanted to slurp from a table on the fly? That is, I want to send in a select statement in my request and I let the columns line up where they may Field wise (i.e. via dynamic fields or I rely on something like select id, colA as fieldA, colB as fieldB from MyTable; )
          Is that possible?

          Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right? So, one can currently only interactively configure a JDBC DataSource if the driver is already in lib and loaded by the ClassLoader. If you haven't already, it might actually be useful to show what drivers are present by using the DriverManager.

          In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit? Also, what if I wanted to plug in my own Connection Pooling library, as I may already have one setup for other things (if using Solr embedded)?

          Show
          Grant Ingersoll added a comment - Patch applies cleanly, tests pass, although I notice several @ignore in there. Docs look good in my preliminary perusing. I've only started looking at things, and have a lot of reading to catch up on, so these first comments, please take with a grain of salt, as the English saying goes... I'd suggest,that instead of relying on MySQL in TestJdbcDataSource, we instead use and embedded Derby or some sort of JDBC mock. I suggest Derby mainly b/c it's already ASF and I don't want to bother looking up licenses for HSQL or any of the others that might work. Also, I notice several interfaces that have a number of methods on them. Have you thought about abstract base classes instead? I know, there is a whole big debate over it, and people will argue that if you get the interface exactly correct, you should use interfaces. Nice in theory, but Lucene/Solr experience suggests that rarely happens. Of course, I think the correct way is to actually do both, as one can easily decorate an abstract base class with more interfaces as needed. Just food for thought, b/c what's going to quickly happen after release is someone is going to need a new method on the DataSource or something and then we are going to be stuck doing all kinds of workarounds due to back compatibility reasons. The alternative is to clearly mark all Interfaces as being experimental at this point and clearly note that we expect them to change. We may even want to consider both! The other point, though, is contrib packages need not be held to the same standard as core when it comes to back compat. What relation does the Context have to the HttpDataSource? In other words, the DataSource init method takes a Context, meaning the HttpDataSource needs a Context, yet in my first glance at the Context, it seems to be DB related. What if I wanted to slurp from a table on the fly? That is, I want to send in a select statement in my request and I let the columns line up where they may Field wise (i.e. via dynamic fields or I rely on something like select id, colA as fieldA, colB as fieldB from MyTable; ) Is that possible? Interactive mode has a bit of a chicken and the egg problem when it comes to JDBC, right, in that the Driver needs to be present in Solr/lib right? So, one can currently only interactively configure a JDBC DataSource if the driver is already in lib and loaded by the ClassLoader. If you haven't already, it might actually be useful to show what drivers are present by using the DriverManager. In the JDBCDataSource, not sure I follow the connection stuff. Can you explain a bit? Also, what if I wanted to plug in my own Connection Pooling library, as I may already have one setup for other things (if using Solr embedded)?
          Hide
          Shalin Shekhar Mangar added a comment -

          This time with the correct name SOLR-469-contrib.patch

          Show
          Shalin Shekhar Mangar added a comment - This time with the correct name SOLR-469 -contrib.patch
          Hide
          Shalin Shekhar Mangar added a comment -

          The last patch wasn't generated correctly. This one fixes it. No changes in the code since the last patch.

          Show
          Shalin Shekhar Mangar added a comment - The last patch wasn't generated correctly. This one fixes it. No changes in the code since the last patch.
          Hide
          Noble Paul added a comment -

          Changes

          • classloading is done using SolrresourceLoader . adding jars to solrhome/lib must work
          • The request parameter can add optimize=false to disable optimize
          Show
          Noble Paul added a comment - Changes classloading is done using SolrresourceLoader . adding jars to solrhome/lib must work The request parameter can add optimize=false to disable optimize
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes

          • Updated the build.xml to compile Solr before building DataImportHandler and place DataImportHandler's javadoc jar to solr/dist folder so that the javadocs are available in Solr nightly builds
          • Removed @author Javadoc tags from all source files in accordance with Solr coding conventions
          • Improved Javadocs for a lot of classes especially the public interfaces
          • Formatted code using the Eclipse codestyle xml given at HowToContribute wiki page
          • Added @since solr 1.3 to all source files
          • I've verified that the Apache license text is present in all the source files

          No changes have been made to the code (in terms of functionality)

          Note – The SOLR-563 patch must be applied before this patch to build Solr with DataImportHandler as a contrib project.

          A lot of people are using this patch and it would be easier for them if DataImportHandler is available in the nightly builds. Also, this patch has become huge and enhancements and bug fixes would also be easier if it were committed. Grant – We feel that this is ready to be committed now whenever you can take a look.

          Show
          Shalin Shekhar Mangar added a comment - Changes Updated the build.xml to compile Solr before building DataImportHandler and place DataImportHandler's javadoc jar to solr/dist folder so that the javadocs are available in Solr nightly builds Removed @author Javadoc tags from all source files in accordance with Solr coding conventions Improved Javadocs for a lot of classes especially the public interfaces Formatted code using the Eclipse codestyle xml given at HowToContribute wiki page Added @since solr 1.3 to all source files I've verified that the Apache license text is present in all the source files No changes have been made to the code (in terms of functionality) Note – The SOLR-563 patch must be applied before this patch to build Solr with DataImportHandler as a contrib project. A lot of people are using this patch and it would be easier for them if DataImportHandler is available in the nightly builds. Also, this patch has become huge and enhancements and bug fixes would also be easier if it were committed. Grant – We feel that this is ready to be committed now whenever you can take a look.
          Hide
          Noble Paul added a comment - - edited

          This patch contains

          • integration with SOLR-505 ( disable cache headers)
          • Tests inTestSCriptTransformer Ignored. (it requires java 6)
          • New feature CachedSqlEntityProcessor. It can dramatically speed up indexing if there are sub-entities. It can cache the rows and avoid subsequent database calls. Consumes a lot of RAM. See wiki
          Show
          Noble Paul added a comment - - edited This patch contains integration with SOLR-505 ( disable cache headers) Tests inTestSCriptTransformer Ignored. (it requires java 6) New feature CachedSqlEntityProcessor . It can dramatically speed up indexing if there are sub-entities. It can cache the rows and avoid subsequent database calls. Consumes a lot of RAM. See wiki
          Hide
          patrick o'leary added a comment -

          With the arow, I noticed by nulling it, that CMS GC was cleaning items up faster in eden space.
          Without it, Full GC kicked in more frequently. This was with indexing about 250~MB from mysql.
          If you've not got that much data then there isn't much of a worry, it's just a little optimization that reduces the need
          to increase your jvm's mx and newsize settings.

          Another thing I was looking at is the SolrWriter, instead of calling an updateHandler directly, I think you should call
          the UpdateRequestProcessorFactory and allow the UpdateRequestProcessor chain handle the
          *processAdd
          *processDelete
          *processCommit
          *finish

          It allows for custom ChainedUpdateProcessor'Factory's which is a fantastic little known about item.

          Show
          patrick o'leary added a comment - With the arow, I noticed by nulling it, that CMS GC was cleaning items up faster in eden space. Without it, Full GC kicked in more frequently. This was with indexing about 250~MB from mysql. If you've not got that much data then there isn't much of a worry, it's just a little optimization that reduces the need to increase your jvm's mx and newsize settings. Another thing I was looking at is the SolrWriter, instead of calling an updateHandler directly, I think you should call the UpdateRequestProcessorFactory and allow the UpdateRequestProcessor chain handle the *processAdd *processDelete *processCommit *finish It allows for custom ChainedUpdateProcessor'Factory's which is a fantastic little known about item.
          Hide
          Shalin Shekhar Mangar added a comment -

          Copying changes in codebase from SOLR-469.patch to SOLR-469-contrib.patch

          Show
          Shalin Shekhar Mangar added a comment - Copying changes in codebase from SOLR-469 .patch to SOLR-469 -contrib.patch
          Hide
          Shalin Shekhar Mangar added a comment -

          A lot of people were using the older patch. I'm generating the contrib one too

          Show
          Shalin Shekhar Mangar added a comment - A lot of people were using the older patch. I'm generating the contrib one too
          Hide
          Olivier Poitrey added a comment -

          No -contrib version this time?

          Show
          Olivier Poitrey added a comment - No -contrib version this time?
          Hide
          Noble Paul added a comment -

          small correction

          Context interface has a new method getDataSource(String entityName) for getting a new DataSource instance for the given entity - Context, ContextImpl, DataImporter, DocBuilder

          /**
             * Gets a new DataSource instance with a name.
             * @return
             * @param name Name of the dataSource as defined in the dataSource tag
             */
            public DataSource getDataSource(String name);
          
          Show
          Noble Paul added a comment - small correction Context interface has a new method getDataSource(String entityName) for getting a new DataSource instance for the given entity - Context, ContextImpl, DataImporter, DocBuilder /** * Gets a new DataSource instance with a name. * @ return * @param name Name of the dataSource as defined in the dataSource tag */ public DataSource getDataSource( String name);
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          A new patch file (SOLR-469.patch) consisting of some important bug fixes and minor enhancements. The changes and the corresponding classes are given below

          Changes

          • Set fetch size to Integer.MIN_VALUE if batchSize in configuration is -1 as per Patrick's suggestion – JdbcDataSource
          • Transformers can add a boost to a document by adding a key/value pair row.put("$docBoost", 2.0f) from any entity – DocBuilder,SolrWriter and DataImportHandler
          • Fixes for infinite loop in SqlEntityProcessor when delta query fails for some reason and NullPointerException is thrown in EntityProcessorBase – EntityProcessorBase
          • Fix for NullPointerException in TemplateTransformer and corresponding test – TemplateTransformer, TestTemplateTransformer
          • Enhancement for specifying table.column syntax for pk attribute in entity as per issue reported by Chris Moser and Olivier Poitrey – SqlEntityProcessor,TestSqlEntityProcessor2
          • Fix for NullPointerException in XPathRecordReader when attribute specified through xpath is null as per issue reported by Nicolas Pastorino in solr-user – XPathRecordReader, TestXPathRecordReader
          • Enhancement to DataSource interface to provide a close method – DataSource, FileDataSource, HttpDataSource, MockDataSource
          • Context interface has a new method getDataSource(String name) for getting a new DataSource instance as per the name specified in solrconfig.xml or data-config.xml – Context, ContextImpl, DataImporter, DocBuilder
          • FileListEntityProcessor implements olderThan and newerThan filtering parameters – FileListEntityProcessor, TestFileListEntityProcessor
          • Debug Mode can be disabled from solrconfig.xml by enableDebug=false – DataImporter, DataImportHandler
          • Running statistics are exposed on the Solr Statistics page in addition to cumulative statictics – DataImportHandler, DocBuilder
          • The dataSource attribute can be null when using certain EntityProcessors such as FileListEntityProcessor which does not need a dataSource. So when dataSource="null", no attempt is made to create a DataSource instance – DataImporter

          Updated as per Noble's comment below.

          Show
          Shalin Shekhar Mangar added a comment - - edited A new patch file ( SOLR-469 .patch) consisting of some important bug fixes and minor enhancements. The changes and the corresponding classes are given below Changes Set fetch size to Integer.MIN_VALUE if batchSize in configuration is -1 as per Patrick's suggestion – JdbcDataSource Transformers can add a boost to a document by adding a key/value pair row.put("$docBoost", 2.0f) from any entity – DocBuilder,SolrWriter and DataImportHandler Fixes for infinite loop in SqlEntityProcessor when delta query fails for some reason and NullPointerException is thrown in EntityProcessorBase – EntityProcessorBase Fix for NullPointerException in TemplateTransformer and corresponding test – TemplateTransformer, TestTemplateTransformer Enhancement for specifying table.column syntax for pk attribute in entity as per issue reported by Chris Moser and Olivier Poitrey – SqlEntityProcessor,TestSqlEntityProcessor2 Fix for NullPointerException in XPathRecordReader when attribute specified through xpath is null as per issue reported by Nicolas Pastorino in solr-user – XPathRecordReader, TestXPathRecordReader Enhancement to DataSource interface to provide a close method – DataSource, FileDataSource, HttpDataSource, MockDataSource Context interface has a new method getDataSource(String name) for getting a new DataSource instance as per the name specified in solrconfig.xml or data-config.xml – Context, ContextImpl, DataImporter, DocBuilder FileListEntityProcessor implements olderThan and newerThan filtering parameters – FileListEntityProcessor, TestFileListEntityProcessor Debug Mode can be disabled from solrconfig.xml by enableDebug=false – DataImporter, DataImportHandler Running statistics are exposed on the Solr Statistics page in addition to cumulative statictics – DataImportHandler, DocBuilder The dataSource attribute can be null when using certain EntityProcessors such as FileListEntityProcessor which does not need a dataSource. So when dataSource="null", no attempt is made to create a DataSource instance – DataImporter Updated as per Noble's comment below.
          Hide
          Noble Paul added a comment -

          Thanks for the suggestions .
          JdbcDataSource is a generic implementation for all jdbc drivers . so batchSize itself is a configurable parameter for JdbcdataSource. set the value and it should be fine. Anyway we will incorporate the changes you have suggested because it is convenient for users.

          The var arow goes out of scope immedietly because the method terminates after this . I'm not sure it make a any difference if I explicitly set it to null

          Show
          Noble Paul added a comment - Thanks for the suggestions . JdbcDataSource is a generic implementation for all jdbc drivers . so batchSize itself is a configurable parameter for JdbcdataSource . set the value and it should be fine. Anyway we will incorporate the changes you have suggested because it is convenient for users. The var arow goes out of scope immedietly because the method terminates after this . I'm not sure it make a any difference if I explicitly set it to null
          Hide
          patrick o'leary added a comment -

          There's a slight problem using Connector/J for mysql, in that it doesn't fully implement the jdbc spec for
          setFetchSize, resulting in all rows in mysql being selected into memory.

          Connector/J states that you must pass ?useCursorFetch=true in the connect string, but it exposes another mysql bug with server-side parsed queries throwing an error of "incorrect key file" on the temp tables generated by the cursor,
          as yet there isn't a fix in mysql that I know of.

          Something that seems to work is to set the batchSize to Integer.MIN_VALUE:

          JdbcDataSource.java

           if (bsz != null) {
                try {
                  batchSize = Integer.parseInt(bsz);
                  if (batchSize < 0)
                      batchSize = Integer.MIN_VALUE;  // pjaol : setting batchSize to <0 in dataSource forces connector / j to use Integer.MIN_VALUE
                } catch (NumberFormatException e) {
                  LOG.log(Level.WARNING, "Invalid batch size: " + bsz);
                }
              }
          

          This basically puts the result set size at 1 row, a little slow, but if you can't set your JVM memory settings high enough
          it gives you a option.

          Also suggest null-ing the row hashmap in DocBuilder immediately after use to allow GC to clean up
          the reference faster within eden space.

          DocBuilder.java

              if (entity.isDocRoot) {
                      if (stop.get())
                        return;
                      boolean result = writer.upload(doc);
                      doc = null;
                      if (result)
                        importStatistics.docCount.incrementAndGet();
                    }
                    
                 arow = null; // pjaol : set reference to hashmap to null to eliminate strong reference                                                   
                 
          
                 } catch (DataImportHandlerException e)
          ..........
          
          Show
          patrick o'leary added a comment - There's a slight problem using Connector/J for mysql, in that it doesn't fully implement the jdbc spec for setFetchSize, resulting in all rows in mysql being selected into memory. Connector/J states that you must pass ?useCursorFetch=true in the connect string, but it exposes another mysql bug with server-side parsed queries throwing an error of "incorrect key file" on the temp tables generated by the cursor, as yet there isn't a fix in mysql that I know of. Something that seems to work is to set the batchSize to Integer.MIN_VALUE: JdbcDataSource.java if (bsz != null ) { try { batchSize = Integer .parseInt(bsz); if (batchSize < 0) batchSize = Integer .MIN_VALUE; // pjaol : setting batchSize to <0 in dataSource forces connector / j to use Integer .MIN_VALUE } catch (NumberFormatException e) { LOG.log(Level.WARNING, "Invalid batch size: " + bsz); } } This basically puts the result set size at 1 row, a little slow, but if you can't set your JVM memory settings high enough it gives you a option. Also suggest null-ing the row hashmap in DocBuilder immediately after use to allow GC to clean up the reference faster within eden space. DocBuilder.java if (entity.isDocRoot) { if (stop.get()) return ; boolean result = writer.upload(doc); doc = null ; if (result) importStatistics.docCount.incrementAndGet(); } arow = null ; // pjaol : set reference to hashmap to null to eliminate strong reference } catch (DataImportHandlerException e) ..........
          Hide
          Olivier Poitrey added a comment - - edited

          Paul,

          The current version of the code seems not to allow the construction pk="forum.forumId" you're talking about. I did a small patch to make it possible. I don't know if it's the correct way to do it but it worked well for me.

          Here is the patch:

          --- a/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/SqlEntityProcessor.java
          +++ b/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/SqlEntityProcessor.java
          @@ -124,7 +124,9 @@ public class SqlEntityProcessor extends EntityProcessorBase {
                   sb.append(" and ");
                   first = false;
                 }
          -      Object val = resolver.resolve(primaryKeys[i]);
          +      // Only send the field part of the pk when pk includes the table ref
          +      String[] pkParts = primaryKeys[i].split("\\.");
          +      Object val = resolver.resolve(pkParts[pkParts.length - 1]);
                 sb.append(primaryKeys[i]).append(" = ");
                 if (val instanceof Number) {
                   sb.append(val.toString());
          

          Hope that helps.

          Show
          Olivier Poitrey added a comment - - edited Paul, The current version of the code seems not to allow the construction pk="forum.forumId" you're talking about. I did a small patch to make it possible. I don't know if it's the correct way to do it but it worked well for me. Here is the patch: --- a/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/SqlEntityProcessor.java +++ b/contrib/dataimporthandler/src/main/java/org/apache/solr/handler/dataimport/SqlEntityProcessor.java @@ -124,7 +124,9 @@ public class SqlEntityProcessor extends EntityProcessorBase { sb.append(" and "); first = false; } - Object val = resolver.resolve(primaryKeys[i]); + // Only send the field part of the pk when pk includes the table ref + String[] pkParts = primaryKeys[i].split("\\."); + Object val = resolver.resolve(pkParts[pkParts.length - 1]); sb.append(primaryKeys[i]).append(" = "); if (val instanceof Number) { sb.append(val.toString()); Hope that helps.
          Hide
          Hoss Man added a comment -

          creating a contrib structure and making the DataImportHandler a contrib definitely seems like the smart way to go ... particularly since it doesn't require any "core" changes.

          Show
          Hoss Man added a comment - creating a contrib structure and making the DataImportHandler a contrib definitely seems like the smart way to go ... particularly since it doesn't require any "core" changes.
          Hide
          Noble Paul added a comment - - edited

          Moser :I guess I now understand your requirement . The solution you have proposed is indeed a good one.
          How about this one:

          The pk is used only for this purpose . So you must be able to keep it as pk="forum.forumId" and when the query is generated I can use it as it is and when I fetch the value , I can just use the part after the period (.)

          In addition I can make the getQuery() method in SqlEntityProcessor public so that you can implement your custom logic very easily

          Show
          Noble Paul added a comment - - edited Moser :I guess I now understand your requirement . The solution you have proposed is indeed a good one. How about this one: The pk is used only for this purpose . So you must be able to keep it as pk="forum.forumId" and when the query is generated I can use it as it is and when I fetch the value , I can just use the part after the period (.) In addition I can make the getQuery() method in SqlEntityProcessor public so that you can implement your custom logic very easily
          Hide
          Chris Moser added a comment -

          Hi Shalin,

          I'm indexing forums with Solr and have tables with a structure similar to this:

          posts
          ------
          forumid int
          messageid int
          deleted boolean
          message text
          
          forums
          ------
          forumid int
          name text
          deleted boolean
          
          

          The simplified data query I'm running goes like this:

          SELECT 
             p.forumid,
             p.messageid,
             IF (p.deleted OR f.deleted,true,false) as deleted,
             p.message
            
          FROM 
             posts p, forums f
          
          WHERE
             f.forumid = p.forumid
          

          The query checks to see if the post or the forum is deleted, and marks it in the index as deleted in either case (which is why I'm doing the join). The problem I'm running into is that the importer is running the WHERE clause like this:

          WHERE 
             f.forumid = p.forumid and forumid=123 and messageid=123456789
          

          In this case, the forumid=123 part is ambiguous (forumid being in the posts and the forums table) so this causes a SQL error. So I added an additional attribute to the entity defintion (pkTable) which prepends the forumid=123 with the pkTable value so it generates pkTable.forumid=123.

          Not sure if this is the best way to do it but it fixed the problem

          Show
          Chris Moser added a comment - Hi Shalin, I'm indexing forums with Solr and have tables with a structure similar to this: posts ------ forumid int messageid int deleted boolean message text forums ------ forumid int name text deleted boolean The simplified data query I'm running goes like this: SELECT p.forumid, p.messageid, IF (p.deleted OR f.deleted, true , false ) as deleted, p.message FROM posts p, forums f WHERE f.forumid = p.forumid The query checks to see if the post or the forum is deleted, and marks it in the index as deleted in either case (which is why I'm doing the join). The problem I'm running into is that the importer is running the WHERE clause like this: WHERE f.forumid = p.forumid and forumid=123 and messageid=123456789 In this case, the forumid=123 part is ambiguous (forumid being in the posts and the forums table) so this causes a SQL error. So I added an additional attribute to the entity defintion (pkTable) which prepends the forumid=123 with the pkTable value so it generates pkTable.forumid=123 . Not sure if this is the best way to do it but it fixed the problem
          Hide
          Shalin Shekhar Mangar added a comment -

          Thanks Chris, nice catch! The first one is definitely a bug. I'll fix that, add a test and upload a new patch. I'm not sure if I understand your second point completely, can you please give an example?

          Show
          Shalin Shekhar Mangar added a comment - Thanks Chris, nice catch! The first one is definitely a bug. I'll fix that, add a test and upload a new patch. I'm not sure if I understand your second point completely, can you please give an example?
          Hide
          Chris Moser added a comment - - edited

          Hi,

          Thanks for all of your work with the dataimporter. It's made working with Solr much easier.

          I think I found a small bug in SqlEntityProcessory.java starting on line 120:

          SqlEntityProcessor.java
          120:    boolean first = true;
          121:    String[] primaryKeys = context.getEntityAttribute("pk").split(",");
          122:    for (int i = 0; i < primaryKeys.length; i++) {
          123:      if (!first) {
          124:        sb.append(" and ");
          125:        first = false;
          126:      }
          

          This causes problems in a generated SQL statement because it doesn't add the "and" string into the SQL statement when more than one field is provided in the pk entity value. End result being a SQL syntax error.

          Given first initialized as true, the if statement on line 123 will never happen (and first will never be set to false). It looks like it would be more appropriate to have line 125 happen after the if statement on line 123.

          This leads me to another issue, and that is the question of how to specify the table of the primary key when the primary key is ambiguous? If there's a join condition in the SQL statment of a deltaQuery, and the any of the primary key columns are present in the joined table, the key is ambiguous and will cause a SQL error.

          Is there a way to specify the table for the primary key? Perhaps an attribute "pkTable" can be added as an option for the entity declaration, i.e. in SqlEntityProcessor.java:

          SqlEntityProcessor.java
          127:      Object val = resolver.resolve(primaryKeys[i]);
          -->	  if (context.getEntityAttribute("pkTable").length()>0)
          -->		sb.append(context.getEntityAttribute("pkTable")+".");
          128:	  sb.append(primaryKeys[i]).append(" = ");
          

          This removes any potential ambiguity issues with joins when pkTable is specified.

          Show
          Chris Moser added a comment - - edited Hi, Thanks for all of your work with the dataimporter. It's made working with Solr much easier. I think I found a small bug in SqlEntityProcessory.java starting on line 120: SqlEntityProcessor.java 120: boolean first = true ; 121: String [] primaryKeys = context.getEntityAttribute( "pk" ).split( "," ); 122: for ( int i = 0; i < primaryKeys.length; i++) { 123: if (!first) { 124: sb.append( " and " ); 125: first = false ; 126: } This causes problems in a generated SQL statement because it doesn't add the "and" string into the SQL statement when more than one field is provided in the pk entity value. End result being a SQL syntax error. Given first initialized as true, the if statement on line 123 will never happen (and first will never be set to false). It looks like it would be more appropriate to have line 125 happen after the if statement on line 123. This leads me to another issue, and that is the question of how to specify the table of the primary key when the primary key is ambiguous? If there's a join condition in the SQL statment of a deltaQuery, and the any of the primary key columns are present in the joined table, the key is ambiguous and will cause a SQL error. Is there a way to specify the table for the primary key? Perhaps an attribute "pkTable" can be added as an option for the entity declaration, i.e. in SqlEntityProcessor.java: SqlEntityProcessor.java 127: Object val = resolver.resolve(primaryKeys[i]); --> if (context.getEntityAttribute( "pkTable" ).length()>0) --> sb.append(context.getEntityAttribute( "pkTable" )+ "." ); 128: sb.append(primaryKeys[i]).append( " = " ); This removes any potential ambiguity issues with joins when pkTable is specified.
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch adds DataImportHandler as a contrib project into Solr. It uses standard Maven directory structure and a build.xml file. No changes have been made to the codebase.

          Note - I've opened SOLR-563 to track contrib area creation in Solr. Using this patch with the SOLR-563 patch lets you compile, test and package DataImportHandler with Solr war file.

          Show
          Shalin Shekhar Mangar added a comment - This patch adds DataImportHandler as a contrib project into Solr. It uses standard Maven directory structure and a build.xml file. No changes have been made to the codebase. Note - I've opened SOLR-563 to track contrib area creation in Solr. Using this patch with the SOLR-563 patch lets you compile, test and package DataImportHandler with Solr war file.
          Hide
          Noble Paul added a comment -

          The best example of the simple usecase can be seen here http://wiki.apache.org/solr/DataImportHandler#shortconfig.
          Here we have joined 4 different tables with so little configuration

          Show
          Noble Paul added a comment - The best example of the simple usecase can be seen here http://wiki.apache.org/solr/DataImportHandler#shortconfig . Here we have joined 4 different tables with so little configuration
          Hide
          Noble Paul added a comment -

          hi Grant,
          we started of with something like that and very soon realized that it cannot scale beyond the very basic usecases.
          We need the ability to apply transformations like, splitting, merging fields etc etc.
          sometimes we need to put in a totally different piece of data .
          eg: if a value is 1-5 put in the string 'low' , 5-10 put in 'medium' etc etc.

          All these are really driven by the business requirements

          And there is the need for joining one table with another from the values in one table or merging one table with many tables.

          Then we had use cases where data comes from a Db and using a key we have to fetch data from an xml/http datasource etc etc.

          So , the fundamental design or the 'kernel' of the system is supposed to be totally agnostic of the use cases and we let the users plug in the implemenations in java/JS etc so that they can do what they actually want. And we want to share some of the components which can be common for others.

          Show
          Noble Paul added a comment - hi Grant, we started of with something like that and very soon realized that it cannot scale beyond the very basic usecases. We need the ability to apply transformations like, splitting, merging fields etc etc. sometimes we need to put in a totally different piece of data . eg: if a value is 1-5 put in the string 'low' , 5-10 put in 'medium' etc etc. All these are really driven by the business requirements And there is the need for joining one table with another from the values in one table or merging one table with many tables. Then we had use cases where data comes from a Db and using a key we have to fetch data from an xml/http datasource etc etc. So , the fundamental design or the 'kernel' of the system is supposed to be totally agnostic of the use cases and we let the users plug in the implemenations in java/JS etc so that they can do what they actually want. And we want to share some of the components which can be common for others.
          Hide
          Shalin Shekhar Mangar added a comment -

          Grant, the limitation comes from multi-valued fields. When you join tables, it is most probably because you have a 1-to-many relationship. However, in that case a single row in the result does not contain all the information needed to create the Solr document. You'd need to combine many rows using the primary/foreign key to get all the data required in the Solr document. Btw, SOLR-103 is similiar to the functionality you have in mind.

          Show
          Shalin Shekhar Mangar added a comment - Grant, the limitation comes from multi-valued fields. When you join tables, it is most probably because you have a 1-to-many relationship. However, in that case a single row in the result does not contain all the information needed to create the Solr document. You'd need to combine many rows using the primary/foreign key to get all the data required in the Solr document. Btw, SOLR-103 is similiar to the functionality you have in mind.
          Hide
          Grant Ingersoll added a comment -

          I was just thinking as an option. How does it limit? A single SQL
          can join across tables. I will try to update the patch w/ my merges
          when I get a breather. I have the case where I can send in something
          like:

          select col1 as field1, col2 as field2, ... from table1, ... where ...;

          and it goes and runs that SQL against the specified connection.
          Basically, any valid SQL select statement can be sent in.

          Actually, I can send in multiple SQL statements as well, or just
          specify the table, or the table and certain columns.

          --------------------------
          Grant Ingersoll

          Lucene Helpful Hints:
          http://wiki.apache.org/lucene-java/BasicsOfPerformance
          http://wiki.apache.org/lucene-java/LuceneFAQ

          Show
          Grant Ingersoll added a comment - I was just thinking as an option. How does it limit? A single SQL can join across tables. I will try to update the patch w/ my merges when I get a breather. I have the case where I can send in something like: select col1 as field1, col2 as field2, ... from table1, ... where ...; and it goes and runs that SQL against the specified connection. Basically, any valid SQL select statement can be sent in. Actually, I can send in multiple SQL statements as well, or just specify the table, or the table and certain columns. -------------------------- Grant Ingersoll Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
          Hide
          Noble Paul added a comment -

          Giving a single SQL may limit the utility, because you may need to join more than one table in most of the usecases.

          But it is possible to pass on the whole dataconfig itself as a request parameter. .We currently use that in the interactive development mode.

          We have tried hard to cut down the verbosity of the configuration patch after patch . Now the 'metadata' i.e the extra information other than the queries itself is minimal. We leverage on the data such as schema etc to achieve it.

          The connections are created once and consumed throughout one import. We take in the details for creating connections in the configuration (see documentation)

          Show
          Noble Paul added a comment - Giving a single SQL may limit the utility, because you may need to join more than one table in most of the usecases. But it is possible to pass on the whole dataconfig itself as a request parameter. .We currently use that in the interactive development mode. We have tried hard to cut down the verbosity of the configuration patch after patch . Now the 'metadata' i.e the extra information other than the queries itself is minimal. We leverage on the data such as schema etc to achieve it. The connections are created once and consumed throughout one import. We take in the details for creating connections in the configuration (see documentation)
          Hide
          Grant Ingersoll added a comment -

          This is some really cool stuff and should be added at some point soon.

          Some high level questions/comments, as I haven't looked in depth into the patch yet:

          Is it possible to just pass in SQL statements, etc. via a request? Or do they have to be configured ahead of time? What about connections? On the one hand, having to configure it ahead of time can lock things down and be a little more secure, on the other hand, having to configure it ahead of time can lock things down and take away flexibility. I hope to combine some of the stuff I've written to do this, with your patch.

          Not sure how to say it, but all the configuration starts to have the feel of Hibernate and/or the other ORMs. Would there be a way to leverage something that already exists? Although I do see from the comment the other day that you have reduced some of the verbosity

          How is scheduling handled?

          Finally, I'm not totally sure where this should live. Solr doesn't currently have a "contrib" area, but this feels like a (major) contrib and may warrant adding it under a contrib area.

          Show
          Grant Ingersoll added a comment - This is some really cool stuff and should be added at some point soon. Some high level questions/comments, as I haven't looked in depth into the patch yet: Is it possible to just pass in SQL statements, etc. via a request? Or do they have to be configured ahead of time? What about connections? On the one hand, having to configure it ahead of time can lock things down and be a little more secure, on the other hand, having to configure it ahead of time can lock things down and take away flexibility. I hope to combine some of the stuff I've written to do this, with your patch. Not sure how to say it, but all the configuration starts to have the feel of Hibernate and/or the other ORMs. Would there be a way to leverage something that already exists? Although I do see from the comment the other day that you have reduced some of the verbosity How is scheduling handled? Finally, I'm not totally sure where this should live. Solr doesn't currently have a "contrib" area, but this feels like a (major) contrib and may warrant adding it under a contrib area.
          Hide
          Shalin Shekhar Mangar added a comment -

          This patch contains the following changes

          • DataSource definitions can now be added inside data-config.xml so there is no need to maintain configuration in two files. It also comes in handy with the interactive development mode.
          • XSLT support in XPathEntityProcessor can apply a given XSL on the XML document before processing it. For example: <entity name="e" processor="XPathEntityProcessor" xsl="/home/user/my.xsl">
          • XPathEntityProcessor now knows how to process Solr Add XMLs. This is handy when using XSLT to change fetched XML directly into Solr Add XML format. Add an extra attribute useSolrAddSchema="true" to enable this. If useSolrAddSchema="true" is specified, then there is no need to put fields in the entity.
          • A new EntityProcessor called FileListEntityProcessor has been added which can operate over a filesystem (directory) and can be used to get files by name (using a regex), size (in bytes) and can also exclude files matching a regex. Recursively operating over a directory is also supported.
          • A TemplateTransformer which lets you put multiple fields into one field according the the given template. For example <field column="name" template="$ {e.lastName}

            , $

            {e.firstName}

            $

            {e.middleName}

            " />

          • In-built transformers are now enhanced to operate on multi-valued fields also.
          • A Test harness has been created to make it easier to test DataImportHandler features. It is called AbstractDataImportHandlerTest and extends from AbstractSolrTestCase. Look at TestDataConfig and TestDocBuilder2 for examples

          We shall write documentation and examples on these changes on the wiki at http://wiki.apache.org/solr/DataImportHandler

          Show
          Shalin Shekhar Mangar added a comment - This patch contains the following changes DataSource definitions can now be added inside data-config.xml so there is no need to maintain configuration in two files. It also comes in handy with the interactive development mode. XSLT support in XPathEntityProcessor can apply a given XSL on the XML document before processing it. For example: <entity name="e" processor="XPathEntityProcessor" xsl="/home/user/my.xsl"> XPathEntityProcessor now knows how to process Solr Add XMLs. This is handy when using XSLT to change fetched XML directly into Solr Add XML format. Add an extra attribute useSolrAddSchema="true" to enable this. If useSolrAddSchema="true" is specified, then there is no need to put fields in the entity. A new EntityProcessor called FileListEntityProcessor has been added which can operate over a filesystem (directory) and can be used to get files by name (using a regex), size (in bytes) and can also exclude files matching a regex. Recursively operating over a directory is also supported. A TemplateTransformer which lets you put multiple fields into one field according the the given template. For example <field column="name" template="$ {e.lastName} , $ {e.firstName} $ {e.middleName} " /> In-built transformers are now enhanced to operate on multi-valued fields also. A Test harness has been created to make it easier to test DataImportHandler features. It is called AbstractDataImportHandlerTest and extends from AbstractSolrTestCase. Look at TestDataConfig and TestDocBuilder2 for examples We shall write documentation and examples on these changes on the wiki at http://wiki.apache.org/solr/DataImportHandler
          Hide
          Shalin Shekhar Mangar added a comment -

          A new patch consisting of a few bug fixes and some major new features. The changes include:

          • No need to write fields in data-config if the field name from DB/XML and field-name in schema.xml are the same. This removes a lot of useless verbosity from data-config.xml
          • A cool new interactive development page, in which you write/change data-config.xml and see results immeadiately making interations extremely fast! Use http://host:port/solr/admin/dataimport.jsp or if using multi-core http://host:port/solr/core-name/admin/dataimport.jsp
          • You can start using the interactive mode without specifying data-config file in solrconfig.xml, however, specifying the data sources is necessary in solrconfig.xml
          • Interactive development uses a new debug mode in DataImportHandler, add debug=on to the full-import command to see the actual documents which are created by DataImportHandler. This shows the first 10 documents created by DataImportHandler using the existing config without committing them to solr. It supports the start and rows parameter (just like query params) which you can use to see any document. This comes in very useful when suppose the 1000th document failed during indexing and you want to see the reason. If there are exceptions, the stacktrace is shown with the response.
          • Verbose mode with verbose=on as a request parameter (used in conjunction with debug=on) which shows exactly how DataImportHandler created each document.
            • What query was executed?
            • How much time it took?
            • What rows it gave back?
            • What transformers were applied and what was the result?
            • Another advantage is that you can see the fields which are indexed but not stored
          • A show-config command has been added which gives the data-config.xml as a raw response (uses RawResponseWriter)
          • A new interface called Evaluator has been added which makes it possible to plugin new expression evaluators (for resolving variable names)
          • Using the same Evaluator interface, a few new evaluators have been added
            • formatDate - use as $ {dataimporter.functions.formatDate('NOW',yyyy-MM-dd HH:mm)}

              , this will format NOW as per the given format and return a string which can be used in queries or urls. It supports the full DateMathParser syntax. You can also format fields e.g. $

              {dataimporter.functions.formatDate(A.purchase_date,dd-MM-yyyy)}
            • encodeUrl - useful for URL-encoding parameters when making a HTTP call. Use as $ {dataimport.functions.encodeUrl(emp.name)}
            • escapeSql - useful for escaping parameters supplied in sql statements. This can replace quotes with two quotes to avoid sql syntax errors. Use as $ {dataimporter.functions.escapeSql(emp.name)}
          • Custom Evaluators can be specified in data-config.xml (more details and example will be added to the wiki)
          • HttpDataSource now reads the content encoding from the response by default. Previously it assumed the default encoding to be UTF-8. This behavior can be overriden by explicitly specifying an encoding in solrconfig.xml
          • A FileDataSource has been added which can read content from local files (e.g. XML feed files on local disk).
          • Transformers can signal skipping a document by adding a key "$skipDoc" with value "true" in the returned map.
          • NumberFormatTransformer is a new transformer which can be used to extract/convert numbers from strings. It uses the java.text.NumberFormat class in Java to provide its features.
          • The Context interface has been enhanced to add new methods for getting/setting session variables which can be used by Transformers to share data. Also a new method called getParentContext can enable a Transformer/EntityProcessor to get the parent entity's context in full imports.

          Please let us know your comments and feedback. More details and examples will soon be added to the wiki page at http://wiki.apache.org/solr/DataImportHandler

          Show
          Shalin Shekhar Mangar added a comment - A new patch consisting of a few bug fixes and some major new features. The changes include: No need to write fields in data-config if the field name from DB/XML and field-name in schema.xml are the same. This removes a lot of useless verbosity from data-config.xml A cool new interactive development page, in which you write/change data-config.xml and see results immeadiately making interations extremely fast! Use http://host:port/solr/admin/dataimport.jsp or if using multi-core http://host:port/solr/core-name/admin/dataimport.jsp You can start using the interactive mode without specifying data-config file in solrconfig.xml, however, specifying the data sources is necessary in solrconfig.xml Interactive development uses a new debug mode in DataImportHandler, add debug=on to the full-import command to see the actual documents which are created by DataImportHandler. This shows the first 10 documents created by DataImportHandler using the existing config without committing them to solr. It supports the start and rows parameter (just like query params) which you can use to see any document. This comes in very useful when suppose the 1000th document failed during indexing and you want to see the reason. If there are exceptions, the stacktrace is shown with the response. Verbose mode with verbose=on as a request parameter (used in conjunction with debug=on) which shows exactly how DataImportHandler created each document. What query was executed? How much time it took? What rows it gave back? What transformers were applied and what was the result? Another advantage is that you can see the fields which are indexed but not stored A show-config command has been added which gives the data-config.xml as a raw response (uses RawResponseWriter) A new interface called Evaluator has been added which makes it possible to plugin new expression evaluators (for resolving variable names) Using the same Evaluator interface, a few new evaluators have been added formatDate - use as $ {dataimporter.functions.formatDate('NOW',yyyy-MM-dd HH:mm)} , this will format NOW as per the given format and return a string which can be used in queries or urls. It supports the full DateMathParser syntax. You can also format fields e.g. $ {dataimporter.functions.formatDate(A.purchase_date,dd-MM-yyyy)} encodeUrl - useful for URL-encoding parameters when making a HTTP call. Use as $ {dataimport.functions.encodeUrl(emp.name)} escapeSql - useful for escaping parameters supplied in sql statements. This can replace quotes with two quotes to avoid sql syntax errors. Use as $ {dataimporter.functions.escapeSql(emp.name)} Custom Evaluators can be specified in data-config.xml (more details and example will be added to the wiki) HttpDataSource now reads the content encoding from the response by default. Previously it assumed the default encoding to be UTF-8. This behavior can be overriden by explicitly specifying an encoding in solrconfig.xml A FileDataSource has been added which can read content from local files (e.g. XML feed files on local disk). Transformers can signal skipping a document by adding a key "$skipDoc" with value "true" in the returned map. NumberFormatTransformer is a new transformer which can be used to extract/convert numbers from strings. It uses the java.text.NumberFormat class in Java to provide its features. The Context interface has been enhanced to add new methods for getting/setting session variables which can be used by Transformers to share data. Also a new method called getParentContext can enable a Transformer/EntityProcessor to get the parent entity's context in full imports. Please let us know your comments and feedback. More details and examples will soon be added to the wiki page at http://wiki.apache.org/solr/DataImportHandler
          Hide
          Noble Paul added a comment -

          The priority is changed to major

          Show
          Noble Paul added a comment - The priority is changed to major
          Hide
          Noble Paul added a comment -

          The scope has been changed from consuming just DB data. It is designed to consume any type of structured data

          Show
          Noble Paul added a comment - The scope has been changed from consuming just DB data. It is designed to consume any type of structured data
          Hide
          Shalin Shekhar Mangar added a comment -

          A change in behavior for XPathEntityProcessor. It now makes ''pk'' optional for entities having XPathEntityProcessor.

          Show
          Shalin Shekhar Mangar added a comment - A change in behavior for XPathEntityProcessor. It now makes ''pk'' optional for entities having XPathEntityProcessor.
          Hide
          Shalin Shekhar Mangar added a comment -

          Fixes a bug with html handling in XPathRecordReader

          Show
          Shalin Shekhar Mangar added a comment - Fixes a bug with html handling in XPathRecordReader
          Hide
          Noble Paul added a comment -

          The last patch started from the wrong root. This applies properly

          Show
          Noble Paul added a comment - The last patch started from the wrong root. This applies properly
          Hide
          Noble Paul added a comment -

          This is the biggest ever feature release for the patch . This contains almost all the planned features for DataImportHandler include:

          • support for xml/http datasources
          • Javascript for transformer (requires java 6)
          • Numerous performance enhancements and bug fixes
          • Better logging and error handling
          • An improved command interface
          • command to reload config
          • statistics integrated with solr statistics
          • Can accessrequest parameters
          • Extra configurable parameters can be passed from solrconfig.xml
          • Multiple transformers possible (chaining)
          • Can put in the handler without a data-config.xml and datasource
            *Can make an arbitrary entity a root entity

          More documentation in the wiki

          Show
          Noble Paul added a comment - This is the biggest ever feature release for the patch . This contains almost all the planned features for DataImportHandler include: support for xml/http datasources Javascript for transformer (requires java 6) Numerous performance enhancements and bug fixes Better logging and error handling An improved command interface command to reload config statistics integrated with solr statistics Can accessrequest parameters Extra configurable parameters can be passed from solrconfig.xml Multiple transformers possible (chaining) Can put in the handler without a data-config.xml and datasource *Can make an arbitrary entity a root entity More documentation in the wiki
          Hide
          Noble Paul added a comment -

          The DB example was an easy one because we could get a schema out of the sample data.
          An RSS/ATOM example is in the works.

          Show
          Noble Paul added a comment - The DB example was an easy one because we could get a schema out of the sample data. An RSS/ATOM example is in the works.
          Hide
          Otis Gospodnetic added a comment - - edited

          I see, I just read to the bottom of the Wiki page - I spoke too soon. It would be great, then, to include another non-SQL/RDBMS example in there.

          What site are you using this on by the way? Oh, AOL? It looks like AOL is jumping on Solr, Hadoop, and friends and contributing - bravo!

          Show
          Otis Gospodnetic added a comment - - edited I see, I just read to the bottom of the Wiki page - I spoke too soon. It would be great, then, to include another non-SQL/RDBMS example in there. What site are you using this on by the way? Oh, AOL? It looks like AOL is jumping on Solr, Hadoop, and friends and contributing - bravo!
          Hide
          Shalin Shekhar Mangar added a comment -

          Hi Otis,

          Thanks for showing interest in this issue and your feedback.

          Originally we started developing this to be a pure DB Import tool. But our own requirements led to us to keep this general enough to be used with other kinds of data sources. For example, we're using this internally for reading from REST API's (including RSS/ATOM feeds). Therefore, we kept the name as DataImportHandler on purpose. Previously, our data source was JdbcDataSource and EntityProcessor was called SqlEntityProcessor. We later extracted interfaces out of them as DataSource and EntityProcessor to make them as generic as possible. Also note that the DataImportHandler does not care about the name of data-config.xml. It could be called anything, all we need is that it should be specified in solrconfig.xml

          We're developing our generic REST datasources and entity processors and plan to contribute them as well. We too are looking forward to see this in Solr and we're committed to do whatever it takes to make sure it becomes a part of Solr.

          Show
          Shalin Shekhar Mangar added a comment - Hi Otis, Thanks for showing interest in this issue and your feedback. Originally we started developing this to be a pure DB Import tool. But our own requirements led to us to keep this general enough to be used with other kinds of data sources. For example, we're using this internally for reading from REST API's (including RSS/ATOM feeds). Therefore, we kept the name as DataImportHandler on purpose. Previously, our data source was JdbcDataSource and EntityProcessor was called SqlEntityProcessor. We later extracted interfaces out of them as DataSource and EntityProcessor to make them as generic as possible. Also note that the DataImportHandler does not care about the name of data-config.xml. It could be called anything, all we need is that it should be specified in solrconfig.xml We're developing our generic REST datasources and entity processors and plan to contribute them as well. We too are looking forward to see this in Solr and we're committed to do whatever it takes to make sure it becomes a part of Solr.
          Hide
          Otis Gospodnetic added a comment -

          Haven't looked at the patch, but I've read most of http://wiki.apache.org/solr/DataImportHandler

          Small comment: don't name that config file "data-config.xml". "data" is so generic. What is this? It's a RDBMS indexing tool implemented as a request handler. I'd pick a better, more specific name both for the config and the handler itself - DataImportHandler - does it import from a file? A BDB? RDBMS? Another search engine? Can't tell from a generic name.

          Really well documented, good job, and I'm looking forward to seeing this in Solr!

          Show
          Otis Gospodnetic added a comment - Haven't looked at the patch, but I've read most of http://wiki.apache.org/solr/DataImportHandler Small comment: don't name that config file "data-config.xml". "data" is so generic. What is this? It's a RDBMS indexing tool implemented as a request handler. I'd pick a better, more specific name both for the config and the handler itself - DataImportHandler - does it import from a file? A BDB? RDBMS? Another search engine? Can't tell from a generic name. Really well documented, good job, and I'm looking forward to seeing this in Solr!
          Hide
          Noble Paul added a comment -

          Add a facility to delete documents from Solr index on the basis of a solr query.
          It is useful if you wish to expire the documents after a certain period of time.

          Show
          Noble Paul added a comment - Add a facility to delete documents from Solr index on the basis of a solr query. It is useful if you wish to expire the documents after a certain period of time.
          Hide
          Shalin Shekhar Mangar added a comment - - edited

          Changes

          • Support for deleted rows detection (Details will be added to Wiki soon)
          • Numerous bug fixes
          • Merged DataImporter and DataImporterContext together
          • Improved response format showing status messages of operation
          • DataImportHandler is now SolrCoreAware
          • Code refactorings
          • A Verifier which checks data-config.xml against the solr schema.xml to make sure that all fields defined in data-config.xml are defined in schema.xml and all (required) fields defined in solr schema.xml are mentioned in data-config.xml

          We recently indexed around 1.7 million documents using this tool. The documents had mostly sint and sdouble fields in it (since we wanted to see the performance of this patch and not lucene's speed). We were able to index 1.7 million documents in 166 seconds on our production hardware.

          Note: Details of the API exposed in our work is now added to our Wiki. Also, an example solr home is provided in the Wiki page (under "Full Import Example" section to try this out.

          Show
          Shalin Shekhar Mangar added a comment - - edited Changes Support for deleted rows detection (Details will be added to Wiki soon) Numerous bug fixes Merged DataImporter and DataImporterContext together Improved response format showing status messages of operation DataImportHandler is now SolrCoreAware Code refactorings A Verifier which checks data-config.xml against the solr schema.xml to make sure that all fields defined in data-config.xml are defined in schema.xml and all (required) fields defined in solr schema.xml are mentioned in data-config.xml We recently indexed around 1.7 million documents using this tool. The documents had mostly sint and sdouble fields in it (since we wanted to see the performance of this patch and not lucene's speed). We were able to index 1.7 million documents in 166 seconds on our production hardware. Note: Details of the API exposed in our work is now added to our Wiki . Also, an example solr home is provided in the Wiki page (under "Full Import Example" section to try this out.
          Hide
          Shalin Shekhar Mangar added a comment -

          The wiki page for DataImportHandler now has instructions for running an example for a full-import process. We've used the same data provided by example in Solr and created a hsqldb database out of it. Would love to have some feedback at this point.

          We'll add examples for delta-import soon.

          Show
          Shalin Shekhar Mangar added a comment - The wiki page for DataImportHandler now has instructions for running an example for a full-import process. We've used the same data provided by example in Solr and created a hsqldb database out of it. Would love to have some feedback at this point. We'll add examples for delta-import soon.
          Hide
          Shalin Shekhar Mangar added a comment -

          Changes

          • Eliminated schema creation step as per Ryan's suggestions.
          • No need to put field attributes such as type, multiValued, indexed, stores etc. in data-config.xml, those are now read directly from the Solr IndexSchema
          • No need to put copyField information in data-config.xml since copy fields are managed by Solr

          The only attributes needed to be provided in data-config.xml for a field are:

          • column (The column in the db from which the field's value comes from, Required)
          • name (Optional, if the field name differs from the column name, the field name needs to be given)
          • boost (Optional, if the field needs to be boosted)

          I'll update the wiki document to reflect the above changes.

          Show
          Shalin Shekhar Mangar added a comment - Changes Eliminated schema creation step as per Ryan's suggestions. No need to put field attributes such as type, multiValued, indexed, stores etc. in data-config.xml, those are now read directly from the Solr IndexSchema No need to put copyField information in data-config.xml since copy fields are managed by Solr The only attributes needed to be provided in data-config.xml for a field are: column (The column in the db from which the field's value comes from, Required) name (Optional, if the field name differs from the column name, the field name needs to be given) boost (Optional, if the field needs to be boosted) I'll update the wiki document to reflect the above changes.
          Hide
          Shalin Shekhar Mangar added a comment -

          Sorry, this new patch is the correct one. Still learning the ropes

          Show
          Shalin Shekhar Mangar added a comment - Sorry, this new patch is the correct one. Still learning the ropes
          Hide
          Shalin Shekhar Mangar added a comment -

          It seems my earlier patch wasn't generated in the correct way. It had absolute paths to all files instead of having relative paths. This new patch corrects it. Also, it removes a test which had got in by mistake the previous patch.

          Show
          Shalin Shekhar Mangar added a comment - It seems my earlier patch wasn't generated in the correct way. It had absolute paths to all files instead of having relative paths. This new patch corrects it. Also, it removes a test which had got in by mistake the previous patch.
          Hide
          Noble Paul added a comment -

          We are planning to eliminate the schema creation step. So we may not need to put in those details which are already present in schema.xml and we can simplify the data-config and eliminate the <copyField> also. So we must introduce a verifier which ensures that the data-config is in sync with the schema.xml.

          Show
          Noble Paul added a comment - We are planning to eliminate the schema creation step. So we may not need to put in those details which are already present in schema.xml and we can simplify the data-config and eliminate the <copyField> also. So we must introduce a verifier which ensures that the data-config is in sync with the schema.xml.
          Hide
          Noble Paul added a comment -

          hi ,
          thanks for the inputs
          !) for very simple use cases we can avoid people touching the
          schema.xml altogether. because we usually have a standard schema.xml
          but for the <field > tags. People can choose to edit the schema after
          it is created but if both are totally different, both may not be in
          sync and can throw errors. Anyway, we are open to suggestions
          2)We can use <copyField> instead of <copyFrom>
          --thanks
          Noble Paul


          --Noble Paul

          Show
          Noble Paul added a comment - hi , thanks for the inputs !) for very simple use cases we can avoid people touching the schema.xml altogether. because we usually have a standard schema.xml but for the <field > tags. People can choose to edit the schema after it is created but if both are totally different, both may not be in sync and can throw errors. Anyway, we are open to suggestions 2)We can use <copyField> instead of <copyFrom> --thanks Noble Paul – --Noble Paul
          Hide
          Ryan McKinley added a comment -

          Hi- thanks for posting this.

          I have not had a chance to look at this in depth, but a couple things jump out at me.

          1. It looks like the model here is to treat "data-config.xml" as the master and generate schema,xml from that. To me this seems a bit strange and difficult to support long term. In my view, "schema.xml" should always be the place to define fields and indexing properties. "data-config.xml" should just be the place that maps SQL to the schema.

          2. why not just use the standard copyField stuff rather then rolling your own?

                  <field name="text">
                      <copyFrom>cat</copyFrom>
                      <copyFrom>name</copyFrom>
                      <copyFrom>manu</copyFrom>
                      <copyFrom>features</copyFrom>
                  </field>
          
          Show
          Ryan McKinley added a comment - Hi- thanks for posting this. I have not had a chance to look at this in depth, but a couple things jump out at me. 1. It looks like the model here is to treat "data-config.xml" as the master and generate schema,xml from that. To me this seems a bit strange and difficult to support long term. In my view, "schema.xml" should always be the place to define fields and indexing properties. "data-config.xml" should just be the place that maps SQL to the schema. 2. why not just use the standard copyField stuff rather then rolling your own? <field name= "text" > <copyFrom> cat </copyFrom> <copyFrom> name </copyFrom> <copyFrom> manu </copyFrom> <copyFrom> features </copyFrom> </field>
          Hide
          Shalin Shekhar Mangar added a comment -

          A patch out of our (Noble Paul's and Shalin Shekhar Mangar's) work on this issue. Please refer to http://wiki.apache.org/solr/DataImportHandler for a user guide.

          Our design philosophy for data imports is based on templatized SQL which gives the user of this tool a lot of flexibility. It can generate schemas, do full-imports and delta-imports. Please note that this is work in progress and there's a lot to be done for it to be committed. We plan to write more documentation and tests as we go on.

          Start by looking at changes to solrconfig.xml and then to DataImportHandler.java The central class is DataImporter.java which uses DocBuilder to do the actual full-dump and delta-dump operations.

          We expose a powerful API for applications to do custom tasks. This API was needed because even in our own tasks, there was frequent need to perform custom operations on rows/columns before they could be indexed. Assuming that other users may face the same problems, we expose Context.java, DataSource.java, EntityProcessor.java, Transformer.java as interfaces which can be used to provide custom data sources or transformations on column values before indexing. In our own project, we have used these interfaces to do tasks such as reading XML from a column and extracting relevant items to be indexed.

          Looking forward to your feedback and comments. Let us know what will it take to get this feature into SOLR.

          • Noble Paul & Shalin Shekhar Mangar
          Show
          Shalin Shekhar Mangar added a comment - A patch out of our (Noble Paul's and Shalin Shekhar Mangar's) work on this issue. Please refer to http://wiki.apache.org/solr/DataImportHandler for a user guide. Our design philosophy for data imports is based on templatized SQL which gives the user of this tool a lot of flexibility. It can generate schemas, do full-imports and delta-imports. Please note that this is work in progress and there's a lot to be done for it to be committed. We plan to write more documentation and tests as we go on. Start by looking at changes to solrconfig.xml and then to DataImportHandler.java The central class is DataImporter.java which uses DocBuilder to do the actual full-dump and delta-dump operations. We expose a powerful API for applications to do custom tasks. This API was needed because even in our own tasks, there was frequent need to perform custom operations on rows/columns before they could be indexed. Assuming that other users may face the same problems, we expose Context.java, DataSource.java, EntityProcessor.java, Transformer.java as interfaces which can be used to provide custom data sources or transformations on column values before indexing. In our own project, we have used these interfaces to do tasks such as reading XML from a column and extracting relevant items to be indexed. Looking forward to your feedback and comments. Let us know what will it take to get this feature into SOLR. Noble Paul & Shalin Shekhar Mangar

            People

            • Assignee:
              Shalin Shekhar Mangar
              Reporter:
              Noble Paul
            • Votes:
              7 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development