Solr
  1. Solr
  2. SOLR-1499

SolrEntityProcessor - DIH EntityProcessor that queries an external Solr via SolrJ

    Details

      Description

      The SolrEntityProcessor queries an external Solr instance. The Solr documents returned are unpacked and emitted as DIH fields.

      The SolrEntityProcessor uses the following attributes:

      • solr='http://localhost:8983/solr/sms'
        • This gives the URL of the target Solr instance.
          • Note: the connection to the target Solr uses the binary SolrJ format.
      • query='Jefferson&sort=id+asc'
        • This gives the base query string use with Solr. It can include any standard Solr request parameter. This attribute is processed under the variable resolution rules and can be driven in an inner stage of the indexing pipeline.
      • rows='10'
        • This gives the number of rows to fetch per request..
        • The SolrEntityProcessor always fetches every document that matches the request..
      • fields='id,tag'
        • This selects the fields to be returned from the Solr request.
        • These must also be declared as <field> elements.
        • As with all fields, template processors can be used to alter the contents to be passed downwards.
      • timeout='30'
        • This limits the query to 5 seconds. This can be used as a fail-safe to prevent the indexing session from freezing up. By default the timeout is 5 minutes.

      Limitations:

      • Solr errors are not handled correctly.
      • Loop control constructs have not been tested.
      • Multi-valued returned fields have not been tested.

      The unit tests give examples of how to use it as the root entity and an inner entity.

      1. SOLR-1499-trunk.patch
        83 kB
        Martijn van Groningen
      2. SOLR-1499-3x.patch
        88 kB
        Martijn van Groningen
      3. SOLR-1499.patch
        85 kB
        Luca Cavanna
      4. SOLR-1499.patch
        91 kB
        Luca Cavanna
      5. SOLR-1499.patch
        33 kB
        Luca Cavanna
      6. SOLR-1499.core.rev1182017.patch
        9 kB
        Pulkit Singhal
      7. SOLR-1499.tests.rev1182017.patch
        24 kB
        Pulkit Singhal
      8. SOLR-1499.patch
        33 kB
        Ahmet Arslan
      9. SOLR-1499.patch
        29 kB
        Lance Norskog
      10. SOLR-1499.patch
        25 kB
        Erik Hatcher
      11. SOLR-1499.patch
        24 kB
        Erik Hatcher
      12. SOLR-1499.patch
        24 kB
        Lance Norskog
      13. SOLR-1499.patch
        24 kB
        Lance Norskog

        Activity

        Hide
        Lance Norskog added a comment -

        Hypothetically why SqlEntityProcessor can't consume SolrDocumentList from SolrDatasource?

        Nobody has needed this feature. The DataSource abstraction seems to be important when multiple entities share a common connection, or a common sequence of data from the DataSource. The SolrEP does not have this problem.
        SolrEP supplies fields to its child entities, and that has been enough for its users.

        Show
        Lance Norskog added a comment - Hypothetically why SqlEntityProcessor can't consume SolrDocumentList from SolrDatasource? Nobody has needed this feature. The DataSource abstraction seems to be important when multiple entities share a common connection, or a common sequence of data from the DataSource. The SolrEP does not have this problem. SolrEP supplies fields to its child entities, and that has been enough for its users.
        Hide
        Mikhail Khludnev added a comment -

        @Noble,
        Agree. And why SolrDatasource can't get an url or q= or q=..&fq=... part from entity processor and return Iterator<Map<>> which is actually (or adaptable from) SolrDocumentList?

        @Lance,
        Hypothetically why SqlEntityProcessor can't consume SolrDocumentList from SolrDatasource?

        It seems I've missed something. Thanks for your replies. I'm asking because I need to make a lot of index scaffolding work like generate some random documents, etc. I feel something common with this ticket.
        I'm considering to employ ScriptTransformer idea and create ScriptDatasource or ScriptEntityProcessor. The former one seems much promising to me, because it can be used with any of entity processors. As an a query I consider just a string which is passed to script engine for evaluation. WDYT?

        Show
        Mikhail Khludnev added a comment - @Noble, Agree. And why SolrDatasource can't get an url or q= or q=..&fq=... part from entity processor and return Iterator<Map<>> which is actually (or adaptable from) SolrDocumentList? @Lance, Hypothetically why SqlEntityProcessor can't consume SolrDocumentList from SolrDatasource? It seems I've missed something. Thanks for your replies. I'm asking because I need to make a lot of index scaffolding work like generate some random documents, etc. I feel something common with this ticket. I'm considering to employ ScriptTransformer idea and create ScriptDatasource or ScriptEntityProcessor. The former one seems much promising to me, because it can be used with any of entity processors. As an a query I consider just a string which is passed to script engine for evaluation. WDYT?
        Hide
        Lance Norskog added a comment -

        There is no other EntityProcessor that would make use of a Solr DataSource, so it is ok for the SolrEntityProcessor to just have its own connection.

        Show
        Lance Norskog added a comment - There is no other EntityProcessor that would make use of a Solr DataSource, so it is ok for the SolrEntityProcessor to just have its own connection.
        Hide
        Noble Paul added a comment -

        An DataSource is responsible for fetching data from an appropriate datastore say url/db/file etc . An EntitiProcessor should give the necessary input to the DataSource (e.g an sql query) and take get the rows of data from the DataSource and apply the needed transformations and put it to the index

        Show
        Noble Paul added a comment - An DataSource is responsible for fetching data from an appropriate datastore say url/db/file etc . An EntitiProcessor should give the necessary input to the DataSource (e.g an sql query) and take get the rows of data from the DataSource and apply the needed transformations and put it to the index
        Hide
        Mikhail Khludnev added a comment -

        Couldn't you help me understand the DIH design: Why it's an EntityProcessor but not a DataSource?

        Show
        Mikhail Khludnev added a comment - Couldn't you help me understand the DIH design: Why it's an EntityProcessor but not a DataSource?
        Hide
        Lance Norskog added a comment -

        Cool! Thanks everyone.

        Show
        Lance Norskog added a comment - Cool! Thanks everyone.
        Hide
        Martijn van Groningen added a comment -

        Committed to trunk and 3x.

        Show
        Martijn van Groningen added a comment - Committed to trunk and 3x.
        Hide
        Martijn van Groningen added a comment -

        Patch for trunk.

        Show
        Martijn van Groningen added a comment - Patch for trunk.
        Hide
        Martijn van Groningen added a comment -

        Patch looks good. Made some minor changes:

        • Removed @override from interface methods (3x is java5)
        • Fixed example configuration
        • Minor code format changes.

        I think it is ready to get committed.

        Show
        Martijn van Groningen added a comment - Patch looks good. Made some minor changes: Removed @override from interface methods (3x is java5) Fixed example configuration Minor code format changes. I think it is ready to get committed.
        Hide
        Luca Cavanna added a comment -

        I see the ThreadedEntityProcessorWrapper#nextRow() method synchronizes the calls to nextRow(). This means that EntityProcessor implementations don't have to worry about synchronization.

        Right, that's what I meant.

        Show
        Luca Cavanna added a comment - I see the ThreadedEntityProcessorWrapper#nextRow() method synchronizes the calls to nextRow(). This means that EntityProcessor implementations don't have to worry about synchronization. Right, that's what I meant.
        Hide
        Martijn van Groningen added a comment -

        Regarding the thread safety: of course the SolrEntityProcessor isn't thread-safe but the synchronization is handled outside that class, since nextRow() and init() methods are both called from a synchronized block. In fact, queries are not executed concurrently. I added some comments, and a specific unit test method which works like the EntityRunner. We can of course make the processor thread-safe itself adding some lock, but I don't think it's worthwhile.

        I see the ThreadedEntityProcessorWrapper#nextRow() method synchronizes the calls to nextRow(). This means that EntityProcessor implementations don't have to worry about synchronization.

        Show
        Martijn van Groningen added a comment - Regarding the thread safety: of course the SolrEntityProcessor isn't thread-safe but the synchronization is handled outside that class, since nextRow() and init() methods are both called from a synchronized block. In fact, queries are not executed concurrently. I added some comments, and a specific unit test method which works like the EntityRunner. We can of course make the processor thread-safe itself adding some lock, but I don't think it's worthwhile. I see the ThreadedEntityProcessorWrapper#nextRow() method synchronizes the calls to nextRow(). This means that EntityProcessor implementations don't have to worry about synchronization.
        Hide
        Luca Cavanna added a comment -

        Does someone have the time to have a look at my last patch? I'd like to know your opinions, hopefully we can have this feature available with the 3.6 version.

        Show
        Luca Cavanna added a comment - Does someone have the time to have a look at my last patch? I'd like to know your opinions, hopefully we can have this feature available with the 3.6 version.
        Hide
        Luca Cavanna added a comment -

        I attached a new version of the patch.
        I made some refactors, added support for the fq parameter and added some test methods.

        Regarding the thread safety: of course the SolrEntityProcessor isn't thread-safe but the synchronization is handled outside that class, since nextRow() and init() methods are both called from a synchronized block. In fact, queries are not executed concurrently. I added some comments, and a specific unit test method which works like the EntityRunner. We can of course make the processor thread-safe itself adding some lock, but I don't think it's worthwhile.

        Actually the patch is easier to apply also to the trunk I guess, since it contains just new files.

        Please, let me know your thoughts!

        Show
        Luca Cavanna added a comment - I attached a new version of the patch. I made some refactors, added support for the fq parameter and added some test methods. Regarding the thread safety: of course the SolrEntityProcessor isn't thread-safe but the synchronization is handled outside that class, since nextRow() and init() methods are both called from a synchronized block. In fact, queries are not executed concurrently. I added some comments, and a specific unit test method which works like the EntityRunner. We can of course make the processor thread-safe itself adding some lock, but I don't think it's worthwhile. Actually the patch is easier to apply also to the trunk I guess, since it contains just new files. Please, let me know your thoughts!
        Hide
        Luca Cavanna added a comment -

        You're of course right. There's also a comment at the beginning of the SolrEntityProcessor class: "Is not thread-safe". So, I guess at least the author of the comment was aware of it.

        Show
        Luca Cavanna added a comment - You're of course right. There's also a comment at the beginning of the SolrEntityProcessor class: "Is not thread-safe". So, I guess at least the author of the comment was aware of it.
        Hide
        Lance Norskog added a comment -

        Good eye! I'm sure this is true. This was written before threads were implemented.

        Show
        Lance Norskog added a comment - Good eye! I'm sure this is true. This was written before threads were implemented.
        Hide
        Martijn van Groningen added a comment -

        Yes we can just point to db or rss core that is also included in the example.

        After looking into the code I have some concerns when the SolrEntityProcessor is configured with threads > 1 it seems to me that the code will fail. Basically the SolrQuery which is used to keep track of the offset is set as a field of the rowIterator and the rowIterator is a field of SolrEntityProcessor (actually its super class). It seems to me when more than one thread is operating on the SolrEntityProcessor that each thread can overwrite the offset of another thread. Seems like we need some locking.

        Show
        Martijn van Groningen added a comment - Yes we can just point to db or rss core that is also included in the example. After looking into the code I have some concerns when the SolrEntityProcessor is configured with threads > 1 it seems to me that the code will fail. Basically the SolrQuery which is used to keep track of the offset is set as a field of the rowIterator and the rowIterator is a field of SolrEntityProcessor (actually its super class). It seems to me when more than one thread is operating on the SolrEntityProcessor that each thread can overwrite the offset of another thread. Seems like we need some locking.
        Hide
        Lance Norskog added a comment -

        You can have the example point to the same Solr. One use case for this is for database updates: instead of relying on a file, you actually go query Solr to decide whether a record should be updated.

        Show
        Lance Norskog added a comment - You can have the example point to the same Solr. One use case for this is for database updates: instead of relying on a file, you actually go query Solr to decide whether a record should be updated.
        Hide
        Luca Cavanna added a comment -

        I attached a new version of the patch.
        I cleaned up the code and added a new core into the example-DIH folder to show how the SolrEntityProcessor works. The only problem I see is that the example requires one more solr instance running and its address needs to be specified into the solr-data-config.xml file.

        I also have some doubts about the condition
        if (root)

        { solrQuery.setQuery(queryString); }

        inside the SolrEntityProcessor#init method, but I haven't had yet the time to write a specific test.
        Please let me know if you have some more suggestions!

        Show
        Luca Cavanna added a comment - I attached a new version of the patch. I cleaned up the code and added a new core into the example-DIH folder to show how the SolrEntityProcessor works. The only problem I see is that the example requires one more solr instance running and its address needs to be specified into the solr-data-config.xml file. I also have some doubts about the condition if (root) { solrQuery.setQuery(queryString); } inside the SolrEntityProcessor#init method, but I haven't had yet the time to write a specific test. Please let me know if you have some more suggestions!
        Hide
        Martijn van Groningen added a comment -

        Looks good Luca! I see that the only differences between 3x and trunk are the changes in DataImporter class, so it easy to port this to trunk. I think we should move forward with this issue and get this feature committed. This issue has been created more than 2 years ago! So lets try to get this in the 3.5 release.

        Show
        Martijn van Groningen added a comment - Looks good Luca! I see that the only differences between 3x and trunk are the changes in DataImporter class, so it easy to port this to trunk. I think we should move forward with this issue and get this feature committed. This issue has been created more than 2 years ago! So lets try to get this in the 3.5 release.
        Hide
        Luca Cavanna added a comment -

        I tried the last 3.x patch and I've found a bug: I had 230 documents to import and rows=50 (default), but I imported only 200 documents (50*4); it means the last iteration, which had less than 50 rows to process, was skipped. I've fixed it.
        I've added a unit test to show the issue: it was failing before the correction and now it works. I changed the MockSolrServer to have a more real behaviour related to the "rows" and "start" parameters.

        Furthermore, multiValued fields seem to be working, so I added a unit test also for this.

        I've corrected some path errors on the last 3.x patch, also two tests (TestSolrEntityProcessorInner and TestSolrEntityProcessorOuter) were failing due to a wrong path. I added a base class to avoid repeating the SolrInstance inner class on both those tests.

        Since this is my first contribution, let me know if there's something wrong, I will be glad to correct my patch.

        Show
        Luca Cavanna added a comment - I tried the last 3.x patch and I've found a bug: I had 230 documents to import and rows=50 (default), but I imported only 200 documents (50*4); it means the last iteration, which had less than 50 rows to process, was skipped. I've fixed it. I've added a unit test to show the issue: it was failing before the correction and now it works. I changed the MockSolrServer to have a more real behaviour related to the "rows" and "start" parameters. Furthermore, multiValued fields seem to be working, so I added a unit test also for this. I've corrected some path errors on the last 3.x patch, also two tests (TestSolrEntityProcessorInner and TestSolrEntityProcessorOuter) were failing due to a wrong path. I added a base class to avoid repeating the SolrInstance inner class on both those tests. Since this is my first contribution, let me know if there's something wrong, I will be glad to correct my patch.
        Hide
        Pulkit Singhal added a comment -

        @Lance - I've fixed the file path errors but with the the super.setup() errors, I cannot figure out how the test cases were meant to be setup and function. Can you please take this further when you have the time?

        You will find the tests separated and attached to the latest trunk revision as of this comment:
        SOLR-1499.tests.rev1182017.patch

        The core functionality seems to work well on its own when I use it without the test cases by just configuring my own sanity tests on the data-config.xml so I've updated and attached the core code to the latest trunk revision as of this comment:
        SOLR-1499.core.rev1182017.patch

        Hopefully as soon as the test cases are in business, one of the committers will review this and commit it.

        Show
        Pulkit Singhal added a comment - @Lance - I've fixed the file path errors but with the the super.setup() errors, I cannot figure out how the test cases were meant to be setup and function. Can you please take this further when you have the time? You will find the tests separated and attached to the latest trunk revision as of this comment: SOLR-1499 .tests.rev1182017.patch The core functionality seems to work well on its own when I use it without the test cases by just configuring my own sanity tests on the data-config.xml so I've updated and attached the core code to the latest trunk revision as of this comment: SOLR-1499 .core.rev1182017.patch Hopefully as soon as the test cases are in business, one of the committers will review this and commit it.
        Hide
        Pulkit Singhal added a comment -

        Searching through

        svn log -v

        shows:
        R /lucene/dev/trunk/solr/contrib/dataimporthandler/src/test-files/solr-dih/conf/contentstream-solrconfig.xml (from /lucene/dev/branches/solr2452/solr/contrib/dataimporthandler/src/test-files/solr-dih/conf/contentstream-solrconfig.xml:1144716)

        And a quick cmd+shift+r in Eclipse shows that a file with the same name exists at:
        /lucene_solr/solr/contrib/dataimporthandler/src/test-files/dih/solr/conf/contentstream-solrconfig.xml

        So it seems that the path fragment "/test-files/solr-dih/" got changed to "/test-files/dih/solr/"

        Show
        Pulkit Singhal added a comment - Searching through svn log -v shows: R /lucene/dev/trunk/solr/contrib/dataimporthandler/src/test-files/solr-dih/conf/contentstream-solrconfig.xml (from /lucene/dev/branches/solr2452/solr/contrib/dataimporthandler/src/test-files/solr-dih/conf/contentstream-solrconfig.xml:1144716) And a quick cmd+shift+r in Eclipse shows that a file with the same name exists at: /lucene_solr/solr/contrib/dataimporthandler/src/test-files/dih/solr/conf/contentstream-solrconfig.xml So it seems that the path fragment "/test-files/solr-dih/" got changed to "/test-files/dih/solr/"
        Hide
        Pulkit Singhal added a comment -

        @ehatcher - Sure Erik, I'll keep that in mind from now on and will update the patch soon.
        @lancenorskog - Hey Lance, since you kicked this off, would you mind telling me what the purpose of contentstream-solrconfig.xml used to be so that I can find a replacement and include it with the patch update?

        Show
        Pulkit Singhal added a comment - @ehatcher - Sure Erik, I'll keep that in mind from now on and will update the patch soon. @lancenorskog - Hey Lance, since you kicked this off, would you mind telling me what the purpose of contentstream-solrconfig.xml used to be so that I can find a replacement and include it with the patch update?
        Hide
        Erik Hatcher added a comment -

        Pulkit - be sure check the box to license your patches to the ASF via ASL - otherwise we can't incorporate them.

        Show
        Erik Hatcher added a comment - Pulkit - be sure check the box to license your patches to the ASF via ASL - otherwise we can't incorporate them.
        Hide
        Lance Norskog added a comment -

        Hi-

        First, get the unit tests to work. After that, we're ready to work on it. You do a full build at the top with

        ant compile'
        

        and then cd to solr/contrib/dataimporthandler and

        ant test
        

        When the unit tests do not work, something fundamental is broken and there is no point going further. In this case, the tests are broken because a solrconfig.xml sample file they depended on has gone away and you need to find replacements.

        Show
        Lance Norskog added a comment - Hi- First, get the unit tests to work. After that, we're ready to work on it. You do a full build at the top with ant compile' and then cd to solr/contrib/dataimporthandler and ant test When the unit tests do not work, something fundamental is broken and there is no point going further. In this case, the tests are broken because a solrconfig.xml sample file they depended on has gone away and you need to find replacements.
        Hide
        Pulkit Singhal added a comment - - edited

        The updated patch is for lucene-solr trunk is attached. Sorry for naming it badly but apparently I can't edit the file name after attaching it: SOLR-1499.rev1181269.buggy.patch

        I need to message multivalued fields, is there any guidance around that? I know its not tested but how should one go about experimenting with it?

        FYI: To prove the patch works, I got a basic sanity-test to work where the data-config.xml file in my bbyopen2 core got its data from the initital bbyopen core:
        1 <dataConfig>
        2 <document>
        3 <entity name="sep"
        4 processor="SolrEntityProcessor"
        5 url="http://localhost:8983/solr/bbyopen"
        6 query="sku:1000159"
        7 format="javabin"
        8 transformer="TemplateTransformer">
        9 <field column="sku" template="COPYOF-$

        {sep.sku}

        "/>
        10 </entity>
        11 </document>
        12 </dataConfig>

        Show
        Pulkit Singhal added a comment - - edited The updated patch is for lucene-solr trunk is attached. Sorry for naming it badly but apparently I can't edit the file name after attaching it: SOLR-1499 .rev1181269.buggy.patch I need to message multivalued fields, is there any guidance around that? I know its not tested but how should one go about experimenting with it? FYI: To prove the patch works, I got a basic sanity-test to work where the data-config.xml file in my bbyopen2 core got its data from the initital bbyopen core: 1 <dataConfig> 2 <document> 3 <entity name="sep" 4 processor="SolrEntityProcessor" 5 url="http://localhost:8983/solr/bbyopen" 6 query="sku:1000159" 7 format="javabin" 8 transformer="TemplateTransformer"> 9 <field column="sku" template="COPYOF-$ {sep.sku} "/> 10 </entity> 11 </document> 12 </dataConfig>
        Hide
        Robert Muir added a comment -

        3.4 -> 3.5

        Show
        Robert Muir added a comment - 3.4 -> 3.5
        Hide
        Ahmet Arslan added a comment -

        Lance, I used it once to upgrade.

        Show
        Ahmet Arslan added a comment - Lance, I used it once to upgrade.
        Hide
        Lance Norskog added a comment -

        Ahmet- are you still using this?

        Show
        Lance Norskog added a comment - Ahmet- are you still using this?
        Hide
        Robert Muir added a comment -

        Bulk move 3.2 -> 3.3

        Show
        Robert Muir added a comment - Bulk move 3.2 -> 3.3
        Hide
        Ahmet Arslan added a comment -

        Bring upto trunk version 1082579.
        Add (format="javabin|xml") parameter. xml is needed for solr upgrade where solr versions are not compatible. Test cases needs to be updated.

        Show
        Ahmet Arslan added a comment - Bring upto trunk version 1082579. Add (format="javabin|xml") parameter. xml is needed for solr upgrade where solr versions are not compatible. Test cases needs to be updated.
        Hide
        Erik Hatcher added a comment -

        not an active area of development currently.

        Show
        Erik Hatcher added a comment - not an active area of development currently.
        Hide
        Erik Hatcher added a comment -

        Seems like a parameter to the SolrEntityProcessor that controls the format (format="javabin|xml", perhaps) is warranted. Performance hit? I'm sure somewhat... since it now adds XML parsing and a larger payload over the wire.

        As for copying over the old codec - I think I'd rather just see it use the XML format for sanity's sake.

        Show
        Erik Hatcher added a comment - Seems like a parameter to the SolrEntityProcessor that controls the format (format="javabin|xml", perhaps) is warranted. Performance hit? I'm sure somewhat... since it now adds XML parsing and a larger payload over the wire. As for copying over the old codec - I think I'd rather just see it use the XML format for sanity's sake.
        Hide
        Ahmet Arslan added a comment -

        Eric,

        Thanks for the pointer. As you said when I use

        new CommonsHttpSolrServer(new URL("http://solr1.4.0Instance:8080/solr"), null, new XMLResponseParser(), false);

        I was able to communicate to solr 1.4.0 instance using solrj-trunk.

        Do you recommend modifying this patch in this manner? Any performance hits?

        Plus, What do you think about copy-pasting JavaBinCodec.java from source version to destination version and Using a custom BinaryResponseParser that uses that copy-paste class? Seems working for 1.4.0 to trunk.

        Or should i stick with writing a little script to do it?

        P.S. I am just trying to use a feature that will be already maintained by solr commnunity.

        Show
        Ahmet Arslan added a comment - Eric, Thanks for the pointer. As you said when I use new CommonsHttpSolrServer(new URL("http://solr1.4.0Instance:8080/solr"), null, new XMLResponseParser(), false); I was able to communicate to solr 1.4.0 instance using solrj-trunk. Do you recommend modifying this patch in this manner? Any performance hits? Plus, What do you think about copy-pasting JavaBinCodec.java from source version to destination version and Using a custom BinaryResponseParser that uses that copy-paste class? Seems working for 1.4.0 to trunk. Or should i stick with writing a little script to do it? P.S. I am just trying to use a feature that will be already maintained by solr commnunity.
        Hide
        Erik Hatcher added a comment -

        SolrEntityProcessor uses the SolrJ version of the indexing server... and unfortunately you can't overcome this currently, though I imagine a slight change to get the SolrJ calls using XML rather than javabin would work.

        An alternative, if all you're doing is pulling stored fields and reindexing into another Solr, is to write a little script to do it. In solr-ruby-speak, for example, there is a SolrSource class that can be used to iterate results... and then could be used to pass back into an Indexer. The SolrEntityProcessor is more meant for when you need to blend data from other sources using the rest of DIH's capabilities.

        Show
        Erik Hatcher added a comment - SolrEntityProcessor uses the SolrJ version of the indexing server... and unfortunately you can't overcome this currently, though I imagine a slight change to get the SolrJ calls using XML rather than javabin would work. An alternative, if all you're doing is pulling stored fields and reindexing into another Solr, is to write a little script to do it. In solr-ruby-speak, for example, there is a SolrSource class that can be used to iterate results... and then could be used to pass back into an Indexer. The SolrEntityProcessor is more meant for when you need to blend data from other sources using the rest of DIH's capabilities.
        Hide
        Ahmet Arslan added a comment -

        Hi Lance,

        I setup patch to latest trunk. It required some change though.
        I pointed out a solr URL (version 1.4.0) to upgrade from 1.4.0 to trunk.

        I received :

        Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format
        at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99)
        at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478)
        at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245)

        What can be a work around to overcome this?

        Show
        Ahmet Arslan added a comment - Hi Lance, I setup patch to latest trunk. It required some change though. I pointed out a solr URL (version 1.4.0) to upgrade from 1.4.0 to trunk. I received : Caused by: java.lang.RuntimeException: Invalid version (expected 2, but 1) or the data in not in 'javabin' format at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:99) at org.apache.solr.client.solrj.impl.BinaryResponseParser.processResponse(BinaryResponseParser.java:41) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:478) at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:245) What can be a work around to overcome this?
        Hide
        Lance Norskog added a comment - - edited

        Yes you can!

        • The source index has to store all of the fields.
        • I would do a series of short queries rather than one long one.

        Thank you for thinking of this.

        It could also be used to recombine cores- you can change your partitioning strategy, for example.

        Show
        Lance Norskog added a comment - - edited Yes you can! The source index has to store all of the fields. I would do a series of short queries rather than one long one. Thank you for thinking of this. It could also be used to recombine cores- you can change your partitioning strategy, for example.
        Hide
        Ahmet Arslan added a comment -

        Hi,

        Can i use this to upgrade solr version? Where the lucene/solr indices are not compatible?

        Thanks,
        Ahmet

        Show
        Ahmet Arslan added a comment - Hi, Can i use this to upgrade solr version? Where the lucene/solr indices are not compatible? Thanks, Ahmet
        Hide
        Hoss Man added a comment -

        Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email...

        http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

        Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed.

        A unique token for finding these 240 issues in the future: hossversioncleanup20100527

        Show
        Hoss Man added a comment - Bulk updating 240 Solr issues to set the Fix Version to "next" per the process outlined in this email... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E Selection criteria was "Unresolved" with a Fix Version of 1.5, 1.6, 3.1, or 4.0. email notifications were suppressed. A unique token for finding these 240 issues in the future: hossversioncleanup20100527
        Hide
        Lance Norskog added a comment - - edited

        Add error-handling. Correctly handles skip, continue and abort.
        Add unit tests for error-handling.
        Rename unit tests for more clarity.

        Still has the flaw that all attributes are evaluated at the beginning.
        It is not thread-safe.

        Includes one non-backwards-compatible change: the 'solr' attribute is now 'url' to maintain consistency with the rest of the DIH.

        Show
        Lance Norskog added a comment - - edited Add error-handling. Correctly handles skip, continue and abort. Add unit tests for error-handling. Rename unit tests for more clarity. Still has the flaw that all attributes are evaluated at the beginning. It is not thread-safe. Includes one non-backwards-compatible change: the 'solr' attribute is now 'url' to maintain consistency with the rest of the DIH.
        Hide
        Noble Paul added a comment -

        is this entity processor planning to handle the onError flag?

        Show
        Noble Paul added a comment - is this entity processor planning to handle the onError flag?
        Hide
        Erik Hatcher added a comment -

        we're seeing success with this in the field. i'll polish and commit soon, barring objections.

        Show
        Erik Hatcher added a comment - we're seeing success with this in the field. i'll polish and commit soon, barring objections.
        Hide
        Erik Hatcher added a comment -

        This patch changes the iterator to throw NoSuchElementException, adjusts DIH, removes @Test annotations (convention of test* works for me , removed test-only SolrEntityProcessor(String) ctor as it wasn't needed, fixed issue with extra unnecessary request to source Solr, added root entity bit to DIH base TestContext.

        Show
        Erik Hatcher added a comment - This patch changes the iterator to throw NoSuchElementException, adjusts DIH, removes @Test annotations (convention of test* works for me , removed test-only SolrEntityProcessor(String) ctor as it wasn't needed, fixed issue with extra unnecessary request to source Solr, added root entity bit to DIH base TestContext.
        Hide
        Erik Hatcher added a comment -

        One issue, the iteration isn't stopping when it should. Here's how I've set up my environment:

        Launched Solr example the standard way, java -jar start.jar from the example directory. Then java -jar post.jar *.xml from the exampledocs directory.

        Using this configuration:

        <dataConfig>
        <document>
        <entity name="sep" processor="SolrEntityProcessor" solr="http://localhost:8983/solr" query=":" transformer="TemplateTransformer">
        <field column="id" template="COPYOF-$

        {sep.id}

        "/>
        </entity>
        </document>
        </dataConfig>

        Mapped into solrconfig.xml like this:

        <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
        <str name="config">dataimport-solr.xml</str>
        </lst>
        </requestHandler>

        I then launched another Solr (with debugger enabled) like this:
        ant run-example -Dexample.data.dir=example/sep -Dexample.debug=true -Dexample.jetty.port=8888

        Doing a full-import, I see the source Solr log this:

        INFO: [] webapp=/solr path=/select params=

        {wt=javabin&rows=50&start=0&timeAllowed=300000&q=*:*&version=1}

        hits=19 status=0 QTime=10
        Oct 13, 2009 1:40:45 PM org.apache.solr.core.SolrCore execute
        INFO: [] webapp=/solr path=/select params=

        {wt=javabin&rows=50&start=19&timeAllowed=300000&q=*:*&version=1}

        hits=19 status=0 QTime=0

        Since there are only 19 documents, a second request shouldn't be made as all documents are in the first 50 originally requested.

        Reporting this for information. I'm working on fixing it now.

        Show
        Erik Hatcher added a comment - One issue, the iteration isn't stopping when it should. Here's how I've set up my environment: Launched Solr example the standard way, java -jar start.jar from the example directory. Then java -jar post.jar *.xml from the exampledocs directory. Using this configuration: <dataConfig> <document> <entity name="sep" processor="SolrEntityProcessor" solr="http://localhost:8983/solr" query=" : " transformer="TemplateTransformer"> <field column="id" template="COPYOF-$ {sep.id} "/> </entity> </document> </dataConfig> Mapped into solrconfig.xml like this: <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dataimport-solr.xml</str> </lst> </requestHandler> I then launched another Solr (with debugger enabled) like this: ant run-example -Dexample.data.dir=example/sep -Dexample.debug=true -Dexample.jetty.port=8888 Doing a full-import, I see the source Solr log this: INFO: [] webapp=/solr path=/select params= {wt=javabin&rows=50&start=0&timeAllowed=300000&q=*:*&version=1} hits=19 status=0 QTime=10 Oct 13, 2009 1:40:45 PM org.apache.solr.core.SolrCore execute INFO: [] webapp=/solr path=/select params= {wt=javabin&rows=50&start=19&timeAllowed=300000&q=*:*&version=1} hits=19 status=0 QTime=0 Since there are only 19 documents, a second request shouldn't be made as all documents are in the first 50 originally requested. Reporting this for information. I'm working on fixing it now.
        Hide
        Erik Hatcher added a comment -

        Attached new patch. Reformats code, removing tabs. Adjusts the hardcoded path (to make tests run in an IDE, set the current working dir to (<solr-working-dir>/contrib/dataimporthandler/src/test/resources) so tests run on my machine. Refactored the mock stuff slightly so a setter is used rather than the entity processor knowing about the mock.

        Show
        Erik Hatcher added a comment - Attached new patch. Reformats code, removing tabs. Adjusts the hardcoded path (to make tests run in an IDE, set the current working dir to (<solr-working-dir>/contrib/dataimporthandler/src/test/resources) so tests run on my machine. Refactored the mock stuff slightly so a setter is used rather than the entity processor knowing about the mock.
        Hide
        Lance Norskog added a comment -

        Formatting error in first uploaded patch file.

        Show
        Lance Norskog added a comment - Formatting error in first uploaded patch file.
        Hide
        Erik Hatcher added a comment -

        The use case: a project that uses an intermediate Solr as a data "store". This is then indexed from the store into (another) Solr instance, blending with other entities from other data sources.

        Show
        Erik Hatcher added a comment - The use case: a project that uses an intermediate Solr as a data "store". This is then indexed from the store into (another) Solr instance, blending with other entities from other data sources.
        Hide
        Noble Paul added a comment -

        Lance , could you have a usecase for this?

        Show
        Noble Paul added a comment - Lance , could you have a usecase for this?
        Hide
        Lance Norskog added a comment -

        First release of SolrEntityProcessor + three unit tests.

        Show
        Lance Norskog added a comment - First release of SolrEntityProcessor + three unit tests.

          People

          • Assignee:
            Unassigned
            Reporter:
            Lance Norskog
          • Votes:
            8 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development