Details

      Description

      Functionality:
      1. Provide a pluggable caching framework for DIH so that users can choose a cache implementation that best suits their data and application.

      2. Provide a means to temporarily cache a child Entity's data without needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor).

      3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches.

      4. Provide the ability to partition data across multiple caches that can then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel.

      Use Cases:
      1. We needed a flexible & scalable way to temporarily cache child-entity data prior to joining to parent entities.

      • Using SqlEntityProcessor with Child Entities can cause an "n+1 select" problem.
      • CachedSqlEntityProcessor only supports an in-memory HashMap as a Caching mechanism and does not scale.
      • There is no way to cache non-SQL inputs (ex: flat files, xml, etc).

      2. We needed the ability to gather data from long-running entities by a process that runs separate from our main indexing process.

      3. We wanted the ability to do a delta import of only the entities that changed.

      • Lucene/Solr requires entire documents to be re-indexed, even if only a few fields changed.
      • Our data comes from 50+ complex sql queries and/or flat files.
      • We do not want to incur overhead re-gathering all of this data if only 1 entity's data changed.
      • Persistent DIH caches solve this problem.

      4. We want the ability to index several documents in parallel (using 1.4.1, which did not have the "threads" parameter).

      5. In the future, we may need to use Shards, creating a need to easily partition our source data into Shards.

      Implementation Details:
      1. De-couple EntityProcessorBase from caching.

      • Created a new interface, DIHCache & two implementations:
      • SortedMapBackedCache - An in-memory cache, used as default with CachedSqlEntityProcessor (now deprecated).
      • BerkleyBackedCache - A disk-backed cache, dependent on bdb-je, tested with je-4.1.6.jar
      • NOTE: the existing Lucene Contrib "db" project uses je-3.3.93.jar. I believe this may be incompatible due to Generic Usage.
      • NOTE: I did not modify the ant script to automatically get this jar, so to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html

      2. Allow Entity Processors to take a "cacheImpl" parameter to cause the entity data to be cached (see EntityProcessorBase & DIHCacheProperties).

      3. Partially De-couple SolrWriter from DocBuilder

      • Created a new interface DIHWriter, & two implementations:
      • SolrWriter (refactored)
      • DIHCacheWriter (allows DIH to write ultimately to a Cache).

      4. Create a new Entity Processor, DIHCacheProcessor, which reads a persistent Cache as DIH Entity Input.

      5. Support a "partition" parameter with both DIHCacheWriter and DIHCacheProcessor to allow for easy partitioning of source entity data.

      6. Change the semantics of entity.destroy()

      • Previously, it was being called on each iteration of DocBuilder.buildDocument().
      • Now it is does one-time cleanup tasks (like closing or deleting a disk-backed cache) once the entity processor is completed.
      • The only out-of-the-box entity processor that previously implemented destroy() was LineEntitiyProcessor, so this is not a very invasive change.

      General Notes:
      We are near completion in converting our search functionality from a legacy search engine to Solr. However, I found that DIH did not support caching to the level of our prior product's data import utility. In order to get our data into Solr, I created these caching enhancements. Because I believe this has broad application, and because we would like this feature to be supported by the Community, I have front-ported this, enhanced, to Trunk. I have also added unit tests and verified that all existing test cases pass. I believe this patch maintains backwards-compatibility and would be a welcome addition to a future version of Solr.

      1. SOLR-2382_3x.patch
        86 kB
        James Dyer
      2. SOLR-2382.patch
        138 kB
        James Dyer
      3. SOLR-2382.patch
        125 kB
        James Dyer
      4. SOLR-2382.patch
        125 kB
        James Dyer
      5. SOLR-2382.patch
        172 kB
        James Dyer
      6. SOLR-2382.patch
        170 kB
        James Dyer
      7. SOLR-2382.patch
        170 kB
        James Dyer
      8. SOLR-2382.patch
        161 kB
        James Dyer
      9. SOLR-2382.patch
        144 kB
        James Dyer
      10. SOLR-2382-dihwriter_standalone.patch
        59 kB
        James Dyer
      11. SOLR-2382-dihwriter.patch
        58 kB
        James Dyer
      12. SOLR-2382-dihwriter.patch
        58 kB
        James Dyer
      13. SOLR-2382-dihwriter.patch
        39 kB
        James Dyer
      14. SOLR-2382-dihwriter.patch
        37 kB
        James Dyer
      15. SOLR-2382-dihwriter.patch
        38 kB
        James Dyer
      16. SOLR-2382-entities.patch
        62 kB
        Noble Paul
      17. SOLR-2382-entities.patch
        68 kB
        James Dyer
      18. SOLR-2382-entities.patch
        68 kB
        James Dyer
      19. SOLR-2382-entities.patch
        62 kB
        James Dyer
      20. SOLR-2382-entities.patch
        61 kB
        James Dyer
      21. SOLR-2382-entities.patch
        59 kB
        James Dyer
      22. SOLR-2382-entities.patch
        59 kB
        James Dyer
      23. SOLR-2382-entities.patch
        60 kB
        James Dyer
      24. SOLR-2382-properties.patch
        15 kB
        James Dyer
      25. SOLR-2382-properties.patch
        14 kB
        James Dyer
      26. SOLR-2382-solrwriter.patch
        29 kB
        James Dyer
      27. SOLR-2382-solrwriter.patch
        30 kB
        Noble Paul
      28. SOLR-2382-solrwriter.patch
        30 kB
        James Dyer
      29. SOLR-2382-solrwriter-verbose-fix.patch
        7 kB
        James Dyer
      30. TestCachedSqlEntityProcessor.java-break-where-clause.patch
        1 kB
        Mikhail Khludnev
      31. TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch
        2 kB
        Mikhail Khludnev
      32. TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch
        2 kB
        Mikhail Khludnev
      33. TestThreaded.java.patch
        4 kB
        Mikhail Khludnev

        Issue Links

          Activity

          Hide
          Lance Norskog added a comment -

          Bravo!

          It's great to see someone else learning the DIH code.

          Show
          Lance Norskog added a comment - Bravo! It's great to see someone else learning the DIH code.
          Hide
          James Dyer added a comment -

          Updated patch with better delta update capabilities for DIHCacheWriter. Also cleaned a couple things up. All tests pass.

          Show
          James Dyer added a comment - Updated patch with better delta update capabilities for DIHCacheWriter. Also cleaned a couple things up. All tests pass.
          Hide
          James Dyer added a comment -

          Updated patch with 2 fixes for things I missed when porting this from 1.4.1 to Trunk. Also added 1 more unit test.

          I think this is ready for someone else to evaluate if anyone has the time & desire. I do believe this would be a nice addition to the DIH product.

          Show
          James Dyer added a comment - Updated patch with 2 fixes for things I missed when porting this from 1.4.1 to Trunk. Also added 1 more unit test. I think this is ready for someone else to evaluate if anyone has the time & desire. I do believe this would be a nice addition to the DIH product.
          Hide
          James Dyer added a comment -

          Fix for two bugs in BerkleyBackedCache:

          • If the passed-in fieldNames & fieldTypes have leading or trailing spaces, opening the cache would fail.
          • If the cache was set up for Delta updates, then closed & re-opened, adding documents would cause an NPE.
          Show
          James Dyer added a comment - Fix for two bugs in BerkleyBackedCache: If the passed-in fieldNames & fieldTypes have leading or trailing spaces, opening the cache would fail. If the cache was set up for Delta updates, then closed & re-opened, adding documents would cause an NPE.
          Hide
          Lance Norskog added a comment -

          Have you tested this under threading?

          Show
          Lance Norskog added a comment - Have you tested this under threading?
          Hide
          James Dyer added a comment -

          There is a multi-threaded unit test in "TestDihCacheWriterAndProcessor.java". However, I have not used the "threads" param in a real-world setting.

          Show
          James Dyer added a comment - There is a multi-threaded unit test in "TestDihCacheWriterAndProcessor.java". However, I have not used the "threads" param in a real-world setting.
          Hide
          James Dyer added a comment -

          This version fixes a bug involving the DIHCacheProcessor in the case of a many-to-[one|many] join between the parent entity and a child entity. If the child entity used a DIHCacheProcessor and the same child joined to consecutive parents, only the first parent would join to the child.

          Show
          James Dyer added a comment - This version fixes a bug involving the DIHCacheProcessor in the case of a many-to- [one|many] join between the parent entity and a child entity. If the child entity used a DIHCacheProcessor and the same child joined to consecutive parents, only the first parent would join to the child.
          Hide
          James Dyer added a comment -

          In light of the recent discussion about the Spatial Contrib, I do wonder if seeking to get this committed is a non-starter because of its dependency on bdb-je. I thought this wouldn't be an issue because we have an existing Lucene contrib (db) with this same dependency, but then I noticed that some of the committers regret the existence of the db contrib for this reason (and others).

          In any case, even if the "BerkleyBackedCache" part of this patch could not be committed, having this framework in place so that developers can write their own persistent cache impls would be a major improvement in my opinion. (I had originally started with a Lucene-backed cache but switch to bdb-je because I couldn't figure out how to achieve acceptable performance for "get"s from the cache).

          Show
          James Dyer added a comment - In light of the recent discussion about the Spatial Contrib, I do wonder if seeking to get this committed is a non-starter because of its dependency on bdb-je. I thought this wouldn't be an issue because we have an existing Lucene contrib (db) with this same dependency, but then I noticed that some of the committers regret the existence of the db contrib for this reason (and others). In any case, even if the "BerkleyBackedCache" part of this patch could not be committed, having this framework in place so that developers can write their own persistent cache impls would be a major improvement in my opinion. (I had originally started with a Lucene-backed cache but switch to bdb-je because I couldn't figure out how to achieve acceptable performance for "get"s from the cache).
          Hide
          Noble Paul added a comment -

          James , Does it make sense to split this issue into two separate issues

          1) The changes required in DIH to support cache (DIHCache interface etc)
          2) Actual cache implementations

          Show
          Noble Paul added a comment - James , Does it make sense to split this issue into two separate issues 1) The changes required in DIH to support cache (DIHCache interface etc) 2) Actual cache implementations
          Hide
          James Dyer added a comment -

          Noble,

          I appreciate your interest in this issue! I could easily move BerkleyBackedCache to its one issue. This would remove any difficulty in dealing with the Sleepycat License. We would still want to maintain the SortedMapBackedCache, however. Otherwise we would lose all caching ability (it would break CachedSqlEntityProcessor).

          In any case, if your goal is to break this issue into more managable chunks just offloading BerkleyBackedCache might not be enough. I had considered breaking this up into possibly 3 parts because I realize this is a huge patch. But the functionality is all designed to work together and it would have been more work for me, etc.

          Let me know what you want me to do. I would love to see this integrated with a GA release someday. I think this would have broad application and a lot of real-world use cases. (& we depend on it here...)

          Show
          James Dyer added a comment - Noble, I appreciate your interest in this issue! I could easily move BerkleyBackedCache to its one issue. This would remove any difficulty in dealing with the Sleepycat License. We would still want to maintain the SortedMapBackedCache, however. Otherwise we would lose all caching ability (it would break CachedSqlEntityProcessor). In any case, if your goal is to break this issue into more managable chunks just offloading BerkleyBackedCache might not be enough. I had considered breaking this up into possibly 3 parts because I realize this is a huge patch. But the functionality is all designed to work together and it would have been more work for me, etc. Let me know what you want me to do. I would love to see this integrated with a GA release someday. I think this would have broad application and a lot of real-world use cases. (& we depend on it here...)
          Hide
          Noble Paul added a comment -

          At least the BDB based cache will have to go to a different issue.

          Show
          Noble Paul added a comment - At least the BDB based cache will have to go to a different issue.
          Hide
          James Dyer added a comment -

          Here's a new patch:

          • Applies cleanly to latest trunk.
          • Better de-coupling of SolrWriter and the PropertiesWriter
          • BerkleyBackedCache removed (will open a separate issue for this).
          • Unit test enhancements to make it easy to unit test new cache impl's as they get added.

          Note that because SortedMapBackedCache does not support persistence, most of the features are untestable (one test was removed...The others just skip for now). Three possible solutions:

          • create a persistence option for SortedMapBackedCache (maybe just use java Serialization).
          • create a new cache impl that doesn't depend on a non-compatible licensed product (maybe Lucene-backed, although I tried this long ago and didn't get the peformance I needed).
          • Figure out an acceptable way to include BerkleyBackedCache (we did it for the Lucene db module, so why not this?)
          Show
          James Dyer added a comment - Here's a new patch: Applies cleanly to latest trunk. Better de-coupling of SolrWriter and the PropertiesWriter BerkleyBackedCache removed (will open a separate issue for this). Unit test enhancements to make it easy to unit test new cache impl's as they get added. Note that because SortedMapBackedCache does not support persistence, most of the features are untestable (one test was removed...The others just skip for now). Three possible solutions: create a persistence option for SortedMapBackedCache (maybe just use java Serialization). create a new cache impl that doesn't depend on a non-compatible licensed product (maybe Lucene-backed, although I tried this long ago and didn't get the peformance I needed). Figure out an acceptable way to include BerkleyBackedCache (we did it for the Lucene db module, so why not this?)
          Hide
          Noble Paul added a comment -

          The patch does not apply on trunk

          Show
          Noble Paul added a comment - The patch does not apply on trunk
          Hide
          James Dyer added a comment -

          Noble,

          I just updated to the latest and re-applied this patch and it worked for me. If you can give me specifics I'll try to dig more to see what might be going wrong. Also, in case you're not on the very latest, there were some very recent commits from about a week ago that broke the previous versions of this patch (r1135954 & r1136789). This newest patch will only work on code from after those commits.

          Show
          James Dyer added a comment - Noble, I just updated to the latest and re-applied this patch and it worked for me. If you can give me specifics I'll try to dig more to see what might be going wrong. Also, in case you're not on the very latest, there were some very recent commits from about a week ago that broke the previous versions of this patch (r1135954 & r1136789). This newest patch will only work on code from after those commits.
          Hide
          James Dyer added a comment -

          Just found a little bug in SortedMapBackedCache. This patch version includes a fix for it.

          Show
          James Dyer added a comment - Just found a little bug in SortedMapBackedCache. This patch version includes a fix for it.
          Hide
          James Dyer added a comment -

          Sorry...that last patch included some unrelated code. This one is correct.

          Show
          James Dyer added a comment - Sorry...that last patch included some unrelated code. This one is correct.
          Hide
          Noble Paul added a comment -

          The patch applies well.

          Suggestions

          The SolrWriter/DIHPropertiesWriter abstarction can be a separate patch and I can commit it right away . It may also have the changes for passing the handler name .

          The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes

          Show
          Noble Paul added a comment - The patch applies well. Suggestions The SolrWriter/DIHPropertiesWriter abstarction can be a separate patch and I can commit it right away . It may also have the changes for passing the handler name . The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes
          Hide
          James Dyer added a comment -

          The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes

          I started down this road this afternoon hoping to have you another patch version to look at today. But it turns out to be more complicated than I first anticipated. Any ideas how to get around these difficulties?

          • DIHCacheProcessor enforces "readOnly=false" and "deletePriorData=true". It also modifies the cacheName if the user specifies partitions. Seeing that Context-Entity-Attributes are immutable, should I pass these as Entity-Scope-Session-Attributes?
          • cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation. This way, if a user applies something non-applicable they are safely ignored rather than getting undefined behavior. Just forwarding the context on doesn't give this flexibility. Do you think its ok to just forward on the context anyway?
          • DocBuilder instantiates DIHCacheWriter which in turn gets the user-specified Cache Implementation and instantiates that. At this point in time, there doesn't seem to be a Context to pass. So, rather than do this in the constructor, is there a safer place down the road where I should be instantiating the DIHCacheWriter?

          I realize that its more lines of code to always copy these properties into a property map to send to the cache, but I was looking at the cache at being a layer down in the stack and maybe it shouldn't have the whole context sent to it. What do you think?

          Show
          James Dyer added a comment - The DIHCache should take the Context as a param and the EntityProcessor does not need to make a copy of the attributes I started down this road this afternoon hoping to have you another patch version to look at today. But it turns out to be more complicated than I first anticipated. Any ideas how to get around these difficulties? DIHCacheProcessor enforces "readOnly=false" and "deletePriorData=true". It also modifies the cacheName if the user specifies partitions. Seeing that Context-Entity-Attributes are immutable, should I pass these as Entity-Scope-Session-Attributes? cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation. This way, if a user applies something non-applicable they are safely ignored rather than getting undefined behavior. Just forwarding the context on doesn't give this flexibility. Do you think its ok to just forward on the context anyway? DocBuilder instantiates DIHCacheWriter which in turn gets the user-specified Cache Implementation and instantiates that. At this point in time, there doesn't seem to be a Context to pass. So, rather than do this in the constructor, is there a safer place down the road where I should be instantiating the DIHCacheWriter? I realize that its more lines of code to always copy these properties into a property map to send to the cache, but I was looking at the cache at being a layer down in the stack and maybe it shouldn't have the whole context sent to it. What do you think?
          Hide
          Noble Paul added a comment - - edited

          cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation

          it doen't matter . It can use any params which are relevant to it. Anyway you can't define what params are required for a future DIHCache impl. Look at a Transformer implementation it can read anything it wants. The cache should be initialized like that only

          Why should the DocBuilder be even aware of DIHCache , Should it not be kept local to the EntityProcessor?

          Show
          Noble Paul added a comment - - edited cacheInit() in EntityProcessorBase specifically passes only the parameters that apply to the current situation it doen't matter . It can use any params which are relevant to it. Anyway you can't define what params are required for a future DIHCache impl. Look at a Transformer implementation it can read anything it wants. The cache should be initialized like that only Why should the DocBuilder be even aware of DIHCache , Should it not be kept local to the EntityProcessor?
          Hide
          James Dyer added a comment -

          Here is a version that passes parameters via the Context object, rather than by building maps.

          Show
          James Dyer added a comment - Here is a version that passes parameters via the Context object, rather than by building maps.
          Hide
          James Dyer added a comment -

          Why should the DocBuilder be even aware of DIHCache , Should it not be kept local to the EntityProcessor?

          You're right that when the cache is owned by an EntityProcessor, DocBuilder has no knowledge of it. But there is another way these caches can be used, described in the "functionality" section of this issue's description:

          3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr ... Also provide the ability to do delta updates on such persistent caches.

          In this case, DocBuilder is outputting not to SolrWriter, but to DIHCacheWriter. It is arguable the DIHCacheWriter should not be instantiated in DocBuilder in this instance, as I currently have it. Perhaps its should happen up the stack in DataImporter, etc. But in any case, whenvever DIHCacheWriter gets instantiated, it needs to know which CacheImpl to create and also pass on any parameters that CacheImpl needs.

          Show
          James Dyer added a comment - Why should the DocBuilder be even aware of DIHCache , Should it not be kept local to the EntityProcessor? You're right that when the cache is owned by an EntityProcessor, DocBuilder has no knowledge of it. But there is another way these caches can be used, described in the "functionality" section of this issue's description: 3. Provide a means to write the final (root entity) DIH output to a cache rather than to Solr ... Also provide the ability to do delta updates on such persistent caches. In this case, DocBuilder is outputting not to SolrWriter, but to DIHCacheWriter. It is arguable the DIHCacheWriter should not be instantiated in DocBuilder in this instance, as I currently have it. Perhaps its should happen up the stack in DataImporter, etc. But in any case, whenvever DIHCacheWriter gets instantiated, it needs to know which CacheImpl to create and also pass on any parameters that CacheImpl needs.
          Hide
          James Dyer added a comment -

          Noble,

          Are you still able to work with me on this issue? Is there anything else you are waiting for from me? The patch I submitted on June 24 passes parameters via the Context object as you requested. Also, I previously separated "BerkleyBackedCache" out into a separate issue to (SOLR-2613) so we won't run into licensing issues here. Let me know what else you think we need to do. Thanks.

          Show
          James Dyer added a comment - Noble, Are you still able to work with me on this issue? Is there anything else you are waiting for from me? The patch I submitted on June 24 passes parameters via the Context object as you requested. Also, I previously separated "BerkleyBackedCache" out into a separate issue to ( SOLR-2613 ) so we won't run into licensing issues here. Let me know what else you think we need to do. Thanks.
          Hide
          Noble Paul added a comment -

          my apologies for the delay.

          The problem w/ the patch is the size/scope. You may not need to open up other issues but stuff like abstracting DIHWriter,DIHpropertiesWriter etc can be given as a separate patch in the same issue and I can commit them straight away. Though the issue is aboout cache improvements , it goes far beyond that scope. committing it in as a whole is difficult.

          Show
          Noble Paul added a comment - my apologies for the delay. The problem w/ the patch is the size/scope. You may not need to open up other issues but stuff like abstracting DIHWriter,DIHpropertiesWriter etc can be given as a separate patch in the same issue and I can commit them straight away. Though the issue is aboout cache improvements , it goes far beyond that scope. committing it in as a whole is difficult.
          Hide
          James Dyer added a comment -

          This is the properties file writer abstraction taken from the June 24 version of the entire patch.

          This removes property file operations from SolrWriter, paving the way for multiple "DIHWriter" implementations (such as "DIHCacheWriter").

          Show
          James Dyer added a comment - This is the properties file writer abstraction taken from the June 24 version of the entire patch. This removes property file operations from SolrWriter, paving the way for multiple "DIHWriter" implementations (such as "DIHCacheWriter").
          Hide
          James Dyer added a comment -

          The previous patch left out a couple of things. here's a fixed version...

          Show
          James Dyer added a comment - The previous patch left out a couple of things. here's a fixed version...
          Hide
          James Dyer added a comment -

          This is the "DIHWriter" abstraction taken from the June 24 version of the entire patch. This patch depends on "SOLR-2382-properties.patch".

          SolrWriter now implements the new DIHWriter interface. Also, logging operations are removed from SolrWriter. This opens the possibility of creating plugable DIHWriters so that DIH can write to something other than a Solr index. This in turn opens up the ability to create a Writer that can write to a Persistent Cache for later use.

          Show
          James Dyer added a comment - This is the "DIHWriter" abstraction taken from the June 24 version of the entire patch. This patch depends on " SOLR-2382 -properties.patch". SolrWriter now implements the new DIHWriter interface. Also, logging operations are removed from SolrWriter. This opens the possibility of creating plugable DIHWriters so that DIH can write to something other than a Solr index. This in turn opens up the ability to create a Writer that can write to a Persistent Cache for later use.
          Hide
          Noble Paul added a comment -

          The patch does not apply. the codebase has changed in the trunk. But it still does not apply.

          Show
          Noble Paul added a comment - The patch does not apply. the codebase has changed in the trunk. But it still does not apply.
          Hide
          Steve Rowe added a comment -

          The patch does not apply. the codebase has changed in the trunk. But it still does not apply.

          trunk/dev-tools/scripts/SOLR-2452.patch.hack.pl should be able to adjust patches like this one.

          Show
          Steve Rowe added a comment - The patch does not apply. the codebase has changed in the trunk. But it still does not apply. trunk/dev-tools/scripts/ SOLR-2452 .patch.hack.pl should be able to adjust patches like this one.
          Hide
          Noble Paul added a comment -

          I've uploaded the patch I modified w/ the abovementioned script

          Show
          Noble Paul added a comment - I've uploaded the patch I modified w/ the abovementioned script
          Hide
          James Dyer added a comment -

          Steve,

          Thanks for the tip, but I'm a bit confused because yesterday I imported a fresh check-out of trunk into my company's svn server so that I could start the process of creating all of these separate patches. I should have had the solr-2452 changes already, shouldn't have I?

          For instance, the 1st line of the "solrwriter" patch has:

          Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestMailEntityProcessor.java

          pre-2452, this path would have started with "solr/contrib/dataimporthandler/src/extras/test/java", right?

          In any case if anyone still has difficulty in applying these, let me know specifically what you the problem is and I'll try to make it right. Thanks.

          Show
          James Dyer added a comment - Steve, Thanks for the tip, but I'm a bit confused because yesterday I imported a fresh check-out of trunk into my company's svn server so that I could start the process of creating all of these separate patches. I should have had the solr-2452 changes already, shouldn't have I? For instance, the 1st line of the "solrwriter" patch has: Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestMailEntityProcessor.java pre-2452, this path would have started with "solr/contrib/dataimporthandler/src/extras/test/java", right? In any case if anyone still has difficulty in applying these, let me know specifically what you the problem is and I'll try to make it right. Thanks.
          Hide
          Steve Rowe added a comment -

          James,

          As you say, patches against recent checkouts don't need to be adjusted (specifically, checkouts from after SOLR-2452 was committed to trunk in r1144761 on 9 July at 23:01 UTC).

          For instance, the 1st line of the "solrwriter" patch has:

          Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestMailEntityProcessor.java

          pre-2452, this path would have started with "solr/contrib/dataimporthandler/src/extras/test/java", right?

          Right.

          Show
          Steve Rowe added a comment - James, As you say, patches against recent checkouts don't need to be adjusted (specifically, checkouts from after SOLR-2452 was committed to trunk in r1144761 on 9 July at 23:01 UTC). For instance, the 1st line of the "solrwriter" patch has: Index: solr/contrib/dataimporthandler-extras/src/test/org/apache/solr/handler/dataimport/TestMailEntityProcessor.java pre-2452, this path would have started with "solr/contrib/dataimporthandler/src/extras/test/java", right? Right.
          Hide
          James Dyer added a comment -

          This is the Entity Processor Caching portion taken from the June 24 version of the entire patch. This patch depends both on the "properties" patch and the "solrwriter" patch.

          This gives users the ability to specify a "cacheImpl" parameter to any Entity Processor, adding caching functionality. CachedSqlEntityProcessor is gutted and deprecated (with no loss of functionality). Instead, users can now use SqlEntityProcessor and specify a "cacheImpl". Also, the caching functionality has been removed from EntityProcessorBase and placed into SortedMapBackedCache, which is the only Cache implementation available here (as BerkleyBackedCache is removed to SOLR-2613).

          Two new unit tests are included in this patch, one for SortedMapBackedCache and another demonstrating the use of an ephemeral cache to do joins.

          Show
          James Dyer added a comment - This is the Entity Processor Caching portion taken from the June 24 version of the entire patch. This patch depends both on the "properties" patch and the "solrwriter" patch. This gives users the ability to specify a "cacheImpl" parameter to any Entity Processor, adding caching functionality. CachedSqlEntityProcessor is gutted and deprecated (with no loss of functionality). Instead, users can now use SqlEntityProcessor and specify a "cacheImpl". Also, the caching functionality has been removed from EntityProcessorBase and placed into SortedMapBackedCache, which is the only Cache implementation available here (as BerkleyBackedCache is removed to SOLR-2613 ). Two new unit tests are included in this patch, one for SortedMapBackedCache and another demonstrating the use of an ephemeral cache to do joins.
          Hide
          James Dyer added a comment -

          This is the "DIHCacheWriter" and "DIHCacheProcessor" portion taken from the June 24 version of the entire patch. This patch depends both on the "properties" patch and the "solrwriter" patch.

          DIHWriter is a drop-in replacement for SolrWriter, allowing a DIH run to write its output to a DIHCache rather than to Solr. Users can specify which CacheImpl to use. If the Cache supports persistence, delta updates can be performed on cached data.

          DIHCacheProcessor is an EntityProcessor that takes a DIHCache as its input.

          Unit tests are included. However, most tests skip because SortedMapBackedCache does not support persistence. To get the tests to run, (for now), we need to also apply SOLR-2613 (BerkleyBackedCache).

          Show
          James Dyer added a comment - This is the "DIHCacheWriter" and "DIHCacheProcessor" portion taken from the June 24 version of the entire patch. This patch depends both on the "properties" patch and the "solrwriter" patch. DIHWriter is a drop-in replacement for SolrWriter, allowing a DIH run to write its output to a DIHCache rather than to Solr. Users can specify which CacheImpl to use. If the Cache supports persistence, delta updates can be performed on cached data. DIHCacheProcessor is an EntityProcessor that takes a DIHCache as its input. Unit tests are included. However, most tests skip because SortedMapBackedCache does not support persistence. To get the tests to run, (for now), we need to also apply SOLR-2613 (BerkleyBackedCache).
          Hide
          James Dyer added a comment -

          Noble,

          I now have this original issue split up into 5 separate pieces. The first 4 are included on this jira issue. BerkleyBackedCache is spun off to SOLR-2613. These patches all apply to trunk post-solr-2452, as Steve Rowe noted. These patches are based on the June 24 version of the big patch, including the change to use "Context" to pass parameters.

          Here is a summary. The patches need to be applied in this order.

          1. SOLR-2382-properties.patch - Remove property file operations from SolrWriter.

          2. SOLR-2382-solrwriter.patch - Allow multiple DIHWriters that inherit a common interface. Remove logging operations from SolrWriter.

          3. SOLR-2382-entities.patch - Allow any EntityProcessor to specify a "cacheImpl" and cache its data. Deprecate CachedSQLEntityProcessor. Introduce SortedMapBackedCache.

          4. SOLR-2382-dihwriter.patch - Introduce the DIHCacheWriter and DIHCacheProcessor.

          5. SOLR-2613 - Introduce BerkleyBackedCache, a fast cache Impl that supports persistence.

          Let me know whatever else you need me to do on this. I appreciate your time in working with me.

          Show
          James Dyer added a comment - Noble, I now have this original issue split up into 5 separate pieces. The first 4 are included on this jira issue. BerkleyBackedCache is spun off to SOLR-2613 . These patches all apply to trunk post-solr-2452, as Steve Rowe noted. These patches are based on the June 24 version of the big patch, including the change to use "Context" to pass parameters. Here is a summary. The patches need to be applied in this order. 1. SOLR-2382 -properties.patch - Remove property file operations from SolrWriter. 2. SOLR-2382 -solrwriter.patch - Allow multiple DIHWriters that inherit a common interface. Remove logging operations from SolrWriter. 3. SOLR-2382 -entities.patch - Allow any EntityProcessor to specify a "cacheImpl" and cache its data. Deprecate CachedSQLEntityProcessor. Introduce SortedMapBackedCache. 4. SOLR-2382 -dihwriter.patch - Introduce the DIHCacheWriter and DIHCacheProcessor. 5. SOLR-2613 - Introduce BerkleyBackedCache, a fast cache Impl that supports persistence. Let me know whatever else you need me to do on this. I appreciate your time in working with me.
          Hide
          Noble Paul added a comment -

          Good James. The patches look fine.I shall review it and commit or suggest mofifications

          Show
          Noble Paul added a comment - Good James. The patches look fine.I shall review it and commit or suggest mofifications
          Hide
          Noble Paul added a comment -

          @James .
          Committed the first one. take a look and let me know if you want something to be changed

          Show
          Noble Paul added a comment - @James . Committed the first one. take a look and let me know if you want something to be changed
          Hide
          James Dyer added a comment -

          Looks good. I agree that DIHPropertiesWriter.isWritable() is preferable over exposing the File entirely.

          Show
          James Dyer added a comment - Looks good. I agree that DIHPropertiesWriter.isWritable() is preferable over exposing the File entirely.
          Hide
          James Dyer added a comment -

          DIHPropertiesWriter doesn't need to import java.io.File anymore...

          Show
          James Dyer added a comment - DIHPropertiesWriter doesn't need to import java.io.File anymore...
          Hide
          Noble Paul added a comment -

          @James can you please update the next patch (SOLR-2382-solrwriter.patch) to trunk?

          Show
          Noble Paul added a comment - @James can you please update the next patch ( SOLR-2382 -solrwriter.patch) to trunk?
          Hide
          James Dyer added a comment -

          Here is the "solrwriter" patch, sync'ed to the latest trunk, which now has the first patch ("properties") committed...

          Show
          James Dyer added a comment - Here is the "solrwriter" patch, sync'ed to the latest trunk, which now has the first patch ("properties") committed...
          Hide
          Noble Paul added a comment -

          SOLR-2382-solrwriter.patch is committed. see if SOLR-2382-entities.patch needs an update

          Show
          Noble Paul added a comment - SOLR-2382 -solrwriter.patch is committed. see if SOLR-2382 -entities.patch needs an update
          Hide
          James Dyer added a comment -

          Here is a freshly-sync'ed version of the "entities" patch.

          Show
          James Dyer added a comment - Here is a freshly-sync'ed version of the "entities" patch.
          Hide
          Alexey Serba added a comment -

          Hmm, after last Solr upgrade my application stopped working. Previously DataImportHandler response contained "verbose-output" part in debug/verbose mode, but not anymore. I see there'r some major refactorings going on related to debugLogger / etc. Any clues why "verbose-ouput" part is missing?

          Show
          Alexey Serba added a comment - Hmm, after last Solr upgrade my application stopped working. Previously DataImportHandler response contained "verbose-output" part in debug/verbose mode, but not anymore. I see there'r some major refactorings going on related to debugLogger / etc. Any clues why "verbose-ouput" part is missing?
          Hide
          Noble Paul added a comment -

          Hi,I shall take a look

          Show
          Noble Paul added a comment - Hi,I shall take a look
          Hide
          Noble Paul added a comment -

          This feature was just for debugging. Do you mean that your app was using the debug+verbose for your app?

          Show
          Noble Paul added a comment - This feature was just for debugging. Do you mean that your app was using the debug+verbose for your app?
          Hide
          Alexey Serba added a comment -

          Yes, I'm using this feature to test if DIH configuration is valid and to present full stack trace to user if there's any problem with SQL query, etc. I checked out source codes and it seems there's still condition to display "verbose-output", but it doesn't working. Something is off after latest refactorings.

          Show
          Alexey Serba added a comment - Yes, I'm using this feature to test if DIH configuration is valid and to present full stack trace to user if there's any problem with SQL query, etc. I checked out source codes and it seems there's still condition to display "verbose-output", but it doesn't working. Something is off after latest refactorings.
          Hide
          James Dyer added a comment -

          Alexey, thanks for finding this problem and letting us know!

          The problem is that SolrWriter contains a package-private null reference to "DebugLogger". This variable is always null and never gets initalized. DataImportHandler.handleRequestBody() null-checks for it before writing out "verbose-output". Because it is always null, "verbose-output" never can get written.

          In the original refactoring, getDebugLogger() was part of the DIHWriter interface, and calling it on SolrWriter would initalize the "DebugLogger" variable. The final version removed getDebugLogger() from the interface, which seems right to me as logging isn't a core function of writing DIH data. Unfortunately, now DocBuilder holds the real reference to "DebugLogger" and this isn't exposed to DataImportHandler.handleRequestBody().

          It seem to me that DocBuilder probably needs to be adding the verbose output to the final response, moving this out of DataImportHandler.handleRequestBody(). Noble, what do you think? Let me know if you have time to fix this or if you'd like me to create a new patch.

          Show
          James Dyer added a comment - Alexey, thanks for finding this problem and letting us know! The problem is that SolrWriter contains a package-private null reference to "DebugLogger". This variable is always null and never gets initalized. DataImportHandler.handleRequestBody() null-checks for it before writing out "verbose-output". Because it is always null, "verbose-output" never can get written. In the original refactoring, getDebugLogger() was part of the DIHWriter interface, and calling it on SolrWriter would initalize the "DebugLogger" variable. The final version removed getDebugLogger() from the interface, which seems right to me as logging isn't a core function of writing DIH data. Unfortunately, now DocBuilder holds the real reference to "DebugLogger" and this isn't exposed to DataImportHandler.handleRequestBody(). It seem to me that DocBuilder probably needs to be adding the verbose output to the final response, moving this out of DataImportHandler.handleRequestBody(). Noble, what do you think? Let me know if you have time to fix this or if you'd like me to create a new patch.
          Hide
          Noble Paul added a comment -

          @james it would be helpful if u can give a patch

          Show
          Noble Paul added a comment - @james it would be helpful if u can give a patch
          Hide
          James Dyer added a comment -

          This patch fixes verbose debugging output and can be applied to the current Trunk.

          Show
          James Dyer added a comment - This patch fixes verbose debugging output and can be applied to the current Trunk.
          Hide
          James Dyer added a comment -

          Here is a slight tweak on the "entities" patch. A convenience method in CachePropertyUtil to get properties as Objects rather than as Strings is added. (this fixes a bug I found in BerkleyBackedCache that was introduced recently by changes we've made...)

          Show
          James Dyer added a comment - Here is a slight tweak on the "entities" patch. A convenience method in CachePropertyUtil to get properties as Objects rather than as Strings is added. (this fixes a bug I found in BerkleyBackedCache that was introduced recently by changes we've made...)
          Hide
          Lance Norskog added a comment -

          Is SOLR-2382-entities.patch/July 29, the latest full patch?

          Show
          Lance Norskog added a comment - Is SOLR-2382 -entities.patch/July 29, the latest full patch?
          Hide
          James Dyer added a comment -

          Lance,

          I apologize that its getting pretty confusing. If you've got the current trunk, apply:

          first: SOLR-2382-solrwriter-verbose-fix.patch (7/29) (fixes bug reported by Alexey)
          second: SOLR-2382-entities.patch (7/29)
          third: SOLR-2382-dihwriter.patch (7/14)
          fourth: SOLR-2613 (BerkleyBackedCache, if desired)

          Show
          James Dyer added a comment - Lance, I apologize that its getting pretty confusing. If you've got the current trunk, apply: first: SOLR-2382 -solrwriter-verbose-fix.patch (7/29) (fixes bug reported by Alexey) second: SOLR-2382 -entities.patch (7/29) third: SOLR-2382 -dihwriter.patch (7/14) fourth: SOLR-2613 (BerkleyBackedCache, if desired)
          Hide
          Alexey Serba added a comment -

          James, I tested your patch (SOLR-2382-solrwriter-verbose-fix.patch) and verified that verbose output is working ok now. Noble, could you please review and commit this patch?

          Show
          Alexey Serba added a comment - James, I tested your patch ( SOLR-2382 -solrwriter-verbose-fix.patch) and verified that verbose output is working ok now. Noble, could you please review and commit this patch?
          Hide
          Noble Paul added a comment -

          Sure
          https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073437#comment-13073437]
          verified that verbose output is working ok now. Noble, could you please
          review and commit this patch?
          SOLR-2382-entities.patch, SOLR-2382-entities.patch,
          SOLR-2382-properties.patch, SOLR-2382-properties.patch,
          SOLR-2382-solrwriter-verbose-fix.patch, SOLR-2382-solrwriter.patch,
          SOLR-2382-solrwriter.patch, SOLR-2382-solrwriter.patch, SOLR-2382.patch,
          SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch,
          SOLR-2382.patch, SOLR-2382.patch, SOLR-2382.patch
          a cache implementation that best suits their data and application.
          needing to create a special cached implementation of the Entity Processor
          (such as CachedSqlEntityProcessor).
          rather than to Solr. Then provide a way for a subsequent DIH call to use the
          cache as an Entity input. Also provide the ability to do delta updates on
          such persistent caches.
          then be fed back into DIH and indexed either to varying Solr Shards, or to
          the same Core in parallel.
          data prior to joining to parent entities.
          problem.
          Caching mechanism and does not scale.
          process that runs separate from our main indexing process.
          changed.
          few fields changed.
          1 entity's data changed.
          1.4.1, which did not have the "threads" parameter).
          partition our source data into Shards.
          CachedSqlEntityProcessor (now deprecated).
          with je-4.1.6.jar
          believe this may be incompatible due to Generic Usage.
          to use or evaluate this patch, download bdb-je from
          http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html
          entity data to be cached (see EntityProcessorBase & DIHCacheProperties).
          persistent Cache as DIH Entity Input.
          DIHCacheProcessor to allow for easy partitioning of source entity data.
          DocBuilder.buildDocument().
          disk-backed cache) once the entity processor is completed.
          destroy() was LineEntitiyProcessor, so this is not a very invasive change.
          legacy search engine to Solr. However, I found that DIH did not support
          caching to the level of our prior product's data import utility. In order to
          get our data into Solr, I created these caching enhancements. Because I
          believe this has broad application, and because we would like this feature
          to be supported by the Community, I have front-ported this, enhanced, to
          Trunk. I have also added unit tests and verified that all existing test
          cases pass. I believe this patch maintains backwards-compatibility and would
          be a welcome addition to a future version of Solr.

          Show
          Noble Paul added a comment - Sure https://issues.apache.org/jira/browse/SOLR-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073437#comment-13073437 ] verified that verbose output is working ok now. Noble, could you please review and commit this patch? SOLR-2382 -entities.patch, SOLR-2382 -entities.patch, SOLR-2382 -properties.patch, SOLR-2382 -properties.patch, SOLR-2382 -solrwriter-verbose-fix.patch, SOLR-2382 -solrwriter.patch, SOLR-2382 -solrwriter.patch, SOLR-2382 -solrwriter.patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch, SOLR-2382 .patch a cache implementation that best suits their data and application. needing to create a special cached implementation of the Entity Processor (such as CachedSqlEntityProcessor). rather than to Solr. Then provide a way for a subsequent DIH call to use the cache as an Entity input. Also provide the ability to do delta updates on such persistent caches. then be fed back into DIH and indexed either to varying Solr Shards, or to the same Core in parallel. data prior to joining to parent entities. problem. Caching mechanism and does not scale. process that runs separate from our main indexing process. changed. few fields changed. 1 entity's data changed. 1.4.1, which did not have the "threads" parameter). partition our source data into Shards. CachedSqlEntityProcessor (now deprecated). with je-4.1.6.jar believe this may be incompatible due to Generic Usage. to use or evaluate this patch, download bdb-je from http://www.oracle.com/technetwork/database/berkeleydb/downloads/index.html entity data to be cached (see EntityProcessorBase & DIHCacheProperties). persistent Cache as DIH Entity Input. DIHCacheProcessor to allow for easy partitioning of source entity data. DocBuilder.buildDocument(). disk-backed cache) once the entity processor is completed. destroy() was LineEntitiyProcessor, so this is not a very invasive change. legacy search engine to Solr. However, I found that DIH did not support caching to the level of our prior product's data import utility. In order to get our data into Solr, I created these caching enhancements. Because I believe this has broad application, and because we would like this feature to be supported by the Community, I have front-ported this, enhanced, to Trunk. I have also added unit tests and verified that all existing test cases pass. I believe this patch maintains backwards-compatibility and would be a welcome addition to a future version of Solr.
          Hide
          Lance Norskog added a comment -

          Hello-

          Are there any benchmark results with this patch? Given, say two tables with a million elements in a pairwise join, how well does this caching system work?

          Show
          Lance Norskog added a comment - Hello- Are there any benchmark results with this patch? Given, say two tables with a million elements in a pairwise join, how well does this caching system work?
          Hide
          James Dyer added a comment -

          Lance,

          I do not have any scientific benchmarks, but I can tell you how we use BerkleyBackedCache and how it performs for us.

          In our main app, we fully re-index all our data every night (13+ million records). Its basically a 2-step process. First we run ~50 DIH handlers, each of which builds a cache from databases, flat files, etc. The caches partition the data 8-ways. Then a "master" DIH script does all the joining, runs transformers on the data, etc. We have all 8 invocations of this same "master" DIH config running simultaneously indexing to the same Solr core, so each DIH invocation is processing 1.6 million records directly out of caches, doing all the 1-many joins, running transformer code, indexing, etc. This takes 1-1/2 hours, so maybe 250-300 solr records get added per second. We're using fast local disks configured with raid-0 on an 8-core 64gb server. This app is running solr 1.4, using the original version of this patch, prior to my front-porting it to trunk. No doubt some of the time is spent contending for the Lucene index as all 8 DIH invocations are indexing at the same time.

          We also have another app that uses Solr4.0 with the patch I originally posted back in February, sharing hardware with the main app. This one has about 10 entities and uses a simple 1-dih-handler configuration. The parent entity drives directly off the database while all the child entities use SqlEntityProcessor with BerkleyBackedCache. There are only 25,000 fairly narrow records and we can re-index everything in about 10 minutes. This includes database time, indexing, running transformers, etc in addition to the cache overhead.

          The inspiration for this was that we were converting off of Endeca and we were relying on Endeca's "Forge" program to join & denormalize all of the data. Forge has a very fast disk-backed caching mechanism and I needed to match that performance with DIH. I'm pretty sure what we have here surpasses Forge. And we also get a big bonus in that it lets you persist caches and use them as a subsequent input. With Forge, we had to output the data into huge delimited text files and then use that as input for the next step...

          Hope this information gives you some idea if this will work for your use case.

          Show
          James Dyer added a comment - Lance, I do not have any scientific benchmarks, but I can tell you how we use BerkleyBackedCache and how it performs for us. In our main app, we fully re-index all our data every night (13+ million records). Its basically a 2-step process. First we run ~50 DIH handlers, each of which builds a cache from databases, flat files, etc. The caches partition the data 8-ways. Then a "master" DIH script does all the joining, runs transformers on the data, etc. We have all 8 invocations of this same "master" DIH config running simultaneously indexing to the same Solr core, so each DIH invocation is processing 1.6 million records directly out of caches, doing all the 1-many joins, running transformer code, indexing, etc. This takes 1-1/2 hours, so maybe 250-300 solr records get added per second. We're using fast local disks configured with raid-0 on an 8-core 64gb server. This app is running solr 1.4, using the original version of this patch, prior to my front-porting it to trunk. No doubt some of the time is spent contending for the Lucene index as all 8 DIH invocations are indexing at the same time. We also have another app that uses Solr4.0 with the patch I originally posted back in February, sharing hardware with the main app. This one has about 10 entities and uses a simple 1-dih-handler configuration. The parent entity drives directly off the database while all the child entities use SqlEntityProcessor with BerkleyBackedCache. There are only 25,000 fairly narrow records and we can re-index everything in about 10 minutes. This includes database time, indexing, running transformers, etc in addition to the cache overhead. The inspiration for this was that we were converting off of Endeca and we were relying on Endeca's "Forge" program to join & denormalize all of the data. Forge has a very fast disk-backed caching mechanism and I needed to match that performance with DIH. I'm pretty sure what we have here surpasses Forge. And we also get a big bonus in that it lets you persist caches and use them as a subsequent input. With Forge, we had to output the data into huge delimited text files and then use that as input for the next step... Hope this information gives you some idea if this will work for your use case.
          Hide
          Lance Norskog added a comment -

          Thank you for this explanation!

          Here is what seems to be a different use case: we need to find each record in the side table only once. Those foreign keys are in the same order as the primary table. Would we write another kind of DIHCache that loads N foreign keys at a time, and walks through those in order?

          Thanks.

          Show
          Lance Norskog added a comment - Thank you for this explanation! Here is what seems to be a different use case: we need to find each record in the side table only once. Those foreign keys are in the same order as the primary table. Would we write another kind of DIHCache that loads N foreign keys at a time, and walks through those in order? Thanks.
          Hide
          Lance Norskog added a comment -
          I apologize that its getting pretty confusing. If you've got the current trunk, apply:
          
          first: SOLR-2382-solrwriter-verbose-fix.patch (7/29) (fixes bug reported by Alexey)
          second: SOLR-2382-entities.patch (7/29)
          third: SOLR-2382-dihwriter.patch (7/14)
          fourth: SOLR-2613 (BerkleyBackedCache, if desired)
          
          • The first patch has been committed on the trunk: SOLR-2382-solrwriter-verbose-fix.patch without notice. At least, applying this patch to today's trunk (August 4) claims that the patch is a no-op.
          • The second patch is required, but has a mistake in it: SOLR-2382-entities.patch
            • The section for CachePropertyUtil.java has an error:
              --- solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/CachePropertyUtil.java   (revision 0)^M
              +++ solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/CachePropertyUtil.java   (revision 0)^M
              @@ -0,0 +1,18 @@^M
              

              The 1,18 should be 1,31 (or 32). This problem caused the file to only have the first 18 lines.

          • Have not tried the third patch yet, or the BDB implementation.

          The unit tests run. And thank you for adding full tests.

          Show
          Lance Norskog added a comment - I apologize that its getting pretty confusing. If you've got the current trunk, apply: first: SOLR-2382-solrwriter-verbose-fix.patch (7/29) (fixes bug reported by Alexey) second: SOLR-2382-entities.patch (7/29) third: SOLR-2382-dihwriter.patch (7/14) fourth: SOLR-2613 (BerkleyBackedCache, if desired) The first patch has been committed on the trunk: SOLR-2382-solrwriter-verbose-fix.patch without notice. At least, applying this patch to today's trunk (August 4) claims that the patch is a no-op. The second patch is required, but has a mistake in it: SOLR-2382-entities.patch The section for CachePropertyUtil.java has an error: --- solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/CachePropertyUtil.java (revision 0)^M +++ solr/contrib/dataimporthandler/src/java/org/apache/solr/handler/dataimport/CachePropertyUtil.java (revision 0)^M @@ -0,0 +1,18 @@^M The 1,18 should be 1,31 (or 32). This problem caused the file to only have the first 18 lines. Have not tried the third patch yet, or the BDB implementation. The unit tests run. And thank you for adding full tests.
          Hide
          Lance Norskog added a comment -

          One request: the DIH multi-threaded feature "threads" is commonly used with SQL data sources. Please add multi-threaded variants of the unit tests for this feature, and for the BDB plug-in. If possible, "thrash-testing" would be very important for catching regressions.

          Thanks.

          Show
          Lance Norskog added a comment - One request: the DIH multi-threaded feature "threads" is commonly used with SQL data sources. Please add multi-threaded variants of the unit tests for this feature, and for the BDB plug-in. If possible, "thrash-testing" would be very important for catching regressions. Thanks.
          Hide
          James Dyer added a comment -

          Updating the patches to the current Trunk. To use these, apply "entities" first, then "dihwriter".

          Show
          James Dyer added a comment - Updating the patches to the current Trunk. To use these, apply "entities" first, then "dihwriter".
          Hide
          James Dyer added a comment -

          we need to find each record in the side table only once. Those foreign keys are in the same order as the primary table. Would we write another kind of DIHCache that loads N foreign keys at a time, and walks through those in order?

          I don't fully understand what you mean. You may be interested in DIHCacheProperties.CACHE_NO_DUPLICATE_KEYS ? This parameter is supported by BerkleyBackedCache and would allow the kinds of 1-to-1 join I think you are describing. From there, you can use this just like CachedSqlEntityProcessor works today, and join on whatever field the child entity is keyed on (assuming it is in the parent). The only difference is you can specify a different "cacheImpl" and therefore you have options beyond the in-memory-hashmap cache we've got today.

          Please add multi-threaded variants of the unit tests for this feature, and for the BDB plug-in. If possible, "thrash-testing" would be very important for catching regressions

          Did you see TestDihCacheWriterAndProcessor.testMultiThreaded()? There is also a BDB variant of the test included with SOLR-2613. I know this doesn't exactly "thrash-test" anything but its a good start and on-par with the other multi-threaded DIH tests. Would you be willing to help with such tests and any subsequent debugging?

          Show
          James Dyer added a comment - we need to find each record in the side table only once. Those foreign keys are in the same order as the primary table. Would we write another kind of DIHCache that loads N foreign keys at a time, and walks through those in order? I don't fully understand what you mean. You may be interested in DIHCacheProperties.CACHE_NO_DUPLICATE_KEYS ? This parameter is supported by BerkleyBackedCache and would allow the kinds of 1-to-1 join I think you are describing. From there, you can use this just like CachedSqlEntityProcessor works today, and join on whatever field the child entity is keyed on (assuming it is in the parent). The only difference is you can specify a different "cacheImpl" and therefore you have options beyond the in-memory-hashmap cache we've got today. Please add multi-threaded variants of the unit tests for this feature, and for the BDB plug-in. If possible, "thrash-testing" would be very important for catching regressions Did you see TestDihCacheWriterAndProcessor.testMultiThreaded()? There is also a BDB variant of the test included with SOLR-2613 . I know this doesn't exactly "thrash-test" anything but its a good start and on-par with the other multi-threaded DIH tests. Would you be willing to help with such tests and any subsequent debugging?
          Hide
          Pulkit Singhal added a comment -

          @jdyer - Awesome work!
          @noble.paul - Seems like the SOLR-2382-entities.patch file is ready to be committed to trunk. What do you think about it?

          Show
          Pulkit Singhal added a comment - @jdyer - Awesome work! @noble.paul - Seems like the SOLR-2382 -entities.patch file is ready to be committed to trunk. What do you think about it?
          Hide
          James Dyer added a comment -

          Noble,

          I've been meaning to ask you if there is anything I can do to help move this along. But then I got very busy and needed to wait until I could genuinely offer assistance! Things are slowing a bit for me now but I realize you no doubt are busy too.

          In any case, do you have any lingering concerns you'd like to see addressed? Like I mentioned before, we wouldn't be in production with Solr without this functionality. I would imagine many other users would like this also.

          Show
          James Dyer added a comment - Noble, I've been meaning to ask you if there is anything I can do to help move this along. But then I got very busy and needed to wait until I could genuinely offer assistance! Things are slowing a bit for me now but I realize you no doubt are busy too. In any case, do you have any lingering concerns you'd like to see addressed? Like I mentioned before, we wouldn't be in production with Solr without this functionality. I would imagine many other users would like this also.
          Hide
          Pulkit Singhal added a comment - - edited

          James,

          I'm trying to figure out how the data-config.xml would be written to support this case:

          We needed a flexible & scalable way to temporarily cache child-entity data prior to joining to parent entities.
          

          1) Can you please provide a sample?

          2) Also my guess is that testing this would feature would require the application of both entities.patch and dihwriter.patch ... Correct? I am guessing so as I think that the dihwriter.patch has the code changes for data to be written-to and consumed-from a cache. In my use-case there is one huge text fixed-position child-entity file which has:

          parentDataID_1 someData1
          parentDataID_1 someData2
          parentDataID_2 someData3
          parentDataID_2 someData4
          

          So I just want to make sure I'm getting the right set of patches and understanding the application of the use case properly. Please advice.

          Show
          Pulkit Singhal added a comment - - edited James, I'm trying to figure out how the data-config.xml would be written to support this case: We needed a flexible & scalable way to temporarily cache child-entity data prior to joining to parent entities. 1) Can you please provide a sample? 2) Also my guess is that testing this would feature would require the application of both entities.patch and dihwriter.patch ... Correct? I am guessing so as I think that the dihwriter.patch has the code changes for data to be written-to and consumed-from a cache. In my use-case there is one huge text fixed-position child-entity file which has: parentDataID_1 someData1 parentDataID_1 someData2 parentDataID_2 someData3 parentDataID_2 someData4 So I just want to make sure I'm getting the right set of patches and understanding the application of the use case properly. Please advice.
          Hide
          Noble Paul added a comment -

          Sorry for the delay. i shall take it soon.

          Show
          Noble Paul added a comment - Sorry for the delay. i shall take it soon.
          Hide
          Noble Paul added a comment - - edited

          The DIHCache interface should not have a the following methods

          /**
          	 * <p>
          	 *  Get the next document in the cache or NULL if the last record has been reached. 
          	 *  Use this method to efficiently iterate through the entire cache.
          	 * </p>
          	 * @return
          	 */
          	public Map<String, Object> getNext() ;
          	
          	/**
          	 * <p>Reset the cache's internal iterator so that a subsequent call to getNext() will return the first cached row</p>
          	 */
          	public void resetNext() ;
          
          

          instead we should have a methods

          Iterator<Map<String,Object>> getIterator()
          
          Iterator<Map<String,Object>> getIterator(String key)
          

          This means all the states maintained for iteration will go away

          If we can do that the following fields will be redundant in SortedMapBackedCache

          	private Iterator<Map.Entry<Object, List<Map<String, Object>>>> theMapIter = null;
          	private Object currentKey = null;
          	private List<Map<String, Object>> currentKeyResult = null;
          	private Iterator<Map<String, Object>> currentKeyResultIter = null;
          
          
          Show
          Noble Paul added a comment - - edited The DIHCache interface should not have a the following methods /** * <p> * Get the next document in the cache or NULL if the last record has been reached. * Use this method to efficiently iterate through the entire cache. * </p> * @ return */ public Map< String , Object > getNext() ; /** * <p>Reset the cache's internal iterator so that a subsequent call to getNext() will return the first cached row</p> */ public void resetNext() ; instead we should have a methods Iterator<Map< String , Object >> getIterator() Iterator<Map< String , Object >> getIterator( String key) This means all the states maintained for iteration will go away If we can do that the following fields will be redundant in SortedMapBackedCache private Iterator<Map.Entry< Object , List<Map< String , Object >>>> theMapIter = null ; private Object currentKey = null ; private List<Map< String , Object >> currentKeyResult = null ; private Iterator<Map< String , Object >> currentKeyResultIter = null ;
          Hide
          James Dyer added a comment -

          Pulkit,

          I take it you have a root entity (possibly from a SQL database) and a child entity from a flat text file, and you need to join the two data sources. There are two ways to do this using this caching. In either case you'll need both patches ("entities" & "dihwriter"). Also, if an in-memory cache is not adequate, you also need to get BerkleyBackedCache from SOLR-2613 (required by the second 2-handler approach).

          The simple way uses a temporary (or ephemeral) cache. To do this, create a single DIH Request Handler and add your cached child entity to data-config.xml. DIH will load and cache "child_entity" every time you do an import. When the import is finished, the cache is deleted. This allows you to do joins on flat files whereas without caching it would not be possible. The downside is if the flat file changes infrequently, or if you're doing delta updates on your index, it would be inefficient to load and cache a large flat file every time. Here's a sample data-config.xml:

          <dataConfig>
           <dataSource name="SQL" ... />
           <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
           <document name="my_doc">
             <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" query="select root_id, more_data, etc..." >
              <entity
               name="child_entity"
               processor="LineEntityProcessor"
               dataSource="URL"
               transformer="BreakIntoFieldsTransformerIWrote"
               cacheImpl="BerkleyBackedCache"
               cacheBaseDir="temp_location_for_cache"
               cachePk="root_id"
               cacheLookup="root_entity.root_id"
               fieldNames="root_id,  flatfiledata1, etc"
               fieldTypes="BIGDECIMAL, STRING, etc"
              />
             </entity>
           </document>
          </dataConfig>
          

          The second approach is to create a second DIH Request Handler in your solrconfig.xml for the child entity. This request handler has its own data-config.xml (named dih-flatfile.xml). You would run this second request handler to build a persistent cache for the flat file, prior to running the main DIH request handler. Here's an example of this second DIH Request Handler configured in sorlconfig.xml:

          <requestHandler name="/dih-flatfile" class="org.apache.solr.handler.dataimport.DataImportHandler">
           <lst name="defaults">
            <str name="config">dih-flatfile.xml</str>
            <str name="cacheDeletePriorData">true</str>
            <str name="fieldNames">root_id,  flatfiledata1, etc</str>
            <str name="fieldTypes">BIGDECIMAL, STRING, etc</str>
            <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str>
            <str name="cacheImpl">BerkleyBackedCache</str>
            <str name="cacheBaseDir">location_of_persistent_caches</str>
            <str name="cacheName">flatfile_cache_name</str>
            <str name="cachePk">root_id</str>
           </lst>
          </requestHandler>
          

          And here is what "dih-flatfile.xml" would look like:

          <dataConfig>
           <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
           <document name="my_doc_child">
             <entity name="child_entity" processor="LineEntityProcessor" dataSource="URL" transformer="BreakIntoFieldsTransformerIWrote" />
           </document>
          </dataConfig>
          

          Your main "dataconfig-xml" would look like this:

          <dataConfig>
           <dataSource name="SQL" ... />
           <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" />
           <document name="my_doc">
             <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" query="select root_id, more_data, etc..." >
              <entity
               name="child_entity"
               processor="org.apache.solr.handler.dataimport.DIHCacheProcessor"
               cacheImpl="BerkleyBackedCache"
               cachePk="root_id"
               cacheLookup="root_entity.root_id"
               cacheBaseDir="location_of_persistent_caches"
               cacheName="flatfile_cache_name"
              />
             </entity>
           </document>
          </dataConfig>
          

          This second approach offers more flexibility (you can load the persistent cache off-hours, re-use it, do delta updates on it, etc) but it is significantly more complex. The worst part is to create a scheduler that will run the child entity's DIH request handler, wait until it finishes, then run the main DIH Request handler. But this is moot if you only need to load the child once or once in a great while.

          Should this all get committed, I will eventually create something on the wiki. In the mean time, I hope you find all of this helpful. For more examples, see the xml files these patches add to the "solr/contrib/dataimporthandler/src/test-files/dih/solr/conf" folder, and also the new unit tests that use these.

          Show
          James Dyer added a comment - Pulkit, I take it you have a root entity (possibly from a SQL database) and a child entity from a flat text file, and you need to join the two data sources. There are two ways to do this using this caching. In either case you'll need both patches ("entities" & "dihwriter"). Also, if an in-memory cache is not adequate, you also need to get BerkleyBackedCache from SOLR-2613 (required by the second 2-handler approach). The simple way uses a temporary (or ephemeral) cache. To do this, create a single DIH Request Handler and add your cached child entity to data-config.xml. DIH will load and cache "child_entity" every time you do an import. When the import is finished, the cache is deleted. This allows you to do joins on flat files whereas without caching it would not be possible. The downside is if the flat file changes infrequently, or if you're doing delta updates on your index, it would be inefficient to load and cache a large flat file every time. Here's a sample data-config.xml: <dataConfig> <dataSource name="SQL" ... /> <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" /> <document name="my_doc"> <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" query="select root_id, more_data, etc..." > <entity name="child_entity" processor="LineEntityProcessor" dataSource="URL" transformer="BreakIntoFieldsTransformerIWrote" cacheImpl="BerkleyBackedCache" cacheBaseDir="temp_location_for_cache" cachePk="root_id" cacheLookup="root_entity.root_id" fieldNames="root_id, flatfiledata1, etc" fieldTypes="BIGDECIMAL, STRING, etc" /> </entity> </document> </dataConfig> The second approach is to create a second DIH Request Handler in your solrconfig.xml for the child entity. This request handler has its own data-config.xml (named dih-flatfile.xml). You would run this second request handler to build a persistent cache for the flat file, prior to running the main DIH request handler. Here's an example of this second DIH Request Handler configured in sorlconfig.xml: <requestHandler name="/dih-flatfile" class="org.apache.solr.handler.dataimport.DataImportHandler"> <lst name="defaults"> <str name="config">dih-flatfile.xml</str> <str name="cacheDeletePriorData">true</str> <str name="fieldNames">root_id, flatfiledata1, etc</str> <str name="fieldTypes">BIGDECIMAL, STRING, etc</str> <str name="writerImpl">org.apache.solr.handler.dataimport.DIHCacheWriter</str> <str name="cacheImpl">BerkleyBackedCache</str> <str name="cacheBaseDir">location_of_persistent_caches</str> <str name="cacheName">flatfile_cache_name</str> <str name="cachePk">root_id</str> </lst> </requestHandler> And here is what "dih-flatfile.xml" would look like: <dataConfig> <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" /> <document name="my_doc_child"> <entity name="child_entity" processor="LineEntityProcessor" dataSource="URL" transformer="BreakIntoFieldsTransformerIWrote" /> </document> </dataConfig> Your main "dataconfig-xml" would look like this: <dataConfig> <dataSource name="SQL" ... /> <dataSource name="URL" baseUrl="path_to_flat_file" type="URLDataSource" /> <document name="my_doc"> <entity name="root_entity" rootEntity="true" dataSource="SQL" pk="root_id" query="select root_id, more_data, etc..." > <entity name="child_entity" processor="org.apache.solr.handler.dataimport.DIHCacheProcessor" cacheImpl="BerkleyBackedCache" cachePk="root_id" cacheLookup="root_entity.root_id" cacheBaseDir="location_of_persistent_caches" cacheName="flatfile_cache_name" /> </entity> </document> </dataConfig> This second approach offers more flexibility (you can load the persistent cache off-hours, re-use it, do delta updates on it, etc) but it is significantly more complex. The worst part is to create a scheduler that will run the child entity's DIH request handler, wait until it finishes, then run the main DIH Request handler. But this is moot if you only need to load the child once or once in a great while. Should this all get committed, I will eventually create something on the wiki. In the mean time, I hope you find all of this helpful. For more examples, see the xml files these patches add to the "solr/contrib/dataimporthandler/src/test-files/dih/solr/conf" folder, and also the new unit tests that use these.
          Hide
          James Dyer added a comment -

          Noble,

          Here is a version of the "entities" patch using ".iterator()" methods as you suggest. Let me know if this is what you had in mind and also if there is anything else you'd like to address.

          Show
          James Dyer added a comment - Noble, Here is a version of the "entities" patch using ".iterator()" methods as you suggest. Let me know if this is what you had in mind and also if there is anything else you'd like to address.
          Hide
          Noble Paul added a comment -

          A few more points

          Let us avoid methods which are not used

          example
          SolrWriter.setDeltaKeys();

          Now that we have a concept of DIHCache, move all cache related logic from EntityprocessorBase to another class. Probably a baseDIHCache.

          it is not implemented and I am not even clear why it is there

          Let us put in the minimum amount of changes

          remove the DIHCacheProperties class and inline the constants. That is the way it done everywhere else

          I don't understand the need for DocBuilder.resetEntity() According to me the DataCOnfig state must not be changed between runs.

          Show
          Noble Paul added a comment - A few more points Let us avoid methods which are not used example SolrWriter.setDeltaKeys(); Now that we have a concept of DIHCache, move all cache related logic from EntityprocessorBase to another class. Probably a baseDIHCache. it is not implemented and I am not even clear why it is there Let us put in the minimum amount of changes remove the DIHCacheProperties class and inline the constants. That is the way it done everywhere else I don't understand the need for DocBuilder.resetEntity() According to me the DataCOnfig state must not be changed between runs.
          Hide
          James Dyer added a comment -

          Noble,

          Thanks for the comments. Let me see if I can answer some of your questions and perhaps you can give further guidance as to how we can address your concerns. In the mean time, I can try to get a patch together that incorporates as much as possible of what you suggest.

          SolrWriter.setDeltaKeys();
          it is not implemented and I am not even clear why it is there

          This implements the new DIHWriter interface, but I see I failed to put the proper annotations in, hence the confusion. This method is required by a DIHWriter that supports both delta updates and duplicate keys (ex. the DIHCacheWriter, in the next patch). SolrWriter does not implement this because Solr does not support duplicate keys. That is, in the case of a delta update to a Solr Index, a repeat key is definitely an Update, and cannot be an Add. Caches that support duplicate keys, however, need to know up-front whether or not a duplicate key is an Add or an Update. In the next patch, I will put all the proper annotations in place. Will this satisfy your concern?

          Now that we have a concept of DIHCache, move all cache related logic from EntityprocessorBase to another class. Probably a baseDIHCache.

          I realized back when this was first developed that this would be a good future refactoring but this is a pretty big project already and I was trying to minimize the changes. But I can do this in the next patch version if you'd like it done now. Sound good?

          remove the DIHCacheProperties class and inline the constants. That is the way it done everywhere else

          It made more sense back when I was developing this to have the constants centralized because many of them are being used by more than one class. But for consistency I can inline them somewhere for the next patch version. Agree?

          I don't understand the need for DocBuilder.resetEntity() According to me the DataCOnfig state must not be changed between runs.

          All this does is recursively set the "initalized" flag on an entity and its children back to "false". (the "initalized" flag ensures that "destroy" is only called on an entity once..See "Implementation Details" #6 in my original description for this issue). I think I added "resetEntity" as a safety measure because I don't know enough about DIH to be guaranteed that these entity objects never get used again. If you're pretty sure its impossible the same entity objects would be used again, we can remove "resetEntity". In the mean time, let me see if all the unit tests pass for everything with this removed. If you're sure, and if all unit tests pass without it, then I'd agree we should remove it. Sound like a plan on this one?

          Show
          James Dyer added a comment - Noble, Thanks for the comments. Let me see if I can answer some of your questions and perhaps you can give further guidance as to how we can address your concerns. In the mean time, I can try to get a patch together that incorporates as much as possible of what you suggest. SolrWriter.setDeltaKeys(); it is not implemented and I am not even clear why it is there This implements the new DIHWriter interface, but I see I failed to put the proper annotations in, hence the confusion. This method is required by a DIHWriter that supports both delta updates and duplicate keys (ex. the DIHCacheWriter, in the next patch). SolrWriter does not implement this because Solr does not support duplicate keys. That is, in the case of a delta update to a Solr Index, a repeat key is definitely an Update, and cannot be an Add. Caches that support duplicate keys, however, need to know up-front whether or not a duplicate key is an Add or an Update. In the next patch, I will put all the proper annotations in place. Will this satisfy your concern? Now that we have a concept of DIHCache, move all cache related logic from EntityprocessorBase to another class. Probably a baseDIHCache. I realized back when this was first developed that this would be a good future refactoring but this is a pretty big project already and I was trying to minimize the changes. But I can do this in the next patch version if you'd like it done now. Sound good? remove the DIHCacheProperties class and inline the constants. That is the way it done everywhere else It made more sense back when I was developing this to have the constants centralized because many of them are being used by more than one class. But for consistency I can inline them somewhere for the next patch version. Agree? I don't understand the need for DocBuilder.resetEntity() According to me the DataCOnfig state must not be changed between runs. All this does is recursively set the "initalized" flag on an entity and its children back to "false". (the "initalized" flag ensures that "destroy" is only called on an entity once..See "Implementation Details" #6 in my original description for this issue). I think I added "resetEntity" as a safety measure because I don't know enough about DIH to be guaranteed that these entity objects never get used again. If you're pretty sure its impossible the same entity objects would be used again, we can remove "resetEntity". In the mean time, let me see if all the unit tests pass for everything with this removed. If you're sure, and if all unit tests pass without it, then I'd agree we should remove it. Sound like a plan on this one?
          Hide
          Noble Paul added a comment -

          I realized back when this was first developed that this would be a good future refactoring but this is a pretty big project already and I was trying to minimize the changes

          It may be worth a try. Because, this issue is about having pluggable caching . If the caching logic lives in the core code the issue is not really complete w/o it

          either implement the SolrWriter.setDeltaKeys() or remove it. Either way I'm fine

          The entities are reused . But it has always been like that. Why do you need that initialized flag? What is initialized?

          Show
          Noble Paul added a comment - I realized back when this was first developed that this would be a good future refactoring but this is a pretty big project already and I was trying to minimize the changes It may be worth a try. Because, this issue is about having pluggable caching . If the caching logic lives in the core code the issue is not really complete w/o it either implement the SolrWriter.setDeltaKeys() or remove it. Either way I'm fine The entities are reused . But it has always been like that. Why do you need that initialized flag? What is initialized?
          Hide
          James Dyer added a comment -

          The entities are reused . But it has always been like that. Why do you need that initialized flag? What is initialized?

          Perhaps "intialized" is the wrong name for this flag, but let me explain how its used. In "Implementation Details" #6 in this issue's description, I mentioned the need to change the semantics for "entity.destroy()". Previous to this patch, for child entities, both "entity.destroy()" & "entity.init" get called once per parent row. So throughout the course of a DIH import, child entities constantly get their "init" and "destroy" methods called over and over again. But what if we have "init" and "destroy" operations that are meant to be executed only once? "init" copes with this by setting a "firstInit" flag on each entity and having any init steps that get called only once controlled by this flag.

          But there was no such coping mechanism built into "destroy". There was never a need because in actuality only one of our prepacked entities implements "destroy()". But entities that use persistent caching require that there be a way to clean up any unneeded caches at the end. Because "destroy()" was largely unused, I decided to change its semantics to handle this end-of-lifecycle cleanup operation. (The one entity that already implements "destroy" is LineEntitiyProcessor, but prior to this patch we cannot use LineEntityProcessor as a child entity and do joins, so the semantic change here doesn't matter.)

          Thus the "entityWrapper.initalized" flag gets set (DocBuilder lines 637-640) the first time a particular entity is encountered. The flag ensures that the entity gets added to the "Destroy-List" only once. When any entity is done being used (its parent is finished), the appropriate "Destroy-List" is looped through, the children are destroyed, and their initialized flags get set back to "false". (DocBuilder lines 617-621). "resetEntity()" sets the flag back, existing in its own method so that it may be done recursively.

          I apologize for this very long explanation, but I hope this is helpful. Obviously I've made design decisions here that you may (or perhaps not) differ on. Basically I need to have an "entity.destroy()" that is guaranteed to get called only once, at the time the entity is done executing. If you would like this done differently, let me know what you have in mind and I can try and change it.

          Do you now understand why I am using an "initalized" flag? Is this ok as-is, or if not, how would you like the design changed?

          Show
          James Dyer added a comment - The entities are reused . But it has always been like that. Why do you need that initialized flag? What is initialized? Perhaps "intialized" is the wrong name for this flag, but let me explain how its used. In "Implementation Details" #6 in this issue's description, I mentioned the need to change the semantics for "entity.destroy()". Previous to this patch, for child entities, both "entity.destroy()" & "entity.init" get called once per parent row. So throughout the course of a DIH import, child entities constantly get their "init" and "destroy" methods called over and over again. But what if we have "init" and "destroy" operations that are meant to be executed only once? "init" copes with this by setting a "firstInit" flag on each entity and having any init steps that get called only once controlled by this flag. But there was no such coping mechanism built into "destroy". There was never a need because in actuality only one of our prepacked entities implements "destroy()". But entities that use persistent caching require that there be a way to clean up any unneeded caches at the end. Because "destroy()" was largely unused, I decided to change its semantics to handle this end-of-lifecycle cleanup operation. (The one entity that already implements "destroy" is LineEntitiyProcessor, but prior to this patch we cannot use LineEntityProcessor as a child entity and do joins, so the semantic change here doesn't matter.) Thus the "entityWrapper.initalized" flag gets set (DocBuilder lines 637-640) the first time a particular entity is encountered. The flag ensures that the entity gets added to the "Destroy-List" only once. When any entity is done being used (its parent is finished), the appropriate "Destroy-List" is looped through, the children are destroyed, and their initialized flags get set back to "false". (DocBuilder lines 617-621). "resetEntity()" sets the flag back, existing in its own method so that it may be done recursively. I apologize for this very long explanation, but I hope this is helpful. Obviously I've made design decisions here that you may (or perhaps not) differ on. Basically I need to have an "entity.destroy()" that is guaranteed to get called only once, at the time the entity is done executing. If you would like this done differently, let me know what you have in mind and I can try and change it. Do you now understand why I am using an "initalized" flag? Is this ok as-is, or if not, how would you like the design changed?
          Hide
          James Dyer added a comment -

          Noble,

          I have updated the "entities" patch to address the changes we've discussed this week. The only thing I didn't address here was your question about "resetEntity()", which I hope was adequately explained yesterday.

          Is there anything else that needs to be done with "entities" before it can be committed?

          Show
          James Dyer added a comment - Noble, I have updated the "entities" patch to address the changes we've discussed this week. The only thing I didn't address here was your question about "resetEntity()", which I hope was adequately explained yesterday. Is there anything else that needs to be done with "entities" before it can be committed?
          Hide
          James Dyer added a comment -

          Here is a version of the "dihwriter" patch compatible with today's "entity" patch update.

          Show
          James Dyer added a comment - Here is a version of the "dihwriter" patch compatible with today's "entity" patch update.
          Hide
          James Dyer added a comment -

          re-attaching "entities" with "grant license" button selected.

          Show
          James Dyer added a comment - re-attaching "entities" with "grant license" button selected.
          Hide
          Noble Paul added a comment -

          With some clean up.
          I think there is a very big omission. The EntityProcessorBase.transformers field is not used in the latest patch. How do does transformation work?

          Show
          Noble Paul added a comment - With some clean up. I think there is a very big omission. The EntityProcessorBase.transformers field is not used in the latest patch. How do does transformation work?
          Hide
          James Dyer added a comment -

          The EntityProcessorBase.transformers field is not used in the latest patch. How do does transformation work?

          This hasn't been in use since SOLR-1120 was applied back in 2009 (r766608). Since SOLR-1120, Transformers are applied in class EntityProcessorWrapper. We could remove this variable from the class now as it does nothing and is not used by this class or any prepackaged subclasses. Agree?

          With some clean up.

          What other kinds of things do you have in mind?

          Show
          James Dyer added a comment - The EntityProcessorBase.transformers field is not used in the latest patch. How do does transformation work? This hasn't been in use since SOLR-1120 was applied back in 2009 (r766608). Since SOLR-1120 , Transformers are applied in class EntityProcessorWrapper. We could remove this variable from the class now as it does nothing and is not used by this class or any prepackaged subclasses. Agree? With some clean up. What other kinds of things do you have in mind?
          Hide
          James Dyer added a comment -

          Noble,

          Is there anything else you need from me to help move this forward?

          Show
          James Dyer added a comment - Noble, Is there anything else you need from me to help move this forward?
          Hide
          Noble Paul added a comment -

          I shall commit the latest patch shortly. I'm seeing some exceptions (unrelated may be). Will investigate a bit firther

          Show
          Noble Paul added a comment - I shall commit the latest patch shortly. I'm seeing some exceptions (unrelated may be). Will investigate a bit firther
          Hide
          Noble Paul added a comment -

          committed svn version 121659. Thanks James

          Show
          Noble Paul added a comment - committed svn version 121659. Thanks James
          Hide
          Steve Rowe added a comment -

          Hi Noble,

          In DIHCache.java, you used the javadoc tag @solr.experimental, but there is no support in the build system for this tag, so it causes javadoc warnings, which fail the build, e.g.: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11327/consoleText (scroll down to the bottom to see the warning):

          [javadoc] [...]/DIHCache.java:14: warning - @solr.experimental is an unknown tag.
          

          Would you mind if I switch @solr.experimental to @lucene.experimental?

          Show
          Steve Rowe added a comment - Hi Noble, In DIHCache.java , you used the javadoc tag @solr.experimental , but there is no support in the build system for this tag, so it causes javadoc warnings, which fail the build, e.g.: https://builds.apache.org/job/Lucene-Solr-tests-only-trunk/11327/consoleText (scroll down to the bottom to see the warning): [javadoc] [...]/DIHCache.java:14: warning - @solr.experimental is an unknown tag. Would you mind if I switch @solr.experimental to @lucene.experimental ?
          Hide
          Noble Paul added a comment -

          please go ahead

          Show
          Noble Paul added a comment - please go ahead
          Hide
          Steve Rowe added a comment -

          please go ahead

          Done: r1201784

          Show
          Steve Rowe added a comment - please go ahead Done: r1201784
          Hide
          James Dyer added a comment -

          Here is an updated version of the "dihwriter" patch, which is the last part of this ticket. This is in-sync with Noble's commit of the "entities" patch.

          This includes:
          1. DIHCacheWriter - lets DIH write to a DIHCache instead of to Solr for future processing. This also supports Delta Updates on caches.
          2. DIHCacheProcessor - lets DIH read from a cache that was previously written by DIHCacheWriter.
          3. MockDIHCache - a bare-bones cache implementation that supports persistence and delta updates. This allows us to run all of the unit tests without also needing to apply SOLR-2613 (BerkleyBackedCache). This also provides a full reference implementation for those who would like to write their own Caches.

          Noble, will you be able to work through this patch also with me?

          Show
          James Dyer added a comment - Here is an updated version of the "dihwriter" patch, which is the last part of this ticket. This is in-sync with Noble's commit of the "entities" patch. This includes: 1. DIHCacheWriter - lets DIH write to a DIHCache instead of to Solr for future processing. This also supports Delta Updates on caches. 2. DIHCacheProcessor - lets DIH read from a cache that was previously written by DIHCacheWriter. 3. MockDIHCache - a bare-bones cache implementation that supports persistence and delta updates. This allows us to run all of the unit tests without also needing to apply SOLR-2613 (BerkleyBackedCache). This also provides a full reference implementation for those who would like to write their own Caches. Noble, will you be able to work through this patch also with me?
          Hide
          Noble Paul added a comment -

          Is this required ? Could you please let me know the use cases for this?

          Show
          Noble Paul added a comment - Is this required ? Could you please let me know the use cases for this?
          Hide
          James Dyer added a comment -

          Noble,

          I can't speak for every use case, but these were necessary for one of our applications. The whole idea is it lets you load your caches in advance of indexing (DIHCacheWriter), then read back your caches at a later time when you're ready to index (DIHCacheProcessor).

          • This is especially helpful if you have a lot of different data sources that each contribute a few data elements in each Solr record. (we have at least 40 data sources.)
          • If you have slow data sources, you can run multiple DIH scripts at the same time and build your caches simultaneously (My app builds 12 DIH Caches at a time as we have some slow legacy databases to content with).
          • If you have a some data sources that change infrequently and other that are changing all the time, you can build caches for the infrequently-changing data sources, making it unnecessary to re-acquire this data every time you do a delta update (this is actually a very common case. Imagine having Solr loaded with Product metadata. Most of the data would seldom change but things like prices, availability flags, stock numbers, etc, might change all the time.)
          • The fact that you can do delta imports on caches allows users to optimize the indexing process further. If you have multiple child-entity caches with data that mostly stays the same, but each has churn on a small percentage of the data, being able to just go in and delta update the cache lets you only re-acquire what changed. Otherwise, you have to take every record that had a change in even 1 data source and re-acquire all of the data sources for every record.
          • These last two points relate to the fact that Lucene cannot do an "update" but only a "replace". Being able to store your system-of-record data in caches alleviates the need to re-acquire all of your data sources every time you need to do an "update" on a few fields.
          • Some systems do not have a separate system-of-record as the data being indexed to Solr is ephemeral or changes frequently. Having the data in caches gives you the freedom to delta update the information or easily re-index all data at system upgrades, etc. I could see for some users these caches factoring into their disaster recovery strategy.
          • There is also a feature to partition the data into multiple caches, which would make it easier to subsequently index the data to separate shards. We use this feature to index the data in parallel to the same core (we're using Solr 1.4, which did not have a "threads" parameter), but this would apply to using multiple shards also.

          Is this convincing enough to go ahead and work towards commit?

          Show
          James Dyer added a comment - Noble, I can't speak for every use case, but these were necessary for one of our applications. The whole idea is it lets you load your caches in advance of indexing (DIHCacheWriter), then read back your caches at a later time when you're ready to index (DIHCacheProcessor). This is especially helpful if you have a lot of different data sources that each contribute a few data elements in each Solr record. (we have at least 40 data sources.) If you have slow data sources, you can run multiple DIH scripts at the same time and build your caches simultaneously (My app builds 12 DIH Caches at a time as we have some slow legacy databases to content with). If you have a some data sources that change infrequently and other that are changing all the time, you can build caches for the infrequently-changing data sources, making it unnecessary to re-acquire this data every time you do a delta update (this is actually a very common case. Imagine having Solr loaded with Product metadata. Most of the data would seldom change but things like prices, availability flags, stock numbers, etc, might change all the time.) The fact that you can do delta imports on caches allows users to optimize the indexing process further. If you have multiple child-entity caches with data that mostly stays the same, but each has churn on a small percentage of the data, being able to just go in and delta update the cache lets you only re-acquire what changed. Otherwise, you have to take every record that had a change in even 1 data source and re-acquire all of the data sources for every record. These last two points relate to the fact that Lucene cannot do an "update" but only a "replace". Being able to store your system-of-record data in caches alleviates the need to re-acquire all of your data sources every time you need to do an "update" on a few fields. Some systems do not have a separate system-of-record as the data being indexed to Solr is ephemeral or changes frequently. Having the data in caches gives you the freedom to delta update the information or easily re-index all data at system upgrades, etc. I could see for some users these caches factoring into their disaster recovery strategy. There is also a feature to partition the data into multiple caches, which would make it easier to subsequently index the data to separate shards. We use this feature to index the data in parallel to the same core (we're using Solr 1.4, which did not have a "threads" parameter), but this would apply to using multiple shards also. Is this convincing enough to go ahead and work towards commit?
          Hide
          Mikhail Khludnev added a comment - - edited

          Hello,

          I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes:
          TestThreaded.java.patch on r1144761
          Pls, let me know, how do you feel about it?

          Actually, I've explored this case at 3.4 some
          time ago, but decided to wait a little until this re-factoring made a progress.

          • the first issue is testCachedSingleThread_FullImport() failure. It's caused by
            DocBuilder.java line 473
               } finally {
                    entityProcessor.destroy();
                  }
            

            this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix.

          • then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation.
          • the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access of child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion.

          Guys, I can handle some of these, let me know how I can help.


          Mikhail

          Show
          Mikhail Khludnev added a comment - - edited Hello, I want to contribute test for Parent/Child usecase with CachedSqlentityProcessor in single- and multi-thread modes: TestThreaded.java.patch on r1144761 Pls, let me know, how do you feel about it? Actually, I've explored this case at 3.4 some time ago, but decided to wait a little until this re-factoring made a progress. the first issue is testCachedSingleThread_FullImport() failure. It's caused by DocBuilder.java line 473 } finally { entityProcessor.destroy(); } this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration, when run() is called from line :510. It shouldn't be a big deal to fix. then, some minor moaning: looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id". for me it's a matter of opinion, backward compatibility and documentation. the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access of child entities iteration state. Now it looks like DIHCacheSupport.dataSourceRowCache is accessed by multiple threads from ThreadedEntityProcessorWrapper. I have some ideas, but want to know your opinion. Guys, I can handle some of these, let me know how I can help. – Mikhail
          Hide
          Noble Paul added a comment -

          @James

          • My question is, Is it relevant to a large no:of people to have this last patch ?
          • Can you achieve your objective by just adding an implementation? I mean will you need to patch DIH to meet your needs?
          Show
          Noble Paul added a comment - @James My question is, Is it relevant to a large no:of people to have this last patch ? Can you achieve your objective by just adding an implementation? I mean will you need to patch DIH to meet your needs?
          Hide
          James Dyer added a comment -

          Mikhail,

          Thank you for testing this and providing information and a patch. I have a few questions.

          this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration

          Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing?

          looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id"

          Do you mean this breaks CachedSqlEntityProcessor? Looking at TestCachedSQLEntityProcessor, it seems like the case you describe is adequately tested and this test still passes. Possibly you mean something different? Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found.

          the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access...

          This sounds like a valid issue, and just looking it seems like you've got a good unit test for it. But if its happening in 3.4 then its not related to SOLR-2382, which is in Trunk/4.0 only. Would you mind opening a new "bug" issue for this?

          Show
          James Dyer added a comment - Mikhail, Thank you for testing this and providing information and a patch. I have a few questions. this code, which cleanups the cache, makes sense, but for parent entities only, and causes a failure for the child entities enumeration Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing? looks like where="xid=x.id" is not supported by new code, which relies on cachePk="xid" and cacheLookup="x.id" Do you mean this breaks CachedSqlEntityProcessor? Looking at TestCachedSQLEntityProcessor, it seems like the case you describe is adequately tested and this test still passes. Possibly you mean something different? Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found. the most interesting problem is failure of testCachedMultiThread_FullImport(). At 3.4 it's caused by concurrent access... This sounds like a valid issue, and just looking it seems like you've got a good unit test for it. But if its happening in 3.4 then its not related to SOLR-2382 , which is in Trunk/4.0 only. Would you mind opening a new "bug" issue for this?
          Hide
          James Dyer added a comment -

          Here is an updated "dihwriter" patch, fixing a test bug.

          Show
          James Dyer added a comment - Here is an updated "dihwriter" patch, fixing a test bug.
          Hide
          Mikhail Khludnev added a comment - - edited

          James,

          Pls find my replies below

          Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing?

          I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761.
          Can you see it in attachments?
          After you applied it both of new tests fail. The simplest one testCachedSingleThread_FullImport() can be fixed by disabling cache destroying by commenting DocBuilder.java line:473 - // entityProcessor.destroy();
          I believe it should be properly covered by the separate issue.

          Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found.

          testCachedSingleThread_FullImport() fails again if you remove cachePk="xid" cacheLookup="x.id" from test config constant line 109. I briefly look into TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate enough, let me attach my findings soon.

          But if its happening in 3.4 then its not related to SOLR-2382, which is in Trunk/4.0 only

          The problem the same as it is at 3.x. My test TestThreaded.java.patch has testCached_Multi_ Thread_FullImport() which covers particularly this race condition. You can see it after make testCached_Single_Thread_FullImport() green somehow (see issue no.1). I can't spawn it as new issue because of the blocker no.1

          Show
          Mikhail Khludnev added a comment - - edited James, Pls find my replies below Currently none of the existing unit tests fail with this change and I'm not sure exactly how to reproduce the problem you're describing. Could you create a failing unit test for this to clarify what you're experiencing? I've done it yesterday, I attached TestThreaded.java.patch on trunk r1144761. Can you see it in attachments? After you applied it both of new tests fail. The simplest one testCachedSingleThread_FullImport() can be fixed by disabling cache destroying by commenting DocBuilder.java line:473 - // entityProcessor.destroy(); I believe it should be properly covered by the separate issue. Once again, a failing unit test would be helpful in knowing how to reproduce the specific problem you've found. testCachedSingleThread_FullImport() fails again if you remove cachePk="xid" cacheLookup="x.id" from test config constant line 109. I briefly look into TestCachedSQLEntityProcessor it seems to me I've got where it is not accurate enough, let me attach my findings soon. But if its happening in 3.4 then its not related to SOLR-2382 , which is in Trunk/4.0 only The problem the same as it is at 3.x. My test TestThreaded.java.patch has testCached_ Multi _ Thread_FullImport() which covers particularly this race condition. You can see it after make testCached_ Single _Thread_FullImport() green somehow (see issue no.1). I can't spawn it as new issue because of the blocker no.1
          Hide
          James Dyer added a comment -

          Can you achieve your objective by just adding an implementation? I mean will you need to patch DIH to meet your needs?

          This patch, "SOLR-2382-dihwriter_standalone.patch" separates the parameter names so that it is entirely stand-alone. So yes, for my purposes, we can put this code in a separate .jar and it will not need any patches in DIH.

          My question is, Is it relevant to a large no:of people to have this last patch ?

          Anyone who is indexing a lot of data that needs to join from multiple sources would probably consider using this, provided it gets well-documented. In our case, we have different data sources across our enterprise that all contribute a few fields to each Solr document, so having this added flexibility was vital for us.

          In any case, if you do not want to work towards committing this last patch now I think it would be wise to close SOLR-2382 and I can open a new case just for this last patch. That'll make it easier for someone who wants the functionality in the future to find it in JIRA, or if some other committer wants to pick it up someday. Let me know what you plan to do.

          Show
          James Dyer added a comment - Can you achieve your objective by just adding an implementation? I mean will you need to patch DIH to meet your needs? This patch, " SOLR-2382 -dihwriter_standalone.patch" separates the parameter names so that it is entirely stand-alone. So yes, for my purposes, we can put this code in a separate .jar and it will not need any patches in DIH. My question is, Is it relevant to a large no:of people to have this last patch ? Anyone who is indexing a lot of data that needs to join from multiple sources would probably consider using this, provided it gets well-documented. In our case, we have different data sources across our enterprise that all contribute a few fields to each Solr document, so having this added flexibility was vital for us. In any case, if you do not want to work towards committing this last patch now I think it would be wise to close SOLR-2382 and I can open a new case just for this last patch. That'll make it easier for someone who wants the functionality in the future to find it in JIRA, or if some other committer wants to pick it up someday. Let me know what you plan to do.
          Hide
          Mikhail Khludnev added a comment - - edited

          James,

          pls find my proof for absence of where="xid=x.id" support. TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - I'm sorry for that. The test was green due to relying on keys order in the map. Wrapping by sorted map breaks that order and lead to peaking up wrong primarykey column. pls find explanation below.

          from my pov the most cruel thing is lines:27-28 it pick ups just first key from the map as primary key, when it wasn't properly detected from attributes. so this condition hides a problem, until just face it and address.

          left part of where clause isn't used here at lines 45-48 and "where=""" is ignored again at lines 185-190

          you can see that the second attach TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch fixes the test by adding cachePk and lookup into attributes.

          My proposals are:

          • fix it. it's not a big deal to came where attr back
          • but why the new attributes cachePk and cacheLoop are better than old where attribute ? in according to reply I vote for
            • decommission where="" or for
            • rolling new cahePk/Lookup attributes back
          • can't we add more randomization into AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar hidden issues. I propose to choose concrete map behaviour randomly: hash, sorted, sorted-reverse. WDYT?
          • the names withWhereClause() and withKeyAndLookup() should be swapped. their content contradicts to the names
              public void withWhereClause() {
            ...
                    "query", q, DIHCacheSupport.CACHE_PRIMARY_KEY,"id", DIHCacheSupport.CACHE_FOREIGN_KEY ,"
            ...
              public void withKeyAndLookup() {
            ...
                Map<String, String> entityAttrs = createMap("query", q, "where", "id=x.id",
            ...
            
          Show
          Mikhail Khludnev added a comment - - edited James, pls find my proof for absence of where="xid=x.id" support. TestCachedSqlEntityProcessor.java-break-where-clause.patch it looks puzzling - I'm sorry for that. The test was green due to relying on keys order in the map. Wrapping by sorted map breaks that order and lead to peaking up wrong primarykey column. pls find explanation below. from my pov the most cruel thing is lines:27-28 it pick ups just first key from the map as primary key, when it wasn't properly detected from attributes. so this condition hides a problem, until just face it and address. left part of where clause isn't used here at lines 45-48 and "where=""" is ignored again at lines 185-190 you can see that the second attach TestCachedSqlEntityProcessor.java-fix-where-clause-by-adding-cachePk-and-lookup.patch fixes the test by adding cachePk and lookup into attributes. My proposals are: fix it. it's not a big deal to came where attr back but why the new attributes cachePk and cacheLoop are better than old where attribute ? in according to reply I vote for decommission where="" or for rolling new cahePk/Lookup attributes back can't we add more randomization into AbstractDataImportHandlerTestCase.createMap(Object...) to find more similar hidden issues. I propose to choose concrete map behaviour randomly: hash, sorted, sorted-reverse. WDYT? the names withWhereClause() and withKeyAndLookup() should be swapped. their content contradicts to the names public void withWhereClause() { ... "query" , q, DIHCacheSupport.CACHE_PRIMARY_KEY, "id" , DIHCacheSupport.CACHE_FOREIGN_KEY ," ... public void withKeyAndLookup() { ... Map< String , String > entityAttrs = createMap( "query" , q, "where" , "id=x.id" , ...
          Hide
          James Dyer added a comment -

          Mikhail,

          I looked at "TestCachedSqlEntityProcessor.java-break-where-clause.patch". It looks like a bug is being introduced into the test because we have a Map<Object> with some Strings and some Integers in it. You modified the test case to store these in a TreeMap. But when the TreeMap's comparator tries to get these in order, it cannot cast the String to Integer and/or visa-versa. I don't mean to say the bug you describe doesn't exist, but that this new test case doesn't seem to properly test for it. Could you improve this test case?

          Show
          James Dyer added a comment - Mikhail, I looked at "TestCachedSqlEntityProcessor.java-break-where-clause.patch". It looks like a bug is being introduced into the test because we have a Map<Object> with some Strings and some Integers in it. You modified the test case to store these in a TreeMap. But when the TreeMap's comparator tries to get these in order, it cannot cast the String to Integer and/or visa-versa. I don't mean to say the bug you describe doesn't exist, but that this new test case doesn't seem to properly test for it. Could you improve this test case?
          Hide
          Mikhail Khludnev added a comment -

          James,

          Your observation contradicts to the stack trace:

          org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
              at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64)
              at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:240)
              at org.apache.solr.handler.dataimport.TestCachedSqlEntityProcessor.doWhereTest(TestCachedSqlEntityProcessor.java:227)
              at org.apache.solr.handler.dataimport.TestCachedSqlEntityProcessor.withKeyAndLookup(TestCachedSqlEntityProcessor.java:211)
          .....
          Caused by: java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Integer
              at java.lang.Integer.compareTo(Integer.java:37)
              at java.util.TreeMap.getEntry(TreeMap.java:328)
              at java.util.TreeMap.get(TreeMap.java:255)
              at org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:106)
              at org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:160)
              at org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:127)
              at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:131)
              at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75)
              at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237)
              ... 31 more
          

          Using a SortedMap is absolutely valid here. It has Strings as column names and String and Integer as a values. The problem is caused by the detecting primary key, which I described above. "desc" column was recognized as Pk, and parent fk is looked up in it. That causes ClassCastException.

          Let me reattach the patch with LinkedHashMap instead of sorted, I need just have "desc" column first to reproduce the issue.

          Show
          Mikhail Khludnev added a comment - James, Your observation contradicts to the stack trace: org.apache.solr.handler.dataimport.DataImportHandlerException: java.lang.ClassCastException: java.lang. String cannot be cast to java.lang. Integer at org.apache.solr.handler.dataimport.DataImportHandlerException.wrapAndThrow(DataImportHandlerException.java:64) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:240) at org.apache.solr.handler.dataimport.TestCachedSqlEntityProcessor.doWhereTest(TestCachedSqlEntityProcessor.java:227) at org.apache.solr.handler.dataimport.TestCachedSqlEntityProcessor.withKeyAndLookup(TestCachedSqlEntityProcessor.java:211) ..... Caused by: java.lang.ClassCastException: java.lang. String cannot be cast to java.lang. Integer at java.lang. Integer .compareTo( Integer .java:37) at java.util.TreeMap.getEntry(TreeMap.java:328) at java.util.TreeMap.get(TreeMap.java:255) at org.apache.solr.handler.dataimport.SortedMapBackedCache.iterator(SortedMapBackedCache.java:106) at org.apache.solr.handler.dataimport.DIHCacheSupport.getIdCacheData(DIHCacheSupport.java:160) at org.apache.solr.handler.dataimport.DIHCacheSupport.getCacheData(DIHCacheSupport.java:127) at org.apache.solr.handler.dataimport.EntityProcessorBase.getNext(EntityProcessorBase.java:131) at org.apache.solr.handler.dataimport.SqlEntityProcessor.nextRow(SqlEntityProcessor.java:75) at org.apache.solr.handler.dataimport.EntityProcessorWrapper.nextRow(EntityProcessorWrapper.java:237) ... 31 more Using a SortedMap is absolutely valid here. It has Strings as column names and String and Integer as a values. The problem is caused by the detecting primary key, which I described above. "desc" column was recognized as Pk, and parent fk is looked up in it. That causes ClassCastException. Let me reattach the patch with LinkedHashMap instead of sorted, I need just have "desc" column first to reproduce the issue.
          Hide
          Mikhail Khludnev added a comment -

          TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch breaks withKeyAndLookup() by reordering entries in row map by LinkedHashMap() (pls have a look to createLinkedMap())

          Stacktrace is pretty the same as at the comment above. you can check by debugger the failed instance of SortedMapBackedCache. it has "desc" as primaryKeyName and messed up theMap:

           {another another three=[{id=3, desc=another another three}], another three=[{id=3, desc=another three}], another two=[{id=2, desc=another two}], one=[{desc=one, id=1}], three=[{id=3, desc=three}], two=[{id=2, desc=two}]}
          

          Regards

          Show
          Mikhail Khludnev added a comment - TestCachedSqlEntityProcessor.java-wrong-pk-detected-due-to-lack-of-where-support.patch breaks withKeyAndLookup() by reordering entries in row map by LinkedHashMap() (pls have a look to createLinkedMap()) Stacktrace is pretty the same as at the comment above. you can check by debugger the failed instance of SortedMapBackedCache. it has "desc" as primaryKeyName and messed up theMap: {another another three=[{id=3, desc=another another three}], another three=[{id=3, desc=another three}], another two=[{id=2, desc=another two}], one=[{desc=one, id=1}], three=[{id=3, desc=three}], two=[{id=2, desc=two}]} Regards
          Hide
          Mikhail Khludnev added a comment -

          I spawned subtask SOLR-2933

          Show
          Mikhail Khludnev added a comment - I spawned subtask SOLR-2933
          Hide
          Noble Paul added a comment -

          @James
          Yes create a new issue for all the further functionalities and let's close this one

          Show
          Noble Paul added a comment - @James Yes create a new issue for all the further functionalities and let's close this one
          Hide
          James Dyer added a comment -

          Noble,

          I have attached a patch with a corrected unit test & fix on SOLR-2933, to fix one of the problems Mikhail described. Indeed the "where" parameter was broken by our last commit and TestCachedSqlEntityProcessor would mask the failure and pass anyway. Would you mind looking at my patch and committing it? Thanks.

          Show
          James Dyer added a comment - Noble, I have attached a patch with a corrected unit test & fix on SOLR-2933 , to fix one of the problems Mikhail described. Indeed the "where" parameter was broken by our last commit and TestCachedSqlEntityProcessor would mask the failure and pass anyway. Would you mind looking at my patch and committing it? Thanks.
          Hide
          James Dyer added a comment -

          Spun off the remaining piece to SOLR-2943, so closing this one.

          Noble, Thank you for all your help with this.

          Show
          James Dyer added a comment - Spun off the remaining piece to SOLR-2943 , so closing this one. Noble, Thank you for all your help with this.
          Hide
          Mikhail Khludnev added a comment -

          James, Noble,

          Pls have a look to the new one: SOLR-2947

          Show
          Mikhail Khludnev added a comment - James, Noble, Pls have a look to the new one: SOLR-2947
          Hide
          James Dyer added a comment -

          Re-Open to backport for 3.x This is being done so that the SOLR-3011 thread fixes can be included in 3.x. (Plan is to give 3.x users a stable-enough "threads" implementation in 3.x, then remove "threads" from trunk.)

          Show
          James Dyer added a comment - Re-Open to backport for 3.x This is being done so that the SOLR-3011 thread fixes can be included in 3.x. (Plan is to give 3.x users a stable-enough "threads" implementation in 3.x, then remove "threads" from trunk.)
          Hide
          James Dyer added a comment -

          Patch for 3.x includes everything already committed to Trunk as well as various bug fixes (also in Trunk already). (Patch is here for reference only ; changes were actually moved using svn merge)

          I will commit to 3.x shortly.

          Show
          James Dyer added a comment - Patch for 3.x includes everything already committed to Trunk as well as various bug fixes (also in Trunk already). (Patch is here for reference only ; changes were actually moved using svn merge) I will commit to 3.x shortly.
          Hide
          James Dyer added a comment -

          commit to 3.x: r1303792 (& r1303822 - license headers)

          Show
          James Dyer added a comment - commit to 3.x: r1303792 (& r1303822 - license headers)

            People

            • Assignee:
              James Dyer
              Reporter:
              James Dyer
            • Votes:
              1 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1h
                1h
                Remaining:
                Remaining Estimate - 1h
                1h
                Logged:
                Time Spent - Not Specified
                Not Specified

                  Development