Solr
  1. Solr
  2. SOLR-4086

Refactor DIH - VariableResolver & Evaluator

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.1, 6.0
    • Labels:
      None

      Description

      This simplifies VariableResolver and moves each built-in Evaluator into its own class. Compiler warnings / missing generics are fixed. Also, the Locale bug with DateFormatEvaluator is solved. Instead of using the machine default, the Root Locale is used by default. An optional 3rd parameter allows users to specify whatever locale they want.

      1. SOLR-4086.patch
        108 kB
        James Dyer

        Issue Links

          Activity

          Hide
          James Dyer added a comment -

          This is a big patch, but much of it is from moving code. All tests pass with this applied.

          My plan is to commit this in a few days.

          Show
          James Dyer added a comment - This is a big patch, but much of it is from moving code. All tests pass with this applied. My plan is to commit this in a few days.
          Hide
          Uwe Schindler added a comment -

          Does this fix all default Locale bugs in DIH? If yes, you should remove the DIH class file exclusion from the API violations checker in solr/build.xml.
          I am happy, that you are taking care, thanks!

          Show
          Uwe Schindler added a comment - Does this fix all default Locale bugs in DIH? If yes, you should remove the DIH class file exclusion from the API violations checker in solr/build.xml. I am happy, that you are taking care, thanks!
          Hide
          Robert Muir added a comment -

          I am happy, that you are taking care, thanks!

          +1 to that!

          Show
          Robert Muir added a comment - I am happy, that you are taking care, thanks! +1 to that!
          Hide
          James Dyer added a comment -

          I don't believe it fixes all of SOLR-1916, and certainly doesn't eliminate all the forbidden API's (that API checker is very handy, by the way). But I think this takes a good bite out of it and hopefully soon we can remove the DIH exclusion.

          We might also be able to eventually run DIH tests in parallel also now that SOLR-4051 gives the ability to specify the location of the "dataimport.properties" file. Tests need to be changed to take advantage first though.

          Show
          James Dyer added a comment - I don't believe it fixes all of SOLR-1916 , and certainly doesn't eliminate all the forbidden API's (that API checker is very handy, by the way). But I think this takes a good bite out of it and hopefully soon we can remove the DIH exclusion. We might also be able to eventually run DIH tests in parallel also now that SOLR-4051 gives the ability to specify the location of the "dataimport.properties" file. Tests need to be changed to take advantage first though.
          Hide
          James Dyer added a comment -

          Committed.

          Trunk: r1411276
          4x: r1411280

          Show
          James Dyer added a comment - Committed. Trunk: r1411276 4x: r1411280
          Hide
          Mark Miller added a comment -

          looks like this triggered a jenkins fail due to tabs,nocommit, or author tag.

          Show
          Mark Miller added a comment - looks like this triggered a jenkins fail due to tabs,nocommit, or author tag.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411371

          SOLR-4086: fix residual problems with DateFormatEvaluator

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411371 SOLR-4086 : fix residual problems with DateFormatEvaluator
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411348

          SOLR-4086: fix dateformat evaluator unit test

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411348 SOLR-4086 : fix dateformat evaluator unit test
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411370

          SOLR-4086: fix TestVariableResolver

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411370 SOLR-4086 : fix TestVariableResolver
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411366

          SOLR-4086: fix residual problems with DateFormatEvaluator

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411366 SOLR-4086 : fix residual problems with DateFormatEvaluator
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411357

          SOLR-4086: fix TestVariableResolver

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411357 SOLR-4086 : fix TestVariableResolver
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411334

          SOLR-4086: fix dateformat evaluator unit test

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411334 SOLR-4086 : fix dateformat evaluator unit test
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1413729

          SOLR-4086: fix trivial test mistake

          Show
          Commit Tag Bot added a comment - [trunk commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1413729 SOLR-4086 : fix trivial test mistake
          Hide
          Dominik Siebel added a comment -

          James Dyer
          I noticed a massive decrease in indexing performance when trying a current checkout from branch_4x after this refactoring. Do you have any explanation for that? Looking through the code I could not find any changes that would explain this.

          Some numbers:

          • 1.3M documents
          • 4 DIHs

          before (with my patch from SOLR-2141):

          • DIH_1: documents processed: 325130, time taken: 00:25:44.445
          • DIH_2: documents processed: 207347, time taken: 01:16:04.607
          • DIH_3: documents processed: 184601, time taken: 01:18:00.797
          • DIH_4: documents processed: 618580, time taken: 04:17:38.414

          after:

          • DIH_1: documents processed: 324996, time taken: 01:07:47.186
          • DIH_2: documents processed: 207347, time taken: 03:31:21.345
          • DIH_3: documents processed: 184521, time taken: 03:13:11.313
          • DIH_4: documents processed: 618491, time taken: 06:42:54.384

          Any idea?

          Show
          Dominik Siebel added a comment - James Dyer I noticed a massive decrease in indexing performance when trying a current checkout from branch_4x after this refactoring. Do you have any explanation for that? Looking through the code I could not find any changes that would explain this. Some numbers: 1.3M documents 4 DIHs before (with my patch from SOLR-2141 ): DIH_1: documents processed: 325130, time taken: 00:25:44.445 DIH_2: documents processed: 207347, time taken: 01:16:04.607 DIH_3: documents processed: 184601, time taken: 01:18:00.797 DIH_4: documents processed: 618580, time taken: 04:17:38.414 after : DIH_1: documents processed: 324996, time taken: 01:07:47.186 DIH_2: documents processed: 207347, time taken: 03:31:21.345 DIH_3: documents processed: 184521, time taken: 03:13:11.313 DIH_4: documents processed: 618491, time taken: 06:42:54.384 Any idea?
          Hide
          James Dyer added a comment -

          Dominik,

          Thank you for reporting this! Could you post your data-config.xml so I can get an idea of which features you're using? Are these full or delta imports? Are you using DateFormatEvaluator ( ${dataimporter.functions.formatDate(...) ) ?

          This issue involved extensive rework to the VariableResolver(Impl) class, and to the Evaluator framework, with the aim on making the code easier to understand and to maintain. Whenever this type of change is made, it is always possible that the new implementation will suffer performance-wise. I will look with you on this: We certainly do not want massive performance decreases to get released.

          One thing that jumps out at me is the old version stored a cache of parsed-out template strings (ex: $

          {foo.bar}

          ) in a HashMap. I was worried that this could potentially consume too much memory and changed this to a WeakHashMap. But if your JVM is clearing out these WeakReferences frequently it might require a lot more work to keep re-parsing these strings.

          Another problem could be with DateFormatEvaluator. It used to keep a single instance of SimpleDateFormat (per-thread) and always use that as the "from-date-format" but now that Locales are involved this doesn't work. To alleviate this it only creates one Dateformat per pattern/locale combination and caches these, also in a WeakHashMap. This also might suffer from WeakReferences going away, but more seriously, I think having this map as an instance variable here entirely defeats its purpose if the VariableResolver is creating a new instance each time it is being used. So I need to look into that.

          probably the biggest change is DateFormatEvaluator

          Show
          James Dyer added a comment - Dominik, Thank you for reporting this! Could you post your data-config.xml so I can get an idea of which features you're using? Are these full or delta imports? Are you using DateFormatEvaluator ( ${dataimporter.functions.formatDate(...) ) ? This issue involved extensive rework to the VariableResolver(Impl) class, and to the Evaluator framework, with the aim on making the code easier to understand and to maintain. Whenever this type of change is made, it is always possible that the new implementation will suffer performance-wise. I will look with you on this: We certainly do not want massive performance decreases to get released. One thing that jumps out at me is the old version stored a cache of parsed-out template strings (ex: $ {foo.bar} ) in a HashMap. I was worried that this could potentially consume too much memory and changed this to a WeakHashMap. But if your JVM is clearing out these WeakReferences frequently it might require a lot more work to keep re-parsing these strings. Another problem could be with DateFormatEvaluator. It used to keep a single instance of SimpleDateFormat (per-thread) and always use that as the "from-date-format" but now that Locales are involved this doesn't work. To alleviate this it only creates one Dateformat per pattern/locale combination and caches these, also in a WeakHashMap. This also might suffer from WeakReferences going away, but more seriously, I think having this map as an instance variable here entirely defeats its purpose if the VariableResolver is creating a new instance each time it is being used. So I need to look into that. probably the biggest change is DateFormatEvaluator
          Hide
          Dominik Siebel added a comment -

          Hi James,

          thanks for the quick reply.
          I will have to check if company policy allows to post our data-config xmls in a public issue tracker, but I can say that much:

          our data-configs mainly consist of nested entities using variables to access the parent scope and also using the $

          {dataimporter.function.escapeSql()}

          function to strip unwanted characters. Nothing more ( no DateFormatEvaluator, at least not explicitly ).
          The query results of the root entity are post-processed via ScriptTransformer.

          The JVM is

          • Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_37 20.12-b01)

          Configured with

          • -d64
          • -Xms8g
          • -Xmx8g
          • -XX:+CMSParallelRemarkEnabled
          • -XX:+UseConcMarkSweepGC
          • -XX:+UseParNewGC

          With cores

          • main (~17G)
          Show
          Dominik Siebel added a comment - Hi James, thanks for the quick reply. I will have to check if company policy allows to post our data-config xmls in a public issue tracker, but I can say that much: our data-configs mainly consist of nested entities using variables to access the parent scope and also using the $ {dataimporter.function.escapeSql()} function to strip unwanted characters. Nothing more ( no DateFormatEvaluator, at least not explicitly ). The query results of the root entity are post-processed via ScriptTransformer. The JVM is Sun Microsystems Inc. Java HotSpot(TM) 64-Bit Server VM (1.6.0_37 20.12-b01) Configured with -d64 -Xms8g -Xmx8g -XX:+CMSParallelRemarkEnabled -XX:+UseConcMarkSweepGC -XX:+UseParNewGC With cores main (~17G)
          Hide
          James Dyer added a comment -

          Dominik,

          I've been testing the new DIH code with my data and if anything the newer version seems slightly faster than the older one (I'm comparing the latest snapshot with one from a couple of months ago, but I only updated the DIH jar so as to isolate DIH changes). I then disabled the WeakHashMap in VariableResolver and tried again and it didn't seem to be much slower, if any (makes me wonder if caching here at all is misguided). Now I'm running it with a TemplateTransformer on a child entity that has multiple children per parent and it still doesn't seem to have slowed down. (The changes with this issue could have dramatically slowed TemplateTransformer if I made a mistake...) The data I'm indexing has about 50 child entities so the VariableResolver gets plenty of exercise matching keys with the parent. I also wonder that because you're seeing slowdowns of 3x of what you had before if perhaps something else isn't going on? I doubt DIH's overhead is nearly enough to cause something like that.

          Can you try and narrow the cause down? Here are the steps I would take:

          • revert back to the old Solr & DIH and re-index. Verify you get the old "good" performance back.
          • Just upgrade DIH and not the rest of Solr. Verify the performance is "bad" again, and that the cause is something in DIH.
          • If in DIH, try eliminating 1 feature at a time:
          • Try it without use of TemplateTransformer.
          • Try it without evaluators.
          • Try eliminating child entities to see if one particular child is causing the difficulties

          If this is indeed caused by DIH changes, it is something that you use that I am not testing (properly or at all) on my end.

          Show
          James Dyer added a comment - Dominik, I've been testing the new DIH code with my data and if anything the newer version seems slightly faster than the older one (I'm comparing the latest snapshot with one from a couple of months ago, but I only updated the DIH jar so as to isolate DIH changes). I then disabled the WeakHashMap in VariableResolver and tried again and it didn't seem to be much slower, if any (makes me wonder if caching here at all is misguided). Now I'm running it with a TemplateTransformer on a child entity that has multiple children per parent and it still doesn't seem to have slowed down. (The changes with this issue could have dramatically slowed TemplateTransformer if I made a mistake...) The data I'm indexing has about 50 child entities so the VariableResolver gets plenty of exercise matching keys with the parent. I also wonder that because you're seeing slowdowns of 3x of what you had before if perhaps something else isn't going on? I doubt DIH's overhead is nearly enough to cause something like that. Can you try and narrow the cause down? Here are the steps I would take: revert back to the old Solr & DIH and re-index. Verify you get the old "good" performance back. Just upgrade DIH and not the rest of Solr. Verify the performance is "bad" again, and that the cause is something in DIH. If in DIH, try eliminating 1 feature at a time: Try it without use of TemplateTransformer. Try it without evaluators. Try eliminating child entities to see if one particular child is causing the difficulties If this is indeed caused by DIH changes, it is something that you use that I am not testing (properly or at all) on my end.
          Hide
          Dominik Siebel added a comment -

          Hi James Dyer,

          sorry I haven't looked into this for quite some time now.
          We already decided to postponed the Solr4 update which is why I had to work on other tasks for the last months. I will definitely look into this again when the time has come.

          Show
          Dominik Siebel added a comment - Hi James Dyer , sorry I haven't looked into this for quite some time now. We already decided to postponed the Solr4 update which is why I had to work on other tasks for the last months. I will definitely look into this again when the time has come.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411371

          SOLR-4086: fix residual problems with DateFormatEvaluator

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411371 SOLR-4086 : fix residual problems with DateFormatEvaluator
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411348

          SOLR-4086: fix dateformat evaluator unit test

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411348 SOLR-4086 : fix dateformat evaluator unit test
          Hide
          Commit Tag Bot added a comment -
          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411305 SOLR-4086 : remove tabs
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] James Dyer
          http://svn.apache.org/viewvc?view=revision&revision=1411280

          SOLR-4086: Refactor DIH - VariableResolver & Evaluator

          Show
          Commit Tag Bot added a comment - [branch_4x commit] James Dyer http://svn.apache.org/viewvc?view=revision&revision=1411280 SOLR-4086 : Refactor DIH - VariableResolver & Evaluator
          Hide
          Dominik Siebel added a comment -

          James Dyer just FYI.
          I jut had time to check integration of 4.3 (build from github mirror branch) into our environment and DIH runtimes seem to be back to normal. Good Work!

          Show
          Dominik Siebel added a comment - James Dyer just FYI. I jut had time to check integration of 4.3 (build from github mirror branch) into our environment and DIH runtimes seem to be back to normal. Good Work!

            People

            • Assignee:
              James Dyer
              Reporter:
              James Dyer
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development