Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-2646

Integrate Solr benchmarking support into the Benchmark module

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As part of my buzzwords Solr pef talk, I did some work to allow some Solr benchmarking with the benchmark module.

      I'll attach a patch with the current work I've done soon - there is still a fair amount to clean up and fix - a couple hacks or three - but it's already fairly useful.

      1. chart.jpg
        31 kB
        Mark Miller
      2. Dev-SolrBenchmarkModule.pdf
        61 kB
        Mark Miller
      3. SOLR-2646.patch
        62 kB
        Mark Miller
      4. SOLR-2646.patch
        43 kB
        Mark Miller
      5. SOLR-2646.patch
        110 kB
        Mark Miller
      6. SOLR-2646.patch
        100 kB
        Shalin Shekhar Mangar
      7. SOLR-2646.patch
        101 kB
        Shalin Shekhar Mangar
      8. SOLR-2646.patch
        90 kB
        Mark Miller
      9. SOLR-2646.patch
        89 kB
        Mark Miller
      10. SOLR-2646.patch
        151 kB
        Mark Bennett
      11. SOLR-2646.patch
        88 kB
        Mark Miller
      12. SOLR-2646.patch
        109 kB
        Mark Miller
      13. SOLR-2646.patch
        90 kB
        Mark Miller
      14. SOLR-2646.patch
        90 kB
        Mark Miller
      15. SolrIndexingPerfHistory.pdf
        66 kB
        Mark Miller

        Issue Links

          Activity

          Hide
          mkhludnev Mikhail Khludnev added a comment -

          Mark Miller, how it goes?

          Show
          mkhludnev Mikhail Khludnev added a comment - Mark Miller , how it goes?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Thanks a lot Mikhail Khludnev! I'll get this in soon then.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Thanks a lot Mikhail Khludnev ! I'll get this in soon then.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          SOLR-9867 is done.

          Show
          mkhludnev Mikhail Khludnev added a comment - SOLR-9867 is done.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I'm mainly waiting on SOLR-9867 to commit this.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I'm mainly waiting on SOLR-9867 to commit this.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Found the issue with precommit. I plan to commit sometime tomorrow.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Found the issue with precommit. I plan to commit sometime tomorrow.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          This is very close to committable. Precommit seems to be pushed over the edge on memory for src validation with the groovy task though.

          Show
          markrmiller@gmail.com Mark Miller added a comment - This is very close to committable. Precommit seems to be pushed over the edge on memory for src validation with the groovy task though.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I can probably get to a solr package in the next iteration. The lucene package was just simpler out of the gate.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I can probably get to a solr package in the next iteration. The lucene package was just simpler out of the gate.
          Hide
          mkhludnev Mikhail Khludnev added a comment -

          +1. beside of:

          • the same package org.apache.lucene.benchmark.byTask.tasks is put to a new module. It's usually fine, but I've remember some discussions about the issues either with javadoc or java 9 modules (my memories are vague here);
          • it won't work under windows, but either noone need it and/or it's a separate ticket.
          Show
          mkhludnev Mikhail Khludnev added a comment - +1. beside of: the same package org.apache.lucene.benchmark.byTask.tasks is put to a new module. It's usually fine, but I've remember some discussions about the issues either with javadoc or java 9 modules (my memories are vague here); it won't work under windows, but either noone need it and/or it's a separate ticket.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          So, towards the end of committing soon, here is a new patch that integrates as a new Solr module that doesn't touch the Lucene benchmark module at all.

          Still some work to do I'm sure, but the basics are working.

          Show
          markrmiller@gmail.com Mark Miller added a comment - So, towards the end of committing soon, here is a new patch that integrates as a new Solr module that doesn't touch the Lucene benchmark module at all. Still some work to do I'm sure, but the basics are working.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Patch to trunk and with the latest changes I could find locally.

          It's still a little rough around the edges, but I'd like to clean up a bit more and commit this. It's harmless and simple and valuable and keeping it current will allow us to use this for more accurately benchmarking changes over time much more easily.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Patch to trunk and with the latest changes I could find locally. It's still a little rough around the edges, but I'd like to clean up a bit more and commit this. It's harmless and simple and valuable and keeping it current will allow us to use this for more accurately benchmarking changes over time much more easily.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I have a bunch of other improvements, but its in a pretty old checkout (before 4x became 5x). I'll try and merge it up to this sometime soon.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I have a bunch of other improvements, but its in a pretty old checkout (before 4x became 5x). I'll try and merge it up to this sometime soon.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Bringing this patch in sync with trunk again.

          1. The start and stop solr tasks use the bin/solr scripts
          2. There are still plenty of references to SolrServer instead of SolrClient which needs to be cleaned up
          3. Query.toString needs to be removed
          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Bringing this patch in sync with trunk again. The start and stop solr tasks use the bin/solr scripts There are still plenty of references to SolrServer instead of SolrClient which needs to be cleaned up Query.toString needs to be removed
          Hide
          elyograg Shawn Heisey added a comment -

          I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think).

          I don't know if the commons-logging classes are actually used (indirectly) by SolrJ ... but if they are, this is a possibly sticky problem because SolrJ itself uses SLF4J. Whether to use the commons-logging or jcl-over-slf4j jar depends on which slf4j binding the final application will use. If the user binds slf4j to jcl with slf4j-jcl, then they need the actual commons-logging jar. If the user chooses any other binding, jcl-over-slf4j is required. We can't make that choice for the user, which I think means that the SolrJ documentation must explain these details.

          Show
          elyograg Shawn Heisey added a comment - I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think). I don't know if the commons-logging classes are actually used (indirectly) by SolrJ ... but if they are, this is a possibly sticky problem because SolrJ itself uses SLF4J. Whether to use the commons-logging or jcl-over-slf4j jar depends on which slf4j binding the final application will use. If the user binds slf4j to jcl with slf4j-jcl, then they need the actual commons-logging jar. If the user chooses any other binding, jcl-over-slf4j is required. We can't make that choice for the user, which I think means that the SolrJ documentation must explain these details.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          I've got some more local work on this too - probably a few little things, but most useful is proper support for batch indexing.

          That'd be great. I might have some time on improving this and setup local jobs to run these regularly.

          I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think)

          Hmm, you might be right. Perhaps the logging is not setup correctly. I'll check again and report back.

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - I've got some more local work on this too - probably a few little things, but most useful is proper support for batch indexing. That'd be great. I might have some time on improving this and setup local jobs to run these regularly. I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think) Hmm, you might be right. Perhaps the logging is not setup correctly. I'll check again and report back.
          Hide
          steve_rowe Steve Rowe added a comment -

          I added the classpath of lucene's replicator because it has commons-logging which is required by HttpComponents. I expected that commons-logging (being a transitive dependency) should have been inside the solrj-lib directory but it isn't.

          Isn't dist/solrj-lib supposed to have all dependencies of solrj (including transitive ones)?

          Only transitive dependencies that Solrj will use should be included.

          I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think).

          Show
          steve_rowe Steve Rowe added a comment - I added the classpath of lucene's replicator because it has commons-logging which is required by HttpComponents. I expected that commons-logging (being a transitive dependency) should have been inside the solrj-lib directory but it isn't. Isn't dist/solrj-lib supposed to have all dependencies of solrj (including transitive ones)? Only transitive dependencies that Solrj will use should be included. I know very little about logging, but shouldn't jcl-over-slf4j-*.jar be used here? I think its job is to intercept commons-logging calls and then slf4j will push them through wherever it's configured to go (log4j by default I think).
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've got some more local work on this too - probably a few little things, but most useful is proper support for batch indexing.

          I'll try and merge the two together if I get a chance.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've got some more local work on this too - probably a few little things, but most useful is proper support for batch indexing. I'll try and merge the two together if I get a chance.
          Hide
          shalinmangar Shalin Shekhar Mangar added a comment -

          Here's a patch which brings this in sync with trunk.

          I added the classpath of lucene's replicator because it has commons-logging which is required by HttpComponents. I expected that commons-logging (being a transitive dependency) should have been inside the solrj-lib directory but it isn't.

          Isn't dist/solrj-lib supposed to have all dependencies of solrj (including transitive ones)?

          Show
          shalinmangar Shalin Shekhar Mangar added a comment - Here's a patch which brings this in sync with trunk. I added the classpath of lucene's replicator because it has commons-logging which is required by HttpComponents. I expected that commons-logging (being a transitive dependency) should have been inside the solrj-lib directory but it isn't. Isn't dist/solrj-lib supposed to have all dependencies of solrj (including transitive ones)?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Here is a patch that brings this up to date with trunk and hacks in some support for CloudSolrServer.

          Because colons are special and are not working for the solr.zkhost property, I hacked it so that | is replaced with colon, but we should probably add some escaping or something.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Here is a patch that brings this up to date with trunk and hacks in some support for CloudSolrServer. Because colons are special and are not working for the solr.zkhost property, I hacked it so that | is replaced with colon, but we should probably add some escaping or something.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've attached a patch that is updated to trunk.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've attached a patch that is updated to trunk.
          Hide
          mbennett Mark Bennett added a comment -

          Two other notes:

          • Hard coded file path in the alg file
            solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg
          • I only pulled in the first 100 docs
            (by using the PDF instructions you can make the records line oriented, and use head -101 to get the first hundred records plus the header line)
          Show
          mbennett Mark Bennett added a comment - Two other notes: Hard coded file path in the alg file solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg I only pulled in the first 100 docs (by using the PDF instructions you can make the records line oriented, and use head -101 to get the first hundred records plus the header line)
          Hide
          mbennett Mark Bennett added a comment -

          Arguments I used / specific test I got running. If you're running other tests, your mileage may vary

          Class: org.apache.lucene.benchmark.byTask.Benchmark
          Argument: /Users/username/solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg
          JVM: -Xmx512M

          Show
          mbennett Mark Bennett added a comment - Arguments I used / specific test I got running. If you're running other tests, your mileage may vary Class: org.apache.lucene.benchmark.byTask.Benchmark Argument: /Users/username/solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg JVM: -Xmx512M
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Of course, for the simple case, just using the Cloud solrj server gets us the load balancing.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Of course, for the simple case, just using the Cloud solrj server gets us the load balancing.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          It depends - for real testing against a real cluster, it's probably just best to use the remote url feature I think. We might just want to build in some round robin action or something. For the internal option, we could run something like the solrcloud.sh script in the cloud-dev scripts to startup the vms just like the single node internal starts the example.

          Show
          markrmiller@gmail.com Mark Miller added a comment - It depends - for real testing against a real cluster, it's probably just best to use the remote url feature I think. We might just want to build in some round robin action or something. For the internal option, we could run something like the solrcloud.sh script in the cloud-dev scripts to startup the vms just like the single node internal starts the example.
          Hide
          mbennett Mark Bennett added a comment -

          Thanks Mark, I'll check the alg.

          Any thoughts of getting this beast working with SolrCloud?

          Show
          mbennett Mark Bennett added a comment - Thanks Mark, I'll check the alg. Any thoughts of getting this beast working with SolrCloud?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure.

          +1

          and not showing the Solr output ... Ideally there'd be a switch to turn it on and off,

          I thought there was an option for this for debugging - the start solr server call takes some arg that turns on output if I remember right (as part of the alg file).

          Show
          markrmiller@gmail.com Mark Miller added a comment - The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure. +1 and not showing the Solr output ... Ideally there'd be a switch to turn it on and off, I thought there was an option for this for debugging - the start solr server call takes some arg that turns on output if I remember right (as part of the alg file).
          Hide
          mbennett Mark Bennett added a comment -

          Issues with latest patch:

          • Haven't checked all tests
          • The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure.
          • The default behavior of completely consuming and not showing the Solr output makes it VERY hard to debug. Conversely, producing all that output might impact the test slightly. Ideally there'd be a switch to turn it on and off, and some way to integrate it into the make logging.
          • This doesn't work with SolrCloud yet.
          Show
          mbennett Mark Bennett added a comment - Issues with latest patch: Haven't checked all tests The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure. The default behavior of completely consuming and not showing the Solr output makes it VERY hard to debug. Conversely, producing all that output might impact the test slightly. Ideally there'd be a switch to turn it on and off, and some way to integrate it into the make logging. This doesn't work with SolrCloud yet.
          Hide
          mbennett Mark Bennett added a comment -

          Draft update to work with Solr 4.1 Will make comments about some of the issues.

          Show
          mbennett Mark Bennett added a comment - Draft update to work with Solr 4.1 Will make comments about some of the issues.
          Hide
          rcmuir Robert Muir added a comment -

          moving all 4.0 issues not touched in a month to 4.1

          Show
          rcmuir Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
          Hide
          rcmuir Robert Muir added a comment -

          rmuir20120906-bulk-40-change

          Show
          rcmuir Robert Muir added a comment - rmuir20120906-bulk-40-change
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          The python script does this on linux:

          Great! I'll add this to my sh script.

          Show
          markrmiller@gmail.com Mark Miller added a comment - The python script does this on linux: Great! I'll add this to my sh script.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Are there strategies to keep the disk cache consistent across runs?

          I have a warm phase that basically runs a slightly short version of the bench to try and be fair here. I was tossing the first first round (there are 2) and the warm phase so that things were on a more even playing field.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Are there strategies to keep the disk cache consistent across runs? I have a warm phase that basically runs a slightly short version of the bench to try and be fair here. I was tossing the first first round (there are 2) and the warm phase so that things were on a more even playing field.
          Hide
          rcmuir Robert Muir added a comment -

          The python script does this on linux:

          echo 3 > /proc/sys/vm/drop_caches
          

          and this on windows:

          for /R %I in (*) do fsutil file setvaliddata %I %~zI
          
          Show
          rcmuir Robert Muir added a comment - The python script does this on linux: echo 3 > /proc/sys/vm/drop_caches and this on windows: for /R %I in (*) do fsutil file setvaliddata %I %~zI
          Hide
          lancenorskog Lance Norskog added a comment -

          Are there strategies to keep the disk cache consistent across runs? Linux has a feature to clear it (poke a 0 somewhere in /proc).

          Show
          lancenorskog Lance Norskog added a comment - Are there strategies to keep the disk cache consistent across runs? Linux has a feature to clear it (poke a 0 somewhere in /proc).
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          It's a constant data set that the test runs on - simply a static dump of wikipedia articles (one doc per line file).

          Every checkout the benchmark runs against uses exactly the same wikipedia docs.

          You can currently compare with Lucene using change over time to some degree, since they both indicate indexing speed.

          I'm sure that we can figure mb/s the same way the Lucene stuff does - but it might be a hack unless you can do it purely in the benchmark package. My current system just extracts info from benchmark result files - so it can extract the result of any benchmark you can make - if thats a mb/s result, that's no problem. I think perhaps though, the Lucene, python driven stuff might even do some external stuff on it's own? I don't know for sure.

          Show
          markrmiller@gmail.com Mark Miller added a comment - It's a constant data set that the test runs on - simply a static dump of wikipedia articles (one doc per line file). Every checkout the benchmark runs against uses exactly the same wikipedia docs. You can currently compare with Lucene using change over time to some degree, since they both indicate indexing speed. I'm sure that we can figure mb/s the same way the Lucene stuff does - but it might be a hack unless you can do it purely in the benchmark package. My current system just extracts info from benchmark result files - so it can extract the result of any benchmark you can make - if thats a mb/s result, that's no problem. I think perhaps though, the Lucene, python driven stuff might even do some external stuff on it's own? I don't know for sure.
          Hide
          erickerickson Erick Erickson added a comment -

          Way cool!

          Is there any chance that we could report MB/sec to/instead of docs/sec? I suspect that's a more meaningful number for comparisons. Or perhaps just count the bytes sent to Solr and post that as a footnote? Yeah, yeah, yeah, the analysis chain will change things.... but a "doc" is an even more variable thing....

          Actually, I guess that this number could be counted once since the data set doesn't change that rapidly.

          FWIW

          Show
          erickerickson Erick Erickson added a comment - Way cool! Is there any chance that we could report MB/sec to/instead of docs/sec? I suspect that's a more meaningful number for comparisons. Or perhaps just count the bytes sent to Solr and post that as a footnote? Yeah, yeah, yeah, the analysis chain will change things.... but a "doc" is an even more variable thing.... Actually, I guess that this number could be counted once since the data set doesn't change that rapidly. FWIW
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Attached an example generated chart. Would probably end up embedding that in html. The Lucene stuff uses a javascript charting lib, but I don't really want to deal with javascript - would rather stick to java when I can.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Attached an example generated chart. Would probably end up embedding that in html. The Lucene stuff uses a javascript charting lib, but I don't really want to deal with javascript - would rather stick to java when I can.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I've got a fair amount of this automated now. It's still somewhat hackey though.

          Because you need to apply the benchmark patch to get things working, I count on that checkout existing and being patched in a specific location. It drives the benchmark, but talks to a running Solr that is started from a checkout. I use git so that it's really cheap to flip through revs and run benchmarks.

          The main driver is an ugly .sh script - it accepts a few params (name of the chart, where to write result files, location of alg file, date range of checkouts to run the alg against, and the interval to try between days).

          For instance, you might say, run the indexing benchmark over the period of 2012-01-04 to 2012-07-15 and do it once for every 5 days.

          This happens and the output of the benchmarks are dumped into a folder.

          Then I have a simple java cmd line app that will process the result folder. It takes a chart name, the location of results folder, and a list of named regexes - each regex pointing to the pertinent data to pull from the results file. The java app pulls out all the data, writes a csv file, and outputs a simple line chart.

          I don't know how cleaned up this will get, i won't post any of it for now - but I may get to the point of running some stuff locally automatically and pushing to a webserver with the charts etc, al la Lucene.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I've got a fair amount of this automated now. It's still somewhat hackey though. Because you need to apply the benchmark patch to get things working, I count on that checkout existing and being patched in a specific location. It drives the benchmark, but talks to a running Solr that is started from a checkout. I use git so that it's really cheap to flip through revs and run benchmarks. The main driver is an ugly .sh script - it accepts a few params (name of the chart, where to write result files, location of alg file, date range of checkouts to run the alg against, and the interval to try between days). For instance, you might say, run the indexing benchmark over the period of 2012-01-04 to 2012-07-15 and do it once for every 5 days. This happens and the output of the benchmarks are dumped into a folder. Then I have a simple java cmd line app that will process the result folder. It takes a chart name, the location of results folder, and a list of named regexes - each regex pointing to the pertinent data to pull from the results file. The java app pulls out all the data, writes a csv file, and outputs a simple line chart. I don't know how cleaned up this will get, i won't post any of it for now - but I may get to the point of running some stuff locally automatically and pushing to a webserver with the charts etc, al la Lucene.
          Hide
          hossman Hoss Man added a comment -

          bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

          Show
          hossman Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Could not tell you. Could be a large variety of things.

          This is a test using the current example configs shipped at each date - which means it's not always apples to apples if default config changes. Analysis could have changed for our default english text. New defaults for features or ease of use may have been enabled.

          For example, I believe the update log is on by default now for durability and realtime GET, etc.

          Also, some code paths have changed to support various new features.

          Also, Lucene is changing underneath us, so we should probably compare to some similar benchmark there (I know Mike publishes quite a few that could be looked at).

          It's not so easy to dig in after the fact with month resolution.

          At some point, it would be nice to have this automated and published as Lucene is - then we could run it nightly.

          There is some work to do to get there though (don't know that ill have time for it in the near future), and we would need a good consistent machine to run it on (I could probably run it at night or something).

          I have not attempted to track anything down other than the broad numbers right now.

          This is simply to start a record that can help as we move forward in evaluating how changes impact performance.

          Obviously the single threaded path has not been affected - so whatever has changed, it's likely mostly around concurrency.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Could not tell you. Could be a large variety of things. This is a test using the current example configs shipped at each date - which means it's not always apples to apples if default config changes. Analysis could have changed for our default english text. New defaults for features or ease of use may have been enabled. For example, I believe the update log is on by default now for durability and realtime GET, etc. Also, some code paths have changed to support various new features. Also, Lucene is changing underneath us, so we should probably compare to some similar benchmark there (I know Mike publishes quite a few that could be looked at). It's not so easy to dig in after the fact with month resolution. At some point, it would be nice to have this automated and published as Lucene is - then we could run it nightly. There is some work to do to get there though (don't know that ill have time for it in the near future), and we would need a good consistent machine to run it on (I could probably run it at night or something). I have not attempted to track anything down other than the broad numbers right now. This is simply to start a record that can help as we move forward in evaluating how changes impact performance. Obviously the single threaded path has not been affected - so whatever has changed, it's likely mostly around concurrency.
          Hide
          dsmiley David Smiley added a comment -

          According to the note at the bottom of SolrIndexingPerfHistory.pdf, it appears trunk is slower than 3.6 – how could that be?

          Show
          dsmiley David Smiley added a comment - According to the note at the bottom of SolrIndexingPerfHistory.pdf, it appears trunk is slower than 3.6 – how could that be?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          I took a little time and tested Solr indexing performance on trunk over about the past year and a half. I also added some numbers from 3.6 for comparison.

          This benchmark tests both a single indexing thread, as well as 4 threads with the concurrent solr server.

          I test indexing 10000 wikipedia docs and do 4 runs (serial, concurrent, serial, concurrent). I toss the first 2 runs and record the second 2 runs. I do this once at the end of each month.

          Show
          markrmiller@gmail.com Mark Miller added a comment - I took a little time and tested Solr indexing performance on trunk over about the past year and a half. I also added some numbers from 3.6 for comparison. This benchmark tests both a single indexing thread, as well as 4 threads with the concurrent solr server. I test indexing 10000 wikipedia docs and do 4 runs (serial, concurrent, serial, concurrent). I toss the first 2 runs and record the second 2 runs. I do this once at the end of each month.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          A patch taking things to trunk.

          Show
          markrmiller@gmail.com Mark Miller added a comment - A patch taking things to trunk.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Yeah - I guess that is my biggest problem - for example, I hack into benchmark module to find the Solr jars - which is why you have to run ant dist first (and it uses the Solr example, so you have to run example).

          +    	<!-- used to run solr benchmarks -->
          +		<pathelement path="../../solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar" />
          +		<fileset dir="../../solr/dist/solrj-lib">
          +			<include name="**/*.jar" />
          +		</fileset>    	
          

          It is even hardcoded for 4.0-SNAPSHOT at the moment - that can be wild-carded, but it's still a little nasty.

          There are certainly plenty of other rough edges, but that is the largest hack issue probably.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Yeah - I guess that is my biggest problem - for example, I hack into benchmark module to find the Solr jars - which is why you have to run ant dist first (and it uses the Solr example, so you have to run example). + <!-- used to run solr benchmarks --> + <pathelement path="../../solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar" /> + <fileset dir="../../solr/dist/solrj-lib"> + <include name="**/*.jar" /> + </fileset> It is even hardcoded for 4.0-SNAPSHOT at the moment - that can be wild-carded, but it's still a little nasty. There are certainly plenty of other rough edges, but that is the largest hack issue probably.
          Hide
          rcmuir Robert Muir added a comment -

          What else do you need to get this in... cleaner integration into the build?

          Show
          rcmuir Robert Muir added a comment - What else do you need to get this in... cleaner integration into the build?
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          to trunk

          Show
          markrmiller@gmail.com Mark Miller added a comment - to trunk
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          New patch -

          *A variety of little improvements in error handling and messages. Slightly better handling of starting/stopping solr internally (a lot I'd like to improve still though).

          *Also adds the log param to StartSolrServer so that you can use StartSolrServer(log) to pump the Solr logs to the console. Very useful when developing an algorithm and to be sure it's doing what you think it is.

          *Also now actually points to the correct configs folder in the internal example algs, and doesn't silently use the example config (or last used) when it cannot find the specified config file.

          Show
          markrmiller@gmail.com Mark Miller added a comment - New patch - *A variety of little improvements in error handling and messages. Slightly better handling of starting/stopping solr internally (a lot I'd like to improve still though). *Also adds the log param to StartSolrServer so that you can use StartSolrServer(log) to pump the Solr logs to the console. Very useful when developing an algorithm and to be sure it's doing what you think it is. *Also now actually points to the correct configs folder in the internal example algs, and doesn't silently use the example config (or last used) when it cannot find the specified config file.
          Hide
          mikemccand Michael McCandless added a comment -

          This is awesome Mark! We badly need to be able to easily benchmark Solr.

          Show
          mikemccand Michael McCandless added a comment - This is awesome Mark! We badly need to be able to easily benchmark Solr.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Some of the available settings (top of the alg file) that can be varied per round:

          solr.server=(fully qualified classname)
          solr.streaming.server.queue.size=(int)
          solr.streaming.server.threadcount=(int)
          
          solr.internal.server.xmx=(eg 1000M)
          
          solr.configs.home=(path to config files to use)
          solr.schema=(schema.xml filename in solr.configs.home)
          solr.config(solrconfig.xml filename in solr.configs.home)
          
          solr.field.mappings=(map benchmark field names to Solr schema names eg doctitle>title,docid>id,docdate>date)
          
          Show
          markrmiller@gmail.com Mark Miller added a comment - Some of the available settings (top of the alg file) that can be varied per round: solr.server=(fully qualified classname) solr.streaming.server.queue.size=( int ) solr.streaming.server.threadcount=( int ) solr.internal.server.xmx=(eg 1000M) solr.configs.home=(path to config files to use) solr.schema=(schema.xml filename in solr.configs.home) solr.config(solrconfig.xml filename in solr.configs.home) solr.field.mappings=(map benchmark field names to Solr schema names eg doctitle>title,docid>id,docdate>date)
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Also, as a reminder to myself - the SolrSearchTask is a bit of hack right now - Query#toString police alert

          Show
          markrmiller@gmail.com Mark Miller added a comment - Also, as a reminder to myself - the SolrSearchTask is a bit of hack right now - Query#toString police alert
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Attached is a brief rough guide to getting started writing or running an algorithm. Thanks to Martijn Koster for contributing improvements and additional info for it.

          Show
          markrmiller@gmail.com Mark Miller added a comment - Attached is a brief rough guide to getting started writing or running an algorithm. Thanks to Martijn Koster for contributing improvements and additional info for it.
          Hide
          markrmiller@gmail.com Mark Miller added a comment -

          Still some to do here, but here is what I have at the moment. Larger issues that are left are:

          • cleanly integrate into the build (hack integration now)
          • improve error handling and reporting so that it's easier to create working algorithms.
          Show
          markrmiller@gmail.com Mark Miller added a comment - Still some to do here, but here is what I have at the moment. Larger issues that are left are: cleanly integrate into the build (hack integration now) improve error handling and reporting so that it's easier to create working algorithms.

            People

            • Assignee:
              markrmiller@gmail.com Mark Miller
              Reporter:
              markrmiller@gmail.com Mark Miller
            • Votes:
              6 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Development