Solr
  1. Solr
  2. SOLR-2646

Integrate Solr benchmarking support into the Benchmark module

    Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      As part of my buzzwords Solr pef talk, I did some work to allow some Solr benchmarking with the benchmark module.

      I'll attach a patch with the current work I've done soon - there is still a fair amount to clean up and fix - a couple hacks or three - but it's already fairly useful.

      1. chart.jpg
        31 kB
        Mark Miller
      2. Dev-SolrBenchmarkModule.pdf
        61 kB
        Mark Miller
      3. SOLR-2646.patch
        90 kB
        Mark Miller
      4. SOLR-2646.patch
        89 kB
        Mark Miller
      5. SOLR-2646.patch
        151 kB
        Mark Bennett
      6. SOLR-2646.patch
        88 kB
        Mark Miller
      7. SOLR-2646.patch
        109 kB
        Mark Miller
      8. SOLR-2646.patch
        90 kB
        Mark Miller
      9. SOLR-2646.patch
        90 kB
        Mark Miller
      10. SolrIndexingPerfHistory.pdf
        66 kB
        Mark Miller

        Activity

        Hide
        Mark Miller added a comment -

        Still some to do here, but here is what I have at the moment. Larger issues that are left are:

        • cleanly integrate into the build (hack integration now)
        • improve error handling and reporting so that it's easier to create working algorithms.
        Show
        Mark Miller added a comment - Still some to do here, but here is what I have at the moment. Larger issues that are left are: cleanly integrate into the build (hack integration now) improve error handling and reporting so that it's easier to create working algorithms.
        Hide
        Mark Miller added a comment -

        Attached is a brief rough guide to getting started writing or running an algorithm. Thanks to Martijn Koster for contributing improvements and additional info for it.

        Show
        Mark Miller added a comment - Attached is a brief rough guide to getting started writing or running an algorithm. Thanks to Martijn Koster for contributing improvements and additional info for it.
        Hide
        Mark Miller added a comment -

        Also, as a reminder to myself - the SolrSearchTask is a bit of hack right now - Query#toString police alert

        Show
        Mark Miller added a comment - Also, as a reminder to myself - the SolrSearchTask is a bit of hack right now - Query#toString police alert
        Hide
        Mark Miller added a comment -

        Some of the available settings (top of the alg file) that can be varied per round:

        solr.server=(fully qualified classname)
        solr.streaming.server.queue.size=(int)
        solr.streaming.server.threadcount=(int)
        
        solr.internal.server.xmx=(eg 1000M)
        
        solr.configs.home=(path to config files to use)
        solr.schema=(schema.xml filename in solr.configs.home)
        solr.config(solrconfig.xml filename in solr.configs.home)
        
        solr.field.mappings=(map benchmark field names to Solr schema names eg doctitle>title,docid>id,docdate>date)
        
        Show
        Mark Miller added a comment - Some of the available settings (top of the alg file) that can be varied per round: solr.server=(fully qualified classname) solr.streaming.server.queue.size=( int ) solr.streaming.server.threadcount=( int ) solr.internal.server.xmx=(eg 1000M) solr.configs.home=(path to config files to use) solr.schema=(schema.xml filename in solr.configs.home) solr.config(solrconfig.xml filename in solr.configs.home) solr.field.mappings=(map benchmark field names to Solr schema names eg doctitle>title,docid>id,docdate>date)
        Hide
        Michael McCandless added a comment -

        This is awesome Mark! We badly need to be able to easily benchmark Solr.

        Show
        Michael McCandless added a comment - This is awesome Mark! We badly need to be able to easily benchmark Solr.
        Hide
        Mark Miller added a comment -

        New patch -

        *A variety of little improvements in error handling and messages. Slightly better handling of starting/stopping solr internally (a lot I'd like to improve still though).

        *Also adds the log param to StartSolrServer so that you can use StartSolrServer(log) to pump the Solr logs to the console. Very useful when developing an algorithm and to be sure it's doing what you think it is.

        *Also now actually points to the correct configs folder in the internal example algs, and doesn't silently use the example config (or last used) when it cannot find the specified config file.

        Show
        Mark Miller added a comment - New patch - *A variety of little improvements in error handling and messages. Slightly better handling of starting/stopping solr internally (a lot I'd like to improve still though). *Also adds the log param to StartSolrServer so that you can use StartSolrServer(log) to pump the Solr logs to the console. Very useful when developing an algorithm and to be sure it's doing what you think it is. *Also now actually points to the correct configs folder in the internal example algs, and doesn't silently use the example config (or last used) when it cannot find the specified config file.
        Hide
        Mark Miller added a comment -

        to trunk

        Show
        Mark Miller added a comment - to trunk
        Hide
        Robert Muir added a comment -

        What else do you need to get this in... cleaner integration into the build?

        Show
        Robert Muir added a comment - What else do you need to get this in... cleaner integration into the build?
        Hide
        Mark Miller added a comment -

        Yeah - I guess that is my biggest problem - for example, I hack into benchmark module to find the Solr jars - which is why you have to run ant dist first (and it uses the Solr example, so you have to run example).

        +    	<!-- used to run solr benchmarks -->
        +		<pathelement path="../../solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar" />
        +		<fileset dir="../../solr/dist/solrj-lib">
        +			<include name="**/*.jar" />
        +		</fileset>    	
        

        It is even hardcoded for 4.0-SNAPSHOT at the moment - that can be wild-carded, but it's still a little nasty.

        There are certainly plenty of other rough edges, but that is the largest hack issue probably.

        Show
        Mark Miller added a comment - Yeah - I guess that is my biggest problem - for example, I hack into benchmark module to find the Solr jars - which is why you have to run ant dist first (and it uses the Solr example, so you have to run example). + <!-- used to run solr benchmarks --> + <pathelement path="../../solr/dist/apache-solr-solrj-4.0-SNAPSHOT.jar" /> + <fileset dir="../../solr/dist/solrj-lib"> + <include name="**/*.jar" /> + </fileset> It is even hardcoded for 4.0-SNAPSHOT at the moment - that can be wild-carded, but it's still a little nasty. There are certainly plenty of other rough edges, but that is the largest hack issue probably.
        Hide
        Mark Miller added a comment -

        A patch taking things to trunk.

        Show
        Mark Miller added a comment - A patch taking things to trunk.
        Hide
        Mark Miller added a comment -

        I took a little time and tested Solr indexing performance on trunk over about the past year and a half. I also added some numbers from 3.6 for comparison.

        This benchmark tests both a single indexing thread, as well as 4 threads with the concurrent solr server.

        I test indexing 10000 wikipedia docs and do 4 runs (serial, concurrent, serial, concurrent). I toss the first 2 runs and record the second 2 runs. I do this once at the end of each month.

        Show
        Mark Miller added a comment - I took a little time and tested Solr indexing performance on trunk over about the past year and a half. I also added some numbers from 3.6 for comparison. This benchmark tests both a single indexing thread, as well as 4 threads with the concurrent solr server. I test indexing 10000 wikipedia docs and do 4 runs (serial, concurrent, serial, concurrent). I toss the first 2 runs and record the second 2 runs. I do this once at the end of each month.
        Hide
        David Smiley added a comment -

        According to the note at the bottom of SolrIndexingPerfHistory.pdf, it appears trunk is slower than 3.6 – how could that be?

        Show
        David Smiley added a comment - According to the note at the bottom of SolrIndexingPerfHistory.pdf, it appears trunk is slower than 3.6 – how could that be?
        Hide
        Mark Miller added a comment -

        Could not tell you. Could be a large variety of things.

        This is a test using the current example configs shipped at each date - which means it's not always apples to apples if default config changes. Analysis could have changed for our default english text. New defaults for features or ease of use may have been enabled.

        For example, I believe the update log is on by default now for durability and realtime GET, etc.

        Also, some code paths have changed to support various new features.

        Also, Lucene is changing underneath us, so we should probably compare to some similar benchmark there (I know Mike publishes quite a few that could be looked at).

        It's not so easy to dig in after the fact with month resolution.

        At some point, it would be nice to have this automated and published as Lucene is - then we could run it nightly.

        There is some work to do to get there though (don't know that ill have time for it in the near future), and we would need a good consistent machine to run it on (I could probably run it at night or something).

        I have not attempted to track anything down other than the broad numbers right now.

        This is simply to start a record that can help as we move forward in evaluating how changes impact performance.

        Obviously the single threaded path has not been affected - so whatever has changed, it's likely mostly around concurrency.

        Show
        Mark Miller added a comment - Could not tell you. Could be a large variety of things. This is a test using the current example configs shipped at each date - which means it's not always apples to apples if default config changes. Analysis could have changed for our default english text. New defaults for features or ease of use may have been enabled. For example, I believe the update log is on by default now for durability and realtime GET, etc. Also, some code paths have changed to support various new features. Also, Lucene is changing underneath us, so we should probably compare to some similar benchmark there (I know Mike publishes quite a few that could be looked at). It's not so easy to dig in after the fact with month resolution. At some point, it would be nice to have this automated and published as Lucene is - then we could run it nightly. There is some work to do to get there though (don't know that ill have time for it in the near future), and we would need a good consistent machine to run it on (I could probably run it at night or something). I have not attempted to track anything down other than the broad numbers right now. This is simply to start a record that can help as we move forward in evaluating how changes impact performance. Obviously the single threaded path has not been affected - so whatever has changed, it's likely mostly around concurrency.
        Hide
        Hoss Man added a comment -

        bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment

        Show
        Hoss Man added a comment - bulk fixing the version info for 4.0-ALPHA and 4.0 all affected issues have "hoss20120711-bulk-40-change" in comment
        Hide
        Mark Miller added a comment -

        I've got a fair amount of this automated now. It's still somewhat hackey though.

        Because you need to apply the benchmark patch to get things working, I count on that checkout existing and being patched in a specific location. It drives the benchmark, but talks to a running Solr that is started from a checkout. I use git so that it's really cheap to flip through revs and run benchmarks.

        The main driver is an ugly .sh script - it accepts a few params (name of the chart, where to write result files, location of alg file, date range of checkouts to run the alg against, and the interval to try between days).

        For instance, you might say, run the indexing benchmark over the period of 2012-01-04 to 2012-07-15 and do it once for every 5 days.

        This happens and the output of the benchmarks are dumped into a folder.

        Then I have a simple java cmd line app that will process the result folder. It takes a chart name, the location of results folder, and a list of named regexes - each regex pointing to the pertinent data to pull from the results file. The java app pulls out all the data, writes a csv file, and outputs a simple line chart.

        I don't know how cleaned up this will get, i won't post any of it for now - but I may get to the point of running some stuff locally automatically and pushing to a webserver with the charts etc, al la Lucene.

        Show
        Mark Miller added a comment - I've got a fair amount of this automated now. It's still somewhat hackey though. Because you need to apply the benchmark patch to get things working, I count on that checkout existing and being patched in a specific location. It drives the benchmark, but talks to a running Solr that is started from a checkout. I use git so that it's really cheap to flip through revs and run benchmarks. The main driver is an ugly .sh script - it accepts a few params (name of the chart, where to write result files, location of alg file, date range of checkouts to run the alg against, and the interval to try between days). For instance, you might say, run the indexing benchmark over the period of 2012-01-04 to 2012-07-15 and do it once for every 5 days. This happens and the output of the benchmarks are dumped into a folder. Then I have a simple java cmd line app that will process the result folder. It takes a chart name, the location of results folder, and a list of named regexes - each regex pointing to the pertinent data to pull from the results file. The java app pulls out all the data, writes a csv file, and outputs a simple line chart. I don't know how cleaned up this will get, i won't post any of it for now - but I may get to the point of running some stuff locally automatically and pushing to a webserver with the charts etc, al la Lucene.
        Hide
        Mark Miller added a comment -

        Attached an example generated chart. Would probably end up embedding that in html. The Lucene stuff uses a javascript charting lib, but I don't really want to deal with javascript - would rather stick to java when I can.

        Show
        Mark Miller added a comment - Attached an example generated chart. Would probably end up embedding that in html. The Lucene stuff uses a javascript charting lib, but I don't really want to deal with javascript - would rather stick to java when I can.
        Hide
        Erick Erickson added a comment -

        Way cool!

        Is there any chance that we could report MB/sec to/instead of docs/sec? I suspect that's a more meaningful number for comparisons. Or perhaps just count the bytes sent to Solr and post that as a footnote? Yeah, yeah, yeah, the analysis chain will change things.... but a "doc" is an even more variable thing....

        Actually, I guess that this number could be counted once since the data set doesn't change that rapidly.

        FWIW

        Show
        Erick Erickson added a comment - Way cool! Is there any chance that we could report MB/sec to/instead of docs/sec? I suspect that's a more meaningful number for comparisons. Or perhaps just count the bytes sent to Solr and post that as a footnote? Yeah, yeah, yeah, the analysis chain will change things.... but a "doc" is an even more variable thing.... Actually, I guess that this number could be counted once since the data set doesn't change that rapidly. FWIW
        Hide
        Mark Miller added a comment -

        It's a constant data set that the test runs on - simply a static dump of wikipedia articles (one doc per line file).

        Every checkout the benchmark runs against uses exactly the same wikipedia docs.

        You can currently compare with Lucene using change over time to some degree, since they both indicate indexing speed.

        I'm sure that we can figure mb/s the same way the Lucene stuff does - but it might be a hack unless you can do it purely in the benchmark package. My current system just extracts info from benchmark result files - so it can extract the result of any benchmark you can make - if thats a mb/s result, that's no problem. I think perhaps though, the Lucene, python driven stuff might even do some external stuff on it's own? I don't know for sure.

        Show
        Mark Miller added a comment - It's a constant data set that the test runs on - simply a static dump of wikipedia articles (one doc per line file). Every checkout the benchmark runs against uses exactly the same wikipedia docs. You can currently compare with Lucene using change over time to some degree, since they both indicate indexing speed. I'm sure that we can figure mb/s the same way the Lucene stuff does - but it might be a hack unless you can do it purely in the benchmark package. My current system just extracts info from benchmark result files - so it can extract the result of any benchmark you can make - if thats a mb/s result, that's no problem. I think perhaps though, the Lucene, python driven stuff might even do some external stuff on it's own? I don't know for sure.
        Hide
        Lance Norskog added a comment -

        Are there strategies to keep the disk cache consistent across runs? Linux has a feature to clear it (poke a 0 somewhere in /proc).

        Show
        Lance Norskog added a comment - Are there strategies to keep the disk cache consistent across runs? Linux has a feature to clear it (poke a 0 somewhere in /proc).
        Hide
        Robert Muir added a comment -

        The python script does this on linux:

        echo 3 > /proc/sys/vm/drop_caches
        

        and this on windows:

        for /R %I in (*) do fsutil file setvaliddata %I %~zI
        
        Show
        Robert Muir added a comment - The python script does this on linux: echo 3 > /proc/sys/vm/drop_caches and this on windows: for /R %I in (*) do fsutil file setvaliddata %I %~zI
        Hide
        Mark Miller added a comment -

        Are there strategies to keep the disk cache consistent across runs?

        I have a warm phase that basically runs a slightly short version of the bench to try and be fair here. I was tossing the first first round (there are 2) and the warm phase so that things were on a more even playing field.

        Show
        Mark Miller added a comment - Are there strategies to keep the disk cache consistent across runs? I have a warm phase that basically runs a slightly short version of the bench to try and be fair here. I was tossing the first first round (there are 2) and the warm phase so that things were on a more even playing field.
        Hide
        Mark Miller added a comment -

        The python script does this on linux:

        Great! I'll add this to my sh script.

        Show
        Mark Miller added a comment - The python script does this on linux: Great! I'll add this to my sh script.
        Hide
        Robert Muir added a comment -

        rmuir20120906-bulk-40-change

        Show
        Robert Muir added a comment - rmuir20120906-bulk-40-change
        Hide
        Robert Muir added a comment -

        moving all 4.0 issues not touched in a month to 4.1

        Show
        Robert Muir added a comment - moving all 4.0 issues not touched in a month to 4.1
        Hide
        Mark Bennett added a comment -

        Draft update to work with Solr 4.1 Will make comments about some of the issues.

        Show
        Mark Bennett added a comment - Draft update to work with Solr 4.1 Will make comments about some of the issues.
        Hide
        Mark Bennett added a comment -

        Issues with latest patch:

        • Haven't checked all tests
        • The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure.
        • The default behavior of completely consuming and not showing the Solr output makes it VERY hard to debug. Conversely, producing all that output might impact the test slightly. Ideally there'd be a switch to turn it on and off, and some way to integrate it into the make logging.
        • This doesn't work with SolrCloud yet.
        Show
        Mark Bennett added a comment - Issues with latest patch: Haven't checked all tests The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure. The default behavior of completely consuming and not showing the Solr output makes it VERY hard to debug. Conversely, producing all that output might impact the test slightly. Ideally there'd be a switch to turn it on and off, and some way to integrate it into the make logging. This doesn't work with SolrCloud yet.
        Hide
        Mark Miller added a comment -

        The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure.

        +1

        and not showing the Solr output ... Ideally there'd be a switch to turn it on and off,

        I thought there was an option for this for debugging - the start solr server call takes some arg that turns on output if I remember right (as part of the alg file).

        Show
        Mark Miller added a comment - The schema.xml file that gets put into the standard example location is quite a bit different from the normal schema.xml. I think the test should expect a separate directory structure. +1 and not showing the Solr output ... Ideally there'd be a switch to turn it on and off, I thought there was an option for this for debugging - the start solr server call takes some arg that turns on output if I remember right (as part of the alg file).
        Hide
        Mark Bennett added a comment -

        Thanks Mark, I'll check the alg.

        Any thoughts of getting this beast working with SolrCloud?

        Show
        Mark Bennett added a comment - Thanks Mark, I'll check the alg. Any thoughts of getting this beast working with SolrCloud?
        Hide
        Mark Miller added a comment -

        It depends - for real testing against a real cluster, it's probably just best to use the remote url feature I think. We might just want to build in some round robin action or something. For the internal option, we could run something like the solrcloud.sh script in the cloud-dev scripts to startup the vms just like the single node internal starts the example.

        Show
        Mark Miller added a comment - It depends - for real testing against a real cluster, it's probably just best to use the remote url feature I think. We might just want to build in some round robin action or something. For the internal option, we could run something like the solrcloud.sh script in the cloud-dev scripts to startup the vms just like the single node internal starts the example.
        Hide
        Mark Miller added a comment -

        Of course, for the simple case, just using the Cloud solrj server gets us the load balancing.

        Show
        Mark Miller added a comment - Of course, for the simple case, just using the Cloud solrj server gets us the load balancing.
        Hide
        Mark Bennett added a comment -

        Arguments I used / specific test I got running. If you're running other tests, your mileage may vary

        Class: org.apache.lucene.benchmark.byTask.Benchmark
        Argument: /Users/username/solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg
        JVM: -Xmx512M

        Show
        Mark Bennett added a comment - Arguments I used / specific test I got running. If you're running other tests, your mileage may vary Class: org.apache.lucene.benchmark.byTask.Benchmark Argument: /Users/username/solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg JVM: -Xmx512M
        Hide
        Mark Bennett added a comment -

        Two other notes:

        • Hard coded file path in the alg file
          solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg
        • I only pulled in the first 100 docs
          (by using the PDF instructions you can make the records line oriented, and use head -101 to get the first hundred records plus the header line)
        Show
        Mark Bennett added a comment - Two other notes: Hard coded file path in the alg file solr-lucene-410-bench/lucene/benchmark/conf/solr/internal/streaming-vs-httpcommon.alg I only pulled in the first 100 docs (by using the PDF instructions you can make the records line oriented, and use head -101 to get the first hundred records plus the header line)
        Hide
        Mark Miller added a comment -

        I've attached a patch that is updated to trunk.

        Show
        Mark Miller added a comment - I've attached a patch that is updated to trunk.
        Hide
        Mark Miller added a comment -

        Here is a patch that brings this up to date with trunk and hacks in some support for CloudSolrServer.

        Because colons are special and are not working for the solr.zkhost property, I hacked it so that | is replaced with colon, but we should probably add some escaping or something.

        Show
        Mark Miller added a comment - Here is a patch that brings this up to date with trunk and hacks in some support for CloudSolrServer. Because colons are special and are not working for the solr.zkhost property, I hacked it so that | is replaced with colon, but we should probably add some escaping or something.

          People

          • Assignee:
            Unassigned
            Reporter:
            Mark Miller
          • Votes:
            4 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

            • Created:
              Updated:

              Development