|
[
Permlink
| « Hide
]
Andrzej Bialecki added a comment - 21/Sep/06 05:18 AM
This is just a starting point for discussion - it's a pretty old file I found lying around, so it may not even compile with modern Lucene. Requires commons-compress.
If you're looking for freely available text in bulk, what about:
Yes, that could be a good additional source. However, IMHO the primary corpus should be widely known and standardized, hence my proposal of the Reuters.
(I mistakenly copy&paste-d the urls in the comment above - of course the corpus they're pointing at is the "20 Newsgroups", not the Reuters one. Correct url for the Reuters corpus is http://www.daviddlewis.com/resources/testcollections/reuters21578/ From a strict performance point of view, a standard set of important, but don't forget other languages.
From a tokenization point of view (seperate to this issues), perhaps the Gutenberg project would be useful to test correctness of the analysis phase. It is also interesting to know how much time is consumed to assemble an instance of Document from the storage. According to my own tests this is the major reason to why InstantiatedIndex is so much faster than a FS/RAMDirectory. I also presume it to be the bottleneck of any RDBMS-, RMI- or any other "proxy"-based storage.
Since this has dependencies, do you think we should put it under contrib? I would be for a Performance directory and we could then organize it from there. Perhaps into packages for quantitative and qualitative performance.
The dependency on commons-compress could be avoided - I used this just to be able to unpack tar.gz files, we can use Ant for that. If you meant the dependency on the corpus - can't Ant download this too as a dependency?
Re: Project Gutenberg - good point, this is a good source for multi-lingual documents. The "Europarl" collection is another, although a bit more hefty, so that could be suitable for running large-scale benchmarks, and texts from Project Gutenberg for running small-scale tests. Yeah, ANT can do this, I think. Take a look at the DB contrib package, it downloads. I think I can setup the necessary stuff in contrib, if people think that is a good idea. First contribution will be this file and then we can go from there. I think Otis has run some perf. stuff too, but I am not sure if it can be contributed. I think someone else has really studied query perf. so it would be cool if that was added too.
I still haven't gotten my employer to sign and fax the CCLA, so I'm stuck and can't contribute my search benchmark.
I have a suggestion for a name for this - Lube, for Lucene Benchmark - contrib/lube. I think this is an incredibly important initiative: with every
non-trivial change to Lucene (eg lock-less commits) we must verify performance did not get worse. But, as things stand now, it's an ad-hoc thing that each developer needs to do. So (as a consumer of this), I would love to have a ready-to-use In the mean time I've been using Europarl for my testing. Also important to realize is there are many dimensions to test. With In addition to standardizing on the corpus I think we ideallly need A few notes on benchmarks:
First, it is important to realize that no benchmark will ever fully-capture all aspects of lucene performance, particularly since so many real-world data distributions are so varied. That said, they are useful tools, especially if they are componentized to measure various aspects of lucene performance (the narrower the goal of the benchmark it, the better a benchmark can be created). It is rather unrealistic to expect to standardize hardware / os ... better to compare before/after numbers on a single configuration, rather than comparing the numbers among configurations. The test process is important, but anything crucial should be built into the test (like the number of iterations; taking the average, etc). Concerning the specifics of this: Requiring reboots is onerous and not an important criterion (at least for unix systems- Of course, any scheme has its problems. In general, the most important thing when using benchmarks is being aware of the limitations of the benchmark and methodology used. My comments are marked by GSI
----------- In the mean time I've been using Europarl for my testing. GSI: perhaps you can contribute once this is setup Also important to realize is there are many dimensions to test. With GSI: I am planning on taking Andrzej contribution and refactoring it into components that can be reused, as well as creating a "standard" benchmark which will be easy to run through a simple ant task, i.e. ant run-baseline GSI: From here, anybody can contribute their own (I will provide interfaces to facilitate this) benchmarks which others can choose to run. In addition to standardizing on the corpus I think we ideallly need GSI: Not really feasible unless you are proposing to buy us machines First – I think it's a good initiative. Grant, when you're thinking about the infrastructure, it would be pretty neat to have a way of logging performance in a way so that one could draw charts from them. You know, for the visual folks
Anyway, my other idea is that benchmarking Lucene can be performed on two levels: one is the user level, where the entire operation counts (such as indexing, searching etc). Another aspect is measurement of atomic parts within the big operation so that you know how much of the whole thing each subpart takes. I wrote an interesting piece of code once that allows measuring times for named operation (per-thread) in a recursive way. Looks something like this: perfLogger.start("indexing"); } finally {
perfLogger.stop();
} in the output you get something like this: indexing: 5 seconds; Of course everything comes at a price and the above logging costs some CPU cycles (my implementation stored a nesting stack in ThreadLocals). One can always put that code in 'if' clauses attached to final variables and enable logging only for benchmarking targets (the compiler will get rid of logging statements then). If folks are interested I can dig out that performance logger and maybe adopt it to what Grant comes up with. I agree: a simple ant-accessible benchmark to enable "before and
after" runs is an awesome step forward. And that a standardized HW/SW testing environment is not really realistic now. > GSI: perhaps you can contribute once this is setup I will try! Few things that would be nice to have in this performance package/framework -
() indexing only overall time. () parametric control over: Additional points: OK, I have a preliminary implementation based on adapting Andrzej's approach. The interesting thing about this approach, is it is easy to adapt to be more or less exhaustive (i.e. how many of the parameters does one wish to have the system alter as it runs) Thus, you can have it change the merge factors, max buffered docs, number of documents indexed, number of different queries run, etc. The tradeoff, of course, is the length of time it takes to run these.
So my question to those interested, is what is a good baseline running time for testing in a standard way? My initial thought is to have something that takes between 15-30 minutes to run, but I am not sure on this. Another approach would be to have three "baselines": 1. quick validation (5 minutes to run...) 2. standard (15-45) 3. exhaustive (1-10 hours). I know several others have built benchmarking suites for their internal use, what has been your strategy? Thoughts, ideas, insights? Thanks, The indexing benchmarking apps I wrote take command line arguments for how many docs and how many reps. My standard test is to do 1000 docs and 6 reps. Within a couple seconds the first rep is done and the app is printing out results. For rapid development, having something that speedy is really handy.
As Marvin points out, quick micro-benchmarks are great to have. But other effects only show up when things get very large. So I think we need at least two baselines: micro and macro.
Grant had asked me if he could reuse some code from the indexer benchmarks I wrote. Here are the relevant files, contributed with the expectation they will be cannibalized, not included verbatim.
OK, here is a first crack at a standard benchmark contribution based on Andrzej original contribution and some updates/changes by me. I wasn't nearly as ambitious as some of the comments attached here, but I think most of them are good things to strive for and will greatly benefit Lucene.
I checked in the basic contrib directory structure, plus some library dependencies, as I wasn't sure how svn diff handles those. I am posting this in patch format to solicit comments first instead of just committing and accepting patches. My thoughts are I'll take a round of comments and make updates as warranted and then make an initial commit. I am particularly interested in the interface/Driver specification and whether people think this approach is useful or not. My thoughts behind it were it might be nice to have a standard way of creating/running benchmarks that could be driven by XML configuration files (some examples are in the conf directory). I am not 100% sold on this and am open to compelling arguments why we should just have each benchmark have it's own main() method. As for the actual Benchmarker, I have created a "standard" version, which runs off the Reuters collection that is downloaded automatically by the ANT task. There are two ANT targets for the two benchmarks: run-micro-standard and run-standard. The micro version takes a few minutes to run on my machine (it indexes 2000 docs), the other one takes a lot longer. There are several support classes in the stats and util packages. The stats package supports building and maintaining information about benchmarks. The utils package contains one class for extracting information out of the Reuters documents for indexing. The ReutersQueries class contains a set of Queries I created by looking at some of the docs in the collection and are a myriad of term, phrase, span, wildcard and other types of queries. They aren't exhaustive by any means. It should be stressed that these benchmarks are best used in gathering before and after numbers. Furthermore, these aren't the be all end all of benchmarking for Lucene. I hope the interface nature will encourage others to submit benchmarks for specific areas of Lucene not covered by this version. Thanks to all who contributed their code/thoughts. Patch to follow Initial Benchmark code based on Andrzej original contribution plus some changes by me to use the Reuters "standard" collection maintained at http://www.daviddlewis.com/resources/testcollections/reuters21578/reuters21578.tar.gz
To run, checkout contrib/benchmark and then apply the benchmark.patch in the contrib/benchmark directory.
I tried it and it is working nice! -
1st run downloaded the documents from the Web before starting to index. 2nd run started right off - as input docs are already in place - great. Seems the only output is what is printed to stdout, right? I got something like this: [java] Start Time: Sun Nov 05 22:41:38 PST 2006 [java] # testData id operation runCnt recCnt rec/s avgFreeMem avgTotalMem I think the "infinity" and "NAN" are caused by op time too short for divide-by-sec. I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy. It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:
..and compose a test by an XML that says which of these simple jobs to run, with what params, in which order, serial/parallel, how long/often etc. On the other hand, chances are, I know, that most useful cases would be those already defined here - standard and micro-standard, so can ask "why bothering changing to define these building blocks". I am not sure here, but thought I'll bring it up. About Using the driver - seems nice and clean to me. I don't know the Digester but it seems to read the config from the XML correctly. Other comments: Attached timedata.zip has modifies TimeData.java and TestData.java for [1 to 5] above, and for the NAN/inifinite. 1st run downloaded the documents from the Web before starting to index.
2nd run started right off - as input docs are already in place - great. Seems the only output is what is printed to stdout, right? GSI: The Benchmarker interface does return the TimeData, so other implementations, etc. could use the results programmatically. I like much the logic of loading test data from the Web, and the scaleUp and maximumDocumentsToIndex params are handy. It seems that all the test logic and some of its data (queries) are java coded. I initially thought of a setting where we define tasks/jobs that are parameterized, like:
GSI: I definitely agree that we want a more flexible one to meet people's benchmarking needs. I wanted at least one test that is "standard" in that you can't change the parameters and test cases, so that we can all be on the same page on a run. Then, when people are having discussions on performance they can say "I ran the standard benchmark before and after and here are the results" and we all know what they are talking about. I think all the components are there for a parameterized version, all it takes is someone to extend the Standard one or implement there own that reads in a config file. I will try to put in a fully parameterized version soon. GSI: Thanks for the fixes, I will incorporate into my version and post another patch soon. I looked at extending the benchmark with:
For this I made lots of changes to the benchmark code, using parts of it and rewriting other parts. I would like to describe how it works to hopefully get early feedback. There are several "basic tasks" defined - all extending an (abstract) class PerfTask:
To further extend the benchmark 'framework', new tasks can be added. Each task must implement the abstract method: doLogic(). For instance, in AddDocTask this method (doLogic) would call indexWriter.addDocument(). A special TaskSequence task contains other tasks. It is either parallel or sequential, which tells if it executes its child tasks serially or in parallel. With these tasks, it is possible to describe a performance test 'algorithm' in a simple syntax. A test invocation takes two parameters:
By convention, for each task class "OpNameTask", the command "OpName" is valid in test.alg. Adding a single document is done by: Adding 3 documents: Or, alternatively: So, '{' and '}' indicate a serial sequence of (child) tasks. To fire 100 queries in a row: Similar, but in parallel: A sequence task can be named for identifying it in reports: And there are tasks that create reports. There are more tasks, and more to tell on the alg syntax, but this post is already long.. I find this quite powerful for perf testing.
I am attaching a sample tiny.* - the .alg and .properties files I currently use - I think they may help to understand how this works.
OK, how about I commit my changes, then you can add a patch that shows your ideas?
Sounds good.
In this case I will add my stuff under a new package: org.apache.lucene.benchmark2. (this package would have no dependencies in org.apache.lucene.benchmark.). I will also add tarkets in buid.xml, and add .alg, and .alg files under conf. Do you already know when you are going to commit it? I'm not a big fan of tacking a number on to the end of Java names, as it doesn't let you know much about what's in the file or package. How about ConfigurableBenchmarker or PropertyBasedBenchmarker or something along those lines, since what you are proposing is a property based one. I think it can just go in the benchmark package or you could make a sub package under there that is more descriptive.
I will try to commit tonight or tomorrow morning. Good point on names with numbers - I'm renaming the package to taskBenchmark, as I think of it as "task sequence" based, more than as propetries based.
Would be nice to get some feedback on what I already have at this point for the "task based benchmark framework for Lucene".
So I am packing it as a zip file. I would probably resubmit as a patch when Grant commits the current benchmark code. To try out taskBenchmark, unzip under contrib/benchmark, on top of Grant's benchmark.patch. 1. replace build.xml - only change there is adding two targets: run-task-standard and run-task-micro-standard. 2. add 4 new files under conf:
3. add a src package 'taskBenchmark' side by side with current 'benchmark' package. To try it out, go to contrib/benchmark and try 'ant run-task-standard' or 'ant run-task-micro-standard'. See inside the .alg files for how a test is specified. The algorithm syntax and the entire package is documented in the package javadoc for taskBenchmark (package.html). Regards, Attached taskBenchmark.zip as described earlier.
Committed the benchmark patch plus Doron's update to TestData and TimeData
I am attaching benchmark.byTask.patch - to be applied in the contrib/benchmark directory.
Root package of byTask classes was modified to org.apache.lucene.benchmark.byTask, in the lines of Grant's suggestion - seems better cause it keeps all benchmark classes under I added one a sample .alg under conf and added some documentation. Entry point - documentation wise - is the package doc for org.apache.lucene.benchmark.byTask. Thanks for any comments on this! PS. Before submitting the patch file, I tried to apply it myself on a clean version of the code, just to make sure that it works. But I got errors like this – Could not retrieve revision 0 of "...\byTask\.." – for every file under a new folder. So I am not sure if it is just my (Windows) svn patch applying utility, or is it really impossible to apply a patch that creates files in (yet) nonexistent directories. I searched Lucene mailing lists and SVN mailing lists and went again through the SVN book again but nowhere could I find what is the expected behavior for applying a patch containing new directories. In fact, "svn diff" would not even show you files that are new (again, this is the Windows svn 1.4.2 version). (I used Tortoise SVN to create the patch). This is rather annoying and I might be misunderstanding something basic about SVN, but I thought it'd be better to share this experience here - might save some time for others trying to apply this patch or other patches... Doron,
When I apply your patch, I am getting strange errors. It seems to go through cleanly, but then the new files (for instance, byTask.stats.Report.java) has the whole file occurring twice in each file, thus causing duplicate class exceptions. This happens for all the files in the byTask package. The changes in the other files apply cleanly. I applied the patch as: patch -p0 -i <patch file> as I always do on a clean version. I suspect that your last comment may be at the root of the issue. Can you try applying this again to a clean version and see if you still have issues or whether it is something I am missing? Can you regenerate this patch, perhaps using a command line tool? Looking at the patch file, I am not sure what the issue is. Otherwise, based on the documentation, this sounds really interesting and useful. Based on some of your other patches, I assume you are using this to do benchmarking, no? Thanks, Grant, thanks for trying this out - I will update the patch shortly.
I am using this for benchmarking - quite easy to add new stuff - and in fact I added some stuff lately but did not update here because wasn't sure if others are interested. I will verify what I have with svn head and pack it here as an updated patch. Regards, Doron This update of the byTask package includes:
To apply the patch from the trunk dir: patch -p0 -i <byTask.2.patch.txt> Grant, I noticed that the patch file contains EOL characters - Unix/DOS thing I guess. Hey Doron,
Your patch uses JDK 1.5. I am assuming it is safe to use Class.getName in place of Class.getSimpleName, right? I think once I do that plus change the String.contains calls to String.indexOf it should all be fine, right? I have it compiling and running, so that is a good sign. I will look to commit soon. -Grant Oops... I had the impression that compiling with compliance level 1.4 is sufficient to prevent this, but guess I need to read again what that compliance level setting guarantees exactly.
Anyhow there are a 3 things that require 1.5:
Modifying Class.getSimpleName() to Class.getName() would not be very nice - queries prints and task names prints would be quite ugly. To fix that I added a method simpleName(Class) to byTask.util.Format. I am attaching an updated patch - byTask.jre1.4.patch.txt - that includes this method and removes the Java 1.5 dependency. Thanks for catching this! Doron,
I have committed your additions. This truly is great stuff. Thank you so much for contributing. The documentation (code and package level) is well done, the output is very readable. The alg language is a bit cryptic and takes a little deciphering, but you do document it quite nicely. I like the extendability factor and I think it will make it easier for people to contribute benchmarking capabilities. I would love to see someone mod the reporting mechanism in the future to allow for printing info to something other than System.out, as I know people have expressed interest in being able to slurp the output into Excel or similar number crunching tools. This could also lead to the possibility of running some of the algorithms nightly and then integrating with JUnitPerf or some other performance unit testing approach. We may want to consider deprecating the other benchmarking stuff, although, I suppose it can't hurt to have multiple opinions in this area. At any rate, this is very much appreciated. I would encourage everyone who is interested in benchmarking to take a look and provide feedback. I'm going to mark this bug as finished for now as I think we have a good baseline for benchmarking at this point. Thanks again, Have committed a baseline benchmarking suite thanks to Doron and Andrzej. Bugs can now be opened specific to the code in the contrib area.
This has been committed and is available for use. New issues can be opened on specific problems.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||