Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9, 3.1, 4.0-ALPHA
    • Component/s: modules/benchmark
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      bzip compression can aid the benchmark package by not requiring extracting bzip files (such as enwiki) in order to index them. The plan is to add a config parameter bzip.compression=true/false and in the relevant tasks either decompress the input file or compress the output file using the bzip streams.
      It will add a dependency on ant.jar which contains two classes similar to GZIPOutputStream and GZIPInputStream which compress/decompress files using the bzip algorithm.

      bzip is known to be superior in its compression performance to the gzip algorithm (~20% better compression), although it does the compression/decompression a bit slower.

      I wil post a patch which adds this parameter and implement it in LineDocMaker, EnwikiDocMaker and WriteLineDoc task. Maybe even add the capability to DocMaker or some of the super classes, so it can be inherited by all sub-classes.

      1. LUCENE-1591.patch
        15 kB
        Shai Erera
      2. LUCENE-1591.patch
        20 kB
        Shai Erera
      3. LUCENE-1591.patch
        21 kB
        Shai Erera
      4. LUCENE-1591.patch
        35 kB
        Shai Erera
      5. LUCENE-1591.patch
        45 kB
        Shai Erera
      6. LUCENE-1591.patch
        47 kB
        Shai Erera
      7. LUCENE-1591.patch
        47 kB
        Shai Erera
      8. LUCENE-1591.patch
        2 kB
        Mark Miller
      9. commons-compress-dev20090413.jar
        137 kB
        Uwe Schindler
      10. commons-compress-dev20090413.jar
        137 kB
        Uwe Schindler

        Activity

        Hide
        Michael McCandless added a comment -

        I'm hitting this, when trying to convert the 20090306 Wikipedia export to a line file:

        Exception in thread "Thread-0" java.lang.ArrayIndexOutOfBoundsException: 2048
        	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
        	at java.lang.Thread.run(Thread.java:619)
        

        From this:

        http://marc.info/?l=xerces-j-user&m=120452263925040&w=2

        It sounds likely an upgrade to xerces 2.9.1 will fix it. I'm testing it now... if it fixes the issue, I'll commit the upgrade to contrib/benchmark.

        Show
        Michael McCandless added a comment - I'm hitting this, when trying to convert the 20090306 Wikipedia export to a line file: Exception in thread " Thread -0" java.lang.ArrayIndexOutOfBoundsException: 2048 at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77) at java.lang. Thread .run( Thread .java:619) From this: http://marc.info/?l=xerces-j-user&m=120452263925040&w=2 It sounds likely an upgrade to xerces 2.9.1 will fix it. I'm testing it now... if it fixes the issue, I'll commit the upgrade to contrib/benchmark.
        Hide
        Michael McCandless added a comment -

        So, after upgrading to xerces 2.9.1, I then hit this error:

         
        Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
        	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:101)
        	at java.lang.Thread.run(Thread.java:619)
        Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
        	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
        	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
        	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77)
        	... 1 more
        

        It appears I'm hitting XERCESJ-1257 which, hideously, is still
        open. Worse, we are already doing the suggested workaround at the
        bottom of the issue. Hmm.

        Show
        Michael McCandless added a comment - So, after upgrading to xerces 2.9.1, I then hit this error: Exception in thread " Thread -0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4- byte UTF-8 sequence. at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:101) at java.lang. Thread .run( Thread .java:619) Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4- byte UTF-8 sequence. at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:77) ... 1 more It appears I'm hitting XERCESJ-1257 which, hideously, is still open. Worse, we are already doing the suggested workaround at the bottom of the issue. Hmm.
        Hide
        Michael McCandless added a comment -

        After some iterations on XERCESJ-1257, I managed to apply the original patch on that issue (thank you Robert!), which indeed allows me to process all of Wikipedia's XML export. I'll commit a recompiled xerces 2.9.1 jar with that patch shortly.

        Show
        Michael McCandless added a comment - After some iterations on XERCESJ-1257 , I managed to apply the original patch on that issue (thank you Robert!), which indeed allows me to process all of Wikipedia's XML export. I'll commit a recompiled xerces 2.9.1 jar with that patch shortly.
        Hide
        Shai Erera added a comment -

        I wonder why does EnwikiDocMaker extend LineDocMaker? The latter assumes the input is given in lines, while the former assumes an XML format ... so why the inheritance?

        This affects EnwikiDocMaker today when LDM.openFile() instantiates a BufferedReader, which is never used by EDM. Is it because of DocState? Perhaps some of the logic in LDM can be pulled up to BasicDocMaker, or a new abstract DocStateDocMaker?
        If there is a good reason, then maybe introduce a protected member useReader and set it to false in EDM? Or override openFile() in EDM and not instantiate the reader?

        Also, somewhat unrelated to this issue, but I found two issues in LDM:

        1. In makeDocument(), if the read line is null, then we first call openFile() and then check 'forever' (and possibly throw a NoMoreDataException). Should we first check forever, and only if it's true call openFile()?
        2. resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig?
          I can include those two in the patch as well.
        Show
        Shai Erera added a comment - I wonder why does EnwikiDocMaker extend LineDocMaker? The latter assumes the input is given in lines, while the former assumes an XML format ... so why the inheritance? This affects EnwikiDocMaker today when LDM.openFile() instantiates a BufferedReader, which is never used by EDM. Is it because of DocState? Perhaps some of the logic in LDM can be pulled up to BasicDocMaker, or a new abstract DocStateDocMaker? If there is a good reason, then maybe introduce a protected member useReader and set it to false in EDM? Or override openFile() in EDM and not instantiate the reader? Also, somewhat unrelated to this issue, but I found two issues in LDM: In makeDocument(), if the read line is null, then we first call openFile() and then check 'forever' (and possibly throw a NoMoreDataException). Should we first check forever, and only if it's true call openFile()? resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig? I can include those two in the patch as well.
        Hide
        Michael McCandless added a comment -

        I wonder why does EnwikiDocMaker extend LineDocMaker?

        I'm not sure... I agree it'd be cleaner to not subclass LineDocMaker, and factor out DocState into BasicDocMaker.

        Should we first check forever, and only if it's true call openFile()?

        Yes, let's fix that!

        resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig?

        I think it should, but I vaguely remember some odd reason why I put it in resetInputs... try moving it and see?

        Show
        Michael McCandless added a comment - I wonder why does EnwikiDocMaker extend LineDocMaker? I'm not sure... I agree it'd be cleaner to not subclass LineDocMaker, and factor out DocState into BasicDocMaker. Should we first check forever, and only if it's true call openFile()? Yes, let's fix that! resetInputs() reads the docs.file property and throws an exception if it's not set. Shouldn't this code belong to setConfig? I think it should, but I vaguely remember some odd reason why I put it in resetInputs... try moving it and see?
        Hide
        Shai Erera added a comment -

        resetInputs() is called from PerfRunData's ctor (as is setConfig), but also from ResetInputsTask. Unless it is possible to change the file name in the middle of execution, I see no reason why not move it to setConfig.

        I'll move it to setConfig and also switch to throw IllegalArgEx, insteas of RuntimeEx.

        Another change I'd like to do is remove the while(true) in makeDoc. All it does is read 1 line and breaks, unless that line is null in which case it reopens the file and reads a line again. I think that in that case, which will happen only after all docs were consumed, and if forever is set to true, we can just call makeDoc again, and avoid the 1-instruction loop in every makeDoc call.

        Show
        Shai Erera added a comment - resetInputs() is called from PerfRunData's ctor (as is setConfig), but also from ResetInputsTask. Unless it is possible to change the file name in the middle of execution, I see no reason why not move it to setConfig. I'll move it to setConfig and also switch to throw IllegalArgEx, insteas of RuntimeEx. Another change I'd like to do is remove the while(true) in makeDoc. All it does is read 1 line and breaks, unless that line is null in which case it reopens the file and reads a line again. I think that in that case, which will happen only after all docs were consumed, and if forever is set to true, we can just call makeDoc again, and avoid the 1-instruction loop in every makeDoc call.
        Hide
        Michael McCandless added a comment -

        OK sounds good!

        Show
        Michael McCandless added a comment - OK sounds good!
        Hide
        Shai Erera added a comment -

        Before I post a patch I wanted to test reading the 20090306 enwiki dump and write it as a one line document, all using the bz2 in/out streams. After 9 hours and 2881000 documents (!!!), I've hit the following exception:

        Exception in thread "Thread-1" java.lang.ArrayIndexOutOfBoundsException
        	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
        	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
        	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
        	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
        	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
        	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
        	at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:76)
        	at java.lang.Thread.run(Thread.java:810)
        

        Same exception like Mike hit, only from a different method. I'm using the latest xerces jar Mike put. I'm beginning to think this enwiki dump is jinxed

        Anyway, I'll post the patch shortly and run on the 20070527 version to verify.

        Show
        Shai Erera added a comment - Before I post a patch I wanted to test reading the 20090306 enwiki dump and write it as a one line document, all using the bz2 in/out streams. After 9 hours and 2881000 documents (!!!), I've hit the following exception: Exception in thread " Thread -1" java.lang.ArrayIndexOutOfBoundsException at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.apache.lucene.benchmark.byTask.feeds.EnwikiDocMaker$Parser.run(EnwikiDocMaker.java:76) at java.lang. Thread .run( Thread .java:810) Same exception like Mike hit, only from a different method. I'm using the latest xerces jar Mike put. I'm beginning to think this enwiki dump is jinxed Anyway, I'll post the patch shortly and run on the 20070527 version to verify.
        Hide
        Shai Erera added a comment -

        The patch touches LineDocMaker, EnwikiDocMaker and WriteLineDocTask.
        Also, put ant-1.7.1 in benchmark/lib

        Show
        Shai Erera added a comment - The patch touches LineDocMaker, EnwikiDocMaker and WriteLineDocTask. Also, put ant-1.7.1 in benchmark/lib
        Hide
        Uwe Schindler added a comment -

        Do you know http://commons.apache.org/compress/ ?

        It is a commons project that replicates the internals from ANT and othe projects for general usage. It is not yet released, but available as snapshot jars. TIKA uses it, too. It also contains BZIPInputStream. I would prefer this instead of polluting the classpath with a full ant distribution.

        Show
        Uwe Schindler added a comment - Do you know http://commons.apache.org/compress/ ? It is a commons project that replicates the internals from ANT and othe projects for general usage. It is not yet released, but available as snapshot jars. TIKA uses it, too. It also contains BZIPInputStream. I would prefer this instead of polluting the classpath with a full ant distribution.
        Hide
        Michael McCandless added a comment -

        Odd – with that patched xerces JAR I was able to parse the full XML. Is it possible your bunzipping code is messing up the XML?

        Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy. (Or are you running tests on a snail-of-a-machine? )

        Can you run only your bunzip code and confirm it produces an XML file that's identical to what bunzip2 from the command line produces? (And measure how long it takes vs the command line).

        Show
        Michael McCandless added a comment - Odd – with that patched xerces JAR I was able to parse the full XML. Is it possible your bunzipping code is messing up the XML? Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy. (Or are you running tests on a snail-of-a-machine? ) Can you run only your bunzip code and confirm it produces an XML file that's identical to what bunzip2 from the command line produces? (And measure how long it takes vs the command line).
        Hide
        Shai Erera added a comment -

        That's the way I wrap FIS with BZIP:

              if (doBzipCompression) {
                // According to CBZip2InputStream's documentation, we should first
                // consume the first two file header chars ('B' and 'Z'), as well as 
                // wrap the underlying stream with a BufferedInputStream, since CBZip2IS
                // uses the read() method exclusively.
                fileIS = new BufferedInputStream(fileIS, READER_BUFFER_BYTES);
                fileIS.read(); fileIS.read();
                fileIS = new CBZip2InputStream(fileIS);
              }
        

        Is it possible your bunzipping code is messing up the XML?

        I successfully read the file and compressed it with Java's GZIP classes, however I did not attempt to parse the XML itself. Did you run EnwikiDocMaker on the actual XML or the bz2 archive?
        The 20070527 run should end soon (I hope - it reached 2.2M documents, so if it doesn't fail, I guess that bzip wrapping is very unlikely to affect the XML parsing.

        Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy.

        I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server. You can download the patch and the jar and try it out on your machine.
        But yes, I did notice bzip is very slow compared to gzip, however it has better compression ration. I do want to measure the times though, to give more accurate numbers, but in order to do that I need to finish a successful run first.

        Can you run only your bunzip code and confirm it ...

        I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place
        I'll try to do that with the 20070527 version, which hopefully will be ~half the size ...

        Show
        Shai Erera added a comment - That's the way I wrap FIS with BZIP: if (doBzipCompression) { // According to CBZip2InputStream's documentation, we should first // consume the first two file header chars ('B' and 'Z'), as well as // wrap the underlying stream with a BufferedInputStream, since CBZip2IS // uses the read() method exclusively. fileIS = new BufferedInputStream(fileIS, READER_BUFFER_BYTES); fileIS.read(); fileIS.read(); fileIS = new CBZip2InputStream(fileIS); } Is it possible your bunzipping code is messing up the XML? I successfully read the file and compressed it with Java's GZIP classes, however I did not attempt to parse the XML itself. Did you run EnwikiDocMaker on the actual XML or the bz2 archive? The 20070527 run should end soon (I hope - it reached 2.2M documents, so if it doesn't fail, I guess that bzip wrapping is very unlikely to affect the XML parsing. Shai, why did it take 9 hours to get to that exception? Is bunzip that slow? That seems crazy. I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server. You can download the patch and the jar and try it out on your machine. But yes, I did notice bzip is very slow compared to gzip, however it has better compression ration. I do want to measure the times though, to give more accurate numbers, but in order to do that I need to finish a successful run first. Can you run only your bunzip code and confirm it ... I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place I'll try to do that with the 20070527 version, which hopefully will be ~half the size ...
        Hide
        Michael McCandless added a comment -

        Did you run EnwikiDocMaker on the actual XML or the bz2 archive?

        I downloaded the bz2 2008036 Wikipedia export, ran bunzip2 on the command line, then had to patch Xerces JAR to get it to parse the XML successfully.

        I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server.

        Hmm – I wonder how long bunzip2 would take on the TP 60. Time to upgrade Get yourself an X25 SSD!

        I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place

        Ahh OK

        Show
        Michael McCandless added a comment - Did you run EnwikiDocMaker on the actual XML or the bz2 archive? I downloaded the bz2 2008036 Wikipedia export, ran bunzip2 on the command line, then had to patch Xerces JAR to get it to parse the XML successfully. I run the test on my TP 60, which is not a snail-of-a-machine, but definitely not a strong server. Hmm – I wonder how long bunzip2 would take on the TP 60. Time to upgrade Get yourself an X25 SSD! I would have done that, but the output XML is 17GB, and doing it twice is not an option on my TP. That's why I wanted this bzip thing in the first place Ahh OK
        Hide
        Shai Erera added a comment -

        I downloaded the bz2 2008036

        I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036?

        Anyway, I think I've found a problem. In the javadocs, they document that the IS version uses the readByte() exclusively, but don't say anything regarding their OS version. I read the code and noticed it always calls write() and never uses the array version.
        So I wrapped the FOS with a BOS (bufSize=64k) and then with BZOS. I did a short test, reading 2000 records from the 20070527 file, before and after the change:

        Num Docs Before After %tg
        2000 106s 30s 72

        I think that if that improvement is stable, than the 9 hours run should drop to ~3 hours, which seems right. I didn't measure the time to unzip the file using WinRAR (the first time I tried it), but it was a couple of hours run.

        Once the current run will complete, I'll kick off a new one with that code change and note the time difference. I'm eager to see it speeds up, but I want to complete a successful run before

        Show
        Shai Erera added a comment - I downloaded the bz2 2008036 I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036? Anyway, I think I've found a problem. In the javadocs, they document that the IS version uses the readByte() exclusively, but don't say anything regarding their OS version. I read the code and noticed it always calls write() and never uses the array version. So I wrapped the FOS with a BOS (bufSize=64k) and then with BZOS. I did a short test, reading 2000 records from the 20070527 file, before and after the change: Num Docs Before After %tg 2000 106s 30s 72 I think that if that improvement is stable, than the 9 hours run should drop to ~3 hours, which seems right. I didn't measure the time to unzip the file using WinRAR (the first time I tried it), but it was a couple of hours run. Once the current run will complete, I'll kick off a new one with that code change and note the time difference. I'm eager to see it speeds up, but I want to complete a successful run before
        Hide
        Shai Erera added a comment -

        Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it? We use BufferedWriter, and calling flush() after every document is a bit expensive, I think.
        I quickly measured the same 2000 documents run and it finished in 28 seconds, 7% improvement compared to the 'after' run and 74% improvement compared to the 'before'.
        So if there's a good reason, we can keep it - the performance gain is not that high, but otherwise I think we should remove it, and count on PerfTask.close() being called at the end of the run (perhaps the absence of close() was the reason to call flush() in the first place?).

        Show
        Shai Erera added a comment - Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it? We use BufferedWriter, and calling flush() after every document is a bit expensive, I think. I quickly measured the same 2000 documents run and it finished in 28 seconds, 7% improvement compared to the 'after' run and 74% improvement compared to the 'before'. So if there's a good reason, we can keep it - the performance gain is not that high, but otherwise I think we should remove it, and count on PerfTask.close() being called at the end of the run (perhaps the absence of close() was the reason to call flush() in the first place?).
        Hide
        Michael McCandless added a comment -

        I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036?

        Sorry I meant 20090306.

        I did a short test, reading 2000 records from the 20070527 file, before and after the change:

        Excellent!

        Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it?

        Hmm that should not be needed; I'd say remove it? But, implement close() to actually close the stream?

        Show
        Michael McCandless added a comment - I'm almost sure its a typo, but just to verify - did download the 20090306 (enwiki-20090306-pages-articles.xml.bz2), or 2008036? Sorry I meant 20090306. I did a short test, reading 2000 records from the 20070527 file, before and after the change: Excellent! Another thing I noticed is that WriteLineDocTask calls flush() after every document it writes. Any reason to do it? Hmm that should not be needed; I'd say remove it? But, implement close() to actually close the stream?
        Hide
        Shai Erera added a comment -

        But, implement close() to actually close the stream?

        Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch.

        I'm finishing a run with the updated code (wrapping w/ BOS), so once that finishes, I'll post an updated patch and some numbers.

        Show
        Shai Erera added a comment - But, implement close() to actually close the stream? Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch. I'm finishing a run with the updated code (wrapping w/ BOS), so once that finishes, I'll post an updated patch and some numbers.
        Hide
        Michael McCandless added a comment -

        Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch.

        Oh yeah right, I already forgot. Feels so long ago

        Show
        Michael McCandless added a comment - Already did, I had to because otherwise the bzip file wasn't sealed properly (that's why I started the other thread about tracking task resources). It already exists in the attached patch. Oh yeah right, I already forgot. Feels so long ago
        Hide
        Shai Erera added a comment -

        Here some numbers:

        • Reading the enwiki bz2 file with CBZip2InputStream, wrapped as a BufferedReader and reading one line at a time took 28m. Unzipping with WinRAR took about ~30m (this includes also writing the uncompressed data to disk). So in that respect, the code does not fall short of other bunzip tools (at least not WinRAR).
        • Before the change, the time to read the compressed data, parse and write to a one-line file, compressed took 7h (3.1M documents were read). After the change (wrapping with BOS and removing flush()) it took 2h, so significant improvement here.

        Overall, I think the performance of the BZIP classes is reasonable. Most of the time spent in the algorithm is in compressing the data, which is usually a process done only once. The result is a 2.5GB enwiki file compressed to a 2.31GB one-line file (8.5GB uncompressed content).

        I compared the time it takes to read 100k lines from the compressed and un-compressed one-line file: compressed-2.26m, un-compressed-1.36m (-66%). The difference is significant, however I'm not sure how much is it from the overall process (i.e., reading the documents and indexing them). On my machine it would take 1.1 hours to read the data, but I'm sure it will take more to index it, and the indexing time is the same whether we read the data from a bzip archive or not.

        I'll attach the patch shortly, and I think overall this is a good addition. It is off by default, and configurable, so if someone doesn't care about disk space, he can always run the indexing algorithm on an un-compressed one-line file.

        Show
        Shai Erera added a comment - Here some numbers: Reading the enwiki bz2 file with CBZip2InputStream, wrapped as a BufferedReader and reading one line at a time took 28m . Unzipping with WinRAR took about ~30m (this includes also writing the uncompressed data to disk). So in that respect, the code does not fall short of other bunzip tools (at least not WinRAR). Before the change, the time to read the compressed data, parse and write to a one-line file, compressed took 7h (3.1M documents were read). After the change (wrapping with BOS and removing flush()) it took 2h, so significant improvement here. Overall, I think the performance of the BZIP classes is reasonable. Most of the time spent in the algorithm is in compressing the data, which is usually a process done only once. The result is a 2.5GB enwiki file compressed to a 2.31GB one-line file (8.5GB uncompressed content). I compared the time it takes to read 100k lines from the compressed and un-compressed one-line file: compressed-2.26m, un-compressed-1.36m ( -66% ). The difference is significant, however I'm not sure how much is it from the overall process (i.e., reading the documents and indexing them). On my machine it would take 1.1 hours to read the data, but I'm sure it will take more to index it, and the indexing time is the same whether we read the data from a bzip archive or not. I'll attach the patch shortly, and I think overall this is a good addition. It is off by default, and configurable, so if someone doesn't care about disk space, he can always run the indexing algorithm on an un-compressed one-line file.
        Hide
        Shai Erera added a comment -

        Patch includes:

        • Wrapping the FileOutputStream with a BufferedOutputStream.
        • Removing the calls to flush().
        • Enhancement to EnwikiDocMaker's startElement and endElement - instead of calling String.equals on the qualified name and compare on 5 different strings, I added a static map from String to Integer and a static method getElementType which returns an int. I then changed those methods to do a 'switch' on the type. I haven't measured the perf. gain, but it's clear it should improve things ...

        There is an open question regarding the ant-1.7.1.jar dependency. Uwe mentioned the commons Compress project, which handles the bzip format (as well as others). I took a look and found no place to download a jar, as well as this looks like a 'young' project, with very little documentation. This is not to say the code is of low quality or not be trusted, it's just that I prefer the ant dependency, at least until this project matures enough. And anyway I guess everyone who uses Lucene has Ant in his system, so this doesn't look like a major dependency.

        However, if you think otherwise, then we should get a jar from there (checking out the code and building it manually is the only way I see, but please correct me if I'm wrong) and adapt the code to use it, do perf. measurements again etc.

        Show
        Shai Erera added a comment - Patch includes: Wrapping the FileOutputStream with a BufferedOutputStream. Removing the calls to flush(). Enhancement to EnwikiDocMaker's startElement and endElement - instead of calling String.equals on the qualified name and compare on 5 different strings, I added a static map from String to Integer and a static method getElementType which returns an int. I then changed those methods to do a 'switch' on the type. I haven't measured the perf. gain, but it's clear it should improve things ... There is an open question regarding the ant-1.7.1.jar dependency. Uwe mentioned the commons Compress project, which handles the bzip format (as well as others). I took a look and found no place to download a jar, as well as this looks like a 'young' project, with very little documentation. This is not to say the code is of low quality or not be trusted, it's just that I prefer the ant dependency, at least until this project matures enough. And anyway I guess everyone who uses Lucene has Ant in his system, so this doesn't look like a major dependency. However, if you think otherwise, then we should get a jar from there (checking out the code and building it manually is the only way I see, but please correct me if I'm wrong) and adapt the code to use it, do perf. measurements again etc.
        Hide
        Shai Erera added a comment -

        BTW, the enhancements to EnwikiDocMaker yielded another 2% improvement to the process of converting the enwiki file to a one-line file. Just a FYI.
        I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better

        Show
        Shai Erera added a comment - BTW, the enhancements to EnwikiDocMaker yielded another 2% improvement to the process of converting the enwiki file to a one-line file. Just a FYI. I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better
        Hide
        Michael McCandless added a comment -

        Should we consider using compress form Apache commons (from Uwe's comment above) instead of full ant jar?

        I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better

        Does this issue depend on LUCENE-1595?

        Show
        Michael McCandless added a comment - Should we consider using compress form Apache commons (from Uwe's comment above ) instead of full ant jar? I basically wait with 1595 (refactoring to benchmark) until this one is committed, so the sooner the better Does this issue depend on LUCENE-1595 ?
        Hide
        Uwe Schindler added a comment -

        The problem is, that the project is currently moving to commons top-level. The SVN pathes changed, but website was not updated and so on. The snapshot jars are not accessible at the moment.
        I could quickly build a JAR and attach it here. To get the code running, you only have to change the package imports. Ideally one would use the Factory to create the decompressor (and then he do not need to skip the 2 bytes with "BZ").
        Uwe

        Show
        Uwe Schindler added a comment - The problem is, that the project is currently moving to commons top-level. The SVN pathes changed, but website was not updated and so on. The snapshot jars are not accessible at the moment. I could quickly build a JAR and attach it here. To get the code running, you only have to change the package imports. Ideally one would use the Factory to create the decompressor (and then he do not need to skip the 2 bytes with "BZ"). Uwe
        Hide
        Shai Erera added a comment -

        Does this issue depend on LUCENE-1595?

        No, the other way around. Well ... it's not an actual dependency, just that 1595 will touch a lot of files, and I want to minimize the noise of working on two issues that touch the same files (1595 will touch all the files this one touches) simultaneously. It's just a matter of convenience ...

        Besides, I don't see what else can be done as part of this issue. The performance is reasonable, the code is quite simple. The patch includes some more enhancements to those files that is unrelated to bzip per sei, but are still required.

        BTW, I successfully executed indexLineFile.alg on the 20070527 one-line bz2 file and the overall indexing process ended in 1h, which seems reasonable to me.

        Regarding Apache Compress, I asked the same question, so it's not fair to return it with a question . I don't think we should decide that now. It can be changed in 1595 if we think Compress is the better approach. Personally I prefer the ant jar, even though I realize it's adding a large dependency for just 3-4 classes ...

        Show
        Shai Erera added a comment - Does this issue depend on LUCENE-1595 ? No, the other way around. Well ... it's not an actual dependency, just that 1595 will touch a lot of files, and I want to minimize the noise of working on two issues that touch the same files (1595 will touch all the files this one touches) simultaneously. It's just a matter of convenience ... Besides, I don't see what else can be done as part of this issue. The performance is reasonable, the code is quite simple. The patch includes some more enhancements to those files that is unrelated to bzip per sei, but are still required. BTW, I successfully executed indexLineFile.alg on the 20070527 one-line bz2 file and the overall indexing process ended in 1h, which seems reasonable to me. Regarding Apache Compress, I asked the same question, so it's not fair to return it with a question . I don't think we should decide that now. It can be changed in 1595 if we think Compress is the better approach. Personally I prefer the ant jar, even though I realize it's adding a large dependency for just 3-4 classes ...
        Hide
        Shai Erera added a comment -

        Uwe, if you can attach the jar here, I can make the necessary code changes and run some tests again. We can the decide based on whether it's working with the Compress classes or not.

        Show
        Shai Erera added a comment - Uwe, if you can attach the jar here, I can make the necessary code changes and run some tests again. We can the decide based on whether it's working with the Compress classes or not.
        Hide
        Uwe Schindler added a comment -

        Here the latest snapshot build of commons compress. All test passed through "mvn install" run.
        About the initial "BZh" bytes. In the javadocs still stands, that they should be read before opening the strea, But the examples on the website and the BZip2Decompressor code is:

        private void init() throws IOException {
                if (null == in) {
                    throw new IOException("No InputStream");
                }
                if (in.available() == 0) {
                    throw new IOException("Empty InputStream");
                }
                checkMagicChar('B', "first");
                checkMagicChar('Z', "second");
                checkMagicChar('h', "third");
        

        So I think, the reading of the initial two bytes can be left out. If something is wrong, this class should throw an IOException.

        Here some usage: http://wiki.apache.org/commons/Compress (this shows, that decompressing a bzip2 file does not need to skip the header),
        here the javadocs: http://commons.apache.org/compress/apidocs/index.html

        Show
        Uwe Schindler added a comment - Here the latest snapshot build of commons compress. All test passed through "mvn install" run. About the initial "BZh" bytes. In the javadocs still stands, that they should be read before opening the strea, But the examples on the website and the BZip2Decompressor code is: private void init() throws IOException { if ( null == in) { throw new IOException( "No InputStream" ); } if (in.available() == 0) { throw new IOException( "Empty InputStream" ); } checkMagicChar('B', "first" ); checkMagicChar('Z', "second" ); checkMagicChar('h', "third" ); So I think, the reading of the initial two bytes can be left out. If something is wrong, this class should throw an IOException. Here some usage: http://wiki.apache.org/commons/Compress (this shows, that decompressing a bzip2 file does not need to skip the header), here the javadocs: http://commons.apache.org/compress/apidocs/index.html
        Hide
        Shai Erera added a comment -

        Ok I'm convinced. I moved to commons-compress and it works great. The jar is smaller and it does add the logical dependency. Since this project is still young we should expect changes, which is good since it means we can actually improve the In(Out) compressing streams to use more efficient methods, such as read(byte[]) and write(byte[]).

        Show
        Shai Erera added a comment - Ok I'm convinced. I moved to commons-compress and it works great. The jar is smaller and it does add the logical dependency. Since this project is still young we should expect changes, which is good since it means we can actually improve the In(Out) compressing streams to use more efficient methods, such as read(byte[]) and write(byte[]).
        Hide
        Michael McCandless added a comment -

        Patch looks good!

        Could you add a test case that eg writes a bzip'd line file, then reads it back & indexes it, or something along those lines?

        Also: should we make "use bzip" pay attention to suffix when defaulting itself? Ie if I explicitly specify "bzip.compression" then listen to me, but if I didn't specify it and my line file source ends with .bz2, default it to true? (And likewise for WriteLineDoc)?

        Show
        Michael McCandless added a comment - Patch looks good! Could you add a test case that eg writes a bzip'd line file, then reads it back & indexes it, or something along those lines? Also: should we make "use bzip" pay attention to suffix when defaulting itself? Ie if I explicitly specify "bzip.compression" then listen to me, but if I didn't specify it and my line file source ends with .bz2, default it to true? (And likewise for WriteLineDoc)?
        Hide
        Uwe Schindler added a comment -

        I created my first bug report for Compress handling the inconsistency in javadocs and the compressor part with the Bzip2 header (compression does not add header, decompression needs header): COMPRESS-69

        Show
        Uwe Schindler added a comment - I created my first bug report for Compress handling the inconsistency in javadocs and the compressor part with the Bzip2 header (compression does not add header, decompression needs header): COMPRESS-69
        Hide
        Shai Erera added a comment -

        argh, you bit me here - I planned to do so myself
        for some reason their OutputStream has the file headers commented out with a comment saying "this is added by the caller", however their InputStream reads them ... strange. Anyway, once that's fixed and we upgrade to a proper jar, the unit test I am working on now will fail, and it will remind us to remove writing the headers in WriteLineDocTask.

        Mike - I am working on the unit test as well as defaulting by extension. I hope a patch will be available sometime later today.

        Show
        Shai Erera added a comment - argh, you bit me here - I planned to do so myself for some reason their OutputStream has the file headers commented out with a comment saying "this is added by the caller", however their InputStream reads them ... strange. Anyway, once that's fixed and we upgrade to a proper jar, the unit test I am working on now will fail, and it will remind us to remove writing the headers in WriteLineDocTask. Mike - I am working on the unit test as well as defaulting by extension. I hope a patch will be available sometime later today.
        Hide
        Uwe Schindler added a comment -

        It is fixed now, including the JavaDocs: COMPRESS-69

        Show
        Uwe Schindler added a comment - It is fixed now , including the JavaDocs: COMPRESS-69
        Hide
        Shai Erera added a comment -

        I updated the code from SVN, but I still see wrong javadocs. In the class javadocs, for both classes, first line still says "(without file headers)". Also, Bzip2TestCase has a xtestBzipCreation() - the 'x' prevents this test from running as JUnit - is that intentional? I removed the 'x' and the test passes.

        Show
        Shai Erera added a comment - I updated the code from SVN, but I still see wrong javadocs. In the class javadocs, for both classes, first line still says "(without file headers)". Also, Bzip2TestCase has a xtestBzipCreation() - the 'x' prevents this test from running as JUnit - is that intentional? I removed the 'x' and the test passes.
        Hide
        Uwe Schindler added a comment -

        I added as comment to COMPRESS-69:

        you forgot to enable the test again...

        He disabled the test (he added the de/encode test directly after opening the issue because of my comment of a missing test) because it failed until he had a solution.

        Show
        Uwe Schindler added a comment - I added as comment to COMPRESS-69 : you forgot to enable the test again... He disabled the test (he added the de/encode test directly after opening the issue because of my comment of a missing test) because it failed until he had a solution.
        Hide
        Uwe Schindler added a comment -

        Now it's really fixed: compression and decompression are working similar, test case enabled, and javadocs fixed. That was really fast issue fixing, congratulations to COMPRESS

        Show
        Uwe Schindler added a comment - Now it's really fixed: compression and decompression are working similar, test case enabled, and javadocs fixed. That was really fast issue fixing, congratulations to COMPRESS
        Hide
        Shai Erera added a comment -

        Great !
        Uwe, can you please update the jar in this issue? I will make sure the test passes with it.

        Show
        Shai Erera added a comment - Great ! Uwe, can you please update the jar in this issue? I will make sure the test passes with it.
        Hide
        Uwe Schindler added a comment -

        Here is it. I thought you had checked it out, too, and created a JAR yourself. I have not done anything other. It's the (renamed) JAR file from the "target" dir after "mvn install".

        Show
        Uwe Schindler added a comment - Here is it. I thought you had checked it out, too, and created a JAR yourself. I have not done anything other. It's the (renamed) JAR file from the "target" dir after "mvn install".
        Hide
        Shai Erera added a comment -

        Sorry about that. I didn't know what to do with the pom.xml. Given your comment above, I'll install maven and use it next time

        Show
        Shai Erera added a comment - Sorry about that. I didn't know what to do with the pom.xml. Given your comment above, I'll install maven and use it next time
        Hide
        Shai Erera added a comment -

        Patch includes:

        • BenchmarkTestCase (currently just sets the working directory, but can be added functionality in the future).
        • LineDocMakerTest
        • WriteLineDocTaskTest
        • Update code according to the latest commons-compress.jar (i.e., not read/write file header chars).
        Show
        Shai Erera added a comment - Patch includes: BenchmarkTestCase (currently just sets the working directory, but can be added functionality in the future). LineDocMakerTest WriteLineDocTaskTest Update code according to the latest commons-compress.jar (i.e., not read/write file header chars).
        Hide
        Michael McCandless added a comment -

        I had some trouble w/ the patch...

        First, I had to edit contrib/benchmark's build.xml to add the compress JAR onto the classpath (things wouldn't compile otherwise).

        Then I see failures in TestPerfTasksParse, eg:

            [junit] java.lang.Exception: Error: cannot understand algorithm!
            [junit] 	at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:63)
            [junit] 	at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.doTestAllTasksSimpleParse(TestPerfTasksParse.java:171)
            [junit] 	at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.testAllTasksSimpleParse(TestPerfTasksParse.java:140)
            [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
            [junit] 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
            [junit] 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
            [junit] 	at java.lang.reflect.Method.invoke(Method.java:597)
            [junit] 	at junit.framework.TestCase.runTest(TestCase.java:164)
            [junit] 	at junit.framework.TestCase.runBare(TestCase.java:130)
            [junit] 	at junit.framework.TestResult$1.protect(TestResult.java:106)
            [junit] 	at junit.framework.TestResult.runProtected(TestResult.java:124)
            [junit] 	at junit.framework.TestResult.run(TestResult.java:109)
            [junit] 	at junit.framework.TestCase.run(TestCase.java:120)
            [junit] 	at junit.framework.TestSuite.runTest(TestSuite.java:230)
            [junit] 	at junit.framework.TestSuite.run(TestSuite.java:225)
            [junit] 	at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35)
            [junit] 	at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912)
            [junit] 	at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766)
            [junit] Caused by: java.lang.reflect.InvocationTargetException
            [junit] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
            [junit] 	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
            [junit] 	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
            [junit] 	at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
            [junit] 	at org.apache.lucene.benchmark.byTask.utils.Algorithm.<init>(Algorithm.java:69)
            [junit] 	at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:61)
            [junit] 	... 19 more
            [junit] Caused by: java.lang.IllegalArgumentException: line.file.out must be set
            [junit] 	at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTask.<init>(WriteLineDocTask.java:73)
            [junit] 	... 25 more
        

        And the new LineDocMakerTest fails with this:

            [junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest):	FAILED
            [junit] expected:<1> but was:<0>
            [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
            [junit] 	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96)
            [junit] 	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
        

        WriteLineDocTest shows a similar failure. Not sure what's up...

        Show
        Michael McCandless added a comment - I had some trouble w/ the patch... First, I had to edit contrib/benchmark's build.xml to add the compress JAR onto the classpath (things wouldn't compile otherwise). Then I see failures in TestPerfTasksParse, eg: [junit] java.lang.Exception: Error: cannot understand algorithm! [junit] at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:63) [junit] at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.doTestAllTasksSimpleParse(TestPerfTasksParse.java:171) [junit] at org.apache.lucene.benchmark.byTask.TestPerfTasksParse.testAllTasksSimpleParse(TestPerfTasksParse.java:140) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [junit] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) [junit] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) [junit] at java.lang.reflect.Method.invoke(Method.java:597) [junit] at junit.framework.TestCase.runTest(TestCase.java:164) [junit] at junit.framework.TestCase.runBare(TestCase.java:130) [junit] at junit.framework.TestResult$1.protect(TestResult.java:106) [junit] at junit.framework.TestResult.runProtected(TestResult.java:124) [junit] at junit.framework.TestResult.run(TestResult.java:109) [junit] at junit.framework.TestCase.run(TestCase.java:120) [junit] at junit.framework.TestSuite.runTest(TestSuite.java:230) [junit] at junit.framework.TestSuite.run(TestSuite.java:225) [junit] at org.junit.internal.runners.OldTestClassRunner.run(OldTestClassRunner.java:35) [junit] at junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:32) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRunner.java:421) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTestRunner.java:912) [junit] at org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestRunner.java:766) [junit] Caused by: java.lang.reflect.InvocationTargetException [junit] at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) [junit] at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) [junit] at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) [junit] at java.lang.reflect.Constructor.newInstance(Constructor.java:513) [junit] at org.apache.lucene.benchmark.byTask.utils.Algorithm.<init>(Algorithm.java:69) [junit] at org.apache.lucene.benchmark.byTask.Benchmark.<init>(Benchmark.java:61) [junit] ... 19 more [junit] Caused by: java.lang.IllegalArgumentException: line.file.out must be set [junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTask.<init>(WriteLineDocTask.java:73) [junit] ... 25 more And the new LineDocMakerTest fails with this: [junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest): FAILED [junit] expected:<1> but was:<0> [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0> [junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96) [junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119) WriteLineDocTest shows a similar failure. Not sure what's up...
        Hide
        Shai Erera added a comment -

        That's strange ...
        About the build.xml, I think the problem lies in line 110, where the classpath defines explicit jars. I changed it to:

            <path id="classpath">
                <pathelement path="${common.dir}/build/classes/java"/>
                <pathelement path="${common.dir}/build/classes/demo"/>
                <pathelement path="${common.dir}/build/contrib/highlighter/classes/java"/>
            	<fileset dir="lib">
            		<include name="**/*.jar"/>
            	</fileset>
            </path>
        

        and it compiled successfully. I think this change is good since it will prevent such problems in the future (in case more dependencies will be added).

        About the test failures - they pass for me in eclipse however fail in Ant. I believe I know the reason - previously, WriteLineDocTask's ctor logic was in its setUp method. I moved it to ctor since setUp is called for every document, and the initialization there did not seem right to me. The "line.file.out' property is indeed mandatory, and hence the exception.
        The reason it doesn't fail in eclipse is because this task is not explicitly defined in findTasks(), and I don't have the "tasks.dir" env variable defined. As soon as I add this line:

        tsks.add(  " WriteLineDoc             "  );
        

        to findTasks(), the test fails.

        I see several ways to solve it:

        • Make line.file.doc optional, and if not set create ByteArrayOutputStream, instead of FileOutputStream. This can also help the tests not create unnecessary files.
        • Move the logic back to setup while checking a boolean if we've been initialized yet. I don't like it very much - I think setup and teardown should be reseved for per-doLogic call.
        • Add INDENT+"line.file.out=test/line.file" + NEW_LINE to TestPerfTasksParse.propPart. I don't like it too since propPart is reserved for properties that are common for all tasks.
          I like (1) most.
        Show
        Shai Erera added a comment - That's strange ... About the build.xml, I think the problem lies in line 110, where the classpath defines explicit jars. I changed it to: <path id= "classpath" > <pathelement path= "${common.dir}/build/classes/java" /> <pathelement path= "${common.dir}/build/classes/demo" /> <pathelement path= "${common.dir}/build/contrib/highlighter/classes/java" /> <fileset dir= "lib" > <include name= "**/*.jar" /> </fileset> </path> and it compiled successfully. I think this change is good since it will prevent such problems in the future (in case more dependencies will be added). About the test failures - they pass for me in eclipse however fail in Ant. I believe I know the reason - previously, WriteLineDocTask's ctor logic was in its setUp method. I moved it to ctor since setUp is called for every document, and the initialization there did not seem right to me. The "line.file.out' property is indeed mandatory, and hence the exception. The reason it doesn't fail in eclipse is because this task is not explicitly defined in findTasks(), and I don't have the "tasks.dir" env variable defined. As soon as I add this line: tsks.add( " WriteLineDoc " ); to findTasks(), the test fails. I see several ways to solve it: Make line.file.doc optional, and if not set create ByteArrayOutputStream, instead of FileOutputStream. This can also help the tests not create unnecessary files. Move the logic back to setup while checking a boolean if we've been initialized yet. I don't like it very much - I think setup and teardown should be reseved for per-doLogic call. Add INDENT+"line.file.out=test/line.file" + NEW_LINE to TestPerfTasksParse.propPart. I don't like it too since propPart is reserved for properties that are common for all tasks. I like (1) most.
        Hide
        Michael McCandless added a comment -

        I think this change is good since it will prevent such problems in the future (in case more dependencies will be added).

        That sounds good.

        I agree WriteLineDocTask should pull its config in ctor, not setUp.

        But: I don't think WriteLineDocTask should be created when it's not going to be used, ie the TestPerfTasksParse.doTestAllTasksSimpleParse seems wrong?

        Show
        Michael McCandless added a comment - I think this change is good since it will prevent such problems in the future (in case more dependencies will be added). That sounds good. I agree WriteLineDocTask should pull its config in ctor, not setUp. But: I don't think WriteLineDocTask should be created when it's not going to be used, ie the TestPerfTasksParse.doTestAllTasksSimpleParse seems wrong?
        Hide
        Shai Erera added a comment -

        Not sure what you mean. The test does use any Task, just attempts to parse algorithm texts with those tasks defined. Do you suggest we exclude WriteLineDocTask from the test?
        Perhaps we can wrap the new Benchmark() call with a try-catch on IAE and log such tests but don't fail? That way, if a certain task has mandatory properties, it shouldn't fail the test ...
        Another option is to define for each tested task mandatory properties, in addition to the common ones used for all tasks ...

        Unless I misunderstand you, I don't see why this test is wrong.

        Show
        Shai Erera added a comment - Not sure what you mean. The test does use any Task, just attempts to parse algorithm texts with those tasks defined. Do you suggest we exclude WriteLineDocTask from the test? Perhaps we can wrap the new Benchmark() call with a try-catch on IAE and log such tests but don't fail? That way, if a certain task has mandatory properties, it shouldn't fail the test ... Another option is to define for each tested task mandatory properties, in addition to the common ones used for all tasks ... Unless I misunderstand you, I don't see why this test is wrong.
        Hide
        Michael McCandless added a comment -

        The test seems to assume you can take any Task in the source tree, and make an alg that simply creates that task.

        I think that assumption is in fact wrong, because tasks like WriteLineDocTask indeed require certain configuration (line.file.out) be set, and the test can't know that. Other tasks in the future will presumably hit the same issue.

        Also, thinking about the test, I think it doesn't add much value? Elsewhere we heavily test that the .alg parser works properly. And all this test does is take every task, and stick it in either "XXX", "[ XXX ] : 2" or "

        { XXX }

        : 3", parse it, and verify it parsed properly.

        I think we should simply turn those three tests off? Or, if that seems to drastic, simply skipping WriteLineDocTask seems OK too?

        Show
        Michael McCandless added a comment - The test seems to assume you can take any Task in the source tree, and make an alg that simply creates that task. I think that assumption is in fact wrong, because tasks like WriteLineDocTask indeed require certain configuration (line.file.out) be set, and the test can't know that. Other tasks in the future will presumably hit the same issue. Also, thinking about the test, I think it doesn't add much value? Elsewhere we heavily test that the .alg parser works properly. And all this test does is take every task, and stick it in either "XXX", "[ XXX ] : 2" or " { XXX } : 3", parse it, and verify it parsed properly. I think we should simply turn those three tests off? Or, if that seems to drastic, simply skipping WriteLineDocTask seems OK too?
        Hide
        Shai Erera added a comment -

        We can turn them off, or wrap new Benchmark with try-catch Exception, logging a failed task. Alternatively, we can add an 'exclude' list which will define tasks that should be discarded by the test, and add WriteLineDocTask to it.

        However, if you think those are useless, i.e. we test .alg parsing elsewhere (and I agree these tests don't add much value), then I agree we should remove them, rather than working hard to mask the test's limitations.

        Show
        Shai Erera added a comment - We can turn them off, or wrap new Benchmark with try-catch Exception, logging a failed task. Alternatively, we can add an 'exclude' list which will define tasks that should be discarded by the test, and add WriteLineDocTask to it. However, if you think those are useless, i.e. we test .alg parsing elsewhere (and I agree these tests don't add much value), then I agree we should remove them, rather than working hard to mask the test's limitations.
        Hide
        Michael McCandless added a comment -

        OK, let's just remove them. Can you post new patch? Thanks.

        Show
        Michael McCandless added a comment - OK, let's just remove them. Can you post new patch? Thanks.
        Hide
        Shai Erera added a comment -

        All benchmark tests pass. Note: when you apply the patch, make sure you include the latest commons-compress jar Uwe uploaded.

        Show
        Shai Erera added a comment - All benchmark tests pass. Note: when you apply the patch, make sure you include the latest commons-compress jar Uwe uploaded.
        Hide
        Michael McCandless added a comment -

        Hmm I'm still hitting some errors, eg:

        [junit] Testcase: testRegularFileWithBZipCompressionEnabled(org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest):	FAILED
        [junit] expected:<3> but was:<1>
        [junit] junit.framework.AssertionFailedError: expected:<3> but was:<1>
        [junit] 	at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.doReadTest(WriteLineDocTaskTest.java:87)
        [junit] 	at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.testRegularFileWithBZipCompressionEnabled(WriteLineDocTaskTest.java:144)
        [junit] 
        

        and

        [junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest):	FAILED
        [junit] expected:<1> but was:<0>
        [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0>
        [junit] 	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96)
        [junit] 	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
        
        Show
        Michael McCandless added a comment - Hmm I'm still hitting some errors, eg: [junit] Testcase: testRegularFileWithBZipCompressionEnabled(org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest): FAILED [junit] expected:<3> but was:<1> [junit] junit.framework.AssertionFailedError: expected:<3> but was:<1> [junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.doReadTest(WriteLineDocTaskTest.java:87) [junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.testRegularFileWithBZipCompressionEnabled(WriteLineDocTaskTest.java:144) [junit] and [junit] Testcase: testBZip2WithBzipCompressionDisabled(org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest): FAILED [junit] expected:<1> but was:<0> [junit] junit.framework.AssertionFailedError: expected:<1> but was:<0> [junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:96) [junit] at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
        Hide
        Shai Erera added a comment -

        That's strange ... I did the following:

        • Checkout trunk to a new project.
        • Download latest commons-compress jar Uwe added.
        • Applied the patch.
        • Ran "ant test".
          The result is: BUILD SUCCESSFUL and I see those two test cases pass ... I also ran all tests from eclipse, they pass too.

        testRegularFileWithBZipCompressionEnabled simulates an attempt to read a bz2 file as a regular file. The very first readLine() should throw a MalformedException or something ... that's what the test is counting on. It seems that in your case this line succeeds, reading something, and then fails on String.split(), since probably it didn't read something meaningful. I don't understand why this would happen though ....
        Can you run this test alone, w/o the rest? Perhaps debug-trace it? The test does not delete the in/output file before and after the test, but relies on FileInputStream(String/File) ctor which is supposed to re-create the file, even if it exists. Could it be that in your case it doesn't happen?

        I assume the second exception is thrown for the same reason. Following the steps I've done above to apply the patch, I don't understand why the test fails on your machine ...

        Show
        Shai Erera added a comment - That's strange ... I did the following: Checkout trunk to a new project. Download latest commons-compress jar Uwe added. Applied the patch. Ran "ant test". The result is: BUILD SUCCESSFUL and I see those two test cases pass ... I also ran all tests from eclipse, they pass too. testRegularFileWithBZipCompressionEnabled simulates an attempt to read a bz2 file as a regular file. The very first readLine() should throw a MalformedException or something ... that's what the test is counting on. It seems that in your case this line succeeds, reading something, and then fails on String.split(), since probably it didn't read something meaningful. I don't understand why this would happen though .... Can you run this test alone, w/o the rest? Perhaps debug-trace it? The test does not delete the in/output file before and after the test, but relies on FileInputStream(String/File) ctor which is supposed to re-create the file, even if it exists. Could it be that in your case it doesn't happen? I assume the second exception is thrown for the same reason. Following the steps I've done above to apply the patch, I don't understand why the test fails on your machine ...
        Hide
        Michael McCandless added a comment -

        So, for LineDocMakerTest.testBZip2WithBzipCompressionDisabled, indeed LineDocMaker opens the binary file, but then no exception is hit: it looks for a tab delimeter, and when it can't find one, sets body/title/date to "" and adds the doc anyway.

        In your case you hit some exception – can you e.printStackTrace(System.out) and post back what exception that is? Maybe somehow your bzip2 is putting a tab in the binary but mine's not?

        Show
        Michael McCandless added a comment - So, for LineDocMakerTest.testBZip2WithBzipCompressionDisabled, indeed LineDocMaker opens the binary file, but then no exception is hit: it looks for a tab delimeter, and when it can't find one, sets body/title/date to "" and adds the doc anyway. In your case you hit some exception – can you e.printStackTrace(System.out) and post back what exception that is? Maybe somehow your bzip2 is putting a tab in the binary but mine's not?
        Hide
        Shai Erera added a comment -
        sun.io.MalformedInputException
        	at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262)
        	at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314)
        	at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:364)
        	at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250)
        	at java.io.InputStreamReader.read(InputStreamReader.java:212)
        	at java.io.BufferedReader.fill(BufferedReader.java:157)
        	at java.io.BufferedReader.readLine(BufferedReader.java:320)
        	at java.io.BufferedReader.readLine(BufferedReader.java:383)
        	at org.apache.lucene.benchmark.byTask.feeds.LineDocMaker.makeDocument(LineDocMaker.java:187)
        	at org.apache.lucene.benchmark.byTask.tasks.AddDocTask.setup(AddDocTask.java:61)
        	at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:92)
        	at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:148)
        	at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:129)
        	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:92)
        	at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119)
        	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79)
        	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        	at java.lang.reflect.Method.invoke(Method.java:618)
        	at junit.framework.TestCase.runTest(TestCase.java:164)
        	at junit.framework.TestCase.runBare(TestCase.java:130)
        	at junit.framework.TestResult$1.protect(TestResult.java:106)
        	at junit.framework.TestResult.runProtected(TestResult.java:124)
        	at junit.framework.TestResult.run(TestResult.java:109)
        	at junit.framework.TestCase.run(TestCase.java:120)
        	at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
        	at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
        	at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
        
        Show
        Shai Erera added a comment - sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java:262) at sun.nio.cs.StreamDecoder$ConverterSD.convertInto(StreamDecoder.java:314) at sun.nio.cs.StreamDecoder$ConverterSD.implRead(StreamDecoder.java:364) at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:250) at java.io.InputStreamReader.read(InputStreamReader.java:212) at java.io.BufferedReader.fill(BufferedReader.java:157) at java.io.BufferedReader.readLine(BufferedReader.java:320) at java.io.BufferedReader.readLine(BufferedReader.java:383) at org.apache.lucene.benchmark.byTask.feeds.LineDocMaker.makeDocument(LineDocMaker.java:187) at org.apache.lucene.benchmark.byTask.tasks.AddDocTask.setup(AddDocTask.java:61) at org.apache.lucene.benchmark.byTask.tasks.PerfTask.runAndMaybeStats(PerfTask.java:92) at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doSerialTasks(TaskSequence.java:148) at org.apache.lucene.benchmark.byTask.tasks.TaskSequence.doLogic(TaskSequence.java:129) at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.doIndexAndSearchTest(LineDocMakerTest.java:92) at org.apache.lucene.benchmark.byTask.feeds.LineDocMakerTest.testBZip2WithBzipCompressionDisabled(LineDocMakerTest.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:79) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:618) at junit.framework.TestCase.runTest(TestCase.java:164) at junit.framework.TestCase.runBare(TestCase.java:130) at junit.framework.TestResult$1.protect(TestResult.java:106) at junit.framework.TestResult.runProtected(TestResult.java:124) at junit.framework.TestResult.run(TestResult.java:109) at junit.framework.TestCase.run(TestCase.java:120) at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130) at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386) at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
        Hide
        Shai Erera added a comment -

        A long shot - can you please print the line read in makeDocument? Could it be that the line is not null, but 0 length (or contains just whitespaces)? I just thought that we're running on two different OSs (I run on Windows and you on Linux/Mac?) and perhaps on your OS the first readLine() succeeds, reading a blank line or something, and the second will fail, attempting to read the actual information?
        Weird though ...

        Show
        Shai Erera added a comment - A long shot - can you please print the line read in makeDocument? Could it be that the line is not null, but 0 length (or contains just whitespaces)? I just thought that we're running on two different OSs (I run on Windows and you on Linux/Mac?) and perhaps on your OS the first readLine() succeeds, reading a blank line or something, and the second will fail, attempting to read the actual information? Weird though ...
        Hide
        Michael McCandless added a comment -

        Here's the line I see, nice and binary (copy/past lost the exact chars I'm sure...): BZh91AY&SY@9J

        Show
        Michael McCandless added a comment - Here's the line I see, nice and binary (copy/past lost the exact chars I'm sure...): BZh91AY&SY@9J
        Hide
        Michael McCandless added a comment -

        Yeah I'm on OS X Leopard. I just tested on a Debian linux derivative and also see the test failing. Weird. Not quite "write once run anywhere"

        Show
        Michael McCandless added a comment - Yeah I'm on OS X Leopard. I just tested on a Debian linux derivative and also see the test failing. Weird. Not quite "write once run anywhere"
        Hide
        Shai Erera added a comment -

        Well ... that worries me ... when I open the bz2 file (with notepad++), I see the same line, but on my machine, readLine() fails with that MIE. It's as if on my machine the readLine() call attempts to fill the buffer of BR, and then hits the exception, while on your machine it just stops in the middle.

        So I wonder how to fix it - LineDocMaker's logic is ok - makeDocument() just reads lines.. There's no point adding code which tries to compensate on any OS specific weridness. Perhaps we can change the 'else' part (which assigns title, body, date to "") to throw a RuntimeException (or MIE) in that case, since obviously this shouldn't happen and if it does - it's really a bug in the file format?

        Or, I can just remove the test ... but I think the above suggestion makes sense, and will solve it. Mike, if you agree, can you quickly apply that to your env. and note if the test fails? (it must fail, but I just want to be sure).

        Show
        Shai Erera added a comment - Well ... that worries me ... when I open the bz2 file (with notepad++), I see the same line, but on my machine, readLine() fails with that MIE. It's as if on my machine the readLine() call attempts to fill the buffer of BR, and then hits the exception, while on your machine it just stops in the middle. So I wonder how to fix it - LineDocMaker's logic is ok - makeDocument() just reads lines.. There's no point adding code which tries to compensate on any OS specific weridness. Perhaps we can change the 'else' part (which assigns title, body, date to "") to throw a RuntimeException (or MIE) in that case, since obviously this shouldn't happen and if it does - it's really a bug in the file format? Or, I can just remove the test ... but I think the above suggestion makes sense, and will solve it. Mike, if you agree, can you quickly apply that to your env. and note if the test fails? (it must fail, but I just want to be sure).
        Hide
        Michael McCandless added a comment -

        Mike, if you agree, can you quickly apply that to your env. and note if the test fails?

        You mean confirm the test passes on adding the RuntimeException on the else clause, right?

        Yes, indeed the test passes with this change. And I like the change (making LineDocMaker more brittle on receiving a malformed line). So let's go forward with that?

        Show
        Michael McCandless added a comment - Mike, if you agree, can you quickly apply that to your env. and note if the test fails? You mean confirm the test passes on adding the RuntimeException on the else clause, right? Yes, indeed the test passes with this change. And I like the change (making LineDocMaker more brittle on receiving a malformed line). So let's go forward with that?
        Hide
        Shai Erera added a comment -

        Let's try with this one. Changes:

        • Added testInvalidFormat to LineDocMakerTest
        • Changed LineDocMaker to throw RuntimeException in case a line does not have two TABs.
        Show
        Shai Erera added a comment - Let's try with this one. Changes: Added testInvalidFormat to LineDocMakerTest Changed LineDocMaker to throw RuntimeException in case a line does not have two TABs.
        Hide
        Michael McCandless added a comment -

        I'm still seeing the one WriteLineDocTest failure:

            [junit] Testcase: testRegularFileWithBZipCompressionEnabled(org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest):	FAILED
            [junit] expected:<3> but was:<1>
            [junit] junit.framework.AssertionFailedError: expected:<3> but was:<1>
            [junit] 	at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.doReadTest(WriteLineDocTaskTest.java:87)
            [junit] 	at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.testRegularFileWithBZipCompressionEnabled(WriteLineDocTaskTest.java:144)
        

        I think it's a similar issue – the doReadTest must hit an exception in readline() on your OS, but not mine.

        Show
        Michael McCandless added a comment - I'm still seeing the one WriteLineDocTest failure: [junit] Testcase: testRegularFileWithBZipCompressionEnabled(org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest): FAILED [junit] expected:<3> but was:<1> [junit] junit.framework.AssertionFailedError: expected:<3> but was:<1> [junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.doReadTest(WriteLineDocTaskTest.java:87) [junit] at org.apache.lucene.benchmark.byTask.tasks.WriteLineDocTaskTest.testRegularFileWithBZipCompressionEnabled(WriteLineDocTaskTest.java:144) I think it's a similar issue – the doReadTest must hit an exception in readline() on your OS, but not mine.
        Hide
        Shai Erera added a comment -

        I removed this test from WriteLineDocTaskTest, since it doesn't really belong there. It tested that if WLDT created a bz2 file, an attempt to read it as regular would fail. But reading is not part of WLDT's logic, and that test case belongs (and already exists) in LDM test.

        I'm tempted to say "this patch should be fine", but given the history of this issue and the OS weird-ness I'm being careful

        Show
        Shai Erera added a comment - I removed this test from WriteLineDocTaskTest, since it doesn't really belong there. It tested that if WLDT created a bz2 file, an attempt to read it as regular would fail. But reading is not part of WLDT's logic, and that test case belongs (and already exists) in LDM test. I'm tempted to say "this patch should be fine", but given the history of this issue and the OS weird-ness I'm being careful
        Hide
        Michael McCandless added a comment -

        All tests pass! And patch looks good. I'll commit shortly. Thanks Shai!

        Show
        Michael McCandless added a comment - All tests pass! And patch looks good. I'll commit shortly. Thanks Shai!
        Hide
        Shai Erera added a comment -

        Mike, did you commit the commons-compress jar too?

        Show
        Shai Erera added a comment - Mike, did you commit the commons-compress jar too?
        Hide
        Michael McCandless added a comment -

        Mike, did you commit the commons-compress jar too?

        Woops, forgot, and now fixed – thanks for catching that!

        Show
        Michael McCandless added a comment - Mike, did you commit the commons-compress jar too? Woops, forgot, and now fixed – thanks for catching that!
        Hide
        Jason Rutherglen added a comment -

        Related to the new xerces-2.9.1-patched-XERCESJ-1257.jar in
        contrib/benchmark I get a
        "java.lang.UnsupportedClassVersionError: Bad version number in
        .class file" message when building.

        Can you please verify?

        Environment: Java(TM) 2 Runtime Environment, Standard Edition
        (build 1.5.0_16-b06-284) Java HotSpot(TM) Client VM (build
        1.5.0_16-133, mixed mode, sharing)

        Show
        Jason Rutherglen added a comment - Related to the new xerces-2.9.1-patched- XERCESJ-1257 .jar in contrib/benchmark I get a "java.lang.UnsupportedClassVersionError: Bad version number in .class file" message when building. Can you please verify? Environment: Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_16-b06-284) Java HotSpot(TM) Client VM (build 1.5.0_16-133, mixed mode, sharing)
        Hide
        Shai Erera added a comment -

        Hmmm ... Mike built that file from the xerces project, after patching it with XERCESJ-1257. I don't know though which JRE he used to build it. Can you please post the full stack trace (mostly interested in the .class file with the problem and the major/minor version it reports).

        I use 1.5 as well and don't experience this error. I "cd benchmark" then "ant jar" and it finished successfully.

        Show
        Shai Erera added a comment - Hmmm ... Mike built that file from the xerces project, after patching it with XERCESJ-1257 . I don't know though which JRE he used to build it. Can you please post the full stack trace (mostly interested in the .class file with the problem and the major/minor version it reports). I use 1.5 as well and don't experience this error. I "cd benchmark" then "ant jar" and it finished successfully.
        Hide
        Michael McCandless added a comment -

        For the record, here's the patch I had applied to XercesJ 2.9.1 sources:

        --- UTF8Reader.java	2006-11-23 00:36:53.000000000 +0100
        +++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java	2008-04-04 00:40:58.000000000 +0200
        @@ -534,6 +534,16 @@
                             invalidByte(4, 4, b2);
                         }
         
        +                // check if output buffer is large enough to hold 2 surrogate chars
        +                if( out + 1 >= offset + length ){
        +                    fBuffer[0] = (byte)b0;
        +                    fBuffer[1] = (byte)b1;
        +                    fBuffer[2] = (byte)b2;
        +                    fBuffer[3] = (byte)b3;
        +                    fOffset = 4;
        +                    return out - offset;
        +		}
        +
                         // decode bytes into surrogate characters
                         int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003);
                         if (uuuuu > 0x10) {
        
        Show
        Michael McCandless added a comment - For the record, here's the patch I had applied to XercesJ 2.9.1 sources: --- UTF8Reader.java 2006-11-23 00:36:53.000000000 +0100 +++ /home/rainman/lucene/xerces-2_9_0/src/org/apache/xerces/impl/io/UTF8Reader.java 2008-04-04 00:40:58.000000000 +0200 @@ -534,6 +534,16 @@ invalidByte(4, 4, b2); } + // check if output buffer is large enough to hold 2 surrogate chars + if ( out + 1 >= offset + length ){ + fBuffer[0] = ( byte )b0; + fBuffer[1] = ( byte )b1; + fBuffer[2] = ( byte )b2; + fBuffer[3] = ( byte )b3; + fOffset = 4; + return out - offset; + } + // decode bytes into surrogate characters int uuuuu = ((b0 << 2) & 0x001C) | ((b1 >> 4) & 0x0003); if (uuuuu > 0x10) {
        Hide
        Michael McCandless added a comment -

        I just committed a JDK 1.4 build of the patched XercesJ jar (I think I had used 1.5 previously, though I don't understand why Jason was having trouble using it).

        Jason can you try with this new JAR?

        Show
        Michael McCandless added a comment - I just committed a JDK 1.4 build of the patched XercesJ jar (I think I had used 1.5 previously, though I don't understand why Jason was having trouble using it). Jason can you try with this new JAR?
        Hide
        Uwe Schindler added a comment -

        Commons-Compress 1.0 is now released, we should use the official JAR file:
        http://commons.apache.org/compress/download_compress.cgi

        Should I update and test compilation?

        Show
        Uwe Schindler added a comment - Commons-Compress 1.0 is now released, we should use the official JAR file: http://commons.apache.org/compress/download_compress.cgi Should I update and test compilation?
        Hide
        Michael McCandless added a comment -

        Excellent! Yes I think so?

        Show
        Michael McCandless added a comment - Excellent! Yes I think so?
        Hide
        Uwe Schindler added a comment -

        I replaced the dev version by 1.0 and it compiled fine. All tests fine. But I did not test the enwiki (takes too long), but according to the changelog of compress, there were no changes in Bzip code.
        I commit shortly.

        Show
        Uwe Schindler added a comment - I replaced the dev version by 1.0 and it compiled fine. All tests fine. But I did not test the enwiki (takes too long), but according to the changelog of compress, there were no changes in Bzip code. I commit shortly.
        Hide
        Uwe Schindler added a comment -

        Committed revision 777458.

        Show
        Uwe Schindler added a comment - Committed revision 777458.
        Hide
        Mark Miller added a comment -

        some java 1.5 code got in with this patch

        Show
        Mark Miller added a comment - some java 1.5 code got in with this patch
        Hide
        Mark Miller added a comment -

        Looks like this spread a little in the docmaker/contentsource breakup issue as well. This patch takes care of both (a few Integer.valueOfs).

        Show
        Mark Miller added a comment - Looks like this spread a little in the docmaker/contentsource breakup issue as well. This patch takes care of both (a few Integer.valueOfs).
        Hide
        Michael McCandless added a comment -

        Thank Mark!

        Show
        Michael McCandless added a comment - Thank Mark!
        Hide
        Mark Miller added a comment -

        committed

        Show
        Mark Miller added a comment - committed
        Hide
        Michael McCandless added a comment -

        Alas, horribly, I'm hitting this bug again, with the 2.10.0 Xerces JAR currently checked in.

        I downloaded the latest XML dump from Wikipedia (en), enwiki-20110115-pages-articles.xml, and after ~2.8M docs I hit this:

             [java] 592.4 sec --> main Wrote 2807000 line docs
             [java] 592.51 sec --> main Wrote 2808000 line docs
             [java] 592.59 sec --> main Wrote 2809000 line docs
             [java] 592.78 sec --> main Wrote 2810000 line docs
             [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:197)
             [java] 	at java.lang.Thread.run(Thread.java:619)
             [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
             [java] 	at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
             [java] 	at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source)
             [java] 	at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:174)
             [java] 	... 1 more
             [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence.
             [java] 	at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source)
             [java] 	at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
             [java] 	at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)
             [java] 	... 8 more
             [java] ####################
             [java] ###  D O N E !!! ###
             [java] ####################
        

        I went back to the old patched Xerces JAR, and it got past that point just fine...

        Show
        Michael McCandless added a comment - Alas, horribly, I'm hitting this bug again, with the 2.10.0 Xerces JAR currently checked in. I downloaded the latest XML dump from Wikipedia (en), enwiki-20110115-pages-articles.xml, and after ~2.8M docs I hit this: [java] 592.4 sec --> main Wrote 2807000 line docs [java] 592.51 sec --> main Wrote 2808000 line docs [java] 592.59 sec --> main Wrote 2809000 line docs [java] 592.78 sec --> main Wrote 2810000 line docs [java] Exception in thread "Thread-0" java.lang.RuntimeException: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:197) [java] at java.lang.Thread.run(Thread.java:619) [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) [java] at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl$FragmentContentDispatcher.dispatch(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanDocument(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XML11Configuration.parse(Unknown Source) [java] at org.apache.xerces.parsers.XMLParser.parse(Unknown Source) [java] at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) [java] at org.apache.lucene.benchmark.byTask.feeds.EnwikiContentSource$Parser.run(EnwikiContentSource.java:174) [java] ... 1 more [java] Caused by: org.apache.xerces.impl.io.MalformedByteSequenceException: Invalid byte 2 of 4-byte UTF-8 sequence. [java] at org.apache.xerces.impl.io.UTF8Reader.invalidByte(Unknown Source) [java] at org.apache.xerces.impl.io.UTF8Reader.read(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) [java] at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) [java] at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) [java] ... 8 more [java] #################### [java] ### D O N E !!! ### [java] #################### I went back to the old patched Xerces JAR, and it got past that point just fine...
        Hide
        Michael McCandless added a comment -

        I think we should just rollback to the old (patched) JAR for 3.1/4.0?

        Show
        Michael McCandless added a comment - I think we should just rollback to the old (patched) JAR for 3.1/4.0?
        Hide
        Michael McCandless added a comment - - edited

        I also tested the latest Xerces release (2.11) and it hits the same exception as above.

        Feel free to go vote for XERCESJ-1257!

        I'll just revert to our patched JAR (based on Xerces 2.9.1).

        Show
        Michael McCandless added a comment - - edited I also tested the latest Xerces release (2.11) and it hits the same exception as above. Feel free to go vote for XERCESJ-1257 ! I'll just revert to our patched JAR (based on Xerces 2.9.1).
        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        Michael McCandless added a comment -

        Note that enwiki-20110115-pages-articles.xml.bz2 also hits XERCESJ-1257 ...

        Show
        Michael McCandless added a comment - Note that enwiki-20110115-pages-articles.xml.bz2 also hits XERCESJ-1257 ...

          People

          • Assignee:
            Mark Miller
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development