Lucene - Core
  1. Lucene - Core
  2. LUCENE-639

[PATCH] Slight performance improvement for readVInt() of IndexInput

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Minor Minor
    • Resolution: Won't Fix
    • Affects Version/s: 2.0.0
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None

      Description

      By unrolling the loop in readVInt() I was able to get a slight, about 1.8 %, performance improvement for this method. The test program invoked the method over 17 million times on each run.

      I ran the performance tests on:

      • Windows XP Pro SP2
      • Sun JDK 1.5.0_07
      • YourKit 5.5.4
      • Lucene trunk
      1. readVInt performance results.pdf
        17 kB
        Johan Stuyts
      2. ReadVIntPerformanceMain.java
        12 kB
        Johan Stuyts
      3. Lucene2ReadVIntPerformance.patch
        1.0 kB
        Johan Stuyts

        Activity

        Hide
        Grant Ingersoll added a comment -

        The testing and discussion seem inconclusive on this and it hasn't been followed up on in a good amount of time, so I am going to mark it as won't fix.

        Show
        Grant Ingersoll added a comment - The testing and discussion seem inconclusive on this and it hasn't been followed up on in a good amount of time, so I am going to mark it as won't fix.
        Hide
        Yonik Seeley added a comment -

        > I also think that you drew your conclusion too fast.

        My conclusion was that I didn't personally have more time to spend on this right now given that initial results with a lucene index were dissapointing on a major platform with the latest JVMs. If people want to run this test on other platforms, or create even more realistic tests, that's fine with me... I'll still listen in

        Show
        Yonik Seeley added a comment - > I also think that you drew your conclusion too fast. My conclusion was that I didn't personally have more time to spend on this right now given that initial results with a lucene index were dissapointing on a major platform with the latest JVMs. If people want to run this test on other platforms, or create even more realistic tests, that's fine with me... I'll still listen in
        Hide
        Johan Stuyts added a comment -

        Yonik, I agree that instead of using artificial data sets instead of real ones might influence the results. Getting my hands on real, large data is going to be difficult.

        I also think that you drew your conclusion too fast. You only ran the tests on a few JVMs. As you can see from the test results that I submitted, the differences between JVMs are significant.

        There is no perfect algorithm which runs better on all JVMs, but I think that algorithms that run better on most JVMs exist.

        Show
        Johan Stuyts added a comment - Yonik, I agree that instead of using artificial data sets instead of real ones might influence the results. Getting my hands on real, large data is going to be difficult. I also think that you drew your conclusion too fast. You only ran the tests on a few JVMs. As you can see from the test results that I submitted, the differences between JVMs are significant. There is no perfect algorithm which runs better on all JVMs, but I think that algorithms that run better on most JVMs exist.
        Hide
        Yonik Seeley added a comment -

        Here's the simple TermDoc iteration test I used:
        http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/index/TestTermdocPerf.java

        I think our attempts at optimization might be foiled by the larger method size, and the fact that so many vints are going to be a single byte (the term freqs are interleaved with the doc increments). Hotspot might have an easier time with the small method, and the branch predictor will have fewer branches to track, etc. who knows...

        If anyone else feels like running on a different processor or with different params, please have at it. For me, the "more real lucene" test on the P4 was enough to dissuade me from pursuing this further (yet again).

        Show
        Yonik Seeley added a comment - Here's the simple TermDoc iteration test I used: http://svn.apache.org/viewvc/lucene/java/trunk/src/test/org/apache/lucene/index/TestTermdocPerf.java I think our attempts at optimization might be foiled by the larger method size, and the fact that so many vints are going to be a single byte (the term freqs are interleaved with the doc increments). Hotspot might have an easier time with the small method, and the branch predictor will have fewer branches to track, etc. who knows... If anyone else feels like running on a different processor or with different params, please have at it. For me, the "more real lucene" test on the P4 was enough to dissuade me from pursuing this further (yet again).
        Hide
        Yonik Seeley added a comment -

        I tested with a Lucene index (artificially constructed).
        It contained 10,000 documents with 10% of them containing a term. I then repeatedly iterated over the docs via a TermDocs instance (I think this is generally the bottleneck - iterating over the docs of common terms).

        Preliminary results:

        P4: Java6 -server -Xmx128M Lucene: 19231 VInt2: 19190 XOR: 20374
        P4: Java5 -server -Xmx128M Lucene: 21957 VInt2: 24141 XOR: 22808

        So with this limited single platform test, the original Lucene version is coming in first on balance!

        Show
        Yonik Seeley added a comment - I tested with a Lucene index (artificially constructed). It contained 10,000 documents with 10% of them containing a term. I then repeatedly iterated over the docs via a TermDocs instance (I think this is generally the bottleneck - iterating over the docs of common terms). Preliminary results: P4: Java6 -server -Xmx128M Lucene: 19231 VInt2: 19190 XOR: 20374 P4: Java5 -server -Xmx128M Lucene: 21957 VInt2: 24141 XOR: 22808 So with this limited single platform test, the original Lucene version is coming in first on balance!
        Hide
        Yonik Seeley added a comment -

        OK, I had a bug in Vint2a... correcting it makes it slower than Vint2 on Java6 (at least on a P4). But VInt2a is still slightly faser on the Opteron w/ Java5.

        You might want to try this version also:

        class VInt2 extends VInt {
        public int readVInt() throws IOException

        { byte b = readByte(); if (b>=0) return b; b &= 0x7f; byte b2 = readByte(); if (b2>=0) return (b2<<7) | b; b2 &= 0x7f; byte b3 = readByte(); if (b3>=0) return (b3<<14) | (b2<<7) | b; b3 &= 0x7f; byte b4 = readByte(); if (b4>=0) return (b4<<21) | (b3<<14) | (b2<<7) | b; b4 &= 0x7f; byte b5 = readByte(); return (b5<<28) | (b4<<21) | (b3<<14) | (b2<<7) | b; }

        }

        For a random distribution of vints less than 200 only, times in miliseconds:
        P4, Java6 -server -Xbatch: VInt2=15833 VInt2a=17425 XOR=15583 Lucene=16900
        P4, Java5 -server -Xbatch: VInt2=16063 VInt2a=16060 XOR=18097 Lucene=16108
        Opteron, Java5 -server -Xbatch: VInt2=13441 VInt2a=13123 XOR=13110 Lucene=13626
        Opteron, Java5 -server: VInt2=26589 VInt2a=26050 XOR=26484 Lucene=28718 (double the iterations)

        In general, XOR looks good, except for Java5 on P4 for some reason, where it is 12.5% slower.
        But when -Xbatch isn't used, the results change on the Opteron oddly enough!

        Show
        Yonik Seeley added a comment - OK, I had a bug in Vint2a... correcting it makes it slower than Vint2 on Java6 (at least on a P4). But VInt2a is still slightly faser on the Opteron w/ Java5. You might want to try this version also: class VInt2 extends VInt { public int readVInt() throws IOException { byte b = readByte(); if (b>=0) return b; b &= 0x7f; byte b2 = readByte(); if (b2>=0) return (b2<<7) | b; b2 &= 0x7f; byte b3 = readByte(); if (b3>=0) return (b3<<14) | (b2<<7) | b; b3 &= 0x7f; byte b4 = readByte(); if (b4>=0) return (b4<<21) | (b3<<14) | (b2<<7) | b; b4 &= 0x7f; byte b5 = readByte(); return (b5<<28) | (b4<<21) | (b3<<14) | (b2<<7) | b; } } For a random distribution of vints less than 200 only, times in miliseconds: P4, Java6 -server -Xbatch: VInt2=15833 VInt2a=17425 XOR=15583 Lucene=16900 P4, Java5 -server -Xbatch: VInt2=16063 VInt2a=16060 XOR=18097 Lucene=16108 Opteron, Java5 -server -Xbatch: VInt2=13441 VInt2a=13123 XOR=13110 Lucene=13626 Opteron, Java5 -server: VInt2=26589 VInt2a=26050 XOR=26484 Lucene=28718 (double the iterations) In general, XOR looks good, except for Java5 on P4 for some reason, where it is 12.5% slower. But when -Xbatch isn't used, the results change on the Opteron oddly enough!
        Hide
        Yonik Seeley added a comment -

        > Yonik, please let me know where you think the test that I provided is flawed.

        Tests are always flawed If our tests don't exactly agree, It doesn't mean one is incorrect.
        I already had a little test framework from before so I used that... it doesn't mean there is anything wrong with your tests, but more validation is better.

        If there is one thing in your tests that I think could perhaps be improved, it's the distribution of VInts.
        Testing 1 byte vints separately from 2 byte vints, etc, will yield different results than testing a realistic mix because of processor branch prediction.

        I really want to test with a real Lucene index (I've already started coming up with a simple test). That will be a better real-world test IMO, and trump both of our current test frameworks.

        Show
        Yonik Seeley added a comment - > Yonik, please let me know where you think the test that I provided is flawed. Tests are always flawed If our tests don't exactly agree, It doesn't mean one is incorrect. I already had a little test framework from before so I used that... it doesn't mean there is anything wrong with your tests, but more validation is better. If there is one thing in your tests that I think could perhaps be improved, it's the distribution of VInts. Testing 1 byte vints separately from 2 byte vints, etc, will yield different results than testing a realistic mix because of processor branch prediction. I really want to test with a real Lucene index (I've already started coming up with a simple test). That will be a better real-world test IMO, and trump both of our current test frameworks.
        Hide
        Johan Stuyts added a comment -

        Yonik, please let me know where you think the test that I provided is flawed. Comparing the results from your test to the results from my test is like comparing apples to oranges. So you have to either:

        • provide arguments why my test is flawed and provide a better test, or
        • run my tests on your machine(s) and publish those results.
        Show
        Johan Stuyts added a comment - Yonik, please let me know where you think the test that I provided is flawed. Comparing the results from your test to the results from my test is like comparing apples to oranges. So you have to either: provide arguments why my test is flawed and provide a better test, or run my tests on your machine(s) and publish those results.
        Hide
        Yonik Seeley added a comment -

        Indeed, I had a bug... I'll do some retesting.

        Show
        Yonik Seeley added a comment - Indeed, I had a bug... I'll do some retesting.
        Hide
        Johan Stuyts added a comment -

        I don't know what your test method is but a difference of 13-15% is very unlikely. The algorithms are too simple to have such a dramatic effect on performance.

        It is possible that the test that I provided is flawed. Just let me know what should be changed, and I will update it and rerun the tests.

        Show
        Johan Stuyts added a comment - I don't know what your test method is but a difference of 13-15% is very unlikely. The algorithms are too simple to have such a dramatic effect on performance. It is possible that the test that I provided is flawed. Just let me know what should be changed, and I will update it and rerun the tests.
        Hide
        Yonik Seeley added a comment -

        The XOR algorithm is a clever idea, but based on some quick tests of my own, it seems slower than my "combine on exit" for what I would guess would be an important case of a mix of 1 and 2 byte vints.

        for random numbers less than 128, the algorithms tied (not surprising).
        for random numbers less than 200 (a mix of 1 and 2 byte vints):
        13% slower on AMD Opteron, Java 1.5.0_07 -server -Xbatch
        15% slower on P4, Java 6 beta2 -server -Xbatch

        Show
        Yonik Seeley added a comment - The XOR algorithm is a clever idea, but based on some quick tests of my own, it seems slower than my "combine on exit" for what I would guess would be an important case of a mix of 1 and 2 byte vints. for random numbers less than 128, the algorithms tied (not surprising). for random numbers less than 200 (a mix of 1 and 2 byte vints): 13% slower on AMD Opteron, Java 1.5.0_07 -server -Xbatch 15% slower on P4, Java 6 beta2 -server -Xbatch
        Hide
        Johan Stuyts added a comment -

        Results of running the test.

        Show
        Johan Stuyts added a comment - Results of running the test.
        Hide
        Johan Stuyts added a comment -

        Source for performance test.

        Show
        Johan Stuyts added a comment - Source for performance test.
        Hide
        Johan Stuyts added a comment -

        Yonik, I think it is possible to make a reasonable assumption about the distribution of the vints. Lucene stores deltas between document IDs instead of document IDs, and I guess (no data available) most frequencies will be below 128.

        In the example given below I take the bit for the frequency that is stored with the document IDs delta into account.

        Suppose you have 50 million documents. With a term occurring in 1 in 50 documents 1 million 1-byte vints are needed. With a term occurring 1 in 2000 documents 25.000 2-byte vints are needed. With an occurrence of 1 in 100.000 documents 500 3-byte vints are needed. Together with my guess about the frequencies many more 1-byte and 2-byte vints than 3-byte and longer vints have to be read. So the algorithm for readVInt() should at least perform very good for 1-byte and 2-byte vints.

        To detemine whether or not it is worth improving the algorighm, I wrote an isolated test: no I/O, as little GC as possible, etc. See the attached Java file.

        I tested four algorithms:

        • the current one;
        • the unrolled one of this issue;
        • a new one I came up with that uses exclusive or (see below);
        • the one from Yonik.

        The exclusive or algorithm started with the same assumption as Yonik abou the first byte: you can exit after only a read when no more bytes follow. The I tried to think of a way to reduce the number of bitwise operations. This line of thinking was based on the code of cardinality() of java.util.BitSet, which uses special bit patterns to count the bits in a long as fast as possible. This is the algorithm I ended up with:

            int i = readByte();
            if (i < 0)
            {
              // Second byte present
              i ^= readByte() << 7;
              if (i >= 0)
              {
                // Third byte present
                i ^= readByte() << 14;
                if (i < 0)
                {
                  // Fourth byte present
                  i ^= readByte() << 21;
                  if (i >= 0)
                  {
                    // Fifth byte present
                    i ^= readByte() << 28;
                    return i ^ 0x0FE03F80;
                  }
                  // No fifth byte
                  return i ^ 0xFFE03F80;
                }
                // No fourth byte
                return i ^ 0x00003F80;
              }
              // No third byte
              return i ^ 0xFFFFFF80;
            }
            return i;
        

        In the tests all values returned by algorithms are verified to ensure an optimization does not cause invaliid values to be returned.

        The statistics of the test are:

        • 6 sets of data with 38,4 million integers per set: random size, 1-byte, 2-byte, 3-byte, 4-byte and 5-byte;
        • 1 warming up and 3 individually measured loops per data set;
        • per loop each algorithm reads the data set 5 times consecutively.

        In total each algorithm reads 38,4 million integers * 6 data sets * 4 loops * 5 reads = over 4 billion integers per test run. Over 18 billion integers are read in the total test.

        The tests are run in a single JVM for all algorithms, data sets, loops, etc.

        I ran the tests on 7 JVMs. Some JVMs have a client and a server variant, resulting in 12 JVM configurations:

        • BEA JDK 1.4.2 client;
        • BEA JDK 1.4.2 server;
        • BEA JDK 1.5.0 client;
        • BEA JDK 1.5.0 server;
        • IBM JRE 1.4.2;
        • IBM JDK 1.5.0;
        • Sun JDK 1.4.2 client;
        • Sun JDK 1.4.2 server;
        • Sun JDK 1.5.0 client;
        • Sun JDK 1.5.0 server;
        • Sun JDK 1.6.0b2 client;
        • Sun JDK 1.6.0b2 server.

        I disconnected all networks on the laptop I ran the tests on. As litle services as possible were running and the firewall was also shutdown. The only applications that were running were Eclipse and the test. The screen saver and energy saving were disabled.

        Here is the configuration used to run the tests:

        • Intel Pentium M 1.73 GHz;
        • 1 GB RAM;
        • Windows XP Pro SP 2;
        • Lucene trunk;
        • Eclipse 3.1.2 Java compiler.

        Please note that during the running of the tests some disk I/O was observed. Even though I believe that no swapping would be needed, swapping might have occurred.

        After running all the tests I:

        • removed the slowest result of each algorithm for each JVM and dataset combination;
        • averaged the two remaining results;
        • divided all results by the time of the current algorithm;
        • selected the 'best' algorithms for each JVM.

        The best algorithm is defined as: ((1-byte time - best 1-byte time) + (2-byte time - best 2-byte time)) < 0,02. This means that the total difference between the best 1-byte and 2-byte times cannot be greater than 2 % to be considered the best.

        Here are the results, best algorithm first. See the attached PDF for more information:

        • exclusive or;
        • combine on exit (Yonik's one);
        • unrolled;
        • current.

        Across all JVMs the exclusive or algorithm saves 3,5-4,8 %.

        Yonik already mentioned this at the URL above, a lot depends on the JVM used to run the tests. You should do tests with the JVM on which you are going to deploy. A colleague of mine also mentioned that the compiler used could influence the results. Unfortunately I do not have the time to test all combinations of compilers and JVMs.

        Show
        Johan Stuyts added a comment - Yonik, I think it is possible to make a reasonable assumption about the distribution of the vints. Lucene stores deltas between document IDs instead of document IDs, and I guess (no data available) most frequencies will be below 128. In the example given below I take the bit for the frequency that is stored with the document IDs delta into account. Suppose you have 50 million documents. With a term occurring in 1 in 50 documents 1 million 1-byte vints are needed. With a term occurring 1 in 2000 documents 25.000 2-byte vints are needed. With an occurrence of 1 in 100.000 documents 500 3-byte vints are needed. Together with my guess about the frequencies many more 1-byte and 2-byte vints than 3-byte and longer vints have to be read. So the algorithm for readVInt() should at least perform very good for 1-byte and 2-byte vints. To detemine whether or not it is worth improving the algorighm, I wrote an isolated test: no I/O, as little GC as possible, etc. See the attached Java file. I tested four algorithms: the current one; the unrolled one of this issue; a new one I came up with that uses exclusive or (see below); the one from Yonik. The exclusive or algorithm started with the same assumption as Yonik abou the first byte: you can exit after only a read when no more bytes follow. The I tried to think of a way to reduce the number of bitwise operations. This line of thinking was based on the code of cardinality() of java.util.BitSet, which uses special bit patterns to count the bits in a long as fast as possible. This is the algorithm I ended up with: int i = readByte(); if (i < 0) { // Second byte present i ^= readByte() << 7; if (i >= 0) { // Third byte present i ^= readByte() << 14; if (i < 0) { // Fourth byte present i ^= readByte() << 21; if (i >= 0) { // Fifth byte present i ^= readByte() << 28; return i ^ 0x0FE03F80; } // No fifth byte return i ^ 0xFFE03F80; } // No fourth byte return i ^ 0x00003F80; } // No third byte return i ^ 0xFFFFFF80; } return i; In the tests all values returned by algorithms are verified to ensure an optimization does not cause invaliid values to be returned. The statistics of the test are: 6 sets of data with 38,4 million integers per set: random size, 1-byte, 2-byte, 3-byte, 4-byte and 5-byte; 1 warming up and 3 individually measured loops per data set; per loop each algorithm reads the data set 5 times consecutively. In total each algorithm reads 38,4 million integers * 6 data sets * 4 loops * 5 reads = over 4 billion integers per test run. Over 18 billion integers are read in the total test. The tests are run in a single JVM for all algorithms, data sets, loops, etc. I ran the tests on 7 JVMs. Some JVMs have a client and a server variant, resulting in 12 JVM configurations: BEA JDK 1.4.2 client; BEA JDK 1.4.2 server; BEA JDK 1.5.0 client; BEA JDK 1.5.0 server; IBM JRE 1.4.2; IBM JDK 1.5.0; Sun JDK 1.4.2 client; Sun JDK 1.4.2 server; Sun JDK 1.5.0 client; Sun JDK 1.5.0 server; Sun JDK 1.6.0b2 client; Sun JDK 1.6.0b2 server. I disconnected all networks on the laptop I ran the tests on. As litle services as possible were running and the firewall was also shutdown. The only applications that were running were Eclipse and the test. The screen saver and energy saving were disabled. Here is the configuration used to run the tests: Intel Pentium M 1.73 GHz; 1 GB RAM; Windows XP Pro SP 2; Lucene trunk; Eclipse 3.1.2 Java compiler. Please note that during the running of the tests some disk I/O was observed. Even though I believe that no swapping would be needed, swapping might have occurred. After running all the tests I: removed the slowest result of each algorithm for each JVM and dataset combination; averaged the two remaining results; divided all results by the time of the current algorithm; selected the 'best' algorithms for each JVM. The best algorithm is defined as: ((1-byte time - best 1-byte time) + (2-byte time - best 2-byte time)) < 0,02. This means that the total difference between the best 1-byte and 2-byte times cannot be greater than 2 % to be considered the best. Here are the results, best algorithm first. See the attached PDF for more information: exclusive or; combine on exit (Yonik's one); unrolled; current. Across all JVMs the exclusive or algorithm saves 3,5-4,8 %. Yonik already mentioned this at the URL above, a lot depends on the JVM used to run the tests. You should do tests with the JVM on which you are going to deploy. A colleague of mine also mentioned that the compiler used could influence the results. Unfortunately I do not have the time to test all combinations of compilers and JVMs.
        Hide
        Yonik Seeley added a comment -

        The problem is that one needs to know the distribution of vints for calls to readVInt() in a real Lucene index/application to judge if it's actually faster or not. I took a shot at a number of versions about 6 months ago:

        http://www.nabble.com/Lucene-performance-bottlenecks-tf659097.html#a1833842

        Show
        Yonik Seeley added a comment - The problem is that one needs to know the distribution of vints for calls to readVInt() in a real Lucene index/application to judge if it's actually faster or not. I took a shot at a number of versions about 6 months ago: http://www.nabble.com/Lucene-performance-bottlenecks-tf659097.html#a1833842
        Hide
        Nicolas Lalevée added a comment -

        oh yeah your right. In fact I was reading the Lucene index format, which don't specify any limit in the integer length.
        Effectively, returning a Java int, it is useless to parse too many bytes !

        Show
        Nicolas Lalevée added a comment - oh yeah your right. In fact I was reading the Lucene index format, which don't specify any limit in the integer length. Effectively, returning a Java int, it is useless to parse too many bytes !
        Hide
        Johan Stuyts added a comment -

        The number of times the loop is executed is indeed not known at compile time. Therefore the loop condition is also replicated multiple times: if (b < 0).

        The comment of 'readVInt()' says that between one and five bytes will be read. The unrolling I did also reads at most five bytes. Five bytes result in 35 bits useful bits for the value, which is more than the number of bits of the return type: int. (The comment also states that negative numbers are not supported, reducing the number of bits of the return type to 31)

        Show
        Johan Stuyts added a comment - The number of times the loop is executed is indeed not known at compile time. Therefore the loop condition is also replicated multiple times: if (b < 0). The comment of 'readVInt()' says that between one and five bytes will be read. The unrolling I did also reads at most five bytes. Five bytes result in 35 bits useful bits for the value, which is more than the number of bits of the return type: int. (The comment also states that negative numbers are not supported, reducing the number of bits of the return type to 31)
        Hide
        Nicolas Lalevée added a comment -

        The loop you unrolled has no compilation-time known iteration. A VInt is not a limited length type. Your patch only works if you don't use Vint larger than 268,435,456.

        Show
        Nicolas Lalevée added a comment - The loop you unrolled has no compilation-time known iteration. A VInt is not a limited length type. Your patch only works if you don't use Vint larger than 268,435,456.
        Hide
        Johan Stuyts added a comment -

        Loop unrolling in 'readVInt()' of 'IndexInput' for slight performance gain.

        Show
        Johan Stuyts added a comment - Loop unrolling in 'readVInt()' of 'IndexInput' for slight performance gain.

          People

          • Assignee:
            Unassigned
            Reporter:
            Johan Stuyts
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development