Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: 4.9, 5.0
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We could test the implementation of the Adaptive Frame Of Reference [1] on the lucene-4.0 branch.
      I am providing the source code of its implementation. Some work needs to be done, as this implementation is working on the old lucene-1458 branch.
      I will attach a tarball containing a running version (with tests) of the AFOR implementation, as well as the implementations of PFOR and of Simple64 (simple family codec working on 64bits word) that has been used in the experiments in [1].

      [1] http://www.deri.ie/fileadmin/documents/deri-tr-afor.pdf

      1. LUCENE-2886.patch
        41 kB
        Robert Muir
      2. LUCENE-2886.patch
        5 kB
        Michael McCandless
      3. LUCENE-2886.patch
        4 kB
        Michael McCandless
      4. LUCENE-2886_simple64_varint.patch
        81 kB
        Michael McCandless
      5. LUCENE-2886_simple64.patch
        50 kB
        Robert Muir
      6. lucene-afor.tar.gz
        2.57 MB
        Renaud Delbru

        Activity

        Hide
        Renaud Delbru added a comment -

        tarball containing a maven project with source code and unit tests for:

        • AFOR1
        • AFOR2
        • FOR
        • PFOR Non Compulsive
        • Simple64
        • a basic tool for debugging IntBlock codecs.

        It includes also the lucene-1458 snapshot dependencies that are necessary to compile the code and run the tests.

        Show
        Renaud Delbru added a comment - tarball containing a maven project with source code and unit tests for: AFOR1 AFOR2 FOR PFOR Non Compulsive Simple64 a basic tool for debugging IntBlock codecs. It includes also the lucene-1458 snapshot dependencies that are necessary to compile the code and run the tests.
        Hide
        Robert Muir added a comment -

        I pulled out the simple64 implementation here and adapted it to the bulkpostings branch.

        Thanks for uploading this code Renaud, its great and the code is easy to work with. I hope to get some more of the codecs you wrote into the branch for testing.

        I changed a few things that helped in benchmarking:

        • the decoder uses relative gets instead of absolute
        • we write #longs in the block header instead of #bytes (as its always long aligned, but smaller numbers)

        But mainly, for this one I think we should change it to be a VariableIntBlock codec... right now it packs 128 integers into as few longs as possible, but this is wasteful for two reasons: it has to write a per-block byte header, and also wastes bits (e.g. in the case of a block of 128 1's).

        With variableintblock, we could do this differently, e.g. read a fixed number of longs per-block (say 4 longs), and our block would then be variable between 4 and 240 integers depending upon data.

        Show
        Robert Muir added a comment - I pulled out the simple64 implementation here and adapted it to the bulkpostings branch. Thanks for uploading this code Renaud, its great and the code is easy to work with. I hope to get some more of the codecs you wrote into the branch for testing. I changed a few things that helped in benchmarking: the decoder uses relative gets instead of absolute we write #longs in the block header instead of #bytes (as its always long aligned, but smaller numbers) But mainly, for this one I think we should change it to be a VariableIntBlock codec... right now it packs 128 integers into as few longs as possible, but this is wasteful for two reasons: it has to write a per-block byte header, and also wastes bits (e.g. in the case of a block of 128 1's). With variableintblock, we could do this differently, e.g. read a fixed number of longs per-block (say 4 longs), and our block would then be variable between 4 and 240 integers depending upon data.
        Hide
        Michael McCandless added a comment -

        New patch, including Robert's patch, and also adding Simple64 as a VarInt codec. We badly need more testing of the VarInt cases, so it's great Simple64 came along (thanks Renaud!).

        All tests pass w/ Simple64VarInt codec.

        Show
        Michael McCandless added a comment - New patch, including Robert's patch, and also adding Simple64 as a VarInt codec. We badly need more testing of the VarInt cases, so it's great Simple64 came along (thanks Renaud!). All tests pass w/ Simple64VarInt codec.
        Hide
        Michael McCandless added a comment -

        OK I committed the two new Simple64 codecs (to bulk branch).

        Show
        Michael McCandless added a comment - OK I committed the two new Simple64 codecs (to bulk branch).
        Hide
        Robert Muir added a comment -

        We are still not using 2 of the simple64 selectors, but in the simple64 paper it discusses
        using these wasted bits to represent 120 and 240 "all ones"... I think we should do this?

        Show
        Robert Muir added a comment - We are still not using 2 of the simple64 selectors, but in the simple64 paper it discusses using these wasted bits to represent 120 and 240 "all ones"... I think we should do this?
        Hide
        Renaud Delbru added a comment - - edited

        Hi Michael, Robert,
        great to hear that the code is useful, looking forward to see some benchmark.
        I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as

        • inter-leaved sequences of 1 bit and 2 bits integers
        • inter-leaved sequences of 2 bits and 3 bits integers
          or something like this.
          The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. But this can be tedious task with only a limited benefit.
        Show
        Renaud Delbru added a comment - - edited Hi Michael, Robert, great to hear that the code is useful, looking forward to see some benchmark. I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In our experiments, we had a version of AFOR allowing frames of size 8, 16 and 32 integers with allOnes and allZeros. The gain was very minimal, in the order to 0.x% index size reduction, because these cases were occurring very rarely. But, this is still better than nothing. However, in the case of simple64, we are not talking about small frame (up to 32 integers), but frame of 120 to 240 integers. Therefore, I expect to see a drop of probability to encounter 120 or 240 consecutive ones. Maybe we can use them for more clever configurations such as inter-leaved sequences of 1 bit and 2 bits integers inter-leaved sequences of 2 bits and 3 bits integers or something like this. The best will be to do some tests to see which new configurations will make sense, like how many times a allOnes config is selected, or other configs, and choose which one to add. But this can be tedious task with only a limited benefit.
        Hide
        Robert Muir added a comment -

        Hi Michael, Robert,

        great to hear that the code is useful, looking forward to see some benchmark.
        I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much.

        In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
        for the gov collection's doc file?

        But still, for the all 1's case I'm not actually thinking about unstructured text so much...
        in this case I am thinking about metadata fields and more structured data?

        Show
        Robert Muir added a comment - Hi Michael, Robert, great to hear that the code is useful, looking forward to see some benchmark. I think the VarIntBlock approach is a good idea. Concerning the two unused "frame" codes, it will not cost too much to add them. This might be useful for the frequency inverted lists. However, I am not sure they will be used that much. In the case of 240 1's, i was surprised to see this selector was used over 2% of the time for the gov collection's doc file? But still, for the all 1's case I'm not actually thinking about unstructured text so much... in this case I am thinking about metadata fields and more structured data?
        Hide
        Renaud Delbru added a comment -

        In the case of 240 1's, i was surprised to see this selector was used over 2% of the time
        for the gov collection's doc file?

        our results were performed on the wikipedia dataset and blogs dataset. I don;t know what was our selection rate, I was just referring to the gain in overall compression rate.

        But still, for the all 1's case I'm not actually thinking about unstructured text so much...
        in this case I am thinking about metadata fields and more structured data?

        Yes, this makes sense. In the context of SIREn (kind of simple xml node based inverted index) which is meant for indexing semi-structured data, the difference was more observable (mainly on the frequency and position files, as well as other structure node files).
        This might be also useful on the document id file for very common terms (maybe for certain type of facets, with a very few number of values covering a large portion of the document collection).

        Show
        Renaud Delbru added a comment - In the case of 240 1's, i was surprised to see this selector was used over 2% of the time for the gov collection's doc file? our results were performed on the wikipedia dataset and blogs dataset. I don;t know what was our selection rate, I was just referring to the gain in overall compression rate. But still, for the all 1's case I'm not actually thinking about unstructured text so much... in this case I am thinking about metadata fields and more structured data? Yes, this makes sense. In the context of SIREn (kind of simple xml node based inverted index) which is meant for indexing semi-structured data, the difference was more observable (mainly on the frequency and position files, as well as other structure node files). This might be also useful on the document id file for very common terms (maybe for certain type of facets, with a very few number of values covering a large portion of the document collection).
        Hide
        Renaud Delbru added a comment - - edited

        Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others):

        DBpedia dataset: (structured version of wikipedia)

        Method Ent Frq Att Val Pos Total
        AFOR-1 0.246 0.043 0.141 0.065 0.180 0.816
        AFOR-2 0.229 0.039 0.132 0.059 0.167 0.758
        AFOR-3 0.229 0.031 0.131 0.054 0.159 0.736
        FOR 0.315 0.061 0.170 0.117 0.216 1.049
        PFOR 0.317 0.044 0.155 0.070 0.205 0.946
        Rice 0.240 0.029 0.115 0.057 0.152 0.708
        S-64 0.249 0.041 0.133 0.062 0.171 0.791
        VByte 0.264 0.162 0.222 0.222 0.245 1.335

        Geonames Dataset:

        Method Ent Frq Att Val Pos Total
        AFOR-1 0.129 0.023 0.058 0.025 0.025 0.318
        AFOR-2 0.123 0.023 0.057 0.024 0.024 0.307
        AFOR-3 0.114 0.006 0.056 0.016 0.008 0.256
        FOR 0.150 0.021 0.065 0.025 0.023 0.349
        PFOR 0.154 0.019 0.057 0.022 0.023 0.332
        Rice 0.133 0.019 0.063 0.029 0.021 0.327
        S-64 0.147 0.021 0.058 0.023 0.023 0.329
        VByte 0.216 0.142 0.143 0.143 0.143 0.929

        Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset

        Method Ent Frq Att Val Pos Total
        AFOR-1 2.578 0.395 0.942 0.665 1.014 6.537
        AFOR-2 2.361 0.380 0.908 0.619 0.906 6.082
        AFOR-3 2.297 0.176 0.876 0.530 0.722 5.475
        FOR 3.506 0.506 1.121 0.916 1.440 8.611
        PFOR 3.221 0.374 1.153 0.795 1.227 7.924
        Rice 2.721 0.314 0.958 0.714 0.941 6.605
        S-64 2.581 0.370 0.917 0.621 0.908 6.313
        VByte 3.287 2.106 2.411 2.430 2.488 15.132

        Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.

        Show
        Renaud Delbru added a comment - - edited Just an additional comment on semi-structured data indexing. AFOR-2 and AFOR-3 (AFOR-3 refers to AFOR-2 with special code for allOnes frames), was able to beat Rice on two datasets, and S-64 on one (but it was very close to Rice on the others): DBpedia dataset: (structured version of wikipedia) Method Ent Frq Att Val Pos Total AFOR-1 0.246 0.043 0.141 0.065 0.180 0.816 AFOR-2 0.229 0.039 0.132 0.059 0.167 0.758 AFOR-3 0.229 0.031 0.131 0.054 0.159 0.736 FOR 0.315 0.061 0.170 0.117 0.216 1.049 PFOR 0.317 0.044 0.155 0.070 0.205 0.946 Rice 0.240 0.029 0.115 0.057 0.152 0.708 S-64 0.249 0.041 0.133 0.062 0.171 0.791 VByte 0.264 0.162 0.222 0.222 0.245 1.335 Geonames Dataset: Method Ent Frq Att Val Pos Total AFOR-1 0.129 0.023 0.058 0.025 0.025 0.318 AFOR-2 0.123 0.023 0.057 0.024 0.024 0.307 AFOR-3 0.114 0.006 0.056 0.016 0.008 0.256 FOR 0.150 0.021 0.065 0.025 0.023 0.349 PFOR 0.154 0.019 0.057 0.022 0.023 0.332 Rice 0.133 0.019 0.063 0.029 0.021 0.327 S-64 0.147 0.021 0.058 0.023 0.023 0.329 VByte 0.216 0.142 0.143 0.143 0.143 0.929 Sindice Dataset: Very heterogeneous dataset containing hundred of thousands of web dataset Method Ent Frq Att Val Pos Total AFOR-1 2.578 0.395 0.942 0.665 1.014 6.537 AFOR-2 2.361 0.380 0.908 0.619 0.906 6.082 AFOR-3 2.297 0.176 0.876 0.530 0.722 5.475 FOR 3.506 0.506 1.121 0.916 1.440 8.611 PFOR 3.221 0.374 1.153 0.795 1.227 7.924 Rice 2.721 0.314 0.958 0.714 0.941 6.605 S-64 2.581 0.370 0.917 0.621 0.908 6.313 VByte 3.287 2.106 2.411 2.430 2.488 15.132 Here, Ent refers to entity id (similar to doc id), Att and Val are structural node ids.
        Hide
        Robert Muir added a comment -

        Thanks for those numbers Renaud... yes the cases you see in e.g. Geonames
        was one example of what I was thinking: in general you might say someone
        should be using "omitTFAP" to omit freqs and positions for these fields,
        but they might not be able to do this, if they want to support e.g. phrase
        queries like "washington hill". So if we can pack long streams of 1s with
        freqs and positions I think this is probably useful for a lot of people.

        Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
        your "Ent" basically a measure of doc deltas? I'm confused exactly
        what it is Because I would think if you take e.g. Geonames, the place
        names in the dataset are not in random order but actually "batched" by
        country for example, so you would have long streams of docdelta=1 for
        country=Germany's postings.

        I'm not saying we could rely upon this, but i do think in general lots
        of people's docs aren't in completely random order, and its probably
        common to see long streams of docdelta=1 in structured data for this reason?

        Show
        Robert Muir added a comment - Thanks for those numbers Renaud... yes the cases you see in e.g. Geonames was one example of what I was thinking: in general you might say someone should be using "omitTFAP" to omit freqs and positions for these fields, but they might not be able to do this, if they want to support e.g. phrase queries like "washington hill". So if we can pack long streams of 1s with freqs and positions I think this is probably useful for a lot of people. Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is your "Ent" basically a measure of doc deltas? I'm confused exactly what it is Because I would think if you take e.g. Geonames, the place names in the dataset are not in random order but actually "batched" by country for example, so you would have long streams of docdelta=1 for country=Germany's postings. I'm not saying we could rely upon this, but i do think in general lots of people's docs aren't in completely random order, and its probably common to see long streams of docdelta=1 in structured data for this reason?
        Hide
        Renaud Delbru added a comment -

        So if we can pack long streams of 1s with
        freqs and positions I think this is probably useful for a lot of people.

        Yes, if the overhead is minimal, it might not be an issue in certain cases.

        Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is
        your "Ent" basically a measure of doc deltas? I'm confused exactly
        what it is

        Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as the document id). It is just that I have changed the name of the concept, as SIREn is manipulating principally entity and not document. In my case, an entity is just a set of attribute-value pairs, similarly to a document in Lucene.

        Because I would think if you take e.g. Geonames, the place
        names in the dataset are not in random order but actually "batched" by
        country for example, so you would have long streams of docdelta=1 for
        country=Germany's postings.

        I checked, and Geonames dataset was alphabetically sorted by url names:
        http://sws.geonames.org/1/
        http://sws.geonames.org/10/
        ...
        as well as dbpedia and sindice.

        So, yes, this might have (good) consequences on the docdelta list for certain datasets such as geonames. And especially when indexing semi-structured data, as the schema of the data in one dataset is generally identical across entities/documents. therefore it is likely to see long runs of 1 for certain terms or schema terms.

        Show
        Renaud Delbru added a comment - So if we can pack long streams of 1s with freqs and positions I think this is probably useful for a lot of people. Yes, if the overhead is minimal, it might not be an issue in certain cases. Additionally for the .doc, i see its smaller in the AFOR-3 case too. Is your "Ent" basically a measure of doc deltas? I'm confused exactly what it is Yes, Ent is jsut a delta representation of the id of the entity (which can be considered as the document id). It is just that I have changed the name of the concept, as SIREn is manipulating principally entity and not document. In my case, an entity is just a set of attribute-value pairs, similarly to a document in Lucene. Because I would think if you take e.g. Geonames, the place names in the dataset are not in random order but actually "batched" by country for example, so you would have long streams of docdelta=1 for country=Germany's postings. I checked, and Geonames dataset was alphabetically sorted by url names: http://sws.geonames.org/1/ http://sws.geonames.org/10/ ... as well as dbpedia and sindice. So, yes, this might have (good) consequences on the docdelta list for certain datasets such as geonames. And especially when indexing semi-structured data, as the schema of the data in one dataset is generally identical across entities/documents. therefore it is likely to see long runs of 1 for certain terms or schema terms.
        Hide
        Michael McCandless added a comment -

        I test perf of BulkVInt vs Simple64VarInt (Linux, 10M wikipedia index, NIOFSDir, multi-seg, java -server -Xbatch -Xmx2g -Xms2g):

        Query QPS bulkvint QPS simple64varint Pct diff
        +united +states 19.46 16.47 -15.4%
        "united states" 11.92 10.09 -15.3%
        unit~0.5 17.35 16.96 -2.2%
        "united states"~3 5.70 5.72 0.3%
        united~0.6 8.24 8.31 0.8%
        doctitle:.[Uu]nited. 5.36 5.48 2.1%
        united~0.75 12.45 12.75 2.4%
        unit~0.7 27.37 28.08 2.6%
        spanFirst(unit, 5) 156.48 162.78 4.0%
        united states 15.35 16.07 4.7%
        spanNear([unit, state], 10, true) 42.84 46.11 7.6%
        states 48.13 52.03 8.1%
        unit* 32.80 37.07 13.0%
        doctimesecnum:[10000 TO 60000] 12.38 14.16 14.4%
        uni* 18.66 21.53 15.3%
        u*d 9.66 11.17 15.7%
        un*d 19.00 22.04 16.0%
        +nebraska +states 101.66 141.83 39.5%
        Show
        Michael McCandless added a comment - I test perf of BulkVInt vs Simple64VarInt (Linux, 10M wikipedia index, NIOFSDir, multi-seg, java -server -Xbatch -Xmx2g -Xms2g): Query QPS bulkvint QPS simple64varint Pct diff +united +states 19.46 16.47 -15.4% "united states" 11.92 10.09 -15.3% unit~0.5 17.35 16.96 -2.2% "united states"~3 5.70 5.72 0.3% united~0.6 8.24 8.31 0.8% doctitle:. [Uu] nited. 5.36 5.48 2.1% united~0.75 12.45 12.75 2.4% unit~0.7 27.37 28.08 2.6% spanFirst(unit, 5) 156.48 162.78 4.0% united states 15.35 16.07 4.7% spanNear( [unit, state] , 10, true) 42.84 46.11 7.6% states 48.13 52.03 8.1% unit* 32.80 37.07 13.0% doctimesecnum: [10000 TO 60000] 12.38 14.16 14.4% uni* 18.66 21.53 15.3% u*d 9.66 11.17 15.7% un*d 19.00 22.04 16.0% +nebraska +states 101.66 141.83 39.5%
        Hide
        Michael McCandless added a comment -

        Same as last perf test, except MMapDir instead:

        Query QPS bulkvint QPS simple64varint Pct diff
        "united states" 12.81 9.95 -22.3%
        "united states"~3 6.04 5.35 -11.4%
        +united +states 19.18 17.25 -10.1%
        united~0.6 10.35 10.25 -0.9%
        united states 16.33 16.29 -0.2%
        doctimesecnum:[10000 TO 60000] 14.18 14.24 0.4%
        unit* 36.69 37.20 1.4%
        united~0.75 14.71 14.92 1.4%
        unit~0.5 22.15 22.55 1.8%
        uni* 21.27 21.73 2.2%
        spanFirst(unit, 5) 166.18 171.96 3.5%
        doctitle:.[Uu]nited. 6.24 6.46 3.6%
        unit~0.7 34.18 35.49 3.8%
        spanNear([unit, state], 10, true) 47.16 49.53 5.0%
        states 50.71 53.52 5.5%
        un*d 21.51 22.96 6.7%
        u*d 11.18 12.22 9.3%
        +nebraska +states 127.16 161.43 26.9%
        Show
        Michael McCandless added a comment - Same as last perf test, except MMapDir instead: Query QPS bulkvint QPS simple64varint Pct diff "united states" 12.81 9.95 -22.3% "united states"~3 6.04 5.35 -11.4% +united +states 19.18 17.25 -10.1% united~0.6 10.35 10.25 -0.9% united states 16.33 16.29 -0.2% doctimesecnum: [10000 TO 60000] 14.18 14.24 0.4% unit* 36.69 37.20 1.4% united~0.75 14.71 14.92 1.4% unit~0.5 22.15 22.55 1.8% uni* 21.27 21.73 2.2% spanFirst(unit, 5) 166.18 171.96 3.5% doctitle:. [Uu] nited. 6.24 6.46 3.6% unit~0.7 34.18 35.49 3.8% spanNear( [unit, state] , 10, true) 47.16 49.53 5.0% states 50.71 53.52 5.5% un*d 21.51 22.96 6.7% u*d 11.18 12.22 9.3% +nebraska +states 127.16 161.43 26.9%
        Hide
        Renaud Delbru added a comment -

        Hi Michael,
        the first results are not that impressive.

        • Could you tell me what is BulkVInt ? Is it the simple VInt codec implemented on top of the Bulk branch ?
        • What the difference between '+united +states' and '+nebraska +states' ? Is nebraska a low frequency term ?
        Show
        Renaud Delbru added a comment - Hi Michael, the first results are not that impressive. Could you tell me what is BulkVInt ? Is it the simple VInt codec implemented on top of the Bulk branch ? What the difference between '+united +states' and '+nebraska +states' ? Is nebraska a low frequency term ?
        Hide
        Robert Muir added a comment -

        Hi Renaud:

        The BulkVInt codec is VInt implemented as a FixedIntBlock codec.
        So it reads a single numBytes Vint header, then a byte[], and decodes BLOCKSIZE vints out of it.
        The reason for this, is it has much different performance than "StandardCodec",
        due to the fact StandardCodec has to readByte() readByte() readByte() ...

        You can see the code here: http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/index/codecs/bulkvint/BulkVIntCodec.java

        One reason for this, is to differentiate performance improvements of actual compression
        algorithms from the way that they do their underlying I/O... previously various codecs
        looked much faster than Vint but a lot of the reason for this is due to the way Vint
        was implemented...

        And yes, you are correct nebraska is a lower freq term. the +united +states is a more
        "normal" phrase query, but +nebraska +states is a phrase query intended to do a lot
        of advance()'ing...

        Show
        Robert Muir added a comment - Hi Renaud: The BulkVInt codec is VInt implemented as a FixedIntBlock codec. So it reads a single numBytes Vint header, then a byte[], and decodes BLOCKSIZE vints out of it. The reason for this, is it has much different performance than "StandardCodec", due to the fact StandardCodec has to readByte() readByte() readByte() ... You can see the code here: http://svn.apache.org/repos/asf/lucene/dev/branches/bulkpostings/lucene/src/java/org/apache/lucene/index/codecs/bulkvint/BulkVIntCodec.java One reason for this, is to differentiate performance improvements of actual compression algorithms from the way that they do their underlying I/O... previously various codecs looked much faster than Vint but a lot of the reason for this is due to the way Vint was implemented... And yes, you are correct nebraska is a lower freq term. the +united +states is a more "normal" phrase query, but +nebraska +states is a phrase query intended to do a lot of advance()'ing...
        Hide
        Robert Muir added a comment -

        Here are my performance results, same setup (-server, etc)/index etc as Mike except
        I'm using windows 64-bit (1.6.0_23/Vista), and I do not have an SSD, just a normal hard disk.
        Using SimpleFS:

        Query QPS bulkVint QPS simple64Varint Pct diff
        +united +states 10.95 10.32 -5.8%
        "united states"~3 3.78 3.67 -2.7%
        "united states" 6.81 6.71 -1.5%
        doctimesecnum:[10000 TO 60000] 8.76 8.81 0.5%
        spanNear([unit, state], 10, true) 22.42 22.92 2.3%
        uni* 13.75 14.22 3.4%
        unit* 24.89 25.86 3.9%
        united~0.75 7.06 7.36 4.4%
        unit~0.7 14.11 14.73 4.4%
        united~0.6 4.34 4.53 4.4%
        unit~0.5 8.16 8.55 4.7%
        states 29.99 31.85 6.2%
        spanFirst(unit, 5) 88.27 93.78 6.2%
        united states 9.97 10.60 6.4%
        doctitle:.[Uu]nited. 2.34 2.49 6.6%
        u*d 5.54 6.05 9.3%
        un*d 12.36 13.59 9.9%
        +nebraska +states 47.58 62.53 31.4%

        Using MMap:

        Query QPS bulkVint QPS simple64Varint Pct diff
        +united +states 13.88 11.95 -13.9%
        spanFirst(unit, 5) 123.50 106.84 -13.5%
        "united states"~3 4.64 4.17 -10.1%
        "united states" 8.90 8.19 -8.0%
        doctimesecnum:[10000 TO 60000] 10.72 9.98 -7.0%
        spanNear([unit, state], 10, true) 33.70 31.63 -6.1%
        uni* 17.26 16.30 -5.5%
        states 37.14 35.40 -4.7%
        unit* 29.96 28.57 -4.7%
        united states 11.52 11.22 -2.6%
        united~0.75 10.87 10.75 -1.1%
        united~0.6 7.69 7.68 -0.1%
        doctitle:.[Uu]nited. 3.83 3.91 2.2%
        un*d 16.83 17.26 2.6%
        unit~0.5 16.52 17.05 3.2%
        u*d 8.51 9.08 6.7%
        unit~0.7 26.83 28.93 7.8%
        +nebraska +states 83.69 113.19 35.2%
        Show
        Robert Muir added a comment - Here are my performance results, same setup (-server, etc)/index etc as Mike except I'm using windows 64-bit (1.6.0_23/Vista), and I do not have an SSD, just a normal hard disk. Using SimpleFS: Query QPS bulkVint QPS simple64Varint Pct diff +united +states 10.95 10.32 -5.8% "united states"~3 3.78 3.67 -2.7% "united states" 6.81 6.71 -1.5% doctimesecnum: [10000 TO 60000] 8.76 8.81 0.5% spanNear( [unit, state] , 10, true) 22.42 22.92 2.3% uni* 13.75 14.22 3.4% unit* 24.89 25.86 3.9% united~0.75 7.06 7.36 4.4% unit~0.7 14.11 14.73 4.4% united~0.6 4.34 4.53 4.4% unit~0.5 8.16 8.55 4.7% states 29.99 31.85 6.2% spanFirst(unit, 5) 88.27 93.78 6.2% united states 9.97 10.60 6.4% doctitle:. [Uu] nited. 2.34 2.49 6.6% u*d 5.54 6.05 9.3% un*d 12.36 13.59 9.9% +nebraska +states 47.58 62.53 31.4% Using MMap: Query QPS bulkVint QPS simple64Varint Pct diff +united +states 13.88 11.95 -13.9% spanFirst(unit, 5) 123.50 106.84 -13.5% "united states"~3 4.64 4.17 -10.1% "united states" 8.90 8.19 -8.0% doctimesecnum: [10000 TO 60000] 10.72 9.98 -7.0% spanNear( [unit, state] , 10, true) 33.70 31.63 -6.1% uni* 17.26 16.30 -5.5% states 37.14 35.40 -4.7% unit* 29.96 28.57 -4.7% united states 11.52 11.22 -2.6% united~0.75 10.87 10.75 -1.1% united~0.6 7.69 7.68 -0.1% doctitle:. [Uu] nited. 3.83 3.91 2.2% un*d 16.83 17.26 2.6% unit~0.5 16.52 17.05 3.2% u*d 8.51 9.08 6.7% unit~0.7 26.83 28.93 7.8% +nebraska +states 83.69 113.19 35.2%
        Hide
        Robert Muir added a comment -

        based on these results, I think a good experiment would be for
        us to "group" Simple64, so that we arent reading blocks of just a single long value.

        With NIOFS/SimpleFS, the reads are buffered so i think some of the overhead
        is taken care of, but it doesn't tend to work so well with MMap.

        So, if we 'group' the long values so we are e.g. reading say N long values
        at once in a single internal 'block', I think we might get more efficiency
        via the I/O system, and also less overhead from the bulkpostings apis.

        I would hope this would speed up the other queries, and likely slow down
        the +nebraska +states type thing, but I think thats ok.

        Show
        Robert Muir added a comment - based on these results, I think a good experiment would be for us to "group" Simple64, so that we arent reading blocks of just a single long value. With NIOFS/SimpleFS, the reads are buffered so i think some of the overhead is taken care of, but it doesn't tend to work so well with MMap. So, if we 'group' the long values so we are e.g. reading say N long values at once in a single internal 'block', I think we might get more efficiency via the I/O system, and also less overhead from the bulkpostings apis. I would hope this would speed up the other queries, and likely slow down the +nebraska +states type thing, but I think thats ok.
        Hide
        Renaud Delbru added a comment -

        The BulkVInt codec is VInt implemented as a FixedIntBlock codec.

        Yes, I saw the code, it is a similar implementation of the VInt we used in our experiments.

        previously various codecs
        looked much faster than Vint but a lot of the reason for this is due to the way Vint
        was implemented...

        This is odd, because we observed the contrary (on the lucene-1458 branch). The standard codec was by an order of magnitude faster than any other codec. We discovered that this was due to the IntBlock interface implementation that:

        • was copying the buffer bytearray two times (one time from the disk to the buffer, then another time from the buffer to the IntBlock codec).
        • had to perform more work wrt to check each of the buffer (IntBlock buffer, IndexInput buffer).
          But this might have been improved since then. Michael told me he worked on a new version of the IntBlock interface which was more performant.

        So, if we 'group' the long values so we are e.g. reading say N long values
        at once in a single internal 'block', I think we might get more efficiency
        via the I/O system, and also less overhead from the bulkpostings apis.

        If I understand, this is similar to increasing the boundaries of the variable block size. Indeed, it incurs some non-negligible overhead to perform a block read for each simple64 long word (simple64 frame), and this might be better to read more than one per block read.

        Show
        Renaud Delbru added a comment - The BulkVInt codec is VInt implemented as a FixedIntBlock codec. Yes, I saw the code, it is a similar implementation of the VInt we used in our experiments. previously various codecs looked much faster than Vint but a lot of the reason for this is due to the way Vint was implemented... This is odd, because we observed the contrary (on the lucene-1458 branch). The standard codec was by an order of magnitude faster than any other codec. We discovered that this was due to the IntBlock interface implementation that: was copying the buffer bytearray two times (one time from the disk to the buffer, then another time from the buffer to the IntBlock codec). had to perform more work wrt to check each of the buffer (IntBlock buffer, IndexInput buffer). But this might have been improved since then. Michael told me he worked on a new version of the IntBlock interface which was more performant. So, if we 'group' the long values so we are e.g. reading say N long values at once in a single internal 'block', I think we might get more efficiency via the I/O system, and also less overhead from the bulkpostings apis. If I understand, this is similar to increasing the boundaries of the variable block size. Indeed, it incurs some non-negligible overhead to perform a block read for each simple64 long word (simple64 frame), and this might be better to read more than one per block read.
        Hide
        Paul Elschot added a comment -

        On the facility for allOnes in AFOR-3: one of the reasons why this appears to be of rather little use is that current analyzers do not index stop words. They do this for two reasons, index size and query time, both based on VByte. With an allOnes facility the first reason disappears almost completely, and query times can be also be guarded in other ways, for example by checking for document frequency and then trying to fall back on digrams.
        So the message is: please keep this in.

        Show
        Paul Elschot added a comment - On the facility for allOnes in AFOR-3: one of the reasons why this appears to be of rather little use is that current analyzers do not index stop words. They do this for two reasons, index size and query time, both based on VByte. With an allOnes facility the first reason disappears almost completely, and query times can be also be guarded in other ways, for example by checking for document frequency and then trying to fall back on digrams. So the message is: please keep this in.
        Hide
        Michael McCandless added a comment -

        Attached patch, adding block multiplier to Simple64VarInt.

        I haven't tested perf impact yet...

        Show
        Michael McCandless added a comment - Attached patch, adding block multiplier to Simple64VarInt. I haven't tested perf impact yet...
        Hide
        Michael McCandless added a comment -

        New patch, cuts over to bulk-reading the byte[] and then pulling a LongBuffer from that.

        Show
        Michael McCandless added a comment - New patch, cuts over to bulk-reading the byte[] and then pulling a LongBuffer from that.
        Hide
        Robert Muir added a comment -

        I spent some more time on Simple64, took Mike's previous patch and added some minor improvements:

        1. Switched the decoding logic to the "Simple-8-4b" referred to in the paper. This is the same encoding, but we process with ints instead of longs.
        2. Because our buffers are so tiny (for example 32 bytes), the overhead of NIO hurts rather than helps, so I switched to native arrays.

        The performance is looking much more reasonable. Here's my tests on windows, maybe i can convince Mike to sanity-check it on linux.

        64-bit SimpleFS

        Query QPS BulkVInt QPS Simple64VarInt4 Pct diff
        "united states"~3 3.79 3.67 -3.3%
        doctitle:.[Uu]nited. 2.45 2.46 0.3%
        spanNear([unit, state], 10, true) 22.13 22.64 2.3%
        uni* 14.04 14.43 2.7%
        united~0.75 6.83 7.04 3.2%
        unit* 25.39 26.21 3.2%
        doctimesecnum:[10000 TO 60000] 8.83 9.16 3.6%
        united~0.6 4.29 4.47 4.2%
        united states 9.35 9.74 4.2%
        un*d 12.88 13.50 4.8%
        "united states" 6.86 7.21 5.1%
        unit~0.7 14.11 14.85 5.3%
        unit~0.5 8.17 8.60 5.3%
        u*d 5.70 6.05 6.1%
        states 30.02 31.90 6.3%
        spanFirst(unit, 5) 86.56 94.15 8.8%
        +united +states 11.10 12.55 13.1%
        +nebraska +states 46.72 57.90 23.9%

        32-bit SimpleFS

        Query QPS BulkVInt QPS Simple64VarInt4 Pct diff
        spanFirst(unit, 5) 95.67 91.02 -4.9%
        "united states" 5.47 5.25 -4.1%
        "united states"~3 3.37 3.32 -1.6%
        unit* 20.45 20.33 -0.6%
        uni* 11.10 11.06 -0.3%
        doctimesecnum:[10000 TO 60000] 7.15 7.16 0.0%
        doctitle:.[Uu]nited. 2.26 2.27 0.4%
        unit~0.5 7.73 7.77 0.5%
        un*d 10.80 10.87 0.6%
        united~0.75 6.77 6.97 2.8%
        unit~0.7 12.97 13.41 3.4%
        united~0.6 4.10 4.26 3.7%
        u*d 4.91 5.10 4.0%
        spanNear([unit, state], 10, true) 20.50 21.72 5.9%
        states 30.00 33.15 10.5%
        +united +states 9.71 10.78 11.1%
        united states 9.65 10.96 13.6%
        +nebraska +states 43.93 54.38 23.8%

        64-bit MMap

        Query QPS BulkVInt QPS Simple64VarInt4 Pct diff
        "united states" 8.99 8.41 -6.4%
        states 38.21 36.16 -5.4%
        spanFirst(unit, 5) 118.11 112.19 -5.0%
        doctimesecnum:[10000 TO 60000] 10.78 10.35 -4.0%
        spanNear([unit, state], 10, true) 33.78 32.51 -3.7%
        "united states"~3 4.68 4.54 -3.0%
        unit* 30.00 29.26 -2.4%
        uni* 17.48 17.06 -2.4%
        united states 11.60 11.35 -2.1%
        +united +states 13.95 14.08 1.0%
        united~0.75 10.76 10.87 1.1%
        united~0.6 7.75 7.88 1.7%
        un*d 17.16 17.66 2.9%
        doctitle:.[Uu]nited. 3.85 3.98 3.3%
        unit~0.7 27.00 28.08 4.0%
        unit~0.5 16.64 17.46 4.9%
        u*d 8.68 9.31 7.2%
        +nebraska +states 83.30 96.53 15.9%
        Show
        Robert Muir added a comment - I spent some more time on Simple64, took Mike's previous patch and added some minor improvements: Switched the decoding logic to the "Simple-8-4b" referred to in the paper. This is the same encoding, but we process with ints instead of longs. Because our buffers are so tiny (for example 32 bytes), the overhead of NIO hurts rather than helps, so I switched to native arrays. The performance is looking much more reasonable. Here's my tests on windows, maybe i can convince Mike to sanity-check it on linux. 64-bit SimpleFS Query QPS BulkVInt QPS Simple64VarInt4 Pct diff "united states"~3 3.79 3.67 -3.3% doctitle:. [Uu] nited. 2.45 2.46 0.3% spanNear( [unit, state] , 10, true) 22.13 22.64 2.3% uni* 14.04 14.43 2.7% united~0.75 6.83 7.04 3.2% unit* 25.39 26.21 3.2% doctimesecnum: [10000 TO 60000] 8.83 9.16 3.6% united~0.6 4.29 4.47 4.2% united states 9.35 9.74 4.2% un*d 12.88 13.50 4.8% "united states" 6.86 7.21 5.1% unit~0.7 14.11 14.85 5.3% unit~0.5 8.17 8.60 5.3% u*d 5.70 6.05 6.1% states 30.02 31.90 6.3% spanFirst(unit, 5) 86.56 94.15 8.8% +united +states 11.10 12.55 13.1% +nebraska +states 46.72 57.90 23.9% 32-bit SimpleFS Query QPS BulkVInt QPS Simple64VarInt4 Pct diff spanFirst(unit, 5) 95.67 91.02 -4.9% "united states" 5.47 5.25 -4.1% "united states"~3 3.37 3.32 -1.6% unit* 20.45 20.33 -0.6% uni* 11.10 11.06 -0.3% doctimesecnum: [10000 TO 60000] 7.15 7.16 0.0% doctitle:. [Uu] nited. 2.26 2.27 0.4% unit~0.5 7.73 7.77 0.5% un*d 10.80 10.87 0.6% united~0.75 6.77 6.97 2.8% unit~0.7 12.97 13.41 3.4% united~0.6 4.10 4.26 3.7% u*d 4.91 5.10 4.0% spanNear( [unit, state] , 10, true) 20.50 21.72 5.9% states 30.00 33.15 10.5% +united +states 9.71 10.78 11.1% united states 9.65 10.96 13.6% +nebraska +states 43.93 54.38 23.8% 64-bit MMap Query QPS BulkVInt QPS Simple64VarInt4 Pct diff "united states" 8.99 8.41 -6.4% states 38.21 36.16 -5.4% spanFirst(unit, 5) 118.11 112.19 -5.0% doctimesecnum: [10000 TO 60000] 10.78 10.35 -4.0% spanNear( [unit, state] , 10, true) 33.78 32.51 -3.7% "united states"~3 4.68 4.54 -3.0% unit* 30.00 29.26 -2.4% uni* 17.48 17.06 -2.4% united states 11.60 11.35 -2.1% +united +states 13.95 14.08 1.0% united~0.75 10.76 10.87 1.1% united~0.6 7.75 7.88 1.7% un*d 17.16 17.66 2.9% doctitle:. [Uu] nited. 3.85 3.98 3.3% unit~0.7 27.00 28.08 4.0% unit~0.5 16.64 17.46 4.9% u*d 8.68 9.31 7.2% +nebraska +states 83.30 96.53 15.9%
        Hide
        Michael McCandless added a comment -

        Linux results, NIOFSDir:

        Query QPS bulkvint QPS simple64x4 Pct diff
        "united states" 11.95 10.50 -12.1%
        +united +states 19.45 18.94 -2.6%
        unit~0.5 18.20 17.85 -1.9%
        doctitle:.[Uu]nited. 5.38 5.40 0.4%
        spanNear([unit, state], 10, true) 43.59 44.31 1.7%
        united~0.6 8.48 8.72 2.8%
        "united states"~3 5.75 5.96 3.6%
        united~0.75 12.66 13.14 3.8%
        unit~0.7 27.58 28.84 4.6%
        spanFirst(unit, 5) 157.05 165.43 5.3%
        united states 15.41 16.26 5.6%
        states 48.10 51.95 8.0%
        unit* 32.72 36.24 10.8%
        uni* 18.66 20.97 12.3%
        u*d 9.63 10.86 12.8%
        +nebraska +states 103.95 117.66 13.2%
        un*d 18.91 21.45 13.4%
        doctimesecnum:[10000 TO 60000] 12.33 13.99 13.5%

        And MMapDir:

        Query QPS bulkvint QPS simple64x4 Pct diff
        "united states" 12.86 11.01 -14.4%
        +nebraska +states 129.23 126.21 -2.3%
        "united states"~3 6.05 6.06 0.1%
        united~0.6 10.55 10.72 1.6%
        unit~0.7 38.24 39.00 2.0%
        +united +states 19.29 19.71 2.2%
        united~0.75 15.08 15.52 2.9%
        united states 16.47 16.95 2.9%
        doctitle:.[Uu]nited. 6.23 6.46 3.8%
        spanNear([unit, state], 10, true) 47.26 49.44 4.6%
        unit* 36.28 38.01 4.7%
        doctimesecnum:[10000 TO 60000] 14.02 14.90 6.3%
        uni* 21.08 22.46 6.6%
        spanFirst(unit, 5) 169.47 180.70 6.6%
        unit~0.5 23.03 24.79 7.6%
        un*d 21.40 23.64 10.5%
        states 50.60 56.17 11.0%
        u*d 11.16 12.59 12.8%
        Show
        Michael McCandless added a comment - Linux results, NIOFSDir: Query QPS bulkvint QPS simple64x4 Pct diff "united states" 11.95 10.50 -12.1% +united +states 19.45 18.94 -2.6% unit~0.5 18.20 17.85 -1.9% doctitle:. [Uu] nited. 5.38 5.40 0.4% spanNear( [unit, state] , 10, true) 43.59 44.31 1.7% united~0.6 8.48 8.72 2.8% "united states"~3 5.75 5.96 3.6% united~0.75 12.66 13.14 3.8% unit~0.7 27.58 28.84 4.6% spanFirst(unit, 5) 157.05 165.43 5.3% united states 15.41 16.26 5.6% states 48.10 51.95 8.0% unit* 32.72 36.24 10.8% uni* 18.66 20.97 12.3% u*d 9.63 10.86 12.8% +nebraska +states 103.95 117.66 13.2% un*d 18.91 21.45 13.4% doctimesecnum: [10000 TO 60000] 12.33 13.99 13.5% And MMapDir: Query QPS bulkvint QPS simple64x4 Pct diff "united states" 12.86 11.01 -14.4% +nebraska +states 129.23 126.21 -2.3% "united states"~3 6.05 6.06 0.1% united~0.6 10.55 10.72 1.6% unit~0.7 38.24 39.00 2.0% +united +states 19.29 19.71 2.2% united~0.75 15.08 15.52 2.9% united states 16.47 16.95 2.9% doctitle:. [Uu] nited. 6.23 6.46 3.8% spanNear( [unit, state] , 10, true) 47.26 49.44 4.6% unit* 36.28 38.01 4.7% doctimesecnum: [10000 TO 60000] 14.02 14.90 6.3% uni* 21.08 22.46 6.6% spanFirst(unit, 5) 169.47 180.70 6.6% unit~0.5 23.03 24.79 7.6% un*d 21.40 23.64 10.5% states 50.60 56.17 11.0% u*d 11.16 12.59 12.8%
        Hide
        Robert Muir added a comment -

        Thanks Mike... looks improved over your previous results.

        Should we commit this to bulk branch? I think we should also remove the "fixed-int" version 1.0 of Simple64 we have there...

        Show
        Robert Muir added a comment - Thanks Mike... looks improved over your previous results. Should we commit this to bulk branch? I think we should also remove the "fixed-int" version 1.0 of Simple64 we have there...
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Uwe Schindler added a comment -

        Move issue to Lucene 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Lucene 4.9.

          People

          • Assignee:
            Unassigned
            Reporter:
            Renaud Delbru
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development