## Description of project goals: Background: * How skipper works For basic postings list intersection, when the list lengths are m and n, the time complexity will be T = m+n. This is done by walking through two lists simultaneously. However, if we record jumping points for every k elements on both lists, these two lists will 'jump' faster, and we optimally reduce the intersection overhead to T = (m+n) / k. A state-of-the-art skip list consists of multiple levels of skip points, while higher levels will jump further to gain more performance optimization. Therefore, an abstract skipper should provide a key method: skipTo(target), which will return the number of skipped elements that are not larger than 'target'. When operating intersection on two or more lists, a shortest one will be iterated, while other lists will call skipTo(heading) to jump to appropriate positions. * Skip structure of current implementation Let 'skipInterval' to be the minimum size of jumping length, and 'skipMultiplier' to be the jumping length multiplier so that for skip data on level i, the skip length will be skipInterval*(skipMultiplier^i). As is illustrated in oal.codecs.MultiLevelSkipListReader.java, when trying to skip to element A, the skipper will first climb to a highest level whose current element is not larger than A. It then walks on that level until exceeding A, and falls down to repeat the walking until level 0 is reached. This procedure will guarantee an logarithmic searching for element A, which is efficient enough in terms of time complexity. However, there will be penalty when using the skipper in real case: > Data in skip datum are too mixed. As is illustrated in [2], the VInt encoding makes skip datum impossible to decode partially. For a simple AndQuery, we'll have to decode both position and payload information(e.g. PosFPSkip, PayLength), although document information is sufficient. > Skip list are not pruned. The lower level skip dataum stores reduntant information. For example, we have skipInterval=skipMultiplier=2, and postings=[d1, d2, d3, d4, d5, d6, d7, d8]. So, level 0 skip data will skip to [d2, d4, d6, d8], level 1 will skip to [d4, d8], and level 2 to [d8]. The data structure in current implementation is shown below: d1 d2 d3 d4 d5 d6 d7 d8 ^ ^ ^ ^ level 0 ^ ^ level 1 ^ level 2 However, since there will always be a level 1 skip pointer at position 4, a level 0 skip data at that position will be unnecessary. The same pruning is applied to position 8. So optimized skip data will be: d1 d2 d3 d4 d5 d6 d7 d8 ^ ^ level 0 ^ level 1 ^ level 2 The detail is proposed in [1], which will be mentioned below. > Skip procedure is not cache-friendly. Since skip data are stored at the end of each postings. Each time we skip to an element, we have to jump to the end to find skip point, and jump back to decode document ids/freqs, this procedure will somewhat disobey spatial locality. * "Compressed Perfect Embedded Skip List" [1] With the help of [1], we'll try to overcome the pruning problem and cache problem by utilizing an optimized skip list, with each skip datum interleaved inside postings list. The paper also proposed some methods to further compress skip list, e.g. Golomb encoding on pointer skip, delta encoding on bit skip. * How frequent we really need to skip, and related experiment Below is the estimation of skip frequency. Queries are taken from luceneutil: wikimedium.10M.nostopwords.tasks, and index is built on WIKI_MEDIUM_ALL, with Lucene41Codec (skipInterval = 128, skipMultiplier = 8) #queries #skips #avg_skip_len AndHighHigh: 500 360438700 6.16 AndHighMed: 500 182918560 25.55 AndHighLow: 500 27702500 775.20 HighPhrase: 42 5071380 20.17 MedPhrase: 500 90279680 32.03 LowPhrase: 500 138730160 35.61 (we might add MinShouldMatch queries later) Project Goals: * Implementation of Compressed Perfect Embedded Skip List A new version of "MultiLevelSkipReader/Writer" will be implemented, and we'll change those related Codecs to test how performance varies. This includes: > compression method on skip pointers (basically we'll use VInt to compare the performance between current version and new version, but Rice encoding should also be investigated, during phase 2) > refactor skip datum, i.e. separate position and payload information from current skip datum, so that we won't need to load/decode them for non-proximity queries. > experiments on skipper settings. From previous experiment, we should try other expectation of skipMultipliers, so that queries are more efficient to skip. * Simplifying current implementation (a different direction). It is quite interesting that we don't really skip much in current dataset... Actually, on average, we seem to need no level 1 skip pointers on these queries. So the question is: how much do we need higher level skip data? Also, it is quite interesting to use the simplest type of skip list: encode each kind of skip data as FOR blocks, and decode docID/docFP blocks only when skipTo() is called, and check how performance gains from FOR encoding. * Introducing a real dataset to test performance. Actually this summer I have access to some big TREC data set (e.g. gov2) I hope this might help the test to fit real cases. [1] Paolo Boldi, et al. Compressed perfect embedded skip lists for quick inverted-index lookups, SPIRE 2005 [2] Skip format: http://lucene.apache.org/core/4_2_1/core/org/apache/lucene/codecs/lucene41/Lucene41PostingsFormat.html#Frequencies