Details

    • Type: Wish Wish
    • Status: Open
    • Priority: Minor Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When positions are used in queries with many terms, each term in each document causes a seek in the positions, and in large indexes these seeks can be far apart even when the terms are in the same document.
      The number of (disk) cache misses of such position seeks might be reduced by putting the positions for all terms in the same document directly behind each other. This should have a noticable effect when terms are alphabetically close, for example for truncations, and it should also help when the documents have few enough positions to fill a cache entry (disk page, cache line).
      This might also help the performance of highlighting based on indexed positions.

        Activity

        Hide
        Paul Elschot added a comment -

        This was more or less suggested in:

        "Compressing Term Positions in Web Indexes", Hao Yan, Shuan Ding, Torsten Suel, SIGIR '09.
        http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152.4748&rep=rep1&type=pdf

        in sections 7 and 8, and especially the last sentence: "... one could even consider storing the parsed documents themselves in highly compressed form and accessing these during a position data lookup, instead of keeping the positions in inverted lists."

        Show
        Paul Elschot added a comment - This was more or less suggested in: "Compressing Term Positions in Web Indexes", Hao Yan, Shuan Ding, Torsten Suel, SIGIR '09. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.152.4748&rep=rep1&type=pdf in sections 7 and 8, and especially the last sentence: "... one could even consider storing the parsed documents themselves in highly compressed form and accessing these during a position data lookup, instead of keeping the positions in inverted lists."
        Hide
        Paul Elschot added a comment -

        Currently positions are stored in this order (see the index file formats for positions):

        TermPositions are ordered by term (the term is implicit, from the .tis file).
        Positions entries are ordered by increasing document number (the document number is implicit from the .frq file).

        The idea is to change this order such that positions are ordered first by document and then by term.

        Show
        Paul Elschot added a comment - Currently positions are stored in this order (see the index file formats for positions): TermPositions are ordered by term (the term is implicit, from the .tis file). Positions entries are ordered by increasing document number (the document number is implicit from the .frq file). The idea is to change this order such that positions are ordered first by document and then by term.
        Hide
        Paul Elschot added a comment -

        Perhaps this also speeds up segment merging, as there is no more need to uninvert the positions.

        Show
        Paul Elschot added a comment - Perhaps this also speeds up segment merging, as there is no more need to uninvert the positions.

          People

          • Assignee:
            Unassigned
            Reporter:
            Paul Elschot
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:

              Development