Solr
  1. Solr
  2. SOLR-475

multi-valued faceting via un-inverted field

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.4
    • Component/s: None
    • Labels:
      None

      Description

      Facet multi-valued fields via a counting method (like the FieldCache method) on an un-inverted representation of the field. For each doc, look at it's terms and increment a count for that term.

      1. facet_performance.html
        36 kB
        Yonik Seeley
      2. SOLR-475.patch
        16 kB
        Yonik Seeley
      3. SOLR-475.patch
        45 kB
        Yonik Seeley
      4. UnInvertedField.java
        18 kB
        Yonik Seeley
      5. UnInvertedField.java
        18 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        Prototype attached.
        This is completely untested code, and is still missing the solr interface + caching.
        The approach is described in the comments (cut-n-pasted here).
        Any thoughts or comments on the approach?

        I may not have time to immediately work on this (fix the bugs, add tests, hook up to solr, add caching of un-inverted field, etc), so additional contributions in this direction are welcome!

        /**
         * Final form of the un-inverted field:
         *   Each document points to a list of term numbers that are contained in that document.
         *
         *   Term numbers are in sorted order, and are encoded as variable-length deltas from the
         *   previous term number.  Real term numbers start at 2 since 0 and 1 are reserved.  A
         *   term number of 0 signals the end of the termNumber list.
         *
         *   There is a singe int[maxDoc()] which either contains a pointer into a byte[] for
         *   the termNumber lists, or directly contains the termNumber list if it fits in the 4
         *   bytes of an integer.  If the first byte in the integer is 1, the next 3 bytes
         *   are a pointer into a byte[] where the termNumber list starts.
         *
         *   There are actually 256 byte arrays, to compensate for the fact that the pointers
         *   into the byte arrays are only 3 bytes long.  The correct byte array for a document
         *   is a function of it's id.
         *
         *   To save space and speed up faceting, any term that matches enough documents will
         *   not be un-inverted... it will be skipped while building the un-inverted field structore,
         *   and will use a set intersection method during faceting.
         *
         *   To further save memory, the terms (the actual string values) are not all stored in
         *   memory, but a TermIndex is used to convert term numbers to term values only
         *   for the terms needed after faceting has completed.  Only every 128th term value
         *   is stored, along with it's corresponding term number, and this is used as an
         *   index to find the closest term and iterate until the desired number is hit (very
         *   much like Lucene's own internal term index).
         */
        
        Show
        Yonik Seeley added a comment - Prototype attached. This is completely untested code, and is still missing the solr interface + caching. The approach is described in the comments (cut-n-pasted here). Any thoughts or comments on the approach? I may not have time to immediately work on this (fix the bugs, add tests, hook up to solr, add caching of un-inverted field, etc), so additional contributions in this direction are welcome! /** * Final form of the un-inverted field: * Each document points to a list of term numbers that are contained in that document. * * Term numbers are in sorted order, and are encoded as variable-length deltas from the * previous term number. Real term numbers start at 2 since 0 and 1 are reserved. A * term number of 0 signals the end of the termNumber list. * * There is a singe int [maxDoc()] which either contains a pointer into a byte [] for * the termNumber lists, or directly contains the termNumber list if it fits in the 4 * bytes of an integer. If the first byte in the integer is 1, the next 3 bytes * are a pointer into a byte [] where the termNumber list starts. * * There are actually 256 byte arrays, to compensate for the fact that the pointers * into the byte arrays are only 3 bytes long . The correct byte array for a document * is a function of it's id. * * To save space and speed up faceting, any term that matches enough documents will * not be un-inverted... it will be skipped while building the un-inverted field structore, * and will use a set intersection method during faceting. * * To further save memory, the terms (the actual string values) are not all stored in * memory, but a TermIndex is used to convert term numbers to term values only * for the terms needed after faceting has completed. Only every 128th term value * is stored, along with it's corresponding term number, and this is used as an * index to find the closest term and iterate until the desired number is hit (very * much like Lucene's own internal term index). */
        Hide
        Yonik Seeley added a comment -

        fix single line oops

        Show
        Yonik Seeley added a comment - fix single line oops
        Hide
        David Bowen added a comment -

        We are seeing faceting on multi-valued fields as a significant performance problem, so we'd very much like to see something of this sort.

        Show
        David Bowen added a comment - We are seeing faceting on multi-valued fields as a significant performance problem, so we'd very much like to see something of this sort.
        Hide
        Koji Sekiguchi added a comment -

        Yonik,
        This looks great! I'd like to contribute (unit test, etc.) to move forward.
        Before I write unit tests, I have a couple of questions:

        1. To initialize TermIndex.index, getEnumerator().close() have to be called, right? (but I cannot find close() call)
        2. Can UnInvertedField have searcher instead of reader and remove searcher argument from getCounts()?
        Show
        Koji Sekiguchi added a comment - Yonik, This looks great! I'd like to contribute (unit test, etc.) to move forward. Before I write unit tests, I have a couple of questions: To initialize TermIndex.index, getEnumerator().close() have to be called, right? (but I cannot find close() call) Can UnInvertedField have searcher instead of reader and remove searcher argument from getCounts()?
        Hide
        Yonik Seeley added a comment -

        I've finished this implementation and am cleaning it up for contribution.
        In the meantime, I'm attaching the results of some performance tests.

        Show
        Yonik Seeley added a comment - I've finished this implementation and am cleaning it up for contribution. In the meantime, I'm attaching the results of some performance tests.
        Hide
        Yonik Seeley added a comment -

        Some further results on a bigger index to show some practical limits.
        This table (JIRA markup format) shows the performance and memory characteristics of facet requests on a 50M document index, for different fields and different numbers of documents being counted in the base query.

          f10_100_t f100_10_t f1000_5_t f10000_5_t f100000_5_t f100000_10_t
        field inversion time (sec) 17.2 17.9 69.4 87.8 133.6 388.0
        inverted field size (MB) 68.1 629.6 416.9 479.0 589.9 807.4
        1000 docs facet time (ms) 7 20 13 13 16 17
        10,000 docs 55 428 22 23 29 28
        100,000 docs 54 421 35 36 46 56
        1,000,000 docs 55 431 149 155 249 307
        10,000,000 docs 54 434 625 625 1183 1620

        The "profile" of the faceted field is encoded in it's name. For example, the field f1000_5_t has 1000 unique values across the whole index and between 0 and 5 values per document. It took 35 ms to facet on this field when the base query matched 100,000 documents.

        Test Hardware: Commodity PC
        Processor: AMD Athlon 64 X2 5000+ (2.6GHz dual core)
        Hard Drive: Western Digital Caviar GP WD5000AACS 500GB 5400 to 7200 RPM SATA 3.0Gb/s
        Memory: 8GB DDR2 800 SDRAM (PC2 6400)
        Operating System: Linux - Ubuntu 8.04 desktop, 64 bit version (x86_64)
        Java VM: Sun Java6 (1.6.0_05) 64 bit hotspot (x86_64)

        Show
        Yonik Seeley added a comment - Some further results on a bigger index to show some practical limits. This table (JIRA markup format) shows the performance and memory characteristics of facet requests on a 50M document index, for different fields and different numbers of documents being counted in the base query.   f10_100_t f100_10_t f1000_5_t f10000_5_t f100000_5_t f100000_10_t field inversion time (sec) 17.2 17.9 69.4 87.8 133.6 388.0 inverted field size (MB) 68.1 629.6 416.9 479.0 589.9 807.4 1000 docs facet time (ms) 7 20 13 13 16 17 10,000 docs 55 428 22 23 29 28 100,000 docs 54 421 35 36 46 56 1,000,000 docs 55 431 149 155 249 307 10,000,000 docs 54 434 625 625 1183 1620 The "profile" of the faceted field is encoded in it's name. For example, the field f1000_5_t has 1000 unique values across the whole index and between 0 and 5 values per document. It took 35 ms to facet on this field when the base query matched 100,000 documents. Test Hardware: Commodity PC Processor: AMD Athlon 64 X2 5000+ (2.6GHz dual core) Hard Drive: Western Digital Caviar GP WD5000AACS 500GB 5400 to 7200 RPM SATA 3.0Gb/s Memory: 8GB DDR2 800 SDRAM (PC2 6400) Operating System: Linux - Ubuntu 8.04 desktop, 64 bit version (x86_64) Java VM: Sun Java6 (1.6.0_05) 64 bit hotspot (x86_64)
        Hide
        Koji Sekiguchi added a comment -

        How amazing! facet_performance.html shows great improvement on both memory and qps. Great job, Yonik!

        Show
        Koji Sekiguchi added a comment - How amazing! facet_performance.html shows great improvement on both memory and qps. Great job, Yonik!
        Hide
        Yonik Seeley added a comment -

        Attaching patch with tests.
        This is well tested, so I'll probably commit relatively soon.

        Show
        Yonik Seeley added a comment - Attaching patch with tests. This is well tested, so I'll probably commit relatively soon.
        Hide
        Yonik Seeley added a comment -

        Committed.

        Show
        Yonik Seeley added a comment - Committed.
        Hide
        Yonik Seeley added a comment -

        update:

        • closing of term enumerators (not strictly necessary in Lucene now, so shouldn't have caused any problems).
        • new fieldValueCache in SolrIndexSearcher
        • uninverted field statistics (per field) available with cache statistics
        Show
        Yonik Seeley added a comment - update: closing of term enumerators (not strictly necessary in Lucene now, so shouldn't have caused any problems). new fieldValueCache in SolrIndexSearcher uninverted field statistics (per field) available with cache statistics

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Yonik Seeley
          • Votes:
            4 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development