Issue Details (XML | Word | Printable)

Key: SOLR-475
Type: New Feature New Feature
Status: Closed Closed
Resolution: Fixed
Priority: Major Major
Assignee: Yonik Seeley
Reporter: Yonik Seeley
Votes: 4
Watchers: 4
Operations

If you were logged in you would be able to see more operations.
Solr

multi-valued faceting via un-inverted field

Created: 08/Feb/08 05:35 PM   Updated: 13/Aug/09 02:19 PM
Return to search
Component/s: None
Affects Version/s: None
Fix Version/s: 1.4

Time Tracking:
Not Specified

File Attachments:
  Size
HTML File Licensed for inclusion in ASF works facet_performance.html 2008-11-23 10:11 PM Yonik Seeley 36 kB
Text File Licensed for inclusion in ASF works SOLR-475.patch 2008-12-07 08:13 PM Yonik Seeley 16 kB
Text File Licensed for inclusion in ASF works SOLR-475.patch 2008-11-24 05:09 AM Yonik Seeley 45 kB
Java Source File Licensed for inclusion in ASF works UnInvertedField.java 2008-02-08 08:32 PM Yonik Seeley 18 kB
Java Source File Licensed for inclusion in ASF works UnInvertedField.java 2008-02-08 08:01 PM Yonik Seeley 18 kB

Resolution Date: 25/Nov/08 04:02 AM


 Description  « Hide
Facet multi-valued fields via a counting method (like the FieldCache method) on an un-inverted representation of the field. For each doc, look at it's terms and increment a count for that term.

 All   Comments   Work Log   Change History   Subversion Commits      Sort Order: Ascending order - Click to sort in descending order
Yonik Seeley added a comment - 08/Feb/08 08:01 PM
Prototype attached.
This is completely untested code, and is still missing the solr interface + caching.
The approach is described in the comments (cut-n-pasted here).
Any thoughts or comments on the approach?

I may not have time to immediately work on this (fix the bugs, add tests, hook up to solr, add caching of un-inverted field, etc), so additional contributions in this direction are welcome!

/**
 * Final form of the un-inverted field:
 *   Each document points to a list of term numbers that are contained in that document.
 *
 *   Term numbers are in sorted order, and are encoded as variable-length deltas from the
 *   previous term number.  Real term numbers start at 2 since 0 and 1 are reserved.  A
 *   term number of 0 signals the end of the termNumber list.
 *
 *   There is a singe int[maxDoc()] which either contains a pointer into a byte[] for
 *   the termNumber lists, or directly contains the termNumber list if it fits in the 4
 *   bytes of an integer.  If the first byte in the integer is 1, the next 3 bytes
 *   are a pointer into a byte[] where the termNumber list starts.
 *
 *   There are actually 256 byte arrays, to compensate for the fact that the pointers
 *   into the byte arrays are only 3 bytes long.  The correct byte array for a document
 *   is a function of it's id.
 *
 *   To save space and speed up faceting, any term that matches enough documents will
 *   not be un-inverted... it will be skipped while building the un-inverted field structore,
 *   and will use a set intersection method during faceting.
 *
 *   To further save memory, the terms (the actual string values) are not all stored in
 *   memory, but a TermIndex is used to convert term numbers to term values only
 *   for the terms needed after faceting has completed.  Only every 128th term value
 *   is stored, along with it's corresponding term number, and this is used as an
 *   index to find the closest term and iterate until the desired number is hit (very
 *   much like Lucene's own internal term index).
 */

Yonik Seeley added a comment - 08/Feb/08 08:32 PM
fix single line oops

David Bowen added a comment - 11/Nov/08 11:05 PM
We are seeing faceting on multi-valued fields as a significant performance problem, so we'd very much like to see something of this sort.

Koji Sekiguchi added a comment - 13/Nov/08 04:38 PM
Yonik,
This looks great! I'd like to contribute (unit test, etc.) to move forward.
Before I write unit tests, I have a couple of questions:
  1. To initialize TermIndex.index, getEnumerator().close() have to be called, right? (but I cannot find close() call)
  2. Can UnInvertedField have searcher instead of reader and remove searcher argument from getCounts()?

Yonik Seeley added a comment - 23/Nov/08 10:11 PM
I've finished this implementation and am cleaning it up for contribution.
In the meantime, I'm attaching the results of some performance tests.

Yonik Seeley added a comment - 23/Nov/08 11:02 PM
Some further results on a bigger index to show some practical limits.
This table (JIRA markup format) shows the performance and memory characteristics of facet requests on a 50M document index, for different fields and different numbers of documents being counted in the base query.
  f10_100_t f100_10_t f1000_5_t f10000_5_t f100000_5_t f100000_10_t
field inversion time (sec) 17.2 17.9 69.4 87.8 133.6 388.0
inverted field size (MB) 68.1 629.6 416.9 479.0 589.9 807.4
1000 docs facet time (ms) 7 20 13 13 16 17
10,000 docs 55 428 22 23 29 28
100,000 docs 54 421 35 36 46 56
1,000,000 docs 55 431 149 155 249 307
10,000,000 docs 54 434 625 625 1183 1620

The "profile" of the faceted field is encoded in it's name. For example, the field f1000_5_t has 1000 unique values across the whole index and between 0 and 5 values per document. It took 35 ms to facet on this field when the base query matched 100,000 documents.

Test Hardware: Commodity PC
Processor: AMD Athlon 64 X2 5000+ (2.6GHz dual core)
Hard Drive: Western Digital Caviar GP WD5000AACS 500GB 5400 to 7200 RPM SATA 3.0Gb/s
Memory: 8GB DDR2 800 SDRAM (PC2 6400)
Operating System: Linux - Ubuntu 8.04 desktop, 64 bit version (x86_64)
Java VM: Sun Java6 (1.6.0_05) 64 bit hotspot (x86_64)


Koji Sekiguchi added a comment - 24/Nov/08 02:29 AM
How amazing! facet_performance.html shows great improvement on both memory and qps. Great job, Yonik!

Yonik Seeley added a comment - 24/Nov/08 05:09 AM
Attaching patch with tests.
This is well tested, so I'll probably commit relatively soon.

Yonik Seeley added a comment - 25/Nov/08 04:02 AM
Committed.

Yonik Seeley added a comment - 07/Dec/08 08:13 PM
update:
  • closing of term enumerators (not strictly necessary in Lucene now, so shouldn't have caused any problems).
  • new fieldValueCache in SolrIndexSearcher
  • uninverted field statistics (per field) available with cache statistics