Lucene - Core
  1. Lucene - Core
  2. LUCENE-1231

Column-stride fields (aka per-document Payloads)

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      This new feature has been proposed and discussed here:
      http://markmail.org/search/?q=per-document+payloads#query:per-document%20payloads+page:1+mid:jq4g5myhlvidw3oc+state:results

      Currently it is possible in Lucene to store data as stored fields or as payloads.
      Stored fields provide good performance if you want to load all fields for one
      document, because this is an sequential I/O operation.

      If you however want to load the data from one field for a large number of
      documents, then stored fields perform quite badly, because lot's of I/O seeks
      might have to be performed.

      A better way to do this is using payloads. By creating a "special" posting list
      that has one posting with payload for each document you can "simulate" a column-
      stride field. The performance is significantly better compared to stored fields,
      however still not optimal. The reason is that for each document the freq value,
      which is in this particular case always 1, has to be decoded, also one position
      value, which is always 0, has to be loaded.

      As a solution we want to add real column-stride fields to Lucene. A possible
      format for the new data structure could look like this (CSD stands for column-
      stride data, once we decide for a final name for this feature we can change
      this):

      CSDList --> FixedLengthList | <VariableLengthList, SkipList>
      FixedLengthList --> <Payload>^SegSize
      VariableLengthList --> <DocDelta, PayloadLength?, Payload>
      Payload --> Byte^PayloadLength
      PayloadLength --> VInt
      SkipList --> see frq.file

      We distinguish here between the fixed length and the variable length cases. To
      allow flexibility, Lucene could automatically pick the "right" data structure.
      This could work like this: When the DocumentsWriter writes a segment it checks
      whether all values of a field have the same length. If yes, it stores them as
      FixedLengthList, if not, then as VariableLengthList. When the SegmentMerger
      merges two or more segments it checks if all segments have a FixedLengthList
      with the same length for a column-stride field. If not, it writes a
      VariableLengthList to the new segment.

      Once this feature is implemented, we should think about making the column-
      stride fields updateable, similar to the norms. This will be a very powerful
      feature that can for example be used for low-latency tagging of documents.

      Other use cases:

      • replace norms
      • allow to store boost values separately from norms
      • as input for the FieldCache, thus providing significantly improved loading
        performance (see LUCENE-831)

      Things that need to be done here:

      • decide for a name for this feature - I think "column-stride fields" was
        liked better than "per-document payloads"
      • Design an API for this feature. We should keep in mind here that these
        fields are supposed to be updateable.
      • Define datastructures.

      I would like to get this feature into 2.4. Feedback about the open questions
      is very welcome so that we can finalize the design soon and start
      implementing.

        Issue Links

          Activity

          Hide
          Doug Cutting added a comment -

          How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

          Show
          Doug Cutting added a comment - How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?
          Hide
          Michael McCandless added a comment -

          How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent?

          I think this would be very similar, except maybe:

          • This proposal would allow for optional non-sparse, fixed-length
            storage (ie, don't include the docID since all docs have a
            payload, and the payload is always the same length). EG norms are
            like this.
          • [From the thread linked above] would allow for binary storage of
            field values. EG for int fields you would store the 4 bytes per
            value, and populating the cache would be much faster than the
            FieldCache now (which must re-parse Strings -> ints, and, must
            walk the terms to "reconstruct" the forward index).
          • This proposal may allow for updating these values, like we can do
            with norms today. Maybe this can only work if the field is
            non-sparse and perhaps if you've loaded it into the FieldCache?

          This would tie into the LUCENE-831, so that you could load these
          fields entirely in RAM, incrementally update them from a reopen, etc.

          Show
          Michael McCandless added a comment - How would this compare to making the storing of position and freq optional for a field? Then one could have an indexed field with a payload or boost but with no freq (or positions, since freq is required for positions). Would that be equivalent? I think this would be very similar, except maybe: This proposal would allow for optional non-sparse, fixed-length storage (ie, don't include the docID since all docs have a payload, and the payload is always the same length). EG norms are like this. [From the thread linked above] would allow for binary storage of field values. EG for int fields you would store the 4 bytes per value, and populating the cache would be much faster than the FieldCache now (which must re-parse Strings -> ints, and, must walk the terms to "reconstruct" the forward index). This proposal may allow for updating these values, like we can do with norms today. Maybe this can only work if the field is non-sparse and perhaps if you've loaded it into the FieldCache? This would tie into the LUCENE-831 , so that you could load these fields entirely in RAM, incrementally update them from a reopen, etc.
          Hide
          Doug Cutting added a comment -

          So there are a number of features these fields would have that differ from other fields:

          • no freq
          • no positions
          • non-sparse representation
          • binary values (is this different from payloads?)
          • updateable

          My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable.

          BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

          Show
          Doug Cutting added a comment - So there are a number of features these fields would have that differ from other fields: no freq no positions non-sparse representation binary values (is this different from payloads?) updateable My question is whether it is best to bundle these together as a new kind of field, or add these as optional features of ordinary fields, or some combination. There are a certain bundles that may work well together: e.g., a dense array of fixed-size, updateable binary values w/o freqs or positions. And not all combinations may be sensible or easy to implement. But most of these would also be useful ala carte too, e.g., no-freqs, no-positions and (perhaps) updateable. BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.
          Hide
          Michael McCandless added a comment -

          Sorry you're right: the payload is the binary data.

          So there are a number of features these fields would have that differ from other fields:

          Maybe add "stored in its own file" or some such, to that list. Ie to
          efficiently update field X I would think you want it stored in its own
          file. We would then fully write a new geneation of that file whenever
          it had changes.

          I agree it would be great to implement this as "flexible indexing",
          such that these are simply a-la-cart options on how the field is
          indexed, rather than make a new specialized kind of field that just
          does one of these "combinations". But I haven't wrapped my brain
          around what all this will entail... it's a biggie!

          BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields.

          I like that!

          Show
          Michael McCandless added a comment - Sorry you're right: the payload is the binary data. So there are a number of features these fields would have that differ from other fields: Maybe add "stored in its own file" or some such, to that list. Ie to efficiently update field X I would think you want it stored in its own file. We would then fully write a new geneation of that file whenever it had changes. I agree it would be great to implement this as "flexible indexing", such that these are simply a-la-cart options on how the field is indexed, rather than make a new specialized kind of field that just does one of these "combinations". But I haven't wrapped my brain around what all this will entail... it's a biggie! BTW, setTermPositions(TermPositions) and setTermDocs(TermDocs) might be a reasonable API for updating sparse fields. I like that!
          Hide
          Michael Busch added a comment -

          I started working on this, but it shouldn't block the 2.4 release.

          Show
          Michael Busch added a comment - I started working on this, but it shouldn't block the 2.4 release.
          Hide
          Michael Busch added a comment -

          While working on this I got kind of stuck on the ingestion side. I
          think we need to make FieldInfos and Fields more flexible. If I
          introduce a column-stride field, then I also need to add something to
          FieldInfos. So I could add another bit indicating a column-stride
          field. But the other bits don't make much sense in combination with
          this.

          Eventually we need more flexibility to utilize the flexible indexing
          chain anyway. We need to store which codec to use for a field. Then we could
          also just make a new codec for column-stride fields and maybe then we
          do not have to introduce a new Field API.

          So I think we should try for 3.0 to overhaul the
          document/field/fieldinfos APIs. I have some ideas which I started
          hacking during a long flight. I'll try to summarize the ideas/goals
          I'd have for such a new API and send it to java-dev.

          Show
          Michael Busch added a comment - While working on this I got kind of stuck on the ingestion side. I think we need to make FieldInfos and Fields more flexible. If I introduce a column-stride field, then I also need to add something to FieldInfos. So I could add another bit indicating a column-stride field. But the other bits don't make much sense in combination with this. Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API. So I think we should try for 3.0 to overhaul the document/field/fieldinfos APIs. I have some ideas which I started hacking during a long flight. I'll try to summarize the ideas/goals I'd have for such a new API and send it to java-dev.
          Hide
          Michael Busch added a comment -

          For the search side we need an API similar to TermDocs and Payloads,
          let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have
          next(), skipTo(), doc(), etc. methods.
          However, the way TermPositions#getPayloads() currently works
          is that it always forces you to copy the bytes from the underlying
          IndexInput into the payload byte[] array. Since we usually use a
          BufferedIndexInput, this is then an arraycopy from
          BufferedIndexInput's buffer array into the byte array.

          I think to improve this we could allow users to call methods like
          readVInt() directly on the CSFA. So I was thinking about adding
          DataInput and DataOutput as superclasses of IndexInput and
          IndexOutput. DataIn(Out)put would implement the different read and
          write methods, whereas IndexIn(Out)put would only implement methods
          like close(), seek(), getFilePointer(), length(), flush(), etc.

          So then CSFA would extend DataInput or alternatively have a
          getDataInput() method. The danger here compared to the current
          payloads API would be that the user might read too few or too many
          bytes of a CSF, which would result in an undefined and possibly hard
          to debug behavior. But we could offer e.g.:

            static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) {
              if (mode == Mode.Fast) {
                return in;
              } else if (mode == Mode.Safe) {
                return new SafeAccessor(in);
              }
          

          The SafeAccessor would count for you the number of read bytes and
          throw exceptions if you don't consume the number of bytes you should
          consume. This is of course overhead, but users could use the
          SafeAccessor until they're confident that everything works fine in
          their system, and then switch to the fast accessor for better
          performance.

          If there are no objections I will open a separate JIRA issue for the
          DataInput/Output patch.

          Show
          Michael Busch added a comment - For the search side we need an API similar to TermDocs and Payloads, let's call it ColumnStrideFieldAccessor (CSFA) for now. It should have next(), skipTo(), doc(), etc. methods. However, the way TermPositions#getPayloads() currently works is that it always forces you to copy the bytes from the underlying IndexInput into the payload byte[] array. Since we usually use a BufferedIndexInput, this is then an arraycopy from BufferedIndexInput's buffer array into the byte array. I think to improve this we could allow users to call methods like readVInt() directly on the CSFA. So I was thinking about adding DataInput and DataOutput as superclasses of IndexInput and IndexOutput. DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc. So then CSFA would extend DataInput or alternatively have a getDataInput() method. The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. But we could offer e.g.: static ColumnStrideFieldsAccessor getAccessor(ColumnStrideFieldsAccessor in, Mode mode) { if (mode == Mode.Fast) { return in; } else if (mode == Mode.Safe) { return new SafeAccessor(in); } The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. This is of course overhead, but users could use the SafeAccessor until they're confident that everything works fine in their system, and then switch to the fast accessor for better performance. If there are no objections I will open a separate JIRA issue for the DataInput/Output patch.
          Hide
          Michael McCandless added a comment -

          Eventually we need more flexibility to utilize the flexible indexing
          chain anyway. We need to store which codec to use for a field. Then we
          could also just make a new codec for column-stride fields and maybe
          then we do not have to introduce a new Field API.

          By creating a custom indexing chain you could actually write CSF,
          today.

          But the lack of extensibility of Field needs to be addressed: you need
          some way to store something arbitrary & opaque into a field such that
          your indexing chain could pick it up and act.

          And FieldInfos also needs "store this opaque thing for me" API.

          One of the big changes in LUCENE-1458 is to strongly separate
          different fields on the read APIs. EG there is a separate FieldsEnum
          from TermsEnum, meaning you first seek to the field you want, then
          request a TermsEnum from that, which can iterate through the terms
          only for that field. It's the codec's job to return the right
          TermsEnum for a given field.

          Not to delay 2.9 further, but... I also wonder if Lucene had
          NumericField (say), how it would simplify things here. EG, today, if
          I have a field "weight" that is a float, I'm going to have to set
          something to tell the CSF (man the similarity of that to CFS is going
          to cause problems!) writer to cast-it-and-save-it-as-float-array to
          disk; I'm going to have to tell the TrieRangeUtil to do the same, etc.
          It'd be much better if that field stored a float (not String), and if
          it default "naturally" to using these two special indexers...

          DataIn(Out)put would implement the different read and
          write methods, whereas IndexIn(Out)put would only implement methods
          like close(), seek(), getFilePointer(), length(), flush(), etc.

          What is the fastest way in Java to slurp in a bunch of bytes as an
          int[], short[], float[], etc? Seems that we need to answer that first
          and then work out how to fix our store APIs to enable that. (Maybe
          it's IntBuffer wrapping ByteBuffer, instead of an int[]?).

          The danger here compared to the current
          payloads API would be that the user might read too few or too many
          bytes of a CSF, which would result in an undefined and possibly hard
          to debug behavior.

          I think it's better to have good performance with added risk of
          danger, then forced handholding always.

          The SafeAccessor would count for you the number of read bytes and
          throw exceptions if you don't consume the number of bytes you should
          consume.

          I generally prefer liberal use of asserts to trip bugs like this,
          instead of explicit strongly divoced code paths / classes / modes
          etc., containing real if statements at production runtime.

          Show
          Michael McCandless added a comment - Eventually we need more flexibility to utilize the flexible indexing chain anyway. We need to store which codec to use for a field. Then we could also just make a new codec for column-stride fields and maybe then we do not have to introduce a new Field API. By creating a custom indexing chain you could actually write CSF, today. But the lack of extensibility of Field needs to be addressed: you need some way to store something arbitrary & opaque into a field such that your indexing chain could pick it up and act. And FieldInfos also needs "store this opaque thing for me" API. One of the big changes in LUCENE-1458 is to strongly separate different fields on the read APIs. EG there is a separate FieldsEnum from TermsEnum, meaning you first seek to the field you want, then request a TermsEnum from that, which can iterate through the terms only for that field. It's the codec's job to return the right TermsEnum for a given field. Not to delay 2.9 further, but... I also wonder if Lucene had NumericField (say), how it would simplify things here. EG, today, if I have a field "weight" that is a float, I'm going to have to set something to tell the CSF (man the similarity of that to CFS is going to cause problems!) writer to cast-it-and-save-it-as-float-array to disk; I'm going to have to tell the TrieRangeUtil to do the same, etc. It'd be much better if that field stored a float (not String), and if it default "naturally" to using these two special indexers... DataIn(Out)put would implement the different read and write methods, whereas IndexIn(Out)put would only implement methods like close(), seek(), getFilePointer(), length(), flush(), etc. What is the fastest way in Java to slurp in a bunch of bytes as an int[], short[], float[], etc? Seems that we need to answer that first and then work out how to fix our store APIs to enable that. (Maybe it's IntBuffer wrapping ByteBuffer, instead of an int[]?). The danger here compared to the current payloads API would be that the user might read too few or too many bytes of a CSF, which would result in an undefined and possibly hard to debug behavior. I think it's better to have good performance with added risk of danger, then forced handholding always. The SafeAccessor would count for you the number of read bytes and throw exceptions if you don't consume the number of bytes you should consume. I generally prefer liberal use of asserts to trip bugs like this, instead of explicit strongly divoced code paths / classes / modes etc., containing real if statements at production runtime.
          Hide
          Michael McCandless added a comment -

          One interesting idea, from Earwin's comment here:

          https://issues.apache.org/jira/browse/LUCENE-1584?focusedCommentId=12696583&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696583

          is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

          Show
          Michael McCandless added a comment - One interesting idea, from Earwin's comment here: https://issues.apache.org/jira/browse/LUCENE-1584?focusedCommentId=12696583&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696583 is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.
          Hide
          Jason Rutherglen added a comment -

          +1 Making it automatic makes sense.

          Show
          Jason Rutherglen added a comment - +1 Making it automatic makes sense.
          Hide
          Michael Busch added a comment -

          is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values.

          Hmm, I'm not sure about this. The Lucene store is optimized for the case when ALL stored fields for a document need to be loaded. E.g. to render search results you usually need to load multiple fields. Then only one disk seek is needed to seek to the right document. Loading the fields is then just one sequential I/O.

          If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

          My general rule of thumb is to use payloads/CSFs for data you need during hit collection (e.g. a UID field or e.g. the norms, which is essential a CSF with fixed size), but ordinary stored fields to load data for rendering search results.

          Show
          Michael Busch added a comment - is for column-stride fields to be a possible source for IndexReader.document() (or whatever it becomes). Ie, when possible, that method should maybe pull from CSFs for values. Hmm, I'm not sure about this. The Lucene store is optimized for the case when ALL stored fields for a document need to be loaded. E.g. to render search results you usually need to load multiple fields. Then only one disk seek is needed to seek to the right document. Loading the fields is then just one sequential I/O. If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs. My general rule of thumb is to use payloads/CSFs for data you need during hit collection (e.g. a UID field or e.g. the norms, which is essential a CSF with fixed size), but ordinary stored fields to load data for rendering search results.
          Hide
          Michael McCandless added a comment -

          If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs.

          If the CSFs have been loaded into the FieldCache then it's fast. If they haven't, I agree (one seek per doc).

          Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache & from stored fields, and merge). EG if I store body text as stored field, but then have a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache (loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.

          Show
          Michael McCandless added a comment - If you e.g. want to show 5 fields per search result, you need to perform 1 seek per doc using stored fields, but will need 5 seeks with CSFs. If the CSFs have been loaded into the FieldCache then it's fast. If they haven't, I agree (one seek per doc). Anyway this would just be "sugar" (a single 'retrieve doc' API that'd pull from the FieldCache & from stored fields, and merge). EG if I store body text as stored field, but then have a bunch of small "metadata" fields (price, weight, manufacturer, etc.) stored in FieldCache (loaded via CSF), it'd be convenient to call document() somewhere and get all fields back.
          Hide
          Earwin Burrfoot added a comment -

          I can share my design for doc loading, if anybody needs it:

          public interface FieldCache

          { DocLoader loader(FieldInfo<?>... fields); .... }

          public interface DocLoader

          { void load(Doc doc); <T> T value(FieldInfo<T> field); }

          Doc is my analog for ScoreDoc, for these purporses it is completely identical
          FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its name, indexing method, whether it is cached or not and the way it is cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related.

          Workflow looks like this:

          // I create a loader. Fields are checked against the cache, for those that aren't cached I create a FieldSelector
          loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field));

          // Then for each document I'm going to send in response for search request I select this document
          // an indexReader.document(fieldSelector) happens here if there are any uncached fields
          loader.load(doc);

          // Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded from Document retrieved in previous step
          hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) // etc, etc

          Having a single API to retrieve values regardless of the way they are stored/cached is very handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached fields is massive win, it becomes even more massive if all required fields happen to be cached.

          Show
          Earwin Burrfoot added a comment - I can share my design for doc loading, if anybody needs it: public interface FieldCache { DocLoader loader(FieldInfo<?>... fields); .... } public interface DocLoader { void load(Doc doc); <T> T value(FieldInfo<T> field); } Doc is my analog for ScoreDoc, for these purporses it is completely identical FieldInfos are constants defined like UserFields.EMAIL, they hold the type for field, its name, indexing method, whether it is cached or not and the way it is cached. Two synthetic fields exist - LUCENE_ID and SCORE, they allow to use same api for anything field-related. Workflow looks like this: // I create a loader. Fields are checked against the cache, for those that aren't cached I create a FieldSelector loader = searcher.fieldCache().loader(concat(payloadFields, ID, DOCUMENT_TYPE, sortBy.field)); // Then for each document I'm going to send in response for search request I select this document // an indexReader.document(fieldSelector) happens here if there are any uncached fields loader.load(doc); // Then I extract the values I need. Cached ones arrive from the cache, uncached are decoded from Document retrieved in previous step hit = new Hit(loader.value(ID), loader.value(DOCUMENT_TYPE), loader.value(sortBy.field)) // etc, etc Having a single API to retrieve values regardless of the way they are stored/cached is very handy. Loading a mix of stored/column-stride (if I correctly understand what are they) fields is pointless, you're more likely to lose performance than to gain it. Loading a mix of cached/uncached fields is massive win, it becomes even more massive if all required fields happen to be cached.
          Hide
          Marvin Humphrey added a comment -

          FWIW, I think priority for document fetching should be to optimize for search
          clusters where you have separate index servers and document/excerpt servers.
          Under such a system, it's clear that you'd want to use the standard stored
          fields.

          To my mind, column stride fields are more of a search tool – an extension of
          the field caches and the "flexible indexing" concept – and I don't think the
          design should be encumbered or deployment slowed so that they can perform
          double duty.

          However, I think column-stride fields may be more of an imperative for Lucy
          than Lucene. In Lucy, our sort caches will be either column-stride or
          variable width and will be written out at index-time – mmap a
          one-value-per-doc column-stride file, and voila: instant sort cache.

          > Loading a mix of cached/uncached fields is massive win

          I don't quite understand why that would be the case. Presuming a cold OS
          cache, the big cost is the .fdt file disk seek. Once you're there, how much
          of a difference is it to read the field off disk as opposed to reading it from
          the cache?

          > it becomes even more massive if all required fields happen to be cached.

          I can see that. However, the gain won't be all that significant for systems
          where the index fits into RAM, or when the persistant storage device is an
          SSD. And of course a different caching strategy altogether (popular document
          caching) is best for dedicated doc servers.

          Show
          Marvin Humphrey added a comment - FWIW, I think priority for document fetching should be to optimize for search clusters where you have separate index servers and document/excerpt servers. Under such a system, it's clear that you'd want to use the standard stored fields. To my mind, column stride fields are more of a search tool – an extension of the field caches and the "flexible indexing" concept – and I don't think the design should be encumbered or deployment slowed so that they can perform double duty. However, I think column-stride fields may be more of an imperative for Lucy than Lucene. In Lucy, our sort caches will be either column-stride or variable width and will be written out at index-time – mmap a one-value-per-doc column-stride file, and voila: instant sort cache. > Loading a mix of cached/uncached fields is massive win I don't quite understand why that would be the case. Presuming a cold OS cache, the big cost is the .fdt file disk seek. Once you're there, how much of a difference is it to read the field off disk as opposed to reading it from the cache? > it becomes even more massive if all required fields happen to be cached. I can see that. However, the gain won't be all that significant for systems where the index fits into RAM, or when the persistant storage device is an SSD. And of course a different caching strategy altogether (popular document caching) is best for dedicated doc servers.
          Hide
          Michael McCandless added a comment -

          To my mind, column stride fields are more of a search tool - an extension of
          the field caches and the "flexible indexing" concept

          I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API.

          However, I think column-stride fields may be more of an imperative for Lucy
          than Lucene. In Lucy, our sort caches will be either column-stride or
          variable width and will be written out at index-time - mmap a
          one-value-per-doc column-stride file, and voila: instant sort cache.

          I think column or row stride, vs fixed or variable width storage, are orthogonal issues?

          EG column-stride fields (in Lucene) could very well be variable width storage (eg for storing String values).

          I can see that. However, the gain won't be all that significant for systems
          where the index fits into RAM, or when the persistant storage device is an
          SSD.

          SSDs, as fast as they are, are still orders of magnitude slower than RAM (assuming of course the OS hasn't swapped your RAM out to your SSD .

          Show
          Michael McCandless added a comment - To my mind, column stride fields are more of a search tool - an extension of the field caches and the "flexible indexing" concept I don't think of CSF's as an extension to FieldCache; I think of them as an alternate (much more efficient than uninversion) underlying store that FieldCache can use to retrieve values. Ie the way you'll access a CSF in Lucene will be through the [new in LUCENE-831] FieldCache API. However, I think column-stride fields may be more of an imperative for Lucy than Lucene. In Lucy, our sort caches will be either column-stride or variable width and will be written out at index-time - mmap a one-value-per-doc column-stride file, and voila: instant sort cache. I think column or row stride, vs fixed or variable width storage, are orthogonal issues? EG column-stride fields (in Lucene) could very well be variable width storage (eg for storing String values). I can see that. However, the gain won't be all that significant for systems where the index fits into RAM, or when the persistant storage device is an SSD. SSDs, as fast as they are, are still orders of magnitude slower than RAM (assuming of course the OS hasn't swapped your RAM out to your SSD .
          Hide
          Simon Willnauer added a comment -

          this has been implemented in LUCENE-3108, LUCENE-2935, LUCENE-2168 and LUCENE-1231 moving out

          Show
          Simon Willnauer added a comment - this has been implemented in LUCENE-3108 , LUCENE-2935 , LUCENE-2168 and LUCENE-1231 moving out

            People

            • Assignee:
              Simon Willnauer
              Reporter:
              Michael Busch
            • Votes:
              8 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development