Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.1
    • Fix Version/s: 4.1
    • Component/s: core/codecs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      The current StoredFieldsFormat are implemented with the assumption that only one type of StoredfieldsFormat is used by the index.
      We would like to be able to configure a StoredFieldsFormat per field, similarly to the PostingsFormat.
      There is a few issues that need to be solved for allowing that:
      1) allowing to configure a segment suffix to the StoredFieldsFormat
      2) implement SPI interface in StoredFieldsFormat
      3) create a PerFieldStoredFieldsFormat

      We are proposing to start first with 1) by modifying the signature of StoredFieldsFormat#fieldsReader and StoredFieldsFormat#fieldsWriter so that they use SegmentReadState and SegmentWriteState instead of the current set of parameters.

      Let us know what you think about this idea. If this is of interest, we can contribute with a first path for 1).

      1. LUCENE-4591.patch
        5 kB
        Adrien Grand
      2. LUCENE-4591.patch
        3 kB
        Renaud Delbru
      3. LUCENE-4591.patch
        1 kB
        Renaud Delbru
      4. PerFieldStoredFieldsFormat.java
        2 kB
        Renaud Delbru
      5. PerFieldStoredFieldsReader.java
        3 kB
        Renaud Delbru
      6. PerFieldStoredFieldsWriter.java
        6 kB
        Renaud Delbru

        Issue Links

          Activity

          Hide
          Adrien Grand added a comment -

          Hi Renaud,

          My concern about a per-field StoredFieldsFormat is that it would increase the number of disk seeks required to load all the stored fields of a document (the current default StoredFieldsFormat impl requires only one disk seek to load all the fields of a document). What are you willing to optimize with a per-field StoredFieldsFormat?

          Show
          Adrien Grand added a comment - Hi Renaud, My concern about a per-field StoredFieldsFormat is that it would increase the number of disk seeks required to load all the stored fields of a document (the current default StoredFieldsFormat impl requires only one disk seek to load all the fields of a document). What are you willing to optimize with a per-field StoredFieldsFormat?
          Hide
          Renaud Delbru added a comment -

          Hi Adrien, yes I understand the problem. While, it is true that in the extreme case, people could configure a different StoredFieldsFormat for each field, which will lead to a large increase of disk seeks, here we would like to use the CompressingStoredFieldsFormat for all the standard fields, but have a different mechanism for specific fields.

          We would like to store certain fields that requires a different type of data structure than the one currently supported, i.e., a document is not a simple list of fields, but a more complex data structure.

          We could solve the problem by copying and modifying the current CompressingStoredFieldsWriter and CompressingStoredFieldsReader so that it can decide what type of encoding to use based on the field info. However, this is kind of hacky, and we will have to keep in synch our copy with the original implementation. The only way we could find is to have a perfield approach.

          Show
          Renaud Delbru added a comment - Hi Adrien, yes I understand the problem. While, it is true that in the extreme case, people could configure a different StoredFieldsFormat for each field, which will lead to a large increase of disk seeks, here we would like to use the CompressingStoredFieldsFormat for all the standard fields, but have a different mechanism for specific fields. We would like to store certain fields that requires a different type of data structure than the one currently supported, i.e., a document is not a simple list of fields, but a more complex data structure. We could solve the problem by copying and modifying the current CompressingStoredFieldsWriter and CompressingStoredFieldsReader so that it can decide what type of encoding to use based on the field info. However, this is kind of hacky, and we will have to keep in synch our copy with the original implementation. The only way we could find is to have a perfield approach.
          Hide
          Adrien Grand added a comment -

          We would like to store certain fields that requires a different type of data structure than the one currently supported, i.e., a document is not a simple list of fields, but a more complex data structure.

          How would you do that? As far as I know, the only way to access stored fields is through the StoredFieldVisitor API, which only supports sequences of fields which are necessary numbers, strings or byte arrays.

          However, this is kind of hacky, and we will have to keep in synch our copy with the original implementation.

          I think writing a different StoredFieldsFormat impl would be a better option as you could still require only one disk seek in the worst case. You could reuse some components of CompressingStoredFieldsFormat such as CompressionMode, and maybe we could expose other ones such as CompressingStoredFieldsFormatIndexWriter,Reader (they write/read the stored fields index in a memory-efficient way).

          If you don't care about increasing the number of disk seeks and if you can encode/decode your complex data structure into a small opaque binary blob, maybe DocValues could be an option (Mike and Robert are adding per-field format support to DocValues in LUCENE-4547).

          Show
          Adrien Grand added a comment - We would like to store certain fields that requires a different type of data structure than the one currently supported, i.e., a document is not a simple list of fields, but a more complex data structure. How would you do that? As far as I know, the only way to access stored fields is through the StoredFieldVisitor API, which only supports sequences of fields which are necessary numbers, strings or byte arrays. However, this is kind of hacky, and we will have to keep in synch our copy with the original implementation. I think writing a different StoredFieldsFormat impl would be a better option as you could still require only one disk seek in the worst case. You could reuse some components of CompressingStoredFieldsFormat such as CompressionMode, and maybe we could expose other ones such as CompressingStoredFieldsFormatIndexWriter,Reader (they write/read the stored fields index in a memory-efficient way). If you don't care about increasing the number of disk seeks and if you can encode/decode your complex data structure into a small opaque binary blob, maybe DocValues could be an option (Mike and Robert are adding per-field format support to DocValues in LUCENE-4547 ).
          Hide
          Robert Muir added a comment -

          I don't think we should implement an SPI unless its really necessary (I've said this before, SPI is heavy-duty stuff and shouldn't be done on a whim).

          As Adrien describes, StoredFieldsFormat does not really make sense per-field.

          I don't think we can/should use SegmentWriteState: since IndexWriter streams stored fields directly to the codec, some parameters like FieldInfos are not available at the time this writer is initialized.

          But I think all components should get and respect segment suffixes!

          Show
          Robert Muir added a comment - I don't think we should implement an SPI unless its really necessary (I've said this before, SPI is heavy-duty stuff and shouldn't be done on a whim). As Adrien describes, StoredFieldsFormat does not really make sense per-field. I don't think we can/should use SegmentWriteState: since IndexWriter streams stored fields directly to the codec, some parameters like FieldInfos are not available at the time this writer is initialized. But I think all components should get and respect segment suffixes!
          Hide
          Renaud Delbru added a comment -

          How would you do that? As far as I know, the only way to access stored fields is through the StoredFieldVisitor API

          Yes, but we can pass our own implementation of StoredFieldVisitor with our own specific information about what to retrieve.

          So from Adrien and Robert feedback, it looks like you do not really want to see a perfield StoredFieldsFormat mechanism. that's fine and understandable.
          Could we instead try to open/extend the current code base so that we are more free to extend it on our side ? For example, opening CompressingStoredFieldsWriter and CompressingStoredFieldsReader, as well as making them configurable with a segment suffixes ? That would greatly help.

          Show
          Renaud Delbru added a comment - How would you do that? As far as I know, the only way to access stored fields is through the StoredFieldVisitor API Yes, but we can pass our own implementation of StoredFieldVisitor with our own specific information about what to retrieve. So from Adrien and Robert feedback, it looks like you do not really want to see a perfield StoredFieldsFormat mechanism. that's fine and understandable. Could we instead try to open/extend the current code base so that we are more free to extend it on our side ? For example, opening CompressingStoredFieldsWriter and CompressingStoredFieldsReader, as well as making them configurable with a segment suffixes ? That would greatly help.
          Hide
          Adrien Grand added a comment -

          Could we instead try to open/extend the current code base so that we are more free to extend it on our side?

          Can you list what you exactly need and/or provide us with a patch?

          Show
          Adrien Grand added a comment - Could we instead try to open/extend the current code base so that we are more free to extend it on our side? Can you list what you exactly need and/or provide us with a patch?
          Hide
          Robert Muir added a comment -

          As I said before, In my opinion all (or at least most) codec components should take segment suffixes and respect them.

          I actually have a nocommit for this in LUCENE-4547 branch.

          I was thinking we could even have a test for this, that writes a segment directly to all the consumers with a suffix and then
          checks the filenames and so on.

          Show
          Robert Muir added a comment - As I said before, In my opinion all (or at least most) codec components should take segment suffixes and respect them. I actually have a nocommit for this in LUCENE-4547 branch. I was thinking we could even have a test for this, that writes a segment directly to all the consumers with a suffix and then checks the filenames and so on.
          Hide
          Renaud Delbru added a comment -

          Given that the task to properly support segment suffixes looks more difficult than expected (until maybe Robert comes with a patch), the remaining solution for us, in order to avoid to copy most of the compressing package, would be to open CompressingStoredFieldsWriter and CompressingStoredFieldsReader, i.e., make them public and non final (see patch). I am not sure that would be acceptable for you, but that would be good for expert-level extension.

          Show
          Renaud Delbru added a comment - Given that the task to properly support segment suffixes looks more difficult than expected (until maybe Robert comes with a patch), the remaining solution for us, in order to avoid to copy most of the compressing package, would be to open CompressingStoredFieldsWriter and CompressingStoredFieldsReader, i.e., make them public and non final (see patch). I am not sure that would be acceptable for you, but that would be good for expert-level extension.
          Hide
          Adrien Grand added a comment -

          Why would you need to make them non-final?

          Show
          Adrien Grand added a comment - Why would you need to make them non-final?
          Hide
          Renaud Delbru added a comment -

          To be able to subclass and extend them. If that is not acceptable for you, then the solution left to us is to copy-paste the original code, extend it on our side (and close this issue).

          Show
          Renaud Delbru added a comment - To be able to subclass and extend them. If that is not acceptable for you, then the solution left to us is to copy-paste the original code, extend it on our side (and close this issue).
          Hide
          Adrien Grand added a comment -

          I had a look at CompressingStoredFieldsWriter and I think that having a different encoding/compression strategy per field would deserve a different StoredFieldsFormat impl (this is a discussion we had in LUCENE-4226, but in that case I think we could open up CompressingStoredFieldsIndexWriter/Reader). However I was thinking that if you don't mind adding one or two extra random seeks, maybe you could reuse it without extending it, like

          MyCustomStoredFieldsWriter {
          
            StoredFieldsWriter defaultSfw; // the default Lucene 4.1 stored fields writer
          
            writeField(FieldInfo info, StorableField field) {
              if (isStandard(field)) {
                defaultSfw.writeField(info, field);
              } else {
                // TODO: custom logic writing non-standard fields to another IndexOutput
              }
            }
          
          }
          

          and similarly for the reader

          MyCustomStoredFieldsReader {
          
            StoredFieldsReader defaultSfr; // the default Lucene 4.1 stored fields reader
          
            void visitDocument(int n, StoredFieldVisitor visitor) {
              // visit standard fields
              defaultSfr.visitDocument(n, visitor);
              // TODO then visit specific fields
            }
          
          }
          

          Would it work for your use-case?

          Show
          Adrien Grand added a comment - I had a look at CompressingStoredFieldsWriter and I think that having a different encoding/compression strategy per field would deserve a different StoredFieldsFormat impl (this is a discussion we had in LUCENE-4226 , but in that case I think we could open up CompressingStoredFieldsIndexWriter/Reader). However I was thinking that if you don't mind adding one or two extra random seeks, maybe you could reuse it without extending it, like MyCustomStoredFieldsWriter { StoredFieldsWriter defaultSfw; // the default Lucene 4.1 stored fields writer writeField(FieldInfo info, StorableField field) { if (isStandard(field)) { defaultSfw.writeField(info, field); } else { // TODO: custom logic writing non-standard fields to another IndexOutput } } } and similarly for the reader MyCustomStoredFieldsReader { StoredFieldsReader defaultSfr; // the default Lucene 4.1 stored fields reader void visitDocument( int n, StoredFieldVisitor visitor) { // visit standard fields defaultSfr.visitDocument(n, visitor); // TODO then visit specific fields } } Would it work for your use-case?
          Hide
          Renaud Delbru added a comment -

          It is a similar approach that we followed (see attached files: PerFieldStoredFieldsFormat, PerFieldStoredFieldsWriter and PerFieldStoredFieldsReader).
          The issue is that our secondary StoredFieldsReader/Writer we are using is, for the moment, a wrapper around an instance of the CompressingStoredFieldsReader/Writer (using a wrapper approach was another way to extend CompressingStoredFieldsReader/Writer). The wrapper implements our encoding logic, and uses the underlying CompressingStoredFieldsWriter to write our data as a binary block. The problem with this approach is that since we can not configure the segment suffix of the CompressingStoredFieldsWriter, then the two StoredFieldsFormat try to write to files that have identical names.
          Since we are using a CompressingStoredFieldsReader/Writer as underlying mechanism to write the stored fields, why are we not using just one instance to store default lucene fields and our specific fields ? The reasons are: that it was more simple for our first implementation to leverage CompressingStoredFieldsReader/Writer (as a temporary solution); and that we would like to keep things (code and segment files) more isolated from each other.
          As said previously, we could simply copy-paste the compressing codec on our side to solve the problem, but I thought that maybe by raising the issue, we could have found a more appropriate solution.

          Show
          Renaud Delbru added a comment - It is a similar approach that we followed (see attached files: PerFieldStoredFieldsFormat, PerFieldStoredFieldsWriter and PerFieldStoredFieldsReader). The issue is that our secondary StoredFieldsReader/Writer we are using is, for the moment, a wrapper around an instance of the CompressingStoredFieldsReader/Writer (using a wrapper approach was another way to extend CompressingStoredFieldsReader/Writer). The wrapper implements our encoding logic, and uses the underlying CompressingStoredFieldsWriter to write our data as a binary block. The problem with this approach is that since we can not configure the segment suffix of the CompressingStoredFieldsWriter, then the two StoredFieldsFormat try to write to files that have identical names. Since we are using a CompressingStoredFieldsReader/Writer as underlying mechanism to write the stored fields, why are we not using just one instance to store default lucene fields and our specific fields ? The reasons are: that it was more simple for our first implementation to leverage CompressingStoredFieldsReader/Writer (as a temporary solution); and that we would like to keep things (code and segment files) more isolated from each other. As said previously, we could simply copy-paste the compressing codec on our side to solve the problem, but I thought that maybe by raising the issue, we could have found a more appropriate solution.
          Hide
          Renaud Delbru added a comment -

          Another solution would be to extend the constructors of CompressingStoredFieldsWriter/Reader to accept a segment suffix as parameter. See new patch attached. And this might also pave the road for Robert's work on making codec components respectful of segment suffix.

          Show
          Renaud Delbru added a comment - Another solution would be to extend the constructors of CompressingStoredFieldsWriter/Reader to accept a segment suffix as parameter. See new patch attached. And this might also pave the road for Robert's work on making codec components respectful of segment suffix.
          Hide
          Adrien Grand added a comment -

          Adding constructors to support segment suffixes sounds good! I modified your patch to add lucene.experimental tags and make ant precommit pass. If it is fine with you, I'll commit it soon.

          Show
          Adrien Grand added a comment - Adding constructors to support segment suffixes sounds good! I modified your patch to add lucene.experimental tags and make ant precommit pass. If it is fine with you, I'll commit it soon.
          Hide
          Renaud Delbru added a comment -

          That is fine for me. Thanks for your help Adrien.

          Show
          Renaud Delbru added a comment - That is fine for me. Thanks for your help Adrien.
          Hide
          Commit Tag Bot added a comment -

          [trunk commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1419449

          LUCENE-4591: Make CompressingStoredFields

          {Writer,Reader}

          accept a segment suffix as a constructor parameter.

          Show
          Commit Tag Bot added a comment - [trunk commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1419449 LUCENE-4591 : Make CompressingStoredFields {Writer,Reader} accept a segment suffix as a constructor parameter.
          Hide
          Commit Tag Bot added a comment -

          [branch_4x commit] Adrien Grand
          http://svn.apache.org/viewvc?view=revision&revision=1419483

          LUCENE-4591: Make CompressingStoredFields

          {Writer,Reader}

          accept a segment suffix as a constructor parameter (merged from r1419449).

          Show
          Commit Tag Bot added a comment - [branch_4x commit] Adrien Grand http://svn.apache.org/viewvc?view=revision&revision=1419483 LUCENE-4591 : Make CompressingStoredFields {Writer,Reader} accept a segment suffix as a constructor parameter (merged from r1419449).

            People

            • Assignee:
              Adrien Grand
              Reporter:
              Renaud Delbru
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development