Lucene - Core
  1. Lucene - Core
  2. LUCENE-3312

Break out StorableField from IndexableField

    Details

    • Lucene Fields:
      New

      Description

      In the field type branch we have strongly decoupled
      Document/Field/FieldType impl from the indexer, by having only a
      narrow API (IndexableField) passed to IndexWriter. This frees apps up
      use their own "documents" instead of the "user-space" impls we provide
      in oal.document.

      Similarly, with LUCENE-3309, we've done the same thing on the
      doc/field retrieval side (from IndexReader), with the
      StoredFieldsVisitor.

      But, maybe we should break out StorableField from IndexableField,
      such that when you index a doc you provide two Iterables – one for the
      IndexableFields and one for the StorableFields. Either can be null.

      One downside is possible perf hit for fields that are both indexed &
      stored (ie, we visit them twice, lookup their name in a hash twice,
      etc.). But the upside is a cleaner separation of concerns in API....

      1. LUCENE-3312-DocumentIterators-uwe.patch
        2 kB
        Uwe Schindler
      2. lucene-3312-patch-01.patch
        44 kB
        Nikola Tankovic
      3. lucene-3312-patch-02.patch
        68 kB
        Nikola Tankovic
      4. lucene-3312-patch-03.patch
        129 kB
        Nikola Tankovic
      5. lucene-3312-patch-04.patch
        126 kB
        Nikola Tankovic
      6. lucene-3312-patch-05.patch
        107 kB
        Nikola Tankovic
      7. lucene-3312-patch-06.patch
        109 kB
        Nikola Tankovic
      8. lucene-3312-patch-07.patch
        95 kB
        Nikola Tankovic
      9. lucene-3312-patch-08.patch
        103 kB
        Nikola Tankovic
      10. lucene-3312-patch-09.patch
        1 kB
        Nikola Tankovic
      11. lucene-3312-patch-10.patch
        9 kB
        Nikola Tankovic
      12. lucene-3312-patch-11.patch
        140 kB
        Nikola Tankovic
      13. lucene-3312-patch-12.patch
        141 kB
        Nikola Tankovic
      14. lucene-3312-patch-12a.patch
        147 kB
        Nikola Tankovic
      15. lucene-3312-patch-13.patch
        6 kB
        Nikola Tankovic
      16. lucene-3312-patch-14.patch
        3 kB
        Nikola Tankovic
      17. LUCENE-3312-reintegration.patch
        312 kB
        Uwe Schindler
      18. LUCENE-3312-reintegration.patch
        308 kB
        Uwe Schindler

        Issue Links

          Activity

          Hide
          Simon Willnauer added a comment -

          once we are on this here we should think about breaking out storing fields from the indexing chain. I think this could easily be a separated process such that stored fields are not written by the indexing chain but once all fields are indexed. This would make the indexing chain somewhat cleaner I think.

          Show
          Simon Willnauer added a comment - once we are on this here we should think about breaking out storing fields from the indexing chain. I think this could easily be a separated process such that stored fields are not written by the indexing chain but once all fields are indexed. This would make the indexing chain somewhat cleaner I think.
          Hide
          Chris Male added a comment -

          With LUCENE-2308 out of the way, I've started looking into this more deeply. Changing the indexer code has not been especially difficult since there is already a clear separation in the handling of indexed and stored fields. The challenges are in the consumer / user code. So I have a couple of questions I'm hoping for some opinions on:

          • Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata?
          • By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality?
          Show
          Chris Male added a comment - With LUCENE-2308 out of the way, I've started looking into this more deeply. Changing the indexer code has not been especially difficult since there is already a clear separation in the handling of indexed and stored fields. The challenges are in the consumer / user code. So I have a couple of questions I'm hoping for some opinions on: Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata? By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality?
          Hide
          Michael McCandless added a comment -

          Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata?

          Right, this has been a long standing problem w/ the Document class you
          load at search time, ie the fields "pretend" to carry over the
          details from indexing. But it's buggy now, eg boost is not carried
          over, and the indexed bit is "global" (comes from field info) while
          the "tokenized" bit used to be per-doc, before LUCENE-2308.

          So I consider this (these indexing details are no longer available
          when you pull the document) a big benefit of cutting over to
          StorableField. Ie, its trappy today since it's buggy, so we'd be
          removing that trap.

          By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality?

          Good question... I think the userland "Field" (oal.document) should
          implement both IndexableField and StorableField? And then
          oal.document.Document holds Field instances?

          Maybe we can name the new class oal.index.Indexable? It's a trivial
          class, just exposing .indexableFieldsIterator and
          .storableFieldsIterator?

          Show
          Michael McCandless added a comment - Due to the fact that FieldInfo is maintained per field name, if an IndexableField and StorableField are added to a Document separately but with the same name, a single FieldInfo will be created noting the field is both indexed and stored. This isn't a problem, however a lot of code used to leverage this fact to get metadata about indexed Fields using searcher.document(docId). They would retrieve all the stored fields and then see which were also indexed (and associated metadata). This seems like a bit of a hack, piggybacking stored fields to find out about their indexing attributes. So I guess it cannot continue to go forward? When you pull the StorableFields, you should only be able to access the stored value metadata? Right, this has been a long standing problem w/ the Document class you load at search time, ie the fields "pretend" to carry over the details from indexing. But it's buggy now, eg boost is not carried over, and the indexed bit is "global" (comes from field info) while the "tokenized" bit used to be per-doc, before LUCENE-2308 . So I consider this (these indexing details are no longer available when you pull the document) a big benefit of cutting over to StorableField. Ie, its trappy today since it's buggy, so we'd be removing that trap. By creating this separation, we will need some notion of a Document in index.* which provides Iterable access to both the IndexableFields and StorableFields. As such, Document itself is becoming more userland. However by letting it store Indexable and StorableFields separately, the functionality it provides (getBinaryValue for example) becomes quite verbose because it must provide an implementations of both kinds of fields. Given that Field is a userland implementation of both Indexable and StorableField, should Document work solely with Fields? or should we allow people to register both kinds of fields separately and just have a verbose set of functionality? Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances? Maybe we can name the new class oal.index.Indexable? It's a trivial class, just exposing .indexableFieldsIterator and .storableFieldsIterator?
          Hide
          Chris Male added a comment -

          Good question... I think the userland "Field" (oal.document) should
          implement both IndexableField and StorableField? And then
          oal.document.Document holds Field instances?

          Yeah I have made Field implement both types. I have also left stored() and indexed() on Field while removing them from the respective interfaces.

          Maybe we can name the new class oal.index.Indexable? It's a trivial
          class, just exposing .indexableFieldsIterator and
          .storableFieldsIterator?

          Absolutely.

          Thanks for your help Mike!

          Show
          Chris Male added a comment - Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances? Yeah I have made Field implement both types. I have also left stored() and indexed() on Field while removing them from the respective interfaces. Maybe we can name the new class oal.index.Indexable? It's a trivial class, just exposing .indexableFieldsIterator and .storableFieldsIterator? Absolutely. Thanks for your help Mike!
          Hide
          Chris Male added a comment -

          I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields?

          Show
          Chris Male added a comment - I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields?
          Hide
          Simon Willnauer added a comment -

          I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields?

          To me it appears that we need some clarification what DocValues are. Actually, when you think about it Stored Fields and DocValues have a lot in common. A Stored Field is basically a DocValues DerefVarBytes type and maybe down the road we should think about merge those two types together. It would be nice to have only one typesafe API that can store whatever you want and based on the codec lucene would decide how to store it on disk ie. if it is a multi field container like Stored Fields are done today or if the values are split appart like DocValues does it today.
          For now we should try to differentiate between and InvertedField and a StoredField ie. everything which is not an InvertedField is a StoredField. The API could basically already reflect that DocValues and StoredFields are the same and simply specify a type like Store.Packed vs. Store.ColumnStride or something like that. If we do that we could also expose loading "Packed" Fields via PerDocValues and have one API for our users.

          Show
          Simon Willnauer added a comment - I'm almost done getting an initial patch for this, just one issue remaining - IndexDocValues. IndexDocValues can be both not indexed and not stored. Therefore when you retrieve the indexed fields and then the stored fields, you can miss some IndexDocValues. It seems to be that we might need a 3rd interface to cover these fields? To me it appears that we need some clarification what DocValues are. Actually, when you think about it Stored Fields and DocValues have a lot in common. A Stored Field is basically a DocValues DerefVarBytes type and maybe down the road we should think about merge those two types together. It would be nice to have only one typesafe API that can store whatever you want and based on the codec lucene would decide how to store it on disk ie. if it is a multi field container like Stored Fields are done today or if the values are split appart like DocValues does it today. For now we should try to differentiate between and InvertedField and a StoredField ie. everything which is not an InvertedField is a StoredField. The API could basically already reflect that DocValues and StoredFields are the same and simply specify a type like Store.Packed vs. Store.ColumnStride or something like that. If we do that we could also expose loading "Packed" Fields via PerDocValues and have one API for our users.
          Hide
          Chris Male added a comment -

          Just for clarification, Packed refers to the notion of a stored field? or am I lost?

          Show
          Chris Male added a comment - Just for clarification, Packed refers to the notion of a stored field? or am I lost?
          Hide
          Simon Willnauer added a comment -

          Just for clarification, Packed refers to the notion of a stored field? or am I lost?

          yes, since we pack all fields together into one location and store only the offset to the first field per document. I just used that term here to differentiate, sorry for the confusion

          Show
          Simon Willnauer added a comment - Just for clarification, Packed refers to the notion of a stored field? or am I lost? yes, since we pack all fields together into one location and store only the offset to the first field per document. I just used that term here to differentiate, sorry for the confusion
          Hide
          Chris Male added a comment -

          Given what you say about the similarities between stored fields and DocValues and the direction we seem to be heading, I think its a good term to start using.

          Show
          Chris Male added a comment - Given what you say about the similarities between stored fields and DocValues and the direction we seem to be heading, I think its a good term to start using.
          Hide
          Chris Male added a comment -

          Good question... I think the userland "Field" (oal.document) should
          implement both IndexableField and StorableField? And then
          oal.document.Document holds Field instances?

          Hm I'm going round in circles on this. For building and indexing a Document, having the class hold Field instances is easiest and the most clean option. However this then means we are once again providing Field instances in the Document returned by reader.document(), meaning we lose:

          So I consider this (these indexing details are no longer available
          when you pull the document) a big benefit of cutting over to
          StorableField. Ie, its trappy today since it's buggy, so we'd be
          removing that trap.

          Thoughts? :/

          Show
          Chris Male added a comment - Good question... I think the userland "Field" (oal.document) should implement both IndexableField and StorableField? And then oal.document.Document holds Field instances? Hm I'm going round in circles on this. For building and indexing a Document, having the class hold Field instances is easiest and the most clean option. However this then means we are once again providing Field instances in the Document returned by reader.document(), meaning we lose: So I consider this (these indexing details are no longer available when you pull the document) a big benefit of cutting over to StorableField. Ie, its trappy today since it's buggy, so we'd be removing that trap. Thoughts? :/
          Hide
          Chris Male added a comment - - edited

          So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields. I like this idea since it gets people over the hump of piggybacking Field, but it is a bw compat break. Any objections?

          Show
          Chris Male added a comment - - edited So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields. I like this idea since it gets people over the hump of piggybacking Field, but it is a bw compat break. Any objections?
          Hide
          Michael McCandless added a comment -

          I agree DocValues seem both indexed and stored, but they are closer to
          stored field so let's put them under StorableField.

          And indeed we could impl stored fields as a DerefVarBytes doc values
          field, but I think we should hold off on unifying this in the APIs we
          are creating here?

          Ie, StorableField should just have a .docValues() method and if that
          returns non-null value the indexer will index those doc values
          (likewise for .stringValue(), binaryValue(), etc.)?

          Show
          Michael McCandless added a comment - I agree DocValues seem both indexed and stored, but they are closer to stored field so let's put them under StorableField. And indeed we could impl stored fields as a DerefVarBytes doc values field, but I think we should hold off on unifying this in the APIs we are creating here? Ie, StorableField should just have a .docValues() method and if that returns non-null value the indexer will index those doc values (likewise for .stringValue(), binaryValue(), etc.)?
          Hide
          Michael McCandless added a comment -

          So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields.

          +1

          Show
          Michael McCandless added a comment - So one thought would be to have a different class being returned by reader.document(), we could call it StoredDocument and it would only make access to StorableFields. +1
          Hide
          Chris Male added a comment - - edited

          Back on this wagon for a bit.

          Just wondering about whether we need a StorableFieldType to accompany StorableField.

          At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages:

          • Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction.
          • It separates the 'data type' away from the actual value.

          If these properties were to stay on StorableField, I can't really see the need for a StorableFieldType.

          Show
          Chris Male added a comment - - edited Back on this wagon for a bit. Just wondering about whether we need a StorableFieldType to accompany StorableField. At this stage I've struggling to identify candidate properties for a StorableFieldType. Options include moving the Numeric.DataType and DocValues' ValueType to the FieldType. While I sort of like this idea, it seems to have a couple of disadvantages: Any FieldTypes passed into NumericField and IndexDocValuesField would have to have these properties set from the beginning. For both of these, this would mean it wouldn't be possible to simple initialize a field and then use one of the setters to define the Data/ValueType - they would need to be known at construction. It separates the 'data type' away from the actual value. If these properties were to stay on StorableField, I can't really see the need for a StorableFieldType.
          Hide
          Nikola Tankovic added a comment -

          Hi folks,

          I think this one could be a nice addition to my last year GSOC and I would like to take on it for this year GSOC.
          Mike also suggested using different class for reader.document() so this means I would also do https://issues.apache.org/jira/browse/LUCENE-3891

          What do you guys think?

          Show
          Nikola Tankovic added a comment - Hi folks, I think this one could be a nice addition to my last year GSOC and I would like to take on it for this year GSOC. Mike also suggested using different class for reader.document() so this means I would also do https://issues.apache.org/jira/browse/LUCENE-3891 What do you guys think?
          Hide
          Michael McCandless added a comment -

          Hi Nikola,

          I think this plus LUCENE-3891 sounds great! The challenge is... we need a mentor for this project... volunteers?

          Show
          Michael McCandless added a comment - Hi Nikola, I think this plus LUCENE-3891 sounds great! The challenge is... we need a mentor for this project... volunteers?
          Hide
          Nikola Tankovic added a comment -

          Hi guys,

          I submited first draft of proposal @ http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/ntankovic/30002#

          I'm welcome for corrections and further proposals.

          Ty!

          Show
          Nikola Tankovic added a comment - Hi guys, I submited first draft of proposal @ http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/ntankovic/30002# I'm welcome for corrections and further proposals. Ty!
          Hide
          Uwe Schindler added a comment -

          Hi Michael,
          I would take the mentorship. We will have an IRC meeting for a short interview tomorrow (Sat, 14) on #lucene-dev at 13:00 UTC.

          Show
          Uwe Schindler added a comment - Hi Michael, I would take the mentorship. We will have an IRC meeting for a short interview tomorrow (Sat, 14) on #lucene-dev at 13:00 UTC.
          Hide
          Uwe Schindler added a comment -
          Show
          Uwe Schindler added a comment - We had the GSoC interview today: http://colabti.org/irclogger/irclogger_log/lucene-dev?date=2012-04-14#l27
          Hide
          Nikola Tankovic added a comment -

          I want to thank you guys for the opportunity to be part of GSOC 2012. Let the coding begin! You rock!

          Show
          Nikola Tankovic added a comment - I want to thank you guys for the opportunity to be part of GSOC 2012. Let the coding begin! You rock!
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          I am glad to work with you this year! I did not get the official confirmation from Google until now, but I will check out later! Once back at home (I am on business trip), we should meet this week maybe in IRC to make a plan for the following weeks!

          Uwe

          Show
          Uwe Schindler added a comment - Hi Nikola, I am glad to work with you this year! I did not get the official confirmation from Google until now, but I will check out later! Once back at home (I am on business trip), we should meet this week maybe in IRC to make a plan for the following weeks! Uwe
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          what's the plan to start with?

          Show
          Uwe Schindler added a comment - Hi Nikola, what's the plan to start with?
          Hide
          Uwe Schindler added a comment -

          While at the Lucene Revolution Conference, Mike and me discussed about "hiding" the internal Lucene document IDs from the public search/TopDocs API. As this is very related to this issue, we may thing about implementing this.

          My idea was to let TopDocs directly return the SearchDocument classes, only addressed by its slot in the TopDocs. This way, we prevent the ongoing problems with users "thinking" that Lucene internal document IDs should be stable (which they aren't).

          Of course in the expert APIs (Scorer, DocIDSet, FoildCache...) we still have the internal DocIDs, but the public facing APIs (executing a query and getting TopDocs) should not use internal DocIds. Robert already opened a new issue to "resurrect the crazy Hits" class (LUCENE-4052, but with it's problems fixed).

          Show
          Uwe Schindler added a comment - While at the Lucene Revolution Conference, Mike and me discussed about "hiding" the internal Lucene document IDs from the public search/TopDocs API. As this is very related to this issue, we may thing about implementing this. My idea was to let TopDocs directly return the SearchDocument classes, only addressed by its slot in the TopDocs. This way, we prevent the ongoing problems with users "thinking" that Lucene internal document IDs should be stable (which they aren't). Of course in the expert APIs (Scorer, DocIDSet, FoildCache...) we still have the internal DocIDs, but the public facing APIs (executing a query and getting TopDocs) should not use internal DocIds. Robert already opened a new issue to "resurrect the crazy Hits" class ( LUCENE-4052 , but with it's problems fixed).
          Hide
          Yonik Seeley added a comment -

          I'd argue that TopDocs is also expert level and core (e.g. see TopFieldCollector.create()) hence we shouldn't hide the IDs as that would seriously be diminishing functionality.

          Show
          Yonik Seeley added a comment - I'd argue that TopDocs is also expert level and core (e.g. see TopFieldCollector.create()) hence we shouldn't hide the IDs as that would seriously be diminishing functionality.
          Hide
          Uwe Schindler added a comment -

          The TopDocs class is what IndexSearcher.search() returns, so why is it expert?

          Show
          Uwe Schindler added a comment - The TopDocs class is what IndexSearcher.search() returns, so why is it expert?
          Hide
          Yonik Seeley added a comment -

          The TopDocs class is what IndexSearcher.search() returns, so why is it expert?

          Because it's not just a friendly wrapper around other expert level classes. Remove access to the internal ID there, and it gets extremely difficult to do some types of searches where you want the IDs.
          For example: the TopDocsCollector (expert level) classes return TopDocs.

          Show
          Yonik Seeley added a comment - The TopDocs class is what IndexSearcher.search() returns, so why is it expert? Because it's not just a friendly wrapper around other expert level classes. Remove access to the internal ID there, and it gets extremely difficult to do some types of searches where you want the IDs. For example: the TopDocsCollector (expert level) classes return TopDocs.
          Hide
          Uwe Schindler added a comment -

          My intention wa snot to remove it completely, the idea was to make TopDocs internally hold the IndexSearcher, too, so it can return the Document instance (or what we have here after the refactoring). Returning the docId would then be "expert", but user-facing code could call "Document TopDocs.getDocument(slot)".

          Show
          Uwe Schindler added a comment - My intention wa snot to remove it completely, the idea was to make TopDocs internally hold the IndexSearcher, too, so it can return the Document instance (or what we have here after the refactoring). Returning the docId would then be "expert", but user-facing code could call "Document TopDocs.getDocument(slot)".
          Hide
          Robert Muir added a comment -

          But its wrong now to put an IndexSearcher in TopDocs, as it can contain results
          from multiple searchers (TopDocs.merge)

          Show
          Robert Muir added a comment - But its wrong now to put an IndexSearcher in TopDocs, as it can contain results from multiple searchers (TopDocs.merge)
          Hide
          Uwe Schindler added a comment -

          Robert, I agree about that! I am just discussing ideas, maybe create separate issue about that.

          Show
          Uwe Schindler added a comment - Robert, I agree about that! I am just discussing ideas, maybe create separate issue about that.
          Hide
          Nikola Tankovic added a comment -

          Patch 01: Introduces StoredDocument,
          Patch 02: Continues along path in patch 01, introduces StorableField

          Show
          Nikola Tankovic added a comment - Patch 01: Introduces StoredDocument, Patch 02: Continues along path in patch 01, introduces StorableField
          Hide
          Nikola Tankovic added a comment -

          Patch 03: Core compiles, and almost all test pass.

          The problem remains: should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments?

          Show
          Nikola Tankovic added a comment - Patch 03: Core compiles, and almost all test pass. The problem remains: should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments?
          Hide
          Michael McCandless added a comment -

          Patch looks great!

          I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I
          think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document
          class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those
          iterators to know what fields to index/store?

          should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments?

          I think ideally we would have only a StorableFieldType accessible in the loaded document? Ie, such
          that, it would in fact be safe to turn around and re-use that StoredFields instance for adding stored fields to
          another document your are about to index?

          Show
          Michael McCandless added a comment - Patch looks great! I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those iterators to know what fields to index/store? should StoredDocument with StorableFields have typeInfos? So we can make new Document for another round of indexing from StoredDocuments? I think ideally we would have only a StorableFieldType accessible in the loaded document? Ie, such that, it would in fact be safe to turn around and re-use that StoredFields instance for adding stored fields to another document your are about to index?
          Hide
          Chris Male added a comment -

          I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I
          think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document
          class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those
          iterators to know what fields to index/store?

          I agree fully with this. We really want to decouple Storable and Indexable concepts as much as possible.

          Show
          Chris Male added a comment - I'm not sure IndexableField should extend StorableField? Shouldn't they be separate classes? I think we should also split out a StorableFieldType from IndexableFieldType? Then, the Document class would expose two iterators (one for indexed fields, one for stored fields) and IW uses those iterators to know what fields to index/store? I agree fully with this. We really want to decouple Storable and Indexable concepts as much as possible.
          Hide
          Nikola Tankovic added a comment -

          OK, I agree with decoupling. Can I make GeneralField interface with methods like "getName" and "getXXXValue"? Then, both Indexable and Storable fields can extend that?

          Show
          Nikola Tankovic added a comment - OK, I agree with decoupling. Can I make GeneralField interface with methods like "getName" and "getXXXValue"? Then, both Indexable and Storable fields can extend that?
          Hide
          Chris Male added a comment -

          I'm not sure that's needed. In the end IndexableField shouldn't have any getXXXValue() methods, it should only return a TokenStream. Only StorableField would need those methods.

          Show
          Chris Male added a comment - I'm not sure that's needed. In the end IndexableField shouldn't have any getXXXValue() methods, it should only return a TokenStream. Only StorableField would need those methods.
          Hide
          Nikola Tankovic added a comment -

          OK, thank you! I was not sure about that.

          Show
          Nikola Tankovic added a comment - OK, thank you! I was not sure about that.
          Hide
          Nikola Tankovic added a comment -

          Patch 04: Status > core compiles.

          This is an attempt to separate IndexableFields and StorebleFields in indexing.

          I introduced oal.index.Document which holds both type of fields.

          I also introduced StorableFieldType interface, StoredFieldType class.

          Let me know what you think!

          Show
          Nikola Tankovic added a comment - Patch 04: Status > core compiles. This is an attempt to separate IndexableFields and StorebleFields in indexing. I introduced oal.index.Document which holds both type of fields. I also introduced StorableFieldType interface, StoredFieldType class. Let me know what you think!
          Hide
          Andrzej Bialecki added a comment -

          Comments to patch 04:

          • index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators...
          • as noted on IRC, this strong decoupling of stored and indexed content poses some interesting questions:
            • since you can add multiple fields with the same name, you can now add an arbitrary sequence of Stored and Indexed fields (all with the same name). This means that you can now store parts of a field that are not indexed, and parts of a field that are indexed but not stored.
            • previously, if a field was flagged as indexed but didn't have a tokenStream, its String or Reader value would be used to create a token stream. Now if you want a value to be stored and indexed you have to add two fields with the same name - one StoredField and the other an IndexedField for which you create a token stream from the value. My assumption is that StoredField-s will never be used anymore as potential sources of token streams?
          • maybe this is a good moment to change all getters that return arrays of fields or values to return List-s, since all the code is doing underneath is collecting them into lists and then converting to arrays?
          • previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields?
          • minor nit: there's a grammar mistake in Field.setTokenStream(..): "TokenStream fields tokenized".
          Show
          Andrzej Bialecki added a comment - Comments to patch 04: index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators... as noted on IRC, this strong decoupling of stored and indexed content poses some interesting questions: since you can add multiple fields with the same name, you can now add an arbitrary sequence of Stored and Indexed fields (all with the same name). This means that you can now store parts of a field that are not indexed, and parts of a field that are indexed but not stored. previously, if a field was flagged as indexed but didn't have a tokenStream, its String or Reader value would be used to create a token stream. Now if you want a value to be stored and indexed you have to add two fields with the same name - one StoredField and the other an IndexedField for which you create a token stream from the value. My assumption is that StoredField-s will never be used anymore as potential sources of token streams? maybe this is a good moment to change all getters that return arrays of fields or values to return List-s, since all the code is doing underneath is collecting them into lists and then converting to arrays? previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields? minor nit: there's a grammar mistake in Field.setTokenStream(..): "TokenStream fields tokenized".
          Hide
          Chris Male added a comment -

          index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators...

          I'm not sure that is such a big deal. But I do think should think about the name here. We already have Document and it's going to become confusing with two different Document classes kind of doing the same thing and with document.Document implementing index.Document as well.

          previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields?

          I think we need to simplify the document.Document API. I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances. It is a userland kind of class and so is Field. We should make it easy for people to add the Fields that they want. If they want to have a Field which is both indexed and stored, then they can create it once and add it to Document. If they want to do it separately, then they can do that too. Since Field implements both IndexableField and StorableField, it can serve the dual purpose.

          That way the API in document.Document is pretty simple and you can add and remove things as done in the past.

          Show
          Chris Male added a comment - index.Document is an interface, I think for better extensibility in the future it could be an abstract class - who knows what we will want to put there in addition to the iterators... I'm not sure that is such a big deal. But I do think should think about the name here. We already have Document and it's going to become confusing with two different Document classes kind of doing the same thing and with document.Document implementing index.Document as well. previously we allowed one to remove fields from document by name, are we going to allow this now separately for indexed and stored fields? I think we need to simplify the document.Document API. I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances. It is a userland kind of class and so is Field. We should make it easy for people to add the Fields that they want. If they want to have a Field which is both indexed and stored, then they can create it once and add it to Document. If they want to do it separately, then they can do that too. Since Field implements both IndexableField and StorableField, it can serve the dual purpose. That way the API in document.Document is pretty simple and you can add and remove things as done in the past.
          Hide
          Eks Dev added a comment -

          My assumption is that StoredField-s will never be used anymore as potential sources of token streams?

          One case where it might make sense are scenarios where a user wants to store analyzed field (not original) and later to to read it as TokenStream. Kind of TermVector without tf. I think I remember seing great patch with indexable-storable field (with serialization and deserialization).

          A user can do it in two passes, but sumetimes it is a not chep to analyze two times

          Show
          Eks Dev added a comment - My assumption is that StoredField-s will never be used anymore as potential sources of token streams? One case where it might make sense are scenarios where a user wants to store analyzed field (not original) and later to to read it as TokenStream. Kind of TermVector without tf. I think I remember seing great patch with indexable-storable field (with serialization and deserialization). A user can do it in two passes, but sumetimes it is a not chep to analyze two times
          Hide
          Andrzej Bialecki added a comment -

          We already have Document and it's going to become confusing with two different Document classes

          +1 to use a better name (LuceneDocument? AbstractDocument?).

          I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances.

          With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField? Well, at least it would allow for expressing the association between consecutive stored/indexed values that we can express now when creating a Document for indexing. But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField.

          I think I remember seing great patch with indexable-storable field (with serialization and deserialization).

          SOLR-1535 .

          Show
          Andrzej Bialecki added a comment - We already have Document and it's going to become confusing with two different Document classes +1 to use a better name (LuceneDocument? AbstractDocument?). I don't think it should hold Indexable/StorableField instances but instead should just hold Field instances. With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField? Well, at least it would allow for expressing the association between consecutive stored/indexed values that we can express now when creating a Document for indexing. But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField. I think I remember seing great patch with indexable-storable field (with serialization and deserialization). SOLR-1535 .
          Hide
          Chris Male added a comment -

          With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField?

          Yes, Nikola has included a StoredDocument class for that. This would prevent users from thinking they can just take a search result and pass it into being indexed. It creates a clear separation between indexing and search results.

          But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField.

          I agree that there are benefits to the decoupling. It's just that one of the important factors in this issue and other work in and around Document & Field is creating a cleaner API for users. I'm not sure bogging the document.Document API down with having to manage both Storable and IndexableField instances is worth it. Field is already basically a parent class with the extensive list of specializations we now have.

          I'm wondering whether expert users who are using their own Storable/IndexableField impls will also want their own 'Document' impls as well, maybe to support direct streaming of fields or something. If we enforce this, then we're have a consistent policy that to use these expert interfaces, you're going to have to provide your own implementations for everything.

          With all that said, I'm open to a clean API in Document that can do everything

          Show
          Chris Male added a comment - With the Field class implementing IndexableField and StorableField, and on retrieval returning a different class that implements only StorableField? Yes, Nikola has included a StoredDocument class for that. This would prevent users from thinking they can just take a search result and pass it into being indexed. It creates a clear separation between indexing and search results. But the strong decoupling of stored/indexed parts of a field has its benefits too (arbitrary sequences of stored/indexed parts of fields)... and if you require a specific implementation at the level of (input) Document then you prevent users from using their own impls. of strongly decoupled sequences of StoredField/IndexedField. I agree that there are benefits to the decoupling. It's just that one of the important factors in this issue and other work in and around Document & Field is creating a cleaner API for users. I'm not sure bogging the document.Document API down with having to manage both Storable and IndexableField instances is worth it. Field is already basically a parent class with the extensive list of specializations we now have. I'm wondering whether expert users who are using their own Storable/IndexableField impls will also want their own 'Document' impls as well, maybe to support direct streaming of fields or something. If we enforce this, then we're have a consistent policy that to use these expert interfaces, you're going to have to provide your own implementations for everything. With all that said, I'm open to a clean API in Document that can do everything
          Hide
          Michael McCandless added a comment -

          I think I like this decoupling ... for normal users I don't think this
          makes the API harder? They still work with TextField, FloatField,
          StoredField, etc.? It's just that, under the hood, these sugar classes
          extend from the right base class (indexed or stored).

          Document.add is just type overloaded, but Document.get* will get
          messier: we'll need getStored and getIndexed? I guess that would be
          simpler if Document could just store Field instances... hmm.

          It would also be less invasive change for migrating from 4.0 -> 5.0
          (assuming this issue is done only for 5.0...) if we didn't do the hard
          split.... else we need a back-compat story...

          We already have Document and it's going to become confusing with two different Document classes

          +1 to use a better name (LuceneDocument? AbstractDocument?).

          Maybe IndexDocument? I think it's OK as an interface if we mark it
          @lucene.internal? This is the raw, super expert low-level that indexer
          uses to consume documents... it has only 2 methods, and I think for
          expert users it could be a hassle if we force the impl to inherit from
          our base class...

          Should StoredDocument (returned from IR.document) be "read only"? Like
          you can iterate its fields, look them up, etc., but not eg remove them?

          We should probably rename document.Field -> document.IndexedField and
          document.Field -> document.IndexedFieldType?

          Also I think we should rename XXXField.TYPE_UNSTORED -> .TYPE, since in
          each case there's only 1 TYPE instance for that sugar field?

          Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED
          from all the sugar fields (TextField, StringField, etc.); expert users
          can always make a custom Indexable/Storable/Field/FieldType that both
          stores & indexes...

          Show
          Michael McCandless added a comment - I think I like this decoupling ... for normal users I don't think this makes the API harder? They still work with TextField, FloatField, StoredField, etc.? It's just that, under the hood, these sugar classes extend from the right base class (indexed or stored). Document.add is just type overloaded, but Document.get* will get messier: we'll need getStored and getIndexed? I guess that would be simpler if Document could just store Field instances... hmm. It would also be less invasive change for migrating from 4.0 -> 5.0 (assuming this issue is done only for 5.0...) if we didn't do the hard split.... else we need a back-compat story... We already have Document and it's going to become confusing with two different Document classes +1 to use a better name (LuceneDocument? AbstractDocument?). Maybe IndexDocument? I think it's OK as an interface if we mark it @lucene.internal? This is the raw, super expert low-level that indexer uses to consume documents... it has only 2 methods, and I think for expert users it could be a hassle if we force the impl to inherit from our base class... Should StoredDocument (returned from IR.document) be "read only"? Like you can iterate its fields, look them up, etc., but not eg remove them? We should probably rename document.Field -> document.IndexedField and document.Field -> document.IndexedFieldType? Also I think we should rename XXXField.TYPE_UNSTORED -> .TYPE, since in each case there's only 1 TYPE instance for that sugar field? Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED from all the sugar fields (TextField, StringField, etc.); expert users can always make a custom Indexable/Storable/Field/FieldType that both stores & indexes...
          Hide
          Chris Male added a comment -

          I am all for the decoupling too, just want to thoroughly kick the tyres on this one I dont want another FieldType like discussion.

          Document.add is just type overloaded, but Document.get* will get
          messier: we'll need getStored and getIndexed? I guess that would be
          simpler if Document could just store Field instances... hmm.

          Perhaps if we just limit the API in Document we can handle this okay. We can provide the overloaded add methods, two get methods and 1 remove method.

          Maybe IndexDocument? I think it's OK as an interface if we mark it
          @lucene.internal? This is the raw, super expert low-level that indexer
          uses to consume documents... it has only 2 methods, and I think for
          expert users it could be a hassle if we force the impl to inherit from
          our base class...

          +1 to both the name and the handling of the interface.

          Should StoredDocument (returned from IR.document) be "read only"? Like
          you can iterate its fields, look them up, etc., but not eg remove them?

          +1 You shouldn't really need to remove fields, you can achieve that by not retrieving them in the first place

          Show
          Chris Male added a comment - I am all for the decoupling too, just want to thoroughly kick the tyres on this one I dont want another FieldType like discussion. Document.add is just type overloaded, but Document.get* will get messier: we'll need getStored and getIndexed? I guess that would be simpler if Document could just store Field instances... hmm. Perhaps if we just limit the API in Document we can handle this okay. We can provide the overloaded add methods, two get methods and 1 remove method. Maybe IndexDocument? I think it's OK as an interface if we mark it @lucene.internal? This is the raw, super expert low-level that indexer uses to consume documents... it has only 2 methods, and I think for expert users it could be a hassle if we force the impl to inherit from our base class... +1 to both the name and the handling of the interface. Should StoredDocument (returned from IR.document) be "read only"? Like you can iterate its fields, look them up, etc., but not eg remove them? +1 You shouldn't really need to remove fields, you can achieve that by not retrieving them in the first place
          Hide
          Michael McCandless added a comment -

          Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED
          from all the sugar fields (TextField, StringField, etc.); expert users
          can always make a custom Indexable/Storable/Field/FieldType that both
          stores & indexes...

          I opened a new issue for this: LUCENE-4101.

          Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts!

          Show
          Michael McCandless added a comment - Separately, I think even for 4.0 we should remove XXXField.TYPE_STORED from all the sugar fields (TextField, StringField, etc.); expert users can always make a custom Indexable/Storable/Field/FieldType that both stores & indexes... I opened a new issue for this: LUCENE-4101 . Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts!
          Hide
          Nikola Tankovic added a comment -

          Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts!

          OK, let me know when you finish!

          Show
          Nikola Tankovic added a comment - Nikola, before you modify any of the tests for this issue you might want to wait for me to commit & forward port to 5.0 else there will be a lot of conflicts! OK, let me know when you finish!
          Hide
          Uwe Schindler added a comment -

          I am back @ office, trying to catch up.

          Show
          Uwe Schindler added a comment - I am back @ office, trying to catch up.
          Hide
          Michael McCandless added a comment -

          I think, given the discussions in LUCENE-4101, that the document.FieldType class should implement both index.IndexableFieldType and index.StorableFieldType, so that users can add a single Field instance that's both stored and indexed ...

          Show
          Michael McCandless added a comment - I think, given the discussions in LUCENE-4101 , that the document.FieldType class should implement both index.IndexableFieldType and index.StorableFieldType, so that users can add a single Field instance that's both stored and indexed ...
          Hide
          Chris Male added a comment -

          Following on from that, I don't think we should expose the IndexableField / StorableField decoupling in document.Document. It should remain an easy to use class for the most common use cases. In which case I think it should only use Field instances. We can then do some work for it to meet the needs of index.IndexDocument. Having it this way means users can choose whether they want to add a single instance for both stored and indexed, or two instances.

          Show
          Chris Male added a comment - Following on from that, I don't think we should expose the IndexableField / StorableField decoupling in document.Document. It should remain an easy to use class for the most common use cases. In which case I think it should only use Field instances. We can then do some work for it to meet the needs of index.IndexDocument. Having it this way means users can choose whether they want to add a single instance for both stored and indexed, or two instances.
          Hide
          Uwe Schindler added a comment -

          I think as we have 2 interfaces, we should only allow the interface StorableField added to StoredDocument and vice versa. The user still can add combined fields, as both interfaces are implemented? Do I miss something?

          Show
          Uwe Schindler added a comment - I think as we have 2 interfaces, we should only allow the interface StorableField added to StoredDocument and vice versa. The user still can add combined fields, as both interfaces are implemented? Do I miss something?
          Hide
          Chris Male added a comment -

          Yeah StoredDocument should only consist of StorableFields.

          Show
          Chris Male added a comment - Yeah StoredDocument should only consist of StorableFields.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          do you have any news or progress? I did not hear anything from you the last two weeks and I am a little bit nervous (because of GSoC).

          Show
          Uwe Schindler added a comment - Hi Nikola, do you have any news or progress? I did not hear anything from you the last two weeks and I am a little bit nervous (because of GSoC).
          Hide
          Nikola Tankovic added a comment -

          Patch 05:

          • Renamed index.Document to index.IndexDocument
          • Made oal.Document consisting only of Field's
          • Made Field implement both Storable and IndexableField
          • Made iterators to extract indexed and stored fields from document separately
          Show
          Nikola Tankovic added a comment - Patch 05: Renamed index.Document to index.IndexDocument Made oal.Document consisting only of Field's Made Field implement both Storable and IndexableField Made iterators to extract indexed and stored fields from document separately
          Hide
          Chris Male added a comment -

          Hey Nikola,

          Just did a quick pass over the patch.

          I have an alternative way to do the Indexable/StorableFieldsIterator in Document (it'll need the policeman's tick though):

          public abstract class SelectiveIterator<T> implements Iterator<T> {
          
            private T next;
            private final List<T> list;
            private int pos;
          
            public SelectiveIterator(List<T> list) {
              this.list = list;
            }
          
            @Override
            public void remove() {
              throw new UnsupportedOperationException();
            }
          
            @Override
            public boolean hasNext() {
              for (; pos < list.size(); pos++) {
                T t = list.get(pos);
                if (isNext(t)) {
                  next = t;
                  return true;
                }
              }
              return false;
            }
          
            @Override
            public T next() {
              return next;
            }
          
            abstract boolean isNext(T t);
          }
          

          I think that'll work. Then you can just create two instances which implement isNext differently.

          I also noticed that you've included import org.apache.commons.lang.NotImplementedException; in Document which will also need to be removed.

          Show
          Chris Male added a comment - Hey Nikola, Just did a quick pass over the patch. I have an alternative way to do the Indexable/StorableFieldsIterator in Document (it'll need the policeman's tick though): public abstract class SelectiveIterator<T> implements Iterator<T> { private T next; private final List<T> list; private int pos; public SelectiveIterator(List<T> list) { this .list = list; } @Override public void remove() { throw new UnsupportedOperationException(); } @Override public boolean hasNext() { for (; pos < list.size(); pos++) { T t = list.get(pos); if (isNext(t)) { next = t; return true ; } } return false ; } @Override public T next() { return next; } abstract boolean isNext(T t); } I think that'll work. Then you can just create two instances which implement isNext differently. I also noticed that you've included import org.apache.commons.lang.NotImplementedException; in Document which will also need to be removed.
          Hide
          Nikola Tankovic added a comment -

          I like the solution, but we will wait for Mr. Policeman

          Sorry about the NotImplementedException, I needed it temporarily, and will remove it!

          Show
          Nikola Tankovic added a comment - I like the solution, but we will wait for Mr. Policeman Sorry about the NotImplementedException, I needed it temporarily, and will remove it!
          Hide
          Uwe Schindler added a comment - - edited

          Chris: The iterator looks generics-wise correct, the big problem is that it only works correct with Lists implementing RandomAccess. To work performant and correct on all Lists, it should use the ListIterator/Iterator of the wrapped List.

          The second problem is that it violates the Iterator pattern: You should be able to call next() without calling hasNext() before and you must also be able to call hasNext() multiple times. So the "iteration" logic must be in next(). Ideally you can do that by coping some code from commons-collections or google-collect. The general pattern how to implement filtering decorators for other Iterators is to use a doNext() method which moves the wrapped iterator forward and is called inside next() after the current value was returned (means the wrapped iterator is already one step further than the outer iterator).

          Uwe

          Show
          Uwe Schindler added a comment - - edited Chris: The iterator looks generics-wise correct, the big problem is that it only works correct with Lists implementing RandomAccess. To work performant and correct on all Lists, it should use the ListIterator/Iterator of the wrapped List. The second problem is that it violates the Iterator pattern: You should be able to call next() without calling hasNext() before and you must also be able to call hasNext() multiple times. So the "iteration" logic must be in next(). Ideally you can do that by coping some code from commons-collections or google-collect. The general pattern how to implement filtering decorators for other Iterators is to use a doNext() method which moves the wrapped iterator forward and is called inside next() after the current value was returned (means the wrapped iterator is already one step further than the outer iterator). Uwe
          Hide
          Chris Male added a comment -

          Yes you're right on all accounts.

          Taking a look at Google Guava I see AbstractIterator might be an example of what we want?

          Show
          Chris Male added a comment - Yes you're right on all accounts. Taking a look at Google Guava I see AbstractIterator might be an example of what we want?
          Show
          Uwe Schindler added a comment - This is an example how to do it correct (just add Generics, please): http://javasourcecode.org/html/open-source/commons-collections/commons-collections-3.2.1/org/apache/commons/collections/iterators/FilterIterator.java.html
          Hide
          Nikola Tankovic added a comment -

          Can I use something like this?

          
            private FilterIterator storedFieldsIterator = new FilterIterator(fields.iterator(), new Predicate() {
              @Override
              public boolean evaluate(Object field) {
                if (field instanceof Field) {
                  Field f = (Field) field;
                  return f.type.stored();
                }
                else return false;
              }
            });
          
          
          Show
          Nikola Tankovic added a comment - Can I use something like this? private FilterIterator storedFieldsIterator = new FilterIterator(fields.iterator(), new Predicate() { @Override public boolean evaluate( Object field) { if (field instanceof Field) { Field f = (Field) field; return f.type.stored(); } else return false ; } });
          Hide
          Chris Male added a comment -

          I think we should make it so you can use appropriate generics and not need that instanceof

          Show
          Chris Male added a comment - I think we should make it so you can use appropriate generics and not need that instanceof
          Hide
          Uwe Schindler added a comment -

          The code I sent was just an example. We have to duplicate the code, as Lucene-core allows no external dependencies. I would also nuke the Predicate class. My proposal was just to fix Chris' code to look like the one from commons-collections (including the generics like Chris proposed). And replace the Predicate class by the abstract method (this saves the stupid object creation).

          Show
          Uwe Schindler added a comment - The code I sent was just an example. We have to duplicate the code, as Lucene-core allows no external dependencies. I would also nuke the Predicate class. My proposal was just to fix Chris' code to look like the one from commons-collections (including the generics like Chris proposed). And replace the Predicate class by the abstract method (this saves the stupid object creation).
          Hide
          Nikola Tankovic added a comment -

          Maybe something like this?

            public static Iterator<StorableField> storedFieldsIterator(final Iterator<Field> in) {
              return new AbstractIterator<StorableField>() {
                protected StorableField computeNext() {
                  while (in.hasNext()) {
                    Field f = in.next();
                    if (f.type.stored()) {
                      return f;
                    }
                  }
                  return endOfData();
                }
              };
            }
          
          Show
          Nikola Tankovic added a comment - Maybe something like this? public static Iterator<StorableField> storedFieldsIterator( final Iterator<Field> in) { return new AbstractIterator<StorableField>() { protected StorableField computeNext() { while (in.hasNext()) { Field f = in.next(); if (f.type.stored()) { return f; } } return endOfData(); } }; }
          Hide
          Nikola Tankovic added a comment -

          OK, so no AbstractIterator then. I will make new generic iterator with abstract method.

          Show
          Nikola Tankovic added a comment - OK, so no AbstractIterator then. I will make new generic iterator with abstract method.
          Hide
          Nikola Tankovic added a comment -

          Let me know if I can further improve this:

          
            public Iterator<StorableField> storedFieldsIterator() {
              return new FieldFilterIterator<StorableField>() {
                @Override
                protected boolean predicateFunction(Field field) {
                  return field.type.stored();
                }
              };
            }
            
            public Iterator<IndexableField> indexedFieldsIterator() {
              return new FieldFilterIterator<IndexableField>() {
                @Override
                protected boolean predicateFunction(Field field) {
                  return field.type.indexed();
                }
              };
            }
            
            private abstract class FieldFilterIterator<T> implements Iterator<T> {
          
              private Iterator<Field> iterator = fields.iterator();
              private T nextField = null;
              private boolean nextIsSet = false;
              
              protected abstract boolean predicateFunction(Field field);
          
              public boolean hasNext() {
                  if (nextIsSet) {
                      return true;
                  } else {
                      return setNext();
                  }
              }
              
              public T next() {
                  if (!nextIsSet) {
                      if (!setNext()) {
                          throw new NoSuchElementException();
                      }
                  }
                  nextIsSet = false;
                  return nextField;
              }
          
              public void remove() {
                  if (nextIsSet) {
                      throw new IllegalStateException("remove() cannot be called");
                  }
                  iterator.remove();
              }
          
              private boolean setNext() {
                  while (iterator.hasNext()) {
                      Field field = iterator.next();
                      if (predicateFunction(field)) {
                          nextField = (T) field;
                          nextIsSet = true;
                          return true;
                      }
                  }
                  return false;
              }
            }
          
          Show
          Nikola Tankovic added a comment - Let me know if I can further improve this: public Iterator<StorableField> storedFieldsIterator() { return new FieldFilterIterator<StorableField>() { @Override protected boolean predicateFunction(Field field) { return field.type.stored(); } }; } public Iterator<IndexableField> indexedFieldsIterator() { return new FieldFilterIterator<IndexableField>() { @Override protected boolean predicateFunction(Field field) { return field.type.indexed(); } }; } private abstract class FieldFilterIterator<T> implements Iterator<T> { private Iterator<Field> iterator = fields.iterator(); private T nextField = null ; private boolean nextIsSet = false ; protected abstract boolean predicateFunction(Field field); public boolean hasNext() { if (nextIsSet) { return true ; } else { return setNext(); } } public T next() { if (!nextIsSet) { if (!setNext()) { throw new NoSuchElementException(); } } nextIsSet = false ; return nextField; } public void remove() { if (nextIsSet) { throw new IllegalStateException( "remove() cannot be called" ); } iterator.remove(); } private boolean setNext() { while (iterator.hasNext()) { Field field = iterator.next(); if (predicateFunction(field)) { nextField = (T) field; nextIsSet = true ; return true ; } } return false ; } }
          Hide
          Uwe Schindler added a comment - - edited

          Looks fine, I would disallow remove() completely like in Chris' iterator.

          One addition, the declaration of the abstract iterator base class should include <T extends Field>, otherweise its not generics correct.

          Show
          Uwe Schindler added a comment - - edited Looks fine, I would disallow remove() completely like in Chris' iterator. One addition, the declaration of the abstract iterator base class should include <T extends Field>, otherweise its not generics correct.
          Hide
          Chris Male added a comment -

          Why not make this a public class in util.* and keep the generic <T> but make the whole thing generic as well (remove Field and just use T). It seems like a useful class.

          Show
          Chris Male added a comment - Why not make this a public class in util.* and keep the generic <T> but make the whole thing generic as well (remove Field and just use T). It seems like a useful class.
          Hide
          Uwe Schindler added a comment -

          +1

          Show
          Uwe Schindler added a comment - +1
          Hide
          Nikola Tankovic added a comment -

          I think this is now OK, I'll put in util.*

            public Iterator<StorableField> storedFieldsIterator() {
              return new FilterIterator<StorableField, Field>(fields.iterator()) {
                @Override
                protected boolean predicateFunction(Field field) {
                  return field.type.stored();
                }
              };
            }
            
            public Iterator<IndexableField> indexedFieldsIterator() {
              return new FilterIterator<IndexableField, Field>(fields.iterator()) {
                @Override
                protected boolean predicateFunction(Field field) {
                  return field.type.indexed();
                }
              };
            }
            
            private abstract class FilterIterator<T, U extends T> implements Iterator<T> {
          
              private Iterator<U> iterator;
              private T next = null;
              private boolean nextIsSet = false;
              
              protected abstract boolean predicateFunction(U field);
          
              public FilterIterator(Iterator<U> baseIterator) {
                this.iterator = baseIterator;
              }
              
              public boolean hasNext() {
                if (nextIsSet) {
                  return true;
                } else {
                  return setNext();
                }
              }
              
              public T next() {
                if (!nextIsSet) {
                  if (!setNext()) { 
                    throw new NoSuchElementException();
                  }
                }
                nextIsSet = false;
                return next;
              }
          
              public void remove() {
                throw new UnsupportedOperationException();
              }
          
              private boolean setNext() {
                while (iterator.hasNext()) {
                  U object = iterator.next();
                  if (predicateFunction(object)) {
                    next = object;
                    nextIsSet = true;
                    return true;
                  }
                }
                return false;
              }
            }
          
          Show
          Nikola Tankovic added a comment - I think this is now OK, I'll put in util.* public Iterator<StorableField> storedFieldsIterator() { return new FilterIterator<StorableField, Field>(fields.iterator()) { @Override protected boolean predicateFunction(Field field) { return field.type.stored(); } }; } public Iterator<IndexableField> indexedFieldsIterator() { return new FilterIterator<IndexableField, Field>(fields.iterator()) { @Override protected boolean predicateFunction(Field field) { return field.type.indexed(); } }; } private abstract class FilterIterator<T, U extends T> implements Iterator<T> { private Iterator<U> iterator; private T next = null ; private boolean nextIsSet = false ; protected abstract boolean predicateFunction(U field); public FilterIterator(Iterator<U> baseIterator) { this .iterator = baseIterator; } public boolean hasNext() { if (nextIsSet) { return true ; } else { return setNext(); } } public T next() { if (!nextIsSet) { if (!setNext()) { throw new NoSuchElementException(); } } nextIsSet = false ; return next; } public void remove() { throw new UnsupportedOperationException(); } private boolean setNext() { while (iterator.hasNext()) { U object = iterator.next(); if (predicateFunction(object)) { next = object; nextIsSet = true ; return true ; } } return false ; } }
          Hide
          Nikola Tankovic added a comment - - edited

          Patch 06: summary of discussions. If we can somewhat agree on this core API then I would like move along to fixing tests, as that is quite a lot of work

          Show
          Nikola Tankovic added a comment - - edited Patch 06: summary of discussions. If we can somewhat agree on this core API then I would like move along to fixing tests, as that is quite a lot of work
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          I will think about the core API and give my comments later.
          As changing tests and solr is really the biggest change, we should create a branch and do it step for step. I would commit the current patch to a branched trunk (5.0) and then you can work with a new checkout from there and I will commit the later steps. This also allows heavy™ commiting™ by other committers. Unfortunately I cannot give you commit access to Apache's SVN.

          Show
          Uwe Schindler added a comment - Hi Nikola, I will think about the core API and give my comments later. As changing tests and solr is really the biggest change, we should create a branch and do it step for step. I would commit the current patch to a branched trunk (5.0) and then you can work with a new checkout from there and I will commit the later steps. This also allows heavy™ commiting™ by other committers. Unfortunately I cannot give you commit access to Apache's SVN.
          Hide
          Nikola Tankovic added a comment -

          Agreed! No problem about commit access, I'll send patches

          Show
          Nikola Tankovic added a comment - Agreed! No problem about commit access, I'll send patches
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,

          I merged your patch to current tunk (removed lots of throws CorruptIndexException) and undid some formatti ng Changes in IndexReader.

          Can you configure your IDE to not change unrelated code? This makes merging extremely hard.

          I committed this in r1357938 to a new branch https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3312 (from trunk r1357904, Lucene 5.0). Please use this branch for further work! I will merge it regularily when bigger changes are in main dev, so please keep it updated. Please provide new patches against this branch.

          Show
          Uwe Schindler added a comment - Hi Nikola, I merged your patch to current tunk (removed lots of throws CorruptIndexException) and undid some formatti ng Changes in IndexReader. Can you configure your IDE to not change unrelated code? This makes merging extremely hard. I committed this in r1357938 to a new branch https://svn.apache.org/repos/asf/lucene/dev/branches/lucene3312 (from trunk r1357904, Lucene 5.0). Please use this branch for further work! I will merge it regularily when bigger changes are in main dev, so please keep it updated. Please provide new patches against this branch.
          Hide
          Uwe Schindler added a comment -

          And as noted before: In Lucene core we dont use extrenal dependencies, so we have a compile failure because of AbstractIterator from Google Collect, we have to put this one into Lucene utils.

          Show
          Uwe Schindler added a comment - And as noted before: In Lucene core we dont use extrenal dependencies, so we have a compile failure because of AbstractIterator from Google Collect, we have to put this one into Lucene utils.
          Hide
          Nikola Tankovic added a comment -

          Uwe,

          I apologize for inconveniences. I don't use AbstractIterator from Google any more, and I did put custom implementation in util, but I forgot to remove "import" declaration. I certainly will configure my IDE to not change unrelated code (apologies once again). I switched to your new branch, and will work on it from now.

          The big question is: can I go to fixing solr and tests or do you think there is some major API change left to do?

          Show
          Nikola Tankovic added a comment - Uwe, I apologize for inconveniences. I don't use AbstractIterator from Google any more, and I did put custom implementation in util, but I forgot to remove "import" declaration. I certainly will configure my IDE to not change unrelated code (apologies once again). I switched to your new branch, and will work on it from now. The big question is: can I go to fixing solr and tests or do you think there is some major API change left to do?
          Hide
          Chris Male added a comment -

          My feeling at least is that we should definitely get going on Solr and tests since they are good ways to see if the API can be consumed. A failing test might reveal something we haven't considered.

          Show
          Chris Male added a comment - My feeling at least is that we should definitely get going on Solr and tests since they are good ways to see if the API can be consumed. A failing test might reveal something we haven't considered.
          Hide
          Uwe Schindler added a comment -

          Fine, thanks. I had no time to do API wise checks, maybe Chris had a closer look. Great work in all cases!

          Uwe.

          Show
          Uwe Schindler added a comment - Fine, thanks. I had no time to do API wise checks, maybe Chris had a closer look. Great work in all cases! Uwe.
          Hide
          Nikola Tankovic added a comment - - edited

          Hi, I'm going over tests and so far so good.

          I am seeing a lot of "id" fields in document that aren't stored, e.g.:

                doc.add(new IntField("id", docCount, Field.Store.NO));
          

          ... and having errors because my returned StoredDocument only shows stored fields. Can I convert these "id" fields into stored in tests? Or did we do something wrong with StoredDocument?

          Nikola

          Show
          Nikola Tankovic added a comment - - edited Hi, I'm going over tests and so far so good. I am seeing a lot of "id" fields in document that aren't stored, e.g.: doc.add( new IntField( "id" , docCount, Field.Store.NO)); ... and having errors because my returned StoredDocument only shows stored fields. Can I convert these "id" fields into stored in tests? Or did we do something wrong with StoredDocument? Nikola
          Hide
          Chris Male added a comment -

          If the test is wanting to retrieve the ID field for a StoredDocument then yes the field will need to be stored.

          Show
          Chris Male added a comment - If the test is wanting to retrieve the ID field for a StoredDocument then yes the field will need to be stored.
          Hide
          Michael McCandless added a comment -

          Hmm ... if the test is running fine today, not storing the id field, then why would it need to start storing it on switching to returning StoredDocument from IR.document...? In theory this should be a rote change?

          Show
          Michael McCandless added a comment - Hmm ... if the test is running fine today, not storing the id field, then why would it need to start storing it on switching to returning StoredDocument from IR.document...? In theory this should be a rote change?
          Hide
          Michael McCandless added a comment -

          Branch looks good!

          But we still seem to have StoredDocument.removeField/s methods? Shouldn't that class be read-only?

          Show
          Michael McCandless added a comment - Branch looks good! But we still seem to have StoredDocument.removeField/s methods? Shouldn't that class be read-only?
          Hide
          Chris Male added a comment -

          Hmm ... if the test is running fine today, not storing the id field, then why would it need to start storing it on switching to returning StoredDocument from IR.document...? In theory this should be a rote change?

          Good point

          But we still seem to have StoredDocument.removeField/s methods? Shouldn't that class be read-only?

          +1

          Show
          Chris Male added a comment - Hmm ... if the test is running fine today, not storing the id field, then why would it need to start storing it on switching to returning StoredDocument from IR.document...? In theory this should be a rote change? Good point But we still seem to have StoredDocument.removeField/s methods? Shouldn't that class be read-only? +1
          Hide
          Nikola Tankovic added a comment -

          Patch 07: Core compiles and core tests looks good (except for one).

          I have only a problem in org.apache.lucene.index.TestIndexableField test. I cannot get it working, any help is appreciated, because I spent days with no success. I'm just missing something

          Regarding 'deleteField' in 'StoredDocument', I cannot remove it 'cause of PersistentSnapshotDeletionPolicy::readSnapshotsInfo function, I guess it needs refactoring.

          Show
          Nikola Tankovic added a comment - Patch 07: Core compiles and core tests looks good (except for one). I have only a problem in org.apache.lucene.index.TestIndexableField test. I cannot get it working, any help is appreciated, because I spent days with no success. I'm just missing something Regarding 'deleteField' in 'StoredDocument', I cannot remove it 'cause of PersistentSnapshotDeletionPolicy::readSnapshotsInfo function, I guess it needs refactoring.
          Hide
          Chris Male added a comment -

          Regarding 'deleteField' in 'StoredDocument', I cannot remove it 'cause of PersistentSnapshotDeletionPolicy::readSnapshotsInfo function, I guess it needs refactoring.

          All that code seems to be doing is removing a specific field from the Document and then iterating over the remaining values in the Document. It seems an easy change to just skip the field during the for loop.

          Show
          Chris Male added a comment - Regarding 'deleteField' in 'StoredDocument', I cannot remove it 'cause of PersistentSnapshotDeletionPolicy::readSnapshotsInfo function, I guess it needs refactoring. All that code seems to be doing is removing a specific field from the Document and then iterating over the remaining values in the Document. It seems an easy change to just skip the field during the for loop.
          Hide
          Uwe Schindler added a comment -

          I applied the patch to the branch: At revision: 1359139

          I will now do svn merge to keep branch up-to-date!

          Show
          Uwe Schindler added a comment - I applied the patch to the branch: At revision: 1359139 I will now do svn merge to keep branch up-to-date!
          Hide
          Uwe Schindler added a comment - - edited

          Done, no new compile failures! At revision: 1359151

          I get 2 unneeded cast warnings in lucene-core, that's all. Lucene Core Tests pass, did you disable the failing tests? (1 test failing as expected)

          Show
          Uwe Schindler added a comment - - edited Done, no new compile failures! At revision: 1359151 I get 2 unneeded cast warnings in lucene-core, that's all. Lucene Core Tests pass, did you disable the failing tests? (1 test failing as expected)
          Hide
          Nikola Tankovic added a comment -

          Hi Uwe,

          how can I see those warnings?

          Thank you!

          Show
          Nikola Tankovic added a comment - Hi Uwe, how can I see those warnings? Thank you!
          Hide
          Uwe Schindler added a comment -

          "ant compile" on command line. Before sending the patch, you should always build & test on command line with ant!

          Show
          Uwe Schindler added a comment - "ant compile" on command line. Before sending the patch, you should always build & test on command line with ant!
          Hide
          Nikola Tankovic added a comment -

          Yes, of course Mike taught me that. The problem is I don't see those warning with "ant compile".

          Show
          Nikola Tankovic added a comment - Yes, of course Mike taught me that. The problem is I don't see those warning with "ant compile".
          Hide
          Uwe Schindler added a comment -

          Maybe you missed to clean before? ANT's javac only compiles files not yet compiled, so warnings only show on first time?

          C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core>ant clean compile
          Buildfile: C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\build.xml
          
          clean:
          
          jflex-uptodate-check:
          
          jflex-notice:
          
          javacc-uptodate-check:
          
          javacc-notice:
          
          ivy-availability-check:
          
          ivy-fail:
          
          ivy-configure:
          [ivy:configure] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ ::
          [ivy:configure] :: loading settings :: file = C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\ivy-settings.xml
          
          resolve:
          
          init:
          
          clover.setup:
          
          clover.info:
               [echo]
               [echo]       Clover not found. Code coverage reports disabled.
               [echo]
          
          clover:
          
          common.compile-core:
              [mkdir] Created dir: C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java
              [javac] Compiling 634 source files to C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java
              [javac] C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\src\java\org\apache\lucene\index\DocFieldProcessor.java:24
          3: warning: [cast] redundant cast to org.apache.lucene.index.StorableField
              [javac]           consumer.add(docState.docID, (StorableField) field);
              [javac]                                        ^
              [javac] C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\src\java\org\apache\lucene\index\NormsConsumerPerField.jav
          a:57: warning: [cast] redundant cast to org.apache.lucene.index.StorableField
              [javac]         consumer.add(docState.docID, (StorableField) field);
              [javac]                                      ^
              [javac] 2 warnings
               [copy] Copying 2 files to C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java
          
          compile-core:
          
          compile:
          
          BUILD SUCCESSFUL
          Total time: 13 seconds
          
          Show
          Uwe Schindler added a comment - Maybe you missed to clean before? ANT's javac only compiles files not yet compiled, so warnings only show on first time? C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core>ant clean compile Buildfile: C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\build.xml clean: jflex-uptodate-check: jflex-notice: javacc-uptodate-check: javacc-notice: ivy-availability-check: ivy-fail: ivy-configure: [ivy:configure] :: Ivy 2.2.0 - 20100923230623 :: http://ant.apache.org/ivy/ :: [ivy:configure] :: loading settings :: file = C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\ivy-settings.xml resolve: init: clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: common.compile-core: [mkdir] Created dir: C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java [javac] Compiling 634 source files to C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java [javac] C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\src\java\org\apache\lucene\index\DocFieldProcessor.java:24 3: warning: [cast] redundant cast to org.apache.lucene.index.StorableField [javac] consumer.add(docState.docID, (StorableField) field); [javac] ^ [javac] C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\core\src\java\org\apache\lucene\index\NormsConsumerPerField.jav a:57: warning: [cast] redundant cast to org.apache.lucene.index.StorableField [javac] consumer.add(docState.docID, (StorableField) field); [javac] ^ [javac] 2 warnings [copy] Copying 2 files to C:\Users\Uwe Schindler\Projects\lucene\lucene3312\lucene\build\core\classes\java compile-core: compile: BUILD SUCCESSFUL Total time: 13 seconds
          Hide
          Nikola Tankovic added a comment -

          Sorry for my n00b-iness. Thank you!

          Show
          Nikola Tankovic added a comment - Sorry for my n00b-iness. Thank you!
          Hide
          Uwe Schindler added a comment -

          I merged trunk again (because of LUCENE-4199): revision: 1359283

          Show
          Uwe Schindler added a comment - I merged trunk again (because of LUCENE-4199 ): revision: 1359283
          Hide
          Uwe Schindler added a comment -

          Nikola: Did you get the test running now? Otherwise I see no problems with the code at the moment, but I will wait for other comment contributions!

          I still have a question to the iterator again:

          public abstract class FilterIterator<T, U extends T> implements Iterator<T> {
          

          This seems strange U extends T, so the iterator returns a wider type than it was in the original. I would expect it to be the other way round. In general for this FilteredIterator I would make no generics magic and let it return the same as the delegate. If the predicate changes type, then this should be done by the caller (who provides the predicate). Do I miss something?

          Show
          Uwe Schindler added a comment - Nikola: Did you get the test running now? Otherwise I see no problems with the code at the moment, but I will wait for other comment contributions! I still have a question to the iterator again: public abstract class FilterIterator<T, U extends T> implements Iterator<T> { This seems strange U extends T, so the iterator returns a wider type than it was in the original. I would expect it to be the other way round. In general for this FilteredIterator I would make no generics magic and let it return the same as the delegate. If the predicate changes type, then this should be done by the caller (who provides the predicate). Do I miss something?
          Hide
          Nikola Tankovic added a comment -

          Hi Uwe,

          I have the test fixed, and both Lucene and Solr are successfully compiling now

          You are right about the FilterIterator,

          public abstract class FilterIterator<T> implements Iterator<T> {
          

          is enough..

          I am now fixing test on rest of Lucene and Solr.

          Show
          Nikola Tankovic added a comment - Hi Uwe, I have the test fixed, and both Lucene and Solr are successfully compiling now You are right about the FilterIterator, public abstract class FilterIterator<T> implements Iterator<T> { is enough.. I am now fixing test on rest of Lucene and Solr.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          the second half of the GSoC started now. What are the plans for the second part? I expected that the Solr+Test changes proceed now, so we have enough time check the new API in general use and fix issues!

          Show
          Uwe Schindler added a comment - Hi Nikola, the second half of the GSoC started now. What are the plans for the second part? I expected that the Solr+Test changes proceed now, so we have enough time check the new API in general use and fix issues!
          Hide
          Nikola Tankovic added a comment -

          Hi Uwe,

          in a few days I will hopefully finish with Solr+Test part. Then we can do another round of discussion, API checking and modifying.

          Show
          Nikola Tankovic added a comment - Hi Uwe, in a few days I will hopefully finish with Solr+Test part. Then we can do another round of discussion, API checking and modifying.
          Hide
          Uwe Schindler added a comment -

          Is there anything at the moment you need help with?

          Show
          Uwe Schindler added a comment - Is there anything at the moment you need help with?
          Hide
          Nikola Tankovic added a comment -

          Not at the moment, just little more time A lot of code here

          Show
          Nikola Tankovic added a comment - Not at the moment, just little more time A lot of code here
          Hide
          Nikola Tankovic added a comment -

          I keep getting this error:

          [junit4:junit4] Suite: org.apache.lucene.benchmark.quality.TestQualityRun
          [junit4:junit4] FAILURE 27.4s | TestQualityRun.testTrecQuality
          [junit4:junit4]    > Throwable #1: java.lang.AssertionError: avg-p should be perfect: 0.9856606205097583 expected:<1.0> but was:<0.9856606205097583>
          

          Should I worry?

          Show
          Nikola Tankovic added a comment - I keep getting this error: [junit4:junit4] Suite: org.apache.lucene.benchmark.quality.TestQualityRun [junit4:junit4] FAILURE 27.4s | TestQualityRun.testTrecQuality [junit4:junit4] > Throwable #1: java.lang.AssertionError: avg-p should be perfect: 0.9856606205097583 expected:<1.0> but was:<0.9856606205097583> Should I worry?
          Hide
          Robert Muir added a comment -

          Yes. There aren't many tests that test that Lucene's default ranking is correct, but this is one of them.

          This means something is wrong...

          Show
          Robert Muir added a comment - Yes. There aren't many tests that test that Lucene's default ranking is correct, but this is one of them. This means something is wrong...
          Hide
          Uwe Schindler added a comment -

          Hi Nikola, should I merge up the branch to trunk? Do you have anything to commit?

          Show
          Uwe Schindler added a comment - Hi Nikola, should I merge up the branch to trunk? Do you have anything to commit?
          Hide
          Nikola Tankovic added a comment -

          Patch 08: Everything compiles, Lucene tests pass except for 'TestQualityRun' (I'm struggling with this one). Solr tests are yet to be fixed. Uwe, you can merge in the trunk, so I can continue.

          Show
          Nikola Tankovic added a comment - Patch 08: Everything compiles, Lucene tests pass except for 'TestQualityRun' (I'm struggling with this one). Solr tests are yet to be fixed. Uwe, you can merge in the trunk, so I can continue.
          Hide
          Uwe Schindler added a comment -

          Thanks Nikola, I applied your patch in revision: 1366638

          The merging is now running.

          Lucene tests pass except for 'TestQualityRun' (I'm struggling with this one)

          I think Robert Muir may be able to help you, he already had some comments about this. I will ping him!

          Show
          Uwe Schindler added a comment - Thanks Nikola, I applied your patch in revision: 1366638 The merging is now running. Lucene tests pass except for 'TestQualityRun' (I'm struggling with this one) I think Robert Muir may be able to help you, he already had some comments about this. I will ping him!
          Hide
          Uwe Schindler added a comment -

          Merged up to trunk in revision: 1366643

          Show
          Uwe Schindler added a comment - Merged up to trunk in revision: 1366643
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          next week is "pencils down", so we should start to finish this task and do final things like "scrub code, write tests, improve documentation" (official google description). Did you find out whats causing your test failures? I may try to look into it this evening, so I will try to find out.
          Should I merge up to trunk?
          The final week until Fri, 24th should be used to prepare the final branch reintegrate and provide the final patch (that could also be sent to Google).

          Show
          Uwe Schindler added a comment - Hi Nikola, next week is "pencils down", so we should start to finish this task and do final things like "scrub code, write tests, improve documentation" (official google description). Did you find out whats causing your test failures? I may try to look into it this evening, so I will try to find out. Should I merge up to trunk? The final week until Fri, 24th should be used to prepare the final branch reintegrate and provide the final patch (that could also be sent to Google).
          Hide
          Nikola Tankovic added a comment - - edited

          Patch 09: small fix that fixes many Solr test, Solr now compiles and tests all OK! Lucene compiles, every test OK, except for mentioned TestQualityRun.testTrecQuality. I am really struggling with this one, any help is appreciated!

          Some tests throw OutOfMemory errors (but that was also last year), so I think this is one final test to fix.

          Show
          Nikola Tankovic added a comment - - edited Patch 09: small fix that fixes many Solr test, Solr now compiles and tests all OK! Lucene compiles, every test OK, except for mentioned TestQualityRun.testTrecQuality. I am really struggling with this one, any help is appreciated! Some tests throw OutOfMemory errors (but that was also last year), so I think this is one final test to fix.
          Hide
          Uwe Schindler added a comment -

          Some tests throw OutOfMemory errors (but that was also last year), so I think this is one final test to fix.

          They should not do this when ran with "ant test". If they do in eclipse or other GUIs it can happen because the default test -Xmx is 512M for Lucene's build.xml, which is not respected by Eclipse.

          I will apply the patch later!

          Show
          Uwe Schindler added a comment - Some tests throw OutOfMemory errors (but that was also last year), so I think this is one final test to fix. They should not do this when ran with "ant test". If they do in eclipse or other GUIs it can happen because the default test -Xmx is 512M for Lucene's build.xml, which is not respected by Eclipse. I will apply the patch later!
          Hide
          Uwe Schindler added a comment -

          Patch applied revision: 1371131

          Show
          Uwe Schindler added a comment - Patch applied revision: 1371131
          Hide
          Uwe Schindler added a comment -

          I merged the branch up to current trunk (revision: 1371142)

          By this merge, new compile failures in tests occur, mainly caused by new tests added in some commits, using the old API. It would be good to fix those.

          Show
          Uwe Schindler added a comment - I merged the branch up to current trunk (revision: 1371142) By this merge, new compile failures in tests occur, mainly caused by new tests added in some commits, using the old API. It would be good to fix those.
          Hide
          Chris Male added a comment -

          Hey Nikola,

          except for mentioned TestQualityRun.testTrecQuality.

          I'm happy to help work out what is going wrong here, have you done any debugging of the test yourself? What have you worked out so far?

          Show
          Chris Male added a comment - Hey Nikola, except for mentioned TestQualityRun.testTrecQuality. I'm happy to help work out what is going wrong here, have you done any debugging of the test yourself? What have you worked out so far?
          Hide
          Nikola Tankovic added a comment - - edited

          Well some stats in this test are hurt:

          Stat nr.3 => Avp: 1.0, Recall: 1.0
          Stat nr.4 => Avp: 0.9919381323163302, Recall: 1.0
          Stat nr.5 => Avp: 0.9978136200716846, Recall: 1.0
          Stat nr.6 => Avp: 1.0, Recall: 1.0
          Stat nr.7 => Avp: 1.0, Recall: 1.0
          Stat nr.11 => Avp: 0.9978136200716846, Recall: 1.0
          Stat nr.12 => Avp: 0.9989247311827957, Recall: 1.0
          Stat nr.13 => Avp: 0.9856606205097583, Recall: 1.0
          Stat nr.14 => Avp: 1.0, Recall: 1.0
          Stat nr.15 => Avp: 0.9934262275544256, Recall: 1.0
          Stat nr.19 => Avp: 0.9862623857947443, Recall: 1.0
          

          Avp below 0.99 is not allowed, what could be the possible reasons of bad Avg-p?

          Show
          Nikola Tankovic added a comment - - edited Well some stats in this test are hurt: Stat nr.3 => Avp: 1.0, Recall: 1.0 Stat nr.4 => Avp: 0.9919381323163302, Recall: 1.0 Stat nr.5 => Avp: 0.9978136200716846, Recall: 1.0 Stat nr.6 => Avp: 1.0, Recall: 1.0 Stat nr.7 => Avp: 1.0, Recall: 1.0 Stat nr.11 => Avp: 0.9978136200716846, Recall: 1.0 Stat nr.12 => Avp: 0.9989247311827957, Recall: 1.0 Stat nr.13 => Avp: 0.9856606205097583, Recall: 1.0 Stat nr.14 => Avp: 1.0, Recall: 1.0 Stat nr.15 => Avp: 0.9934262275544256, Recall: 1.0 Stat nr.19 => Avp: 0.9862623857947443, Recall: 1.0 Avp below 0.99 is not allowed, what could be the possible reasons of bad Avg-p?
          Hide
          Robert Muir added a comment -

          Something is going wrong with the indexing of the reuters content.

          I ran the test with SimpleText on both branches (adding forceMerge(1) for simplicity) and looked at the resulting index:

          Trunk:

          -rw-rw-r-- 1 rmuir rmuir   13798 Aug  9 09:42 _0_6.len
          -rw-rw-r-- 1 rmuir rmuir 1022509 Aug  9 09:42 _0.fld
          -rw-rw-r-- 1 rmuir rmuir    1310 Aug  9 09:42 _0.inf
          -rw-rw-r-- 1 rmuir rmuir 3345582 Aug  9 09:42 _0.pst
          -rw-rw-r-- 1 rmuir rmuir     513 Aug  9 09:42 _0.si
          -rw-rw-r-- 1 rmuir rmuir      71 Aug  9 09:42 segments_1
          -rw-rw-r-- 1 rmuir rmuir      20 Aug  9 09:42 segments.gen
          

          Branch:

          -rw-rw-r-- 1 rmuir rmuir     13262 Aug  9 09:46 _4_6.len
          -rw-rw-r-- 1 rmuir rmuir 290247032 Aug  9 09:45 _4.fld
          -rw-rw-r-- 1 rmuir rmuir      1310 Aug  9 09:46 _4.inf
          -rw-rw-r-- 1 rmuir rmuir 459164224 Aug  9 09:46 _4.pst
          -rw-rw-r-- 1 rmuir rmuir       593 Aug  9 09:46 _4.si
          -rw-rw-r-- 1 rmuir rmuir        71 Aug  9 09:46 segments_1
          -rw-rw-r-- 1 rmuir rmuir        20 Aug  9 09:46 segments.gen
          

          Looking into the .fld file, I think the problem is obvious:
          on trunk:

          doc 0
            numfields 5
          doc 1
            numfields 5
          doc 2
            numfields 5
          

          on branch:

          doc 0
            numfields 5
          doc 1
            numfields 10
          doc 2
            numfields 15
          

          So there is some bug, where a field is 'accumulating' across documents. The last document has 2890.

          I'm really horrified this is the only test that fails!

          Show
          Robert Muir added a comment - Something is going wrong with the indexing of the reuters content. I ran the test with SimpleText on both branches (adding forceMerge(1) for simplicity) and looked at the resulting index: Trunk: -rw-rw-r-- 1 rmuir rmuir 13798 Aug 9 09:42 _0_6.len -rw-rw-r-- 1 rmuir rmuir 1022509 Aug 9 09:42 _0.fld -rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:42 _0.inf -rw-rw-r-- 1 rmuir rmuir 3345582 Aug 9 09:42 _0.pst -rw-rw-r-- 1 rmuir rmuir 513 Aug 9 09:42 _0.si -rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:42 segments_1 -rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:42 segments.gen Branch: -rw-rw-r-- 1 rmuir rmuir 13262 Aug 9 09:46 _4_6.len -rw-rw-r-- 1 rmuir rmuir 290247032 Aug 9 09:45 _4.fld -rw-rw-r-- 1 rmuir rmuir 1310 Aug 9 09:46 _4.inf -rw-rw-r-- 1 rmuir rmuir 459164224 Aug 9 09:46 _4.pst -rw-rw-r-- 1 rmuir rmuir 593 Aug 9 09:46 _4.si -rw-rw-r-- 1 rmuir rmuir 71 Aug 9 09:46 segments_1 -rw-rw-r-- 1 rmuir rmuir 20 Aug 9 09:46 segments.gen Looking into the .fld file, I think the problem is obvious: on trunk: doc 0 numfields 5 doc 1 numfields 5 doc 2 numfields 5 on branch: doc 0 numfields 5 doc 1 numfields 10 doc 2 numfields 15 So there is some bug, where a field is 'accumulating' across documents. The last document has 2890. I'm really horrified this is the only test that fails!
          Hide
          Chris Male added a comment -

          Wow, I have replicated the same behaviour. On the branch the number of fields per doc is... wow.

          Show
          Chris Male added a comment - Wow, I have replicated the same behaviour. On the branch the number of fields per doc is... wow.
          Hide
          Chris Male added a comment -

          Ah I think I found the problem, it's in Document, I'll verify in a few seconds.

          Show
          Chris Male added a comment - Ah I think I found the problem, it's in Document, I'll verify in a few seconds.
          Hide
          Robert Muir added a comment -

          nice: the bug wasn't obvious to me (i glanced thru the diff of the branches),
          but at least SimpleText came to the rescue

          I'm still really really shocked more tests aren't failing for this: I guess
          maybe it only happens in certain circumstances?

          Show
          Robert Muir added a comment - nice: the bug wasn't obvious to me (i glanced thru the diff of the branches), but at least SimpleText came to the rescue I'm still really really shocked more tests aren't failing for this: I guess maybe it only happens in certain circumstances?
          Hide
          Nikola Tankovic added a comment -

          I found the problem, will report in a minute with solution!

          Show
          Nikola Tankovic added a comment - I found the problem, will report in a minute with solution!
          Hide
          Chris Male added a comment -

          Yup found it.

          The problem is in the branch Document#getFields() is creating a new List and inside DocMaker in the benchmark module, it is pulling the Fields and clearing them (using clear()). Since a new List is being created each time, it is the new List that is getting cleared rather than the actual fields. Hence each iteration just adds more fields without having the previous ones cleared.

          Show
          Chris Male added a comment - Yup found it. The problem is in the branch Document#getFields() is creating a new List and inside DocMaker in the benchmark module, it is pulling the Fields and clearing them (using clear() ). Since a new List is being created each time, it is the new List that is getting cleared rather than the actual fields. Hence each iteration just adds more fields without having the previous ones cleared.
          Hide
          Chris Male added a comment -

          Nikola, we should probably move all of Document's methods over to just working with Field (and not IndexableField). I don't mind if we want to make getFields() return an immutable list but we then need to provide a clear() method so people can reuse Document instances.

          Show
          Chris Male added a comment - Nikola, we should probably move all of Document's methods over to just working with Field (and not IndexableField). I don't mind if we want to make getFields() return an immutable list but we then need to provide a clear() method so people can reuse Document instances.
          Hide
          Robert Muir added a comment -

          Can we not return a new list? I don't think we should just work around the problem in DocMaker.

          This would be a serious sneaky bug to introduce to apps that do this.

          Show
          Robert Muir added a comment - Can we not return a new list? I don't think we should just work around the problem in DocMaker. This would be a serious sneaky bug to introduce to apps that do this.
          Hide
          Robert Muir added a comment -

          I don't mind if we want to make getFields() return an immutable list

          Thats an ok solution too, so someone would get exception if they do this?

          Then they would use Document.clear() or whatever else instead? (we should make sure
          they can still remove things or whatever, just safely).

          Show
          Robert Muir added a comment - I don't mind if we want to make getFields() return an immutable list Thats an ok solution too, so someone would get exception if they do this? Then they would use Document.clear() or whatever else instead? (we should make sure they can still remove things or whatever, just safely).
          Hide
          Nikola Tankovic added a comment -

          Yes, that is the problem. clear() meathod was clearing not the fields of Document but a copy. Should I go with immutable list, and Document.clear()?
          Document.getFields().clear() doesn't sound right...

          Show
          Nikola Tankovic added a comment - Yes, that is the problem. clear() meathod was clearing not the fields of Document but a copy. Should I go with immutable list, and Document.clear()? Document.getFields().clear() doesn't sound right...
          Hide
          Chris Male added a comment -

          Yeah we definitely shouldn't return a new list. I think the immutable list and Document.clear() combo will suffice.

          Show
          Chris Male added a comment - Yeah we definitely shouldn't return a new list. I think the immutable list and Document.clear() combo will suffice.
          Hide
          Chris Male added a comment -

          Oh we should also include a unit test that verifies this behaviour.

          Show
          Chris Male added a comment - Oh we should also include a unit test that verifies this behaviour.
          Hide
          Robert Muir added a comment -

          +1, nobody should have to debug TestQualityRun

          Show
          Robert Muir added a comment - +1, nobody should have to debug TestQualityRun
          Hide
          Nikola Tankovic added a comment -

          OK, will do that, and also a Document.clear() test.

          Show
          Nikola Tankovic added a comment - OK, will do that, and also a Document.clear() test.
          Hide
          Chris Male added a comment - - edited

          Nikola,

          On a note totally unrelated to the bug, I noticed that StorableField still returns an IndexableFieldType for type(). This lead me to GeneralField. I don't think we need this. IndexableField should only need name(), tokenStream() and type(). StorableField needs name(), type() and the various xyzValue() accessors. Its type() should be a StorableFieldType and some of the functionality from IndexableFieldType should go there.

          Show
          Chris Male added a comment - - edited Nikola, On a note totally unrelated to the bug, I noticed that StorableField still returns an IndexableFieldType for type(). This lead me to GeneralField. I don't think we need this. IndexableField should only need name(), tokenStream() and type(). StorableField needs name(), type() and the various xyzValue() accessors. Its type() should be a StorableFieldType and some of the functionality from IndexableFieldType should go there.
          Hide
          Nikola Tankovic added a comment - - edited

          I'm sure someone knows a better way to create immutable List than this:

          
           
            public final List<IndexableField> getFields() {
              IndexableField[] immArray = new IndexableField[fields.size()];
              int i=0;
              for (IndexableField field : fields) {
                immArray[i++] = field;
              }
          
              return Arrays.asList(immArray);
            }
          
          

          Any pointers?

          Show
          Nikola Tankovic added a comment - - edited I'm sure someone knows a better way to create immutable List than this: public final List<IndexableField> getFields() { IndexableField[] immArray = new IndexableField[fields.size()]; int i=0; for (IndexableField field : fields) { immArray[i++] = field; } return Arrays.asList(immArray); } Any pointers?
          Hide
          Chris Male added a comment -
          public final List<Field> getFields() {
            return Collections.unmodifiableList(fields);
          }
          
          Show
          Chris Male added a comment - public final List<Field> getFields() { return Collections.unmodifiableList(fields); }
          Hide
          Nikola Tankovic added a comment -

          Chris,

          I tried to go with StorableFieldType but I ended with a whole lot of mess, after this fix I'll try that again and report if I find problems!

          Show
          Nikola Tankovic added a comment - Chris, I tried to go with StorableFieldType but I ended with a whole lot of mess, after this fix I'll try that again and report if I find problems!
          Hide
          Nikola Tankovic added a comment -

          Patch 10: Added Document.clear() method with Unit tests. Fixed all JUnit tests. Everything compiles and all tests pass.

          Show
          Nikola Tankovic added a comment - Patch 10: Added Document.clear() method with Unit tests. Fixed all JUnit tests. Everything compiles and all tests pass.
          Hide
          Michael McCandless added a comment -

          Branch + current patch looks great! I just found some minor things:

          IndexDocument, Document methods (eg the new clear()), GeneralField
          need javadocs.

          Import statements should be under the copyright header (eg
          StoredDocument.java, StorableField.java, GeneralField.java,
          StoredFieldsWriter.java)? Silly IDEs... Emacs does this correctly

          Document's add(IndexableField) and add(StorableField) seem dangerous
          because they secretly cast to oal.document.Field? Ie, I cannot use
          Document to hold my private Storable/IndexableField implementations.
          I think we should remove them, leaving only add(Field)?

          I think StoredDocument should be in oal.index not oal.document? Ie,
          because it's something you've retrieved from the IndexReader. Also,
          it will cause confusion with oal.document.Document which is the
          obvious class you should use to hold all your indexed/stored fields.

          Why does StoredDocument still have removeField/s? Shouldn't it be
          read-only? (I feel like a broken record....).

          Show
          Michael McCandless added a comment - Branch + current patch looks great! I just found some minor things: IndexDocument, Document methods (eg the new clear()), GeneralField need javadocs. Import statements should be under the copyright header (eg StoredDocument.java, StorableField.java, GeneralField.java, StoredFieldsWriter.java)? Silly IDEs... Emacs does this correctly Document's add(IndexableField) and add(StorableField) seem dangerous because they secretly cast to oal.document.Field? Ie, I cannot use Document to hold my private Storable/IndexableField implementations. I think we should remove them, leaving only add(Field)? I think StoredDocument should be in oal.index not oal.document? Ie, because it's something you've retrieved from the IndexReader. Also, it will cause confusion with oal.document.Document which is the obvious class you should use to hold all your indexed/stored fields. Why does StoredDocument still have removeField/s? Shouldn't it be read-only? (I feel like a broken record....).
          Hide
          Robert Muir added a comment -

          I was confused about one thing, Norm.java has a StoredField in the branch.

          This didn't seem intuitive to me, because Norms are an indexing thing.

          But then i looked at it more, it seems Norm just uses this as a "container" (implementation detail)
          to hold its "docvalue".

          Can we add some notes or javadocs about this to this class? I think it would
          prevent future confusion.

          Show
          Robert Muir added a comment - I was confused about one thing, Norm.java has a StoredField in the branch. This didn't seem intuitive to me, because Norms are an indexing thing. But then i looked at it more, it seems Norm just uses this as a "container" (implementation detail) to hold its "docvalue". Can we add some notes or javadocs about this to this class? I think it would prevent future confusion.
          Hide
          Nikola Tankovic added a comment - - edited

          Patch 11: Mike's and Robert's comments applied.

          Show
          Nikola Tankovic added a comment - - edited Patch 11: Mike's and Robert's comments applied.
          Hide
          Uwe Schindler added a comment -

          I applied patch to current branch, but it does not compile anymore.

          Show
          Uwe Schindler added a comment - I applied patch to current branch, but it does not compile anymore.
          Hide
          Nikola Tankovic added a comment -

          Uwe,
          it seems that my SVN client doesn't mark oal.document.StoredDocument for removal. Wierd... Can you please manually remove this class because it moved to oal.index.StoredDocument? Then everything should be OK.

          Show
          Nikola Tankovic added a comment - Uwe, it seems that my SVN client doesn't mark oal.document.StoredDocument for removal. Wierd... Can you please manually remove this class because it moved to oal.index.StoredDocument? Then everything should be OK.
          Hide
          Nikola Tankovic added a comment -

          Patch 12: Fix of compiler warning. You still need to manually remove oal.document.StoredDocument

          Show
          Nikola Tankovic added a comment - Patch 12: Fix of compiler warning. You still need to manually remove oal.document.StoredDocument
          Hide
          Uwe Schindler added a comment -

          Your patch is also missing the replacement file addition. I tried to do it manually, but i have no patch.

          Can you use SVN 1.7.x and use --show-copies-as-adds (this does not work with SVN 1.6)? This simplifies patches a lot!

          Show
          Uwe Schindler added a comment - Your patch is also missing the replacement file addition. I tried to do it manually, but i have no patch. Can you use SVN 1.7.x and use --show-copies-as-adds (this does not work with SVN 1.6)? This simplifies patches a lot!
          Hide
          Nikola Tankovic added a comment -

          Patch 12a: Added missing oal.index.StoredDocument. I'm really sorry about that!

          Show
          Nikola Tankovic added a comment - Patch 12a: Added missing oal.index.StoredDocument. I'm really sorry about that!
          Hide
          Uwe Schindler added a comment -

          Applied patch to branch in revision 1372427.

          Now merging trunk in...

          Show
          Uwe Schindler added a comment - Applied patch to branch in revision 1372427. Now merging trunk in...
          Hide
          Uwe Schindler added a comment -

          Merged up to trunk revision: 1372438

          There was one conflict in some TermVectors test, but they now pass.

          Show
          Uwe Schindler added a comment - Merged up to trunk revision: 1372438 There was one conflict in some TermVectors test, but they now pass.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,

          we should now use the remaining time to do some cleanup and prepare the branch for merging with Lucene trunk. There will be no backport, so this would be the first Lucene 5.x only change, do we all agree with this? I think the change would be too heavy to go into 4.0.

          The final "pencils down" date would be monday next week, the evaluations of GSoC until Thursday, noon UTC next week, so we should hurry up.

          Show
          Uwe Schindler added a comment - Hi Nikola, we should now use the remaining time to do some cleanup and prepare the branch for merging with Lucene trunk. There will be no backport, so this would be the first Lucene 5.x only change, do we all agree with this? I think the change would be too heavy to go into 4.0. The final "pencils down" date would be monday next week, the evaluations of GSoC until Thursday, noon UTC next week, so we should hurry up.
          Hide
          Chris Male added a comment -

          Is it going to be possible to address IndexableFieldType vs StorableFieldType situation resolved before this lands? I can assist if that would help.

          Show
          Chris Male added a comment - Is it going to be possible to address IndexableFieldType vs StorableFieldType situation resolved before this lands? I can assist if that would help.
          Hide
          Uwe Schindler added a comment -

          Hi, I merged up to trunk (to get jenkins config changes in): revision 1373337

          I also created a Jenkins Job on the Policeman build server: http://jenkins.sd-datasolutions.de/job/lucene3312-branch/
          It will send mails to Nikola and myself on failures. It would be good to adress them asap.

          Show
          Uwe Schindler added a comment - Hi, I merged up to trunk (to get jenkins config changes in): revision 1373337 I also created a Jenkins Job on the Policeman build server: http://jenkins.sd-datasolutions.de/job/lucene3312-branch/ It will send mails to Nikola and myself on failures. It would be good to adress them asap.
          Show
          Uwe Schindler added a comment - First build succeeded: http://jenkins.sd-datasolutions.de/job/lucene3312-branch/1/consoleFull
          Hide
          Uwe Schindler added a comment - - edited

          Hi Nikola, the first build was not done using Oracle JDK 6, so Javadocs were not built. The recent one failed, because of invalid Javadocs. Could you send a patch with those corrected? I would recommend to run "ant javadocs" or "ant javadocs-lint) (more thorough) from top-level.

          Here the error message (we fail on javadocs warnings): http://jenkins.sd-datasolutions.de/job/lucene3312-branch/3/console

          Show
          Uwe Schindler added a comment - - edited Hi Nikola, the first build was not done using Oracle JDK 6, so Javadocs were not built. The recent one failed, because of invalid Javadocs. Could you send a patch with those corrected? I would recommend to run "ant javadocs" or "ant javadocs-lint) (more thorough) from top-level. Here the error message (we fail on javadocs warnings): http://jenkins.sd-datasolutions.de/job/lucene3312-branch/3/console
          Hide
          Nikola Tankovic added a comment -

          Patch 13: fixed Javadoc warnings. 'ant javadocs' builds successfully.

          Show
          Nikola Tankovic added a comment - Patch 13: fixed Javadoc warnings. 'ant javadocs' builds successfully.
          Hide
          Uwe Schindler added a comment -

          OK, applied and committed the patch, rev 1373940.

          Show
          Uwe Schindler added a comment - OK, applied and committed the patch, rev 1373940.
          Hide
          Uwe Schindler added a comment -

          How should we proceed with this? I think the IndexableFieldType vs StorableFieldType situation is not yet decided. How should we proceed? We have to stop working next monday and prepare the final GSoC evaluation.

          Show
          Uwe Schindler added a comment - How should we proceed with this? I think the IndexableFieldType vs StorableFieldType situation is not yet decided. How should we proceed? We have to stop working next monday and prepare the final GSoC evaluation.
          Hide
          Chris Male added a comment -

          We definitely need to clean up StorableFieldType situation, but I think we can tackle that afterwards. I think it's best to ensure what we have now works and we're comfortable with the API.

          Show
          Chris Male added a comment - We definitely need to clean up StorableFieldType situation, but I think we can tackle that afterwards. I think it's best to ensure what we have now works and we're comfortable with the API.
          Hide
          Uwe Schindler added a comment -

          Merged up to trunk revision: 1374718

          Show
          Uwe Schindler added a comment - Merged up to trunk revision: 1374718
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          afetr the branch merge, the more picky javadocs checker in Lucene Core found few classes without Javadoc at all. It would be good to add Javadocs for the new StorableField, StorableFieldType, GeneralField,... classes.

          Also please make sure that the ASF License header does not start with /** (which is javadoc), but starts with /* (simple comment). I fixed the ones I found.

          Show
          Uwe Schindler added a comment - Hi Nikola, afetr the branch merge, the more picky javadocs checker in Lucene Core found few classes without Javadoc at all. It would be good to add Javadocs for the new StorableField, StorableFieldType, GeneralField,... classes. Also please make sure that the ASF License header does not start with /** (which is javadoc), but starts with /* (simple comment). I fixed the ones I found.
          Hide
          Nikola Tankovic added a comment -

          Added missing javadoc descriptions, very basic; may need further elaboration.

          Show
          Nikola Tankovic added a comment - Added missing javadoc descriptions, very basic; may need further elaboration.
          Hide
          Uwe Schindler added a comment -

          Applied patch in revision 1375040. Thanks for fixing.

          Show
          Uwe Schindler added a comment - Applied patch in revision 1375040. Thanks for fixing.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          thanks for your work on this GSoC project. The Jenkins job seems to pass, we should now work on reintegrating the branch into trunk.

          Here my questions to the other committers:

          • Apply only to trunk (5.0) - so it has more time to bake? I think this change would be too big for Lucene 4.0 - and too late??
          • Are there any other things to change? One open point is StorableFieldType.

          I would like to integrate it asap, as it gets out of date quite early. I will do a merge from trunk -> branch to keep up-to-date.

          Show
          Uwe Schindler added a comment - Hi Nikola, thanks for your work on this GSoC project. The Jenkins job seems to pass, we should now work on reintegrating the branch into trunk. Here my questions to the other committers: Apply only to trunk (5.0) - so it has more time to bake? I think this change would be too big for Lucene 4.0 - and too late?? Are there any other things to change? One open point is StorableFieldType. I would like to integrate it asap, as it gets out of date quite early. I will do a merge from trunk -> branch to keep up-to-date.
          Hide
          Uwe Schindler added a comment -

          Merged up to trunk rev 1377246.

          Show
          Uwe Schindler added a comment - Merged up to trunk rev 1377246.
          Hide
          Chris Male added a comment -

          Apply only to trunk (5.0) - so it has more time to bake? I think this change would be too big for Lucene 4.0 - and too late??

          +1 to 5.0 only. It's another big change to the Document/Field API that we may want to evolve more as it bakes and earlier adopters begin to use it.

          Are there any other things to change? One open point is StorableFieldType.

          StorableFieldType seems like the only thing at this stage that needs to be addressed.

          Show
          Chris Male added a comment - Apply only to trunk (5.0) - so it has more time to bake? I think this change would be too big for Lucene 4.0 - and too late?? +1 to 5.0 only. It's another big change to the Document/Field API that we may want to evolve more as it bakes and earlier adopters begin to use it. Are there any other things to change? One open point is StorableFieldType. StorableFieldType seems like the only thing at this stage that needs to be addressed.
          Hide
          Uwe Schindler added a comment -

          Attached you will find the "master patch" after running "svn merge --reintegrate".

          Nikola: I think you could use this patch to submit the mandatory code submission to Google. It shows what you did the last months on GSoC.

          Other committers: Are you fine with this patch (see my comments above)?

          Show
          Uwe Schindler added a comment - Attached you will find the "master patch" after running "svn merge --reintegrate". Nikola: I think you could use this patch to submit the mandatory code submission to Google. It shows what you did the last months on GSoC. Other committers: Are you fine with this patch (see my comments above)?
          Hide
          Uwe Schindler added a comment -

          I have one comment about the following methods in Document:

          +  /** Obtains all indexed fields in document */
          +  @Override
          +  public Iterable<? extends IndexableField> indexableFields() {
          +    Iterator<Field> it = indexedFieldsIterator();
          +    
          +    List<IndexableField> result = new ArrayList<IndexableField>();
          +    while(it.hasNext()) {
          +      result.add(it.next());
          +    }
          +    
          +    return result;
          +  }
          +
          +
          +  /** Obtains all stored fields in document. */
          +  @Override
          +  public Iterable<? extends StorableField> storableFields() {
          +    Iterator<Field> it = storedFieldsIterator();
          +    
          +    List<StorableField> result = new ArrayList<StorableField>();
          +    while(it.hasNext()) {
          +      result.add(it.next());
          +    }
          +    
          +    return result;
          +  }
          +
          

          In my opinion, this should not copy to an ArrayList, it shoudl simply return a anonymous Iterable<..> wrapping the iterator:

          public Iterable<? extends StorableField> storableFields() {
           return new Iterable<? extends StorableField>() {
            @Override
            Iterator<? extends StorableField> iterator() {
              return Document.this.storedFieldsIterator();
            }
           }
          }
          

          Also it may not be needed to have <? extends Foo> a simple <Foo> is enough here (comment from Generics Policman)

          Show
          Uwe Schindler added a comment - I have one comment about the following methods in Document: + /** Obtains all indexed fields in document */ + @Override + public Iterable<? extends IndexableField> indexableFields() { + Iterator<Field> it = indexedFieldsIterator(); + + List<IndexableField> result = new ArrayList<IndexableField>(); + while(it.hasNext()) { + result.add(it.next()); + } + + return result; + } + + + /** Obtains all stored fields in document. */ + @Override + public Iterable<? extends StorableField> storableFields() { + Iterator<Field> it = storedFieldsIterator(); + + List<StorableField> result = new ArrayList<StorableField>(); + while(it.hasNext()) { + result.add(it.next()); + } + + return result; + } + In my opinion, this should not copy to an ArrayList, it shoudl simply return a anonymous Iterable<..> wrapping the iterator: public Iterable<? extends StorableField> storableFields() { return new Iterable<? extends StorableField>() { @Override Iterator<? extends StorableField> iterator() { return Document. this .storedFieldsIterator(); } } } Also it may not be needed to have <? extends Foo> a simple <Foo> is enough here (comment from Generics Policman)
          Hide
          Chris Male added a comment -

          +1

          Show
          Chris Male added a comment - +1
          Hide
          Uwe Schindler added a comment -

          I committed attached patch to branch.

          Please proceed with reviewing the reintegration patch. Once all changes are done, I will reintegrate again and commit!

          Show
          Uwe Schindler added a comment - I committed attached patch to branch. Please proceed with reviewing the reintegration patch. Once all changes are done, I will reintegrate again and commit!
          Hide
          Uwe Schindler added a comment -

          I merged in the recent changes in trunk (rev. 1379200). Robert Muir added lots of JavaDocs to the document and index package, so we should check that everything is still correct. We should especially review sentences that contain hints to stored documents on IndexableDocument and vice versa.

          Show
          Uwe Schindler added a comment - I merged in the recent changes in trunk (rev. 1379200). Robert Muir added lots of JavaDocs to the document and index package, so we should check that everything is still correct. We should especially review sentences that contain hints to stored documents on IndexableDocument and vice versa.
          Hide
          Robert Muir added a comment -
           * If you also need to store the value, you should add a
           * separate {@link StoredField} instance.
           ...
           * */
          
          public class ByteDocValuesField extends StoredField {
          

          I opened an issue for this already (LUCENE-4331), but here its really confusing since
          all thse DocValuesField themselves extend StoredField. I think we need to figure this out.

          Show
          Robert Muir added a comment - * If you also need to store the value, you should add a * separate {@link StoredField} instance. ... * */ public class ByteDocValuesField extends StoredField { I opened an issue for this already ( LUCENE-4331 ), but here its really confusing since all thse DocValuesField themselves extend StoredField. I think we need to figure this out.
          Hide
          Robert Muir added a comment -

          I echo Chris on the confusion of StorableField requires IndexableFieldType (since it extends GeneralField).

          To me storing needs no 'type' information at all: But I guess the problem with that is that we need
          DocValues types since DocValues are "stored fields" here.

          But I think this is related to my comment above: I think its confusing that DocValues fields are treated
          as Stored fields at all?

          This basically is the same problem all over again.

          • You make a Document with N StorableFields
          • You call IR.document and get a StorableDocument back, with N-3 StorableFields.
          • You wonder: what happened to the other 3 fields?

          They were DocValues.

          Show
          Robert Muir added a comment - I echo Chris on the confusion of StorableField requires IndexableFieldType (since it extends GeneralField). To me storing needs no 'type' information at all: But I guess the problem with that is that we need DocValues types since DocValues are "stored fields" here. But I think this is related to my comment above: I think its confusing that DocValues fields are treated as Stored fields at all? This basically is the same problem all over again. You make a Document with N StorableFields You call IR.document and get a StorableDocument back, with N-3 StorableFields. You wonder: what happened to the other 3 fields? They were DocValues.
          Hide
          Robert Muir added a comment -

          and some of my javadocs warnings on DocumentStoredFieldVisitor in trunk, that were removed
          during merging should be added back here again as long as StorableField still has IndexableFieldType:

             * @return Document populated with stored fields. Note that only
             *         the stored information in the field instances is valid,
             *         data such as indexing options, term vector options,
             *         etc is not set.
          
          Show
          Robert Muir added a comment - and some of my javadocs warnings on DocumentStoredFieldVisitor in trunk, that were removed during merging should be added back here again as long as StorableField still has IndexableFieldType: * @return Document populated with stored fields. Note that only * the stored information in the field instances is valid, * data such as indexing options, term vector options, * etc is not set.
          Hide
          Robert Muir added a comment -

          By the way, these werent meant to be objections to the issue (just random thoughts while reviewing javadocs).

          Show
          Robert Muir added a comment - By the way, these werent meant to be objections to the issue (just random thoughts while reviewing javadocs).
          Hide
          Uwe Schindler added a comment - - edited

          and some of my javadocs warnings on DocumentStoredFieldVisitor in trunk,

          sorry that was only this one, I did it because at the time of merging the fact that it still implements IndexableField was not in my mind. But this is the same issue like Chris complained about. We should cover that in a second step.

          Robert can you commit your javadoc comments back in or should I do it?

          Show
          Uwe Schindler added a comment - - edited and some of my javadocs warnings on DocumentStoredFieldVisitor in trunk, sorry that was only this one, I did it because at the time of merging the fact that it still implements IndexableField was not in my mind. But this is the same issue like Chris complained about. We should cover that in a second step. Robert can you commit your javadoc comments back in or should I do it?
          Hide
          Chris Male added a comment -

          I've thought about this a little bit.

          To me storing needs no 'type' information at all: But I guess the problem with that is that we need
          DocValues types since DocValues are "stored fields" here.

          We've gone back and forwards about this a lot since the Fields cleanup began but it would be nice to actually have the DocValues Types on the StorableField itself rather than on StorableFieldType. In the end the type is related to the type of the value itself, not disconnected metadata. Having it this way would also alleviate the need for StorableFieldType and make storing values as simple as possible.

          This basically is the same problem all over again.

          • You make a Document with N StorableFields
          • You call IR.document and get a StorableDocument back, with N-3 StorableFields.
          • You wonder: what happened to the other 3 fields?

          They were DocValues.

          What if they were returned? Because you're absolutely right, it seems odd for DocValues Fields to be StorableFields and then not accessible like all other StorableFields. So what if we changed how IR.document worked so you could pull DocValues Fields too. Is that something users might want?

          Show
          Chris Male added a comment - I've thought about this a little bit. To me storing needs no 'type' information at all: But I guess the problem with that is that we need DocValues types since DocValues are "stored fields" here. We've gone back and forwards about this a lot since the Fields cleanup began but it would be nice to actually have the DocValues Types on the StorableField itself rather than on StorableFieldType. In the end the type is related to the type of the value itself, not disconnected metadata. Having it this way would also alleviate the need for StorableFieldType and make storing values as simple as possible. This basically is the same problem all over again. You make a Document with N StorableFields You call IR.document and get a StorableDocument back, with N-3 StorableFields. You wonder: what happened to the other 3 fields? They were DocValues. What if they were returned? Because you're absolutely right, it seems odd for DocValues Fields to be StorableFields and then not accessible like all other StorableFields. So what if we changed how IR.document worked so you could pull DocValues Fields too. Is that something users might want?
          Hide
          Uwe Schindler added a comment -

          What if they were returned? Because you're absolutely right, it seems odd for DocValues Fields to be StorableFields and then not accessible like all other StorableFields. So what if we changed how IR.document worked so you could pull DocValues Fields too. Is that something users might want?

          This could be a large overhead if e.g. the loading of the whole column would be triggered automatically (depends on configuration). Also, IndexReader.document() is in the basic IndexReader class (because stored fields can always be returned, also for composite readers). DocValues is AtomicReader only... This could of course be managed by BaseCompositeReader to use the subindex function to get the correct document, but it is somehow not the thing docvalues are made for. They are there for using them while scoring, filtering, functions...

          Show
          Uwe Schindler added a comment - What if they were returned? Because you're absolutely right, it seems odd for DocValues Fields to be StorableFields and then not accessible like all other StorableFields. So what if we changed how IR.document worked so you could pull DocValues Fields too. Is that something users might want? This could be a large overhead if e.g. the loading of the whole column would be triggered automatically (depends on configuration). Also, IndexReader.document() is in the basic IndexReader class (because stored fields can always be returned, also for composite readers). DocValues is AtomicReader only... This could of course be managed by BaseCompositeReader to use the subindex function to get the correct document, but it is somehow not the thing docvalues are made for. They are there for using them while scoring, filtering, functions...
          Hide
          Robert Muir added a comment -

          This is not really a viable option. its n random seeks to retreive n dv fields for a doc.

          They are not stored fields

          Show
          Robert Muir added a comment - This is not really a viable option. its n random seeks to retreive n dv fields for a doc. They are not stored fields
          Hide
          Uwe Schindler added a comment -

          Yeah right! Every value is a seek.

          Show
          Uwe Schindler added a comment - Yeah right! Every value is a seek.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,

          I am about to reintegrate the branch back o Lucene trunk. We need a new entry for MIGRATE.txt. Can you prepare one, so users of the current Lucene 4.0 API can migrate to the new one? It should give some hints what needs to be changed in the code to make a Lucene 4.0 APP ready for Lucene trunk (5.0)? The migrate.txt is formatted using Markdown syntax, so mostly text-only.

          I will in all cases commit the reintragrated branch, but I want to add a short guide about the changes at a later stage.

          Show
          Uwe Schindler added a comment - Hi Nikola, I am about to reintegrate the branch back o Lucene trunk. We need a new entry for MIGRATE.txt. Can you prepare one, so users of the current Lucene 4.0 API can migrate to the new one? It should give some hints what needs to be changed in the code to make a Lucene 4.0 APP ready for Lucene trunk (5.0)? The migrate.txt is formatted using Markdown syntax, so mostly text-only. I will in all cases commit the reintragrated branch, but I want to add a short guide about the changes at a later stage.
          Hide
          Uwe Schindler added a comment -

          Attached the reintegration patch.

          Show
          Uwe Schindler added a comment - Attached the reintegration patch.
          Hide
          Nikola Tankovic added a comment -

          Hi Uwe,

          it would be most helpful if I could see some similar MIGRATE.txt file from previous migrations to see the level of detail, but if it's a hassle I'll probably manage something without it.

          Show
          Nikola Tankovic added a comment - Hi Uwe, it would be most helpful if I could see some similar MIGRATE.txt file from previous migrations to see the level of detail, but if it's a hassle I'll probably manage something without it.
          Hide
          Uwe Schindler added a comment -

          Hi Nikola,
          just download the 4.0-BETA release of Lucene. There is a MIGRATE.txt (and corresponding Markdown-generated HTML in the docs folder): http://lucene.apache.org/core/4_0_0-BETA/index.html -> http://lucene.apache.org/core/4_0_0-BETA/MIGRATE.html, the source code is here: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/MIGRATE.txt

          Show
          Uwe Schindler added a comment - Hi Nikola, just download the 4.0-BETA release of Lucene. There is a MIGRATE.txt (and corresponding Markdown-generated HTML in the docs folder): http://lucene.apache.org/core/4_0_0-BETA/index.html -> http://lucene.apache.org/core/4_0_0-BETA/MIGRATE.html , the source code is here: https://svn.apache.org/repos/asf/lucene/dev/branches/branch_4x/lucene/MIGRATE.txt
          Hide
          Uwe Schindler added a comment -

          I merged the branch to trunk, rev 1379982. The branch itsself was deleted after reintegration.

          I will now open new issues for the remaining problems.

          Show
          Uwe Schindler added a comment - I merged the branch to trunk, rev 1379982. The branch itsself was deleted after reintegration. I will now open new issues for the remaining problems.
          Hide
          Chris Male added a comment -

          Thanks Uwe and Nikola!

          Show
          Chris Male added a comment - Thanks Uwe and Nikola!
          Hide
          Uwe Schindler added a comment -

          I opened LUCENE-4347 as container issue for later changes.

          Nikola, please attach MIGRATE.txt changes to LUCENE-4348, as patches againt Lucene trunk! Thanks.

          Finally: My thanks also goes to Nikola and Chris for the work on this issue. I want to also mention Robert and Mike for helpful comments.

          Show
          Uwe Schindler added a comment - I opened LUCENE-4347 as container issue for later changes. Nikola, please attach MIGRATE.txt changes to LUCENE-4348 , as patches againt Lucene trunk! Thanks. Finally: My thanks also goes to Nikola and Chris for the work on this issue. I want to also mention Robert and Mike for helpful comments.
          Hide
          David Smiley added a comment -

          Uwe, can you please explain why you changed SpatialStrategy.createIndexableFields to return Field[] instead of IndexableField[]? As its name suggests and as the javadocs go to some lengths to clarify, createIndexableFields is for indexed data and not storing it. Field implements StorableField.

          Show
          David Smiley added a comment - Uwe, can you please explain why you changed SpatialStrategy.createIndexableFields to return Field[] instead of IndexableField[]? As its name suggests and as the javadocs go to some lengths to clarify, createIndexableFields is for indexed data and not storing it. Field implements StorableField.
          Hide
          Uwe Schindler added a comment -

          s/Uwe/Nikola/;

          Show
          Uwe Schindler added a comment - s/Uwe/Nikola/;
          Hide
          Chris Male added a comment -

          David, just at a guess I imagine the branch used in this issue was created before we changed createIndexableFields to not handle storing. To satisfy the conditions at the time (indexing and storing) Nikola changed it to return Field. Lets just fix it and we'll be fine.

          Show
          Chris Male added a comment - David, just at a guess I imagine the branch used in this issue was created before we changed createIndexableFields to not handle storing. To satisfy the conditions at the time (indexing and storing) Nikola changed it to return Field. Lets just fix it and we'll be fine.

            People

            • Assignee:
              Nikola Tankovic
              Reporter:
              Michael McCandless
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development