Solr
  1. Solr
  2. SOLR-1131

Allow a single field type to index multiple fields

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.5, 3.1, 4.0-ALPHA
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      In a few special cases, it makes sense for a single "field" (the concept) to be indexed as a set of Fields (lucene Field). Consider SOLR-773. The concept "point" may be best indexed in a variety of ways:

      • geohash (sincle lucene field)
      • lat field, lon field (two double fields)
      • cartesian tiers (a series of fields with tokens to say if it exists within that region)
      1. SOLR-1131-IndexMultipleFields.patch
        5 kB
        Ryan McKinley
      2. SOLR-1131.patch
        6 kB
        Grant Ingersoll
      3. SOLR-1131.patch
        16 kB
        Grant Ingersoll
      4. SOLR-1131.patch
        17 kB
        Yonik Seeley
      5. SOLR-1131.patch
        26 kB
        Yonik Seeley
      6. SOLR-1131.patch
        82 kB
        Grant Ingersoll
      7. SOLR-1131.patch
        70 kB
        Grant Ingersoll
      8. SOLR-1131.patch
        115 kB
        Grant Ingersoll
      9. SOLR-1131.patch
        118 kB
        Grant Ingersoll
      10. SOLR-1131.patch
        119 kB
        Grant Ingersoll
      11. SOLR-1131.patch
        122 kB
        Shalin Shekhar Mangar
      12. SOLR-1131.patch
        124 kB
        Noble Paul
      13. SOLR-1131.patch
        121 kB
        Grant Ingersoll
      14. SOLR-1131.patch
        129 kB
        Yonik Seeley
      15. SOLR-1131.patch
        118 kB
        Grant Ingersoll
      16. SOLR-1131.patch
        121 kB
        Grant Ingersoll
      17. SOLR-1131.Mattmann.121109.patch.txt
        120 kB
        Chris A. Mattmann
      18. SOLR-1131.Mattmann.121009.patch.txt
        68 kB
        Chris A. Mattmann
      19. diff.patch
        91 kB
        Yonik Seeley

        Issue Links

          Activity

          Hide
          Ryan McKinley added a comment -

          This is a sketch to see how things look if we have SchemaField/FieldType return Field[] rather then Field:

          +  /**
          +   * @deprecated use {@link #createFields(String, float)}
          +   */
             public Field createField(String val, float boost) {
               return type.createField(this,val,boost);
             }
          +  
          +  public Field[] createFields(String val, float boost) {
          +    return type.createFields(this,val,boost);
          +  }
          

          I think this could work – this would let FieldType#createFields() return a list of fields. The issues i see are:

          • indexing each field adds a for loop (maybe not a big deal?)
          • FieldType#toInternal() may or may not relate to the actual indexed value
          • FieldType#getAnalyzer() – I guess the same analyzer would have to apply to every field? I'm not sure what the implication is on this.
          • what about #getRangeQuery()
          Show
          Ryan McKinley added a comment - This is a sketch to see how things look if we have SchemaField/FieldType return Field[] rather then Field: + /** + * @deprecated use {@link #createFields( String , float )} + */ public Field createField( String val, float boost) { return type.createField( this ,val,boost); } + + public Field[] createFields( String val, float boost) { + return type.createFields( this ,val,boost); + } I think this could work – this would let FieldType#createFields() return a list of fields. The issues i see are: indexing each field adds a for loop (maybe not a big deal?) FieldType#toInternal() may or may not relate to the actual indexed value FieldType#getAnalyzer() – I guess the same analyzer would have to apply to every field? I'm not sure what the implication is on this. what about #getRangeQuery()
          Hide
          Erik Hatcher added a comment -

          Perhaps this is a case where the TeeTokenFilter and friends can come into play in Solr finally?

          Show
          Erik Hatcher added a comment - Perhaps this is a case where the TeeTokenFilter and friends can come into play in Solr finally?
          Hide
          Ryan McKinley added a comment -

          TeeTokenFilter look interesting. How do you imagine it could help with letting a solr Field index multiple fields?

          Using the TeeTokenFilter seems like a matter of plumbing. Perhaps we could add some properties to FieldType that create a SinkTokenizer and then use copyField (or something similar) to use the existing SinkTokenizer.

          I'm not familiar enought with the Sink/Tee stuff to know what we would need – but we should make sure any SchemaField/FieldType changes open the door to this.

          Show
          Ryan McKinley added a comment - TeeTokenFilter look interesting. How do you imagine it could help with letting a solr Field index multiple fields? Using the TeeTokenFilter seems like a matter of plumbing. Perhaps we could add some properties to FieldType that create a SinkTokenizer and then use copyField (or something similar) to use the existing SinkTokenizer. I'm not familiar enought with the Sink/Tee stuff to know what we would need – but we should make sure any SchemaField/FieldType changes open the door to this.
          Hide
          Yonik Seeley added a comment -

          FieldType#getAnalyzer() - I guess the same analyzer would have to apply to every field?

          Fields in Lucene can now be pre-analyzed (which this was available when Solr was first being developed!)
          See Field constructors that take a TokenStream..

          Show
          Yonik Seeley added a comment - FieldType#getAnalyzer() - I guess the same analyzer would have to apply to every field? Fields in Lucene can now be pre-analyzed (which this was available when Solr was first being developed!) See Field constructors that take a TokenStream..
          Hide
          Grant Ingersoll added a comment -

          Brings it up to trunk. Still needs test cases. All other tests pass.

          Show
          Grant Ingersoll added a comment - Brings it up to trunk. Still needs test cases. All other tests pass.
          Hide
          Noble Paul added a comment -

          Is this a good idea? This creates an extra Object (Field[]) for every Field . For a corner case we are introducing an overhead to all the field types.

          Why don't we add a new interface MutlValuedFieldType which extends FieldType for this

          Show
          Noble Paul added a comment - Is this a good idea? This creates an extra Object (Field[]) for every Field . For a corner case we are introducing an overhead to all the field types. Why don't we add a new interface MutlValuedFieldType which extends FieldType for this
          Hide
          Grant Ingersoll added a comment -

          Is this a good idea?

          Not sure yet.

          Why don't we add a new interface MutlValuedFieldType which extends FieldType for this

          Aren't we just substituting a very simple construction for an instanceof check?

          I was possibly thinking of a couple of other options, too:
          1. add a boolean on FT for isMultiField which returns false by default, then we could check that
          2. Add a threadlocal that stores a preconstructed array of size one which could then simply be set for the single field case, which is the most common case.

          My gut, however, says the object is very short lived and is likely to be of negligible cost.

          Show
          Grant Ingersoll added a comment - Is this a good idea? Not sure yet. Why don't we add a new interface MutlValuedFieldType which extends FieldType for this Aren't we just substituting a very simple construction for an instanceof check? I was possibly thinking of a couple of other options, too: 1. add a boolean on FT for isMultiField which returns false by default, then we could check that 2. Add a threadlocal that stores a preconstructed array of size one which could then simply be set for the single field case, which is the most common case. My gut, however, says the object is very short lived and is likely to be of negligible cost.
          Hide
          Noble Paul added a comment -

          dd a boolean on FT for isMultiField which returns false by default, then we could check that

          not bad

          My gut, however, says the object is very short lived and is likely to be of negligible cost.

          but, for a huge ingestion it just means several million objects created and that much extra GC

          Show
          Noble Paul added a comment - dd a boolean on FT for isMultiField which returns false by default, then we could check that not bad My gut, however, says the object is very short lived and is likely to be of negligible cost. but, for a huge ingestion it just means several million objects created and that much extra GC
          Hide
          Grant Ingersoll added a comment - - edited

          I'm also looking for ideas on how to handle the naming of the fields that are produced by this. I think a FieldType that produces multiple fields should hide the logistics of the naming, which this patch doesn't even begin to scratch the surface of and also on the search side, how does one search against just one of the fields?

          Would appreciate thoughts on that.

          Show
          Grant Ingersoll added a comment - - edited I'm also looking for ideas on how to handle the naming of the fields that are produced by this. I think a FieldType that produces multiple fields should hide the logistics of the naming, which this patch doesn't even begin to scratch the surface of and also on the search side, how does one search against just one of the fields? Would appreciate thoughts on that.
          Hide
          Chris Male added a comment -

          My initial feeling is, is searching against just one field something this functionality needs to concern itself with? If someone creates a field of type Point for example, which behind the scenes is indexed as 2 fields, from a Solr schema.xml perspective it is just 1 field, and so it should be the same at the querying level. We are trying to encapsulate the fact that the FieldType results in multiple fields. This then frees us up to choose a naming convention that is easy for us to implement, because we don't have to concern users with the convention.

          If someone does want to be able to search against just one field, such as maybe being able to find documents at a certain x coordinate, rather than an x,y Point, then I think we can simply recommend they index that data in a separate field.

          Show
          Chris Male added a comment - My initial feeling is, is searching against just one field something this functionality needs to concern itself with? If someone creates a field of type Point for example, which behind the scenes is indexed as 2 fields, from a Solr schema.xml perspective it is just 1 field, and so it should be the same at the querying level. We are trying to encapsulate the fact that the FieldType results in multiple fields. This then frees us up to choose a naming convention that is easy for us to implement, because we don't have to concern users with the convention. If someone does want to be able to search against just one field, such as maybe being able to find documents at a certain x coordinate, rather than an x,y Point, then I think we can simply recommend they index that data in a separate field.
          Hide
          Grant Ingersoll added a comment -

          I definitely agree, Chris, the interesting part is how that manifests itself in terms of implementation, which is where I am digging in at the moment. It means the Query parsers need to handle it as well as the ResponseWriters, etc.

          Show
          Grant Ingersoll added a comment - I definitely agree, Chris, the interesting part is how that manifests itself in terms of implementation, which is where I am digging in at the moment. It means the Query parsers need to handle it as well as the ResponseWriters, etc.
          Hide
          Chris Male added a comment -

          Those are definitely big problems.

          The ResponseWriter problem could be simplified if they used SolrDocuments rather than retrieving raw Lucene Documents. When constructing the SolrDocuments, which is done in cooperation with an IndexSchema instance, we have the information needed to bring the multiple fields together as one. I'm not sure of the performance impact of doing this, but it seems like having the ResponseWriters retrieve the data in a single consistent fashion is a good thing in the long run anyway.

          Show
          Chris Male added a comment - Those are definitely big problems. The ResponseWriter problem could be simplified if they used SolrDocuments rather than retrieving raw Lucene Documents. When constructing the SolrDocuments, which is done in cooperation with an IndexSchema instance, we have the information needed to bring the multiple fields together as one. I'm not sure of the performance impact of doing this, but it seems like having the ResponseWriters retrieve the data in a single consistent fashion is a good thing in the long run anyway.
          Hide
          Grant Ingersoll added a comment -

          Starting to add unit tests. Still no support on the search/response side, but groundwork for adding multiple fields per SchemaField/FieldType is now laid. Still need a way to know that a field/fieldtype is going to output multiple fields so that we can detect them when searching, etc.

          Show
          Grant Ingersoll added a comment - Starting to add unit tests. Still no support on the search/response side, but groundwork for adding multiple fields per SchemaField/FieldType is now laid. Still need a way to know that a field/fieldtype is going to output multiple fields so that we can detect them when searching, etc.
          Show
          Grant Ingersoll added a comment - See discussion at http://search.lucidimagination.com/search/document/d24c920ddf05b4f7/solr_1131_multiple_fields_per_field_type
          Hide
          Yonik Seeley added a comment -

          Here's a completely untested prototype patch along the lines of how I was thinking this would work with geo.

          Show
          Yonik Seeley added a comment - Here's a completely untested prototype patch along the lines of how I was thinking this would work with geo.
          Hide
          Yonik Seeley added a comment -

          Second try - forgot to "svn add" the new files.

          Show
          Yonik Seeley added a comment - Second try - forgot to "svn add" the new files.
          Hide
          Grant Ingersoll added a comment -

          Hey Yonik,

          One of the things I was debating was whether it was worthwhile to keep the single field creation or not. I see in your patch you drop it. I've got a patch that keeps it. I will try to put it up this week.

          Show
          Grant Ingersoll added a comment - Hey Yonik, One of the things I was debating was whether it was worthwhile to keep the single field creation or not. I see in your patch you drop it. I've got a patch that keeps it. I will try to put it up this week.
          Hide
          Yonik Seeley added a comment -

          One of the things I was debating was whether it was worthwhile to keep the single field creation or not. I see in your patch you drop it.

          No, I kept it. And I borrowed the isPolyField() from your email thread.

          Show
          Yonik Seeley added a comment - One of the things I was debating was whether it was worthwhile to keep the single field creation or not. I see in your patch you drop it. No, I kept it. And I borrowed the isPolyField() from your email thread.
          Hide
          Grant Ingersoll added a comment -

          I've got a patch with search and all the existing tests working. Still trying to work on one test that is failing due to toInternal/external conflicts. In the patch, I hide all the details of the internal fields, thus not requiring the dynamic field stuff.

          Show
          Grant Ingersoll added a comment - I've got a patch with search and all the existing tests working. Still trying to work on one test that is failing due to toInternal/external conflicts. In the patch, I hide all the details of the internal fields, thus not requiring the dynamic field stuff.
          Hide
          Grant Ingersoll added a comment - - edited

          OK, here's my take on this. I took Yonik's and merged it w/ a patch I had in the works. It's not done, but all tests pass, including the new on I added (PolyFieldTest). Yonik's move to put getFieldQuery in FieldType was just the key to answering the question of how to generate queries given a FieldType.

          Notes:
          1. I changed the Geo examples to be CoordinateFieldType (representing an abstract coordinate system) and then PointFieldType which represents a point in an n-dimensional space (default 2D). I think from this, we could easily add things like PolygonFieldType, etc. which would allow us to create more sophisticated shapes and do things like intersections, etc. For instance, imagine saying: Does this point lie within this shape? I think that might be able to be expressed as a RangeQuery
          2. I'm not sure I care for the name of the new abstract FieldType that is a base class of CoordinateFieldType called DelegatingFieldType
          3. I'm not sure yet on the properties of the generated fields just yet. Right now, I'm delegating the handling to the sub FieldType except I'm overriding to turn off storage, which I think is pretty cool (could even work as a copy field like functionality)
          4. I'm not thrilled about creating a SchemaField every time in the createFields protected helper method, but SchemaField is final and doesn't have a setName method (which makes sense)

          Questions for Yonik on his patch:
          1. Why is TextField overriding getFieldQuery when it isn't called, except possibly via the FieldQParserPlugin?
          2. I'm not sure I understand the getDistance, getBoundingBox methods on the GeoFieldType. It seems like that precludes one from picking a specific distance (for instance, some times you may want a faster approx. and others a slower more accurate calculation)

          Needs:
          1. Write up changes.txt
          2. More tests, including performance testing
          3. Patch doesn't support dynamic fields yet, but it should

          Show
          Grant Ingersoll added a comment - - edited OK, here's my take on this. I took Yonik's and merged it w/ a patch I had in the works. It's not done, but all tests pass, including the new on I added (PolyFieldTest). Yonik's move to put getFieldQuery in FieldType was just the key to answering the question of how to generate queries given a FieldType. Notes: 1. I changed the Geo examples to be CoordinateFieldType (representing an abstract coordinate system) and then PointFieldType which represents a point in an n-dimensional space (default 2D). I think from this, we could easily add things like PolygonFieldType, etc. which would allow us to create more sophisticated shapes and do things like intersections, etc. For instance, imagine saying: Does this point lie within this shape? I think that might be able to be expressed as a RangeQuery 2. I'm not sure I care for the name of the new abstract FieldType that is a base class of CoordinateFieldType called DelegatingFieldType 3. I'm not sure yet on the properties of the generated fields just yet. Right now, I'm delegating the handling to the sub FieldType except I'm overriding to turn off storage, which I think is pretty cool (could even work as a copy field like functionality) 4. I'm not thrilled about creating a SchemaField every time in the createFields protected helper method, but SchemaField is final and doesn't have a setName method (which makes sense) Questions for Yonik on his patch: 1. Why is TextField overriding getFieldQuery when it isn't called, except possibly via the FieldQParserPlugin? 2. I'm not sure I understand the getDistance, getBoundingBox methods on the GeoFieldType. It seems like that precludes one from picking a specific distance (for instance, some times you may want a faster approx. and others a slower more accurate calculation) Needs: 1. Write up changes.txt 2. More tests, including performance testing 3. Patch doesn't support dynamic fields yet, but it should
          Hide
          Chris A. Mattmann added a comment -

          Patch is looking good! I'm pouring through it right now – I'll try and test this as part of work I'm doing on SOLR-1586 – maybe even update that issue if I get a sec today

          Show
          Chris A. Mattmann added a comment - Patch is looking good! I'm pouring through it right now – I'll try and test this as part of work I'm doing on SOLR-1586 – maybe even update that issue if I get a sec today
          Hide
          Yonik Seeley added a comment -

          1. Why is TextField overriding getFieldQuery when it isn't called, except possibly via the FieldQParserPlugin?

          As you point out, it is called by FieldQParserPlugin, and it's a move to make things a bit more orthogonal - with a little more work it could even be used by the SolrQueryParser for text field types as well. It also opened the (expert) possibility of creating a new TextField type that handled things a bit differently.

          2. I'm not sure I understand the getDistance, getBoundingBox methods on the GeoFieldType. It seems like that precludes one from picking a specific distance (for instance, some times you may want a faster approx. and others a slower more accurate calculation)

          This decision will often be made for the user by the choice of field-types. End users and app clients should be able to specify something like a bounding box filter and get the most performant implementation w/o having to know if it resolves to range queries, cartesian grids, or whatever.

          fq=

          {!gbox point=110,220 r=1.5}

          #specify a point and a radius

          This does not necessarily preclude users from calling exact functions if they know they are supported for that field type.

          Show
          Yonik Seeley added a comment - 1. Why is TextField overriding getFieldQuery when it isn't called, except possibly via the FieldQParserPlugin? As you point out, it is called by FieldQParserPlugin, and it's a move to make things a bit more orthogonal - with a little more work it could even be used by the SolrQueryParser for text field types as well. It also opened the (expert) possibility of creating a new TextField type that handled things a bit differently. 2. I'm not sure I understand the getDistance, getBoundingBox methods on the GeoFieldType. It seems like that precludes one from picking a specific distance (for instance, some times you may want a faster approx. and others a slower more accurate calculation) This decision will often be made for the user by the choice of field-types. End users and app clients should be able to specify something like a bounding box filter and get the most performant implementation w/o having to know if it resolves to range queries, cartesian grids, or whatever. fq= {!gbox point=110,220 r=1.5} #specify a point and a radius This does not necessarily preclude users from calling exact functions if they know they are supported for that field type.
          Hide
          Yonik Seeley added a comment -

          Some minor nits about tests in general that I've also noticed in the past:
          IMO, unit tests can be too low level. They can also be too fragile.

          The test below is a pain to maintain... it essentially means you can't change the schema at all w/o breaking the test.
          + Map<String,FieldType> polyFs = schema.getPolyFieldTypes();
          + assertNotNull(polyFs);
          + assertTrue(polyFs.size() == 3);

          I also prefer testing low level behavior as opposed to testing low level implementation.
          It would be nice, for example, if testPointFieldType indexed a few couments (with various combinations of stored / indexed) and then queried the index, with our high level xpath validation code to test that the field was correctly matched and had stored, or had not stored the value.

          Rendundant null checks, trivial strings, etc:
          + assertNotNull("topDocs is null", topDocs);
          + assertTrue(topDocs.totalHits + " does not equal: " + 1, topDocs.totalHits == 1)

          Things like the above can be replaced with the much more concise and readable:
          assertEquals(1, topDocs.totalHits)

          But really, stuff like this:
          + TopDocs topDocs = core.getSearcher().get().search(bq, 1);

          Should normally use the higher level search and xpath validate functionallity. The code above actually leads to a refcount leak.

          schema.xml: geo will be core... let's not add a new/different schema file in tests for this and simply add it to the latest schema12

          Show
          Yonik Seeley added a comment - Some minor nits about tests in general that I've also noticed in the past: IMO, unit tests can be too low level. They can also be too fragile. The test below is a pain to maintain... it essentially means you can't change the schema at all w/o breaking the test. + Map<String,FieldType> polyFs = schema.getPolyFieldTypes(); + assertNotNull(polyFs); + assertTrue(polyFs.size() == 3); I also prefer testing low level behavior as opposed to testing low level implementation. It would be nice, for example, if testPointFieldType indexed a few couments (with various combinations of stored / indexed) and then queried the index, with our high level xpath validation code to test that the field was correctly matched and had stored, or had not stored the value. Rendundant null checks, trivial strings, etc: + assertNotNull("topDocs is null", topDocs); + assertTrue(topDocs.totalHits + " does not equal: " + 1, topDocs.totalHits == 1) Things like the above can be replaced with the much more concise and readable: assertEquals(1, topDocs.totalHits) But really, stuff like this: + TopDocs topDocs = core.getSearcher().get().search(bq, 1); Should normally use the higher level search and xpath validate functionallity. The code above actually leads to a refcount leak. schema.xml: geo will be core... let's not add a new/different schema file in tests for this and simply add it to the latest schema12
          Hide
          Chris A. Mattmann added a comment -

          Hi Yonik:

          I agree in general with your points above regarding unit tests. However, there seems to be a contradiction in your last statement to what you proposed above:

          schema.xml: geo will be core... let's not add a new/different schema file in tests for this and simply add it to the latest schema12

          Why have unit tests point at the actual schema? That explicitly ties your unit tests to the shipped ops schema, and then encourages people to write unit tests against it (which could lead to the specific number checks that will break when the schema is updated as you mentioned). Instead +1 for having a separate test schema even if it causes duplication it insulates change.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Yonik: I agree in general with your points above regarding unit tests. However, there seems to be a contradiction in your last statement to what you proposed above: schema.xml: geo will be core... let's not add a new/different schema file in tests for this and simply add it to the latest schema12 Why have unit tests point at the actual schema? That explicitly ties your unit tests to the shipped ops schema, and then encourages people to write unit tests against it (which could lead to the specific number checks that will break when the schema is updated as you mentioned). Instead +1 for having a separate test schema even if it causes duplication it insulates change. Cheers, Chris
          Hide
          Yonik Seeley added a comment -

          Please see the DocumentBuilder changes I had added... there was starting to be too much duplicated code and I pulled it out into a utility method.

          Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch.

          Show
          Yonik Seeley added a comment - Please see the DocumentBuilder changes I had added... there was starting to be too much duplicated code and I pulled it out into a utility method. Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch.
          Hide
          Yonik Seeley added a comment -

          Regarding polyfields... it's not clear why they are special enough to have to change the IndexSchema? (IndexSchema.isPolyField, getPolyField, getPolyFieldType, getPolyFieldTypeNoEx, etc). Can't we just store them as normal field types?

          Show
          Yonik Seeley added a comment - Regarding polyfields... it's not clear why they are special enough to have to change the IndexSchema? (IndexSchema.isPolyField, getPolyField, getPolyFieldType, getPolyFieldTypeNoEx, etc). Can't we just store them as normal field types?
          Hide
          Grant Ingersoll added a comment -

          IMO, unit tests can be too low level. They can also be too fragile.

          I guess it all comes down to what you call a unit.

          t would be nice, for example, if testPointFieldType indexed a few couments (with various combinations of stored / indexed) and then queried the index,

          This is done in testIndexing()

          TopDocs topDocs = core.getSearcher().get().search(bq, 1);

          Yeah, see my comment there even! I wanted a way to validate that the correct query is created, but I don't even really need to run a search for that.

          Rendundant null checks, trivial strings, etc:
          + assertNotNull("topDocs is null", topDocs);
          + assertTrue(topDocs.totalHits + " does not equal: " + 1, topDocs.totalHits == 1)

          I need to update my IntelliJ "Live Templates", as I have them setup to spit out a pattern like above

          Please see the DocumentBuilder changes I had added...

          Will do.

          Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch.

          I thought I captured that, but will look again.

          Show
          Grant Ingersoll added a comment - IMO, unit tests can be too low level. They can also be too fragile. I guess it all comes down to what you call a unit. t would be nice, for example, if testPointFieldType indexed a few couments (with various combinations of stored / indexed) and then queried the index, This is done in testIndexing() TopDocs topDocs = core.getSearcher().get().search(bq, 1); Yeah, see my comment there even! I wanted a way to validate that the correct query is created, but I don't even really need to run a search for that. Rendundant null checks, trivial strings, etc: + assertNotNull("topDocs is null", topDocs); + assertTrue(topDocs.totalHits + " does not equal: " + 1, topDocs.totalHits == 1) I need to update my IntelliJ "Live Templates", as I have them setup to spit out a pattern like above Please see the DocumentBuilder changes I had added... Will do. Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch. I thought I captured that, but will look again.
          Hide
          Grant Ingersoll added a comment -

          Regarding polyfields... it's not clear why they are special enough to have to change the IndexSchema? (IndexSchema.isPolyField, getPolyField, getPolyFieldType, getPolyFieldTypeNoEx, etc). Can't we just store them as normal field types?

          My thinking was that a Query Parser or other things might need to know look up this information, but you are right, I don't have a specific use case for them at the moment. At the same time, poly fields feel like a hybrid between regular fields and dynamic fields and thus fit at the same level they do.

          Show
          Grant Ingersoll added a comment - Regarding polyfields... it's not clear why they are special enough to have to change the IndexSchema? (IndexSchema.isPolyField, getPolyField, getPolyFieldType, getPolyFieldTypeNoEx, etc). Can't we just store them as normal field types? My thinking was that a Query Parser or other things might need to know look up this information, but you are right, I don't have a specific use case for them at the moment. At the same time, poly fields feel like a hybrid between regular fields and dynamic fields and thus fit at the same level they do.
          Hide
          Yonik Seeley added a comment -

          I need to update my IntelliJ "Live Templates", as I have them setup to spit out a pattern like above

          lol... so that's where all that comes from... I was going to say something like "this looks like it came out of a code generator" but it sounded a bit too harsh in the off chance that it wasn't I'm very relieved to find you weren't typing out that crap by hand.

          Show
          Yonik Seeley added a comment - I need to update my IntelliJ "Live Templates", as I have them setup to spit out a pattern like above lol... so that's where all that comes from... I was going to say something like "this looks like it came out of a code generator" but it sounded a bit too harsh in the off chance that it wasn't I'm very relieved to find you weren't typing out that crap by hand.
          Hide
          Yonik Seeley added a comment -

          poly fields feel like a hybrid between regular fields and dynamic fields and thus fit at the same level they do.

          The schema needs to know about dynamic fields because it affects field names.

          Actually, I think I just saw why you currently need some support in IndexSchema with the current way you are doing things:
          If you have a dynamicField pt, then I think you use field names like home_pt0 and home_pt_1?
          So in essence, it's a new type of dynamic field? This seems like it might be hard to actually get right in all of the corner cases.

          What if, instead, dynamic fields are directly used for subfields?
          So for a field name "home" instead of home_0 and home1, you would use home_0_d, home__1_d
          Not as short, but it avoids having to add new capabilities to the IndexSchema.

          Another alternative: use a prefix for subfields and the suffix for the type. _0_home_d, _1_home_d

          Another thing to keep in mind - not all subfields will always be of the same type.

          Show
          Yonik Seeley added a comment - poly fields feel like a hybrid between regular fields and dynamic fields and thus fit at the same level they do. The schema needs to know about dynamic fields because it affects field names. Actually, I think I just saw why you currently need some support in IndexSchema with the current way you are doing things: If you have a dynamicField pt, then I think you use field names like home_pt 0 and home_pt _1? So in essence, it's a new type of dynamic field? This seems like it might be hard to actually get right in all of the corner cases. What if, instead, dynamic fields are directly used for subfields? So for a field name "home" instead of home_ 0 and home 1, you would use home _0_d, home__1_d Not as short, but it avoids having to add new capabilities to the IndexSchema. Another alternative: use a prefix for subfields and the suffix for the type. _0_home_d, _1_home_d Another thing to keep in mind - not all subfields will always be of the same type.
          Hide
          Grant Ingersoll added a comment -

          What if, instead, dynamic fields are directly used for subfields?

          That then requires those dynamic fields to be present, which I'd rather not have to do. Part of the goal of this issue is to hide the implementation. Having said that, I still don't know whether that means I need to keep the IndexSchema changes. Let me do another iteration.

          Another thing to keep in mind - not all subfields will always be of the same type.

          Agreed, but I don't think this is baked in to the generic capabilities, just the Point stuff, where I think it is fine to have the same sub-type.

          Show
          Grant Ingersoll added a comment - What if, instead, dynamic fields are directly used for subfields? That then requires those dynamic fields to be present, which I'd rather not have to do. Part of the goal of this issue is to hide the implementation. Having said that, I still don't know whether that means I need to keep the IndexSchema changes. Let me do another iteration. Another thing to keep in mind - not all subfields will always be of the same type. Agreed, but I don't think this is baked in to the generic capabilities, just the Point stuff, where I think it is fine to have the same sub-type.
          Hide
          Yonik Seeley added a comment -

          That then requires those dynamic fields to be present, which I'd rather not have to do.

          That's sort of a separate question: if one were allowed to register a dynamic field (not sure if this capability is present), then it could be registered if it didn't exist.

          Also, you have subFieldType="double" in the schema... and that requires that the "double" field type be defined. Why not have subFieldSuffix="_d" and require the "_d" dynamic field be defined? Seems like the same complexity level.

          > > Another thing to keep in mind - not all subfields will always be of the same type.
          > Agreed, but I don't think this is baked in to the generic capabilities, just the Point stuff,

          For a specific point implementation, that's fine. But if you use a point type that can do cartesian grid stuff, then you already have different field types. But I guess subFieldType="double" need only apply to some of the subfields (the ones that index the points).

          Show
          Yonik Seeley added a comment - That then requires those dynamic fields to be present, which I'd rather not have to do. That's sort of a separate question: if one were allowed to register a dynamic field (not sure if this capability is present), then it could be registered if it didn't exist. Also, you have subFieldType="double" in the schema... and that requires that the "double" field type be defined. Why not have subFieldSuffix="_d" and require the "_d" dynamic field be defined? Seems like the same complexity level. > > Another thing to keep in mind - not all subfields will always be of the same type. > Agreed, but I don't think this is baked in to the generic capabilities, just the Point stuff, For a specific point implementation, that's fine. But if you use a point type that can do cartesian grid stuff, then you already have different field types. But I guess subFieldType="double" need only apply to some of the subfields (the ones that index the points).
          Hide
          Grant Ingersoll added a comment -

          Also, you have subFieldType="double" in the schema... and that requires that the "double" field type be defined. Why not have subFieldSuffix="_d" and require the "_d" dynamic field be defined? Seems like the same complexity level

          I think it makes more sense for the subFieldType to be present to be tied to a type than a Field (subFieldSuffix), as it seems weird to have a field type have a dependency on a Field, whereas it seems fine for a field type to have a dependency on another field type.

          For a specific point implementation, that's fine. But if you use a point type that can do cartesian grid stuff, then you already have different field types. But I guess subFieldType="double" need only apply to some of the subfields (the ones that index the points).

          I'm not sure I see this. If and when we implement CartesianPointType, it will still need to have a type for the sub fields (depending on the tiers specified) but I don't see why the subFieldType wouldn't be the same for all of them. AIUI, they all have the same precision requirements.

          I think part of what's missing is that for some of these attributes, it would be better for them to be field properties and not fieldType properties. For instance for the Cartesian case, you will need to declare what levels to support. If that is specified on the FieldType, then you have a proliferation of Field Type declarations, whereas if it is on the Field, that is a lot cleaner and less verbose. I'm just not sure how that gets implemented just yet, as having to specify startTier and endTier doesn't seem like the same level as multiValued or stored.

          Show
          Grant Ingersoll added a comment - Also, you have subFieldType="double" in the schema... and that requires that the "double" field type be defined. Why not have subFieldSuffix="_d" and require the "_d" dynamic field be defined? Seems like the same complexity level I think it makes more sense for the subFieldType to be present to be tied to a type than a Field (subFieldSuffix), as it seems weird to have a field type have a dependency on a Field, whereas it seems fine for a field type to have a dependency on another field type. For a specific point implementation, that's fine. But if you use a point type that can do cartesian grid stuff, then you already have different field types. But I guess subFieldType="double" need only apply to some of the subfields (the ones that index the points). I'm not sure I see this. If and when we implement CartesianPointType, it will still need to have a type for the sub fields (depending on the tiers specified) but I don't see why the subFieldType wouldn't be the same for all of them. AIUI, they all have the same precision requirements. I think part of what's missing is that for some of these attributes, it would be better for them to be field properties and not fieldType properties. For instance for the Cartesian case, you will need to declare what levels to support. If that is specified on the FieldType, then you have a proliferation of Field Type declarations, whereas if it is on the Field, that is a lot cleaner and less verbose. I'm just not sure how that gets implemented just yet, as having to specify startTier and endTier doesn't seem like the same level as multiValued or stored.
          Hide
          Yonik Seeley added a comment -

          Perhaps we should return to the design level a bit instead of me reading code and maybe making mistakes and trying to infer intent.

          Assume we have this:
          <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/>
          <field name="home" type="xy" indexed="true" stored="true"/>

          What are the exact field names that are indexed?

          Show
          Yonik Seeley added a comment - Perhaps we should return to the design level a bit instead of me reading code and maybe making mistakes and trying to infer intent. Assume we have this: <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/> <field name="home" type="xy" indexed="true" stored="true"/> What are the exact field names that are indexed?
          Hide
          Chris A. Mattmann added a comment -

          Perhaps we should return to the design level a bit instead of me reading code and maybe making mistakes and trying to infer intent.

          Assume we have this:
          <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/>
          <field name="home" type="xy" indexed="true" stored="true"/>

          What are the exact field names that are indexed?

          Regarding the fieldType subFieldType attribute – a question popped into my mind. How do we handle poly fields where each type is different? I.e., where subFieldType="double,tint" or whatever...

          Show
          Chris A. Mattmann added a comment - Perhaps we should return to the design level a bit instead of me reading code and maybe making mistakes and trying to infer intent. Assume we have this: <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/> <field name="home" type="xy" indexed="true" stored="true"/> What are the exact field names that are indexed? Regarding the fieldType subFieldType attribute – a question popped into my mind. How do we handle poly fields where each type is different? I.e., where subFieldType="double,tint" or whatever...
          Hide
          Grant Ingersoll added a comment - - edited

          <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/>
          <field name="home" type="xy" indexed="true" stored="true"/>

          Two indexed fields
          home___0
          home___1

          One stored field:
          home

          Show
          Grant Ingersoll added a comment - - edited <fieldType name="xy" class="solr.PointType" dimension="2" subFieldType="double"/> <field name="home" type="xy" indexed="true" stored="true"/> Two indexed fields home___0 home___1 One stored field: home
          Hide
          Yonik Seeley added a comment -

          OK... so the real issue is that this introduces a new mechanism to look up field types... not necessarily a horrible thing, but we should definitely think twice before doing so.

          home__0 and home__1 are not dynamic fields as I understand it (in that there is no ___0 dynamic field. The lookup is done by adding new support to the IndexSchema to strip off ___foo off of any field and use that as it's type?

          But... that scheme seems to limit us to a single subField type (in addition to the other downsides of requiring a new lookup mechanism).

          I do want to separate these two issues though:
          1) field lookup mechanism (currently just exact name in schema followed by a dynamic field check)
          2) if and when fields or field types should be explicitly defined in the schema vs being created by the polyField

          Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name?

             public SchemaField getFieldOrNull(String fieldName) {
                SchemaField f = fields.get(fieldName);
          @@ -1055,25 +1071,28 @@
               for (DynamicField df : dynamicFields) {
                 if (df.matches(fieldName)) return df.makeSchemaField(fieldName);
               }
          -    
          +    int idx = fieldName.indexOf(FieldType.POLY_FIELD_SEPARATOR);
          +    if (idx != -1){
          +      String fn = fieldName.substring(0, idx);
          +      f = getFieldOrNull(fn);
          +    }
               return f;
          
          Show
          Yonik Seeley added a comment - OK... so the real issue is that this introduces a new mechanism to look up field types... not necessarily a horrible thing, but we should definitely think twice before doing so. home__ 0 and home __1 are not dynamic fields as I understand it (in that there is no ___0 dynamic field. The lookup is done by adding new support to the IndexSchema to strip off ___foo off of any field and use that as it's type? But... that scheme seems to limit us to a single subField type (in addition to the other downsides of requiring a new lookup mechanism). I do want to separate these two issues though: 1) field lookup mechanism (currently just exact name in schema followed by a dynamic field check) 2) if and when fields or field types should be explicitly defined in the schema vs being created by the polyField Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name? public SchemaField getFieldOrNull( String fieldName) { SchemaField f = fields.get(fieldName); @@ -1055,25 +1071,28 @@ for (DynamicField df : dynamicFields) { if (df.matches(fieldName)) return df.makeSchemaField(fieldName); } - + int idx = fieldName.indexOf(FieldType.POLY_FIELD_SEPARATOR); + if (idx != -1){ + String fn = fieldName.substring(0, idx); + f = getFieldOrNull(fn); + } return f;
          Hide
          Grant Ingersoll added a comment -

          OK... so the real issue is that this introduces a new mechanism to look up field types... not necessarily a horrible thing, but we should definitely think twice before doing so.

          Agreed. I'm not wedded to this approach, just want to see the discussion through. I do feel strongly that the goal is such that an app designer should be able to use a FieldType just as they always have, either dynamic or static. How we get to that I don't care so much as long as it works and performs.

          But... that scheme seems to limit us to a single subField type (in addition to the other downsides of requiring a new lookup mechanism).

          I don't follow this. In this particular implementation, I have a single subFieldType, but I don't see why a different implementation couldn't do something like:

          <fieldType name="foo" type="solr.MultiSubPointType" dimension="3" subFieldTypes="double,tdouble,int"/>
          

          Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name?

          Hmmm, I think it should return the "owning" Schema Field, i.e. the one that exists in the schema.xml file.

          Show
          Grant Ingersoll added a comment - OK... so the real issue is that this introduces a new mechanism to look up field types... not necessarily a horrible thing, but we should definitely think twice before doing so. Agreed. I'm not wedded to this approach, just want to see the discussion through. I do feel strongly that the goal is such that an app designer should be able to use a FieldType just as they always have, either dynamic or static. How we get to that I don't care so much as long as it works and performs. But... that scheme seems to limit us to a single subField type (in addition to the other downsides of requiring a new lookup mechanism). I don't follow this. In this particular implementation, I have a single subFieldType, but I don't see why a different implementation couldn't do something like: <fieldType name= "foo" type= "solr.MultiSubPointType" dimension= "3" subFieldTypes= " double ,tdouble, int " /> Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name? Hmmm, I think it should return the "owning" Schema Field, i.e. the one that exists in the schema.xml file.
          Hide
          Grant Ingersoll added a comment -

          Note, I don't think the distance function queries will work w/ my patch yet.

          Show
          Grant Ingersoll added a comment - Note, I don't think the distance function queries will work w/ my patch yet.
          Hide
          Yonik Seeley added a comment -

          > Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name?
          > > Hmmm, I think it should return the "owning" Schema Field, i.e. the one that exists in the schema.xml file.

          Those fields probably will be exposed at least internally to other parts of solr, so they should really return the correct field / fieldType.

          Show
          Yonik Seeley added a comment - > Aside: it looks like the code for getFieldOrNull isn't right? Seems like it will return a field with both the wrong type and the wrong name? > > Hmmm, I think it should return the "owning" Schema Field, i.e. the one that exists in the schema.xml file. Those fields probably will be exposed at least internally to other parts of solr, so they should really return the correct field / fieldType.
          Hide
          Grant Ingersoll added a comment -

          Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch.

          Yonik, could you elaborate on this? It seems kind of weird to have that instanceof check in SolrQueryParser.getFieldQuery() to see if we have a TextField or not.

          Show
          Grant Ingersoll added a comment - Seems like SolrQueryParser should use getFieldQuery for everything (except TextField... but it could even be used for that if we make it such that we could call back to getBooleanQuery, etc). I had this in my patch. Yonik, could you elaborate on this? It seems kind of weird to have that instanceof check in SolrQueryParser.getFieldQuery() to see if we have a TextField or not.
          Hide
          Grant Ingersoll added a comment -

          This implements Option B as laid out at: http://search.lucidimagination.com/search/document/83a5442ab155686/solr_1131_multiple_fields_per_field_type#a600de441418a798

          Next up: Implement ValueSource support for PointType.

          Show
          Grant Ingersoll added a comment - This implements Option B as laid out at: http://search.lucidimagination.com/search/document/83a5442ab155686/solr_1131_multiple_fields_per_field_type#a600de441418a798 Next up: Implement ValueSource support for PointType.
          Hide
          Yonik Seeley added a comment -

          Next up: Implement ValueSource support for PointType.

          Exactly!

          My thinking was perhaps to add the ability to get lat+lon from DocValues... but for efficiency have DocValues fill in values on an object passed to it (this is inner-loop stuff.... we don't want to be creating objects per doc).

          DocValues getPoint(Point point)
          or perhaps
          DocValues.getPoint(double[] point)

          Show
          Yonik Seeley added a comment - Next up: Implement ValueSource support for PointType. Exactly! My thinking was perhaps to add the ability to get lat+lon from DocValues... but for efficiency have DocValues fill in values on an object passed to it (this is inner-loop stuff.... we don't want to be creating objects per doc). DocValues getPoint(Point point) or perhaps DocValues.getPoint(double[] point)
          Hide
          Yonik Seeley added a comment -

          Yonik, could you elaborate on this? It seems kind of weird to have that instanceof check in SolrQueryParser.getFieldQuery() to see if we have a TextField or not.

          If you look at the impl in TextField, I had to comment out stuff like "newTermQuery" and replace it with "new TermQuery".

          + // Query currentQuery = newTermQuery(new Term(field, term));
          + Query currentQuery = new TermQuery(new Term(field, term));

          To be fully back compatible, all we would need to do is check if the parser was an instance of QueryParser, and if so, delegate to newTermQuery. Then we could use fieldType.getFieldQuery() absolutely everywhere.

          Show
          Yonik Seeley added a comment - Yonik, could you elaborate on this? It seems kind of weird to have that instanceof check in SolrQueryParser.getFieldQuery() to see if we have a TextField or not. If you look at the impl in TextField, I had to comment out stuff like "newTermQuery" and replace it with "new TermQuery". + // Query currentQuery = newTermQuery(new Term(field, term)); + Query currentQuery = new TermQuery(new Term(field, term)); To be fully back compatible, all we would need to do is check if the parser was an instance of QueryParser, and if so, delegate to newTermQuery. Then we could use fieldType.getFieldQuery() absolutely everywhere.
          Hide
          Grant Ingersoll added a comment -

          DocValues.getPoint(double[] point)

          OK, let me see how that plays out.

          See also http://www.lucidimagination.com/search/document/fd804bcd78d7bec1/solr_1131_poly_fields_and_valuesource

          Show
          Grant Ingersoll added a comment - DocValues.getPoint(double[] point) OK, let me see how that plays out. See also http://www.lucidimagination.com/search/document/fd804bcd78d7bec1/solr_1131_poly_fields_and_valuesource
          Hide
          Chris A. Mattmann added a comment - - edited

          Hi All:

          Here's a cut on the patch. Some questions/comments on the existing patch(es):

          1. Why use
            private DynamicField[] dynamicFields;
            

            instead of

            List<DynamicField>
            

            or

            Collection<DynamicField>
            

            in IndexSchema?

          2. There are a bunch of useless whitespace changes (e.g., in IndexSchema, FieldType) in the existing patches. The final patch probably shouldn't include those since it makes it difficult to understand what was actually changed.
          3. IndexSchema:
            • when checking for isDuplicateDynField, if it is, nothing is done. Shouldn't this be where an exception is thrown or a message is logged? In the patch I'm attaching I took the log approach.
          4. IndexSchema:
            • what happens if subs.isEmpty() == true?
            • maybe log message that says, dyn field definition is up to you?
            • I took the approach in my attached patch to log it.
          5. Why does getPolyFieldType(String) throw an exception if the field is not a poly field type – that seems a bit brittle? Also there's the NoEx version anyways (why not just keep that one?). In the patch I've attached, I took the approach of only including a getPolyFieldType that returns null rather than throwing an ex (the NoEx version).
          6. CoordinateFieldType: why process > 1 sub field types and then throw an exception at the end? I cleaned this up to throw the Exception when it occurs.
          7. parsePoint in DistanceUtils, why use ',' as the separator – use ' ' (at least conforms to georss point then). I guess because you are supporting N-dimensional points, right?
          8. parsePoint – instead of complicated isolation loops, why not just use trim()? I've taken that approach in the patch I've attached.

          This patch passes all unit tests as well. This doesn't implement option C that I proposed yet. Hopefully I'll get a chance to put that up later tonight.

          Show
          Chris A. Mattmann added a comment - - edited Hi All: Here's a cut on the patch. Some questions/comments on the existing patch(es): Why use private DynamicField[] dynamicFields; instead of List<DynamicField> or Collection<DynamicField> in IndexSchema? There are a bunch of useless whitespace changes (e.g., in IndexSchema, FieldType) in the existing patches. The final patch probably shouldn't include those since it makes it difficult to understand what was actually changed. IndexSchema: when checking for isDuplicateDynField, if it is, nothing is done. Shouldn't this be where an exception is thrown or a message is logged? In the patch I'm attaching I took the log approach. IndexSchema: what happens if subs.isEmpty() == true? maybe log message that says, dyn field definition is up to you? I took the approach in my attached patch to log it. Why does getPolyFieldType(String) throw an exception if the field is not a poly field type – that seems a bit brittle? Also there's the NoEx version anyways (why not just keep that one?). In the patch I've attached, I took the approach of only including a getPolyFieldType that returns null rather than throwing an ex (the NoEx version). CoordinateFieldType: why process > 1 sub field types and then throw an exception at the end? I cleaned this up to throw the Exception when it occurs. parsePoint in DistanceUtils, why use ',' as the separator – use ' ' (at least conforms to georss point then). I guess because you are supporting N-dimensional points, right? parsePoint – instead of complicated isolation loops, why not just use trim()? I've taken that approach in the patch I've attached. This patch passes all unit tests as well. This doesn't implement option C that I proposed yet. Hopefully I'll get a chance to put that up later tonight.
          Hide
          Yonik Seeley added a comment -

          Quick comment based on a spot check of the changes to IndexSchema: rather than make polyField special somehow w.r.t IndexSchema, and add a FieldType.getPolyFieldNames, etc, I had been thinking more along the lines of having an IndexSchema.registerDynamicFieldDefinition - just like the existing registerDynamicCopyField. This would (optionally) allow any field type to add other definitions to the IndexSchema. I continue to think it would be good to stay away of special logic for "polyfields" in the IndexSchema.

          Show
          Yonik Seeley added a comment - Quick comment based on a spot check of the changes to IndexSchema: rather than make polyField special somehow w.r.t IndexSchema, and add a FieldType.getPolyFieldNames, etc, I had been thinking more along the lines of having an IndexSchema.registerDynamicFieldDefinition - just like the existing registerDynamicCopyField. This would (optionally) allow any field type to add other definitions to the IndexSchema. I continue to think it would be good to stay away of special logic for "polyfields" in the IndexSchema.
          Hide
          Yonik Seeley added a comment -

          parsePoint in DistanceUtils, why use ',' as the separator

          A comma is more user friendly - spaces are often already used as delimiters in quite a few places.
          Why did you replace more optimized code that was already written in parsePoint with less optimized code?

          Show
          Yonik Seeley added a comment - parsePoint in DistanceUtils, why use ',' as the separator A comma is more user friendly - spaces are often already used as delimiters in quite a few places. Why did you replace more optimized code that was already written in parsePoint with less optimized code?
          Hide
          Chris A. Mattmann added a comment -

          A comma is more user friendly - spaces are often already used as delimiters in quite a few places.
          Why did you replace more optimized code that was already written in parsePoint with less optimized code?

          Meh, I could go either way on the comma/space issue. It would be nice to be compatible with an existing GeoPoint standard. I know georss uses space as the delimeter – do you know of any that use ","?

          RE: optimized code, can you be explicit? I would argue the code I inserted is more optimized from a readiability standpoint. It's a bit easier for your typical CS101 grad to understand. All that was being done in the prior patch is a set of forwards/backwards isolation loops to determine the start/end index to substring out, in case you have:

          34.333 ,100.1 OR
          34.333,100.1 OR
          34.333, 100.1

          At first blush, trying to understand that code was a bit harder than simply tokenizing on the known delimeter, and then trimming each tokenized value.

            out = externalVal.split(",");
          +      if(out.length != dimension){
          +        throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incompatible dimension (" + dimension +
          +            ") and values (" + externalVal + ").  Only " + i + " values specified");        
          +      }
          +      for(int j=0; j < out.length; j++){
          +        out[j] = out[j].trim();
          +      }
          
          Show
          Chris A. Mattmann added a comment - A comma is more user friendly - spaces are often already used as delimiters in quite a few places. Why did you replace more optimized code that was already written in parsePoint with less optimized code? Meh, I could go either way on the comma/space issue. It would be nice to be compatible with an existing GeoPoint standard. I know georss uses space as the delimeter – do you know of any that use ","? RE: optimized code, can you be explicit? I would argue the code I inserted is more optimized from a readiability standpoint. It's a bit easier for your typical CS101 grad to understand. All that was being done in the prior patch is a set of forwards/backwards isolation loops to determine the start/end index to substring out, in case you have: 34.333 ,100.1 OR 34.333,100.1 OR 34.333, 100.1 At first blush, trying to understand that code was a bit harder than simply tokenizing on the known delimeter, and then trimming each tokenized value. out = externalVal.split( "," ); + if (out.length != dimension){ + throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incompatible dimension (" + dimension + + ") and values (" + externalVal + "). Only " + i + " values specified" ); + } + for ( int j=0; j < out.length; j++){ + out[j] = out[j].trim(); + }
          Hide
          Grant Ingersoll added a comment -

          Quick comment based on a spot check of the changes to IndexSchema: rather than make polyField special somehow w.r.t IndexSchema, and add a FieldType.getPolyFieldNames, etc, I had been thinking more along the lines of having an IndexSchema.registerDynamicFieldDefinition - just like the existing registerDynamicCopyField. This would (optionally) allow any field type to add other definitions to the IndexSchema. I continue to think it would be good to stay away of special logic for "polyfields" in the IndexSchema.

          So, then the FieldType would register it's Dynamic Fields in it's own init() method by calling this method? That can work.

          Why use Dynamic Field array

          The array is sorted and array access is much faster and we often have to loop over it to look it up.

          CoordinateFieldType: why process > 1 sub field types and then throw an exception at the end? I cleaned this up to throw the Exception when it occurs.

          OK. Actually, this should just be in the derived class, as it may be the case some other CoordinateFieldType has multiple sub types.

          1. parsePoint in DistanceUtils, why use ',' as the separator - use ' ' (at least conforms to georss point then). I guess because you are supporting N-dimensional points, right?
          2. parsePoint - instead of complicated isolation loops, why not just use trim()? I've taken that approach in the patch I've attached.

          I think comma makes sense. As for the optimization stuff, I agree w/ Yonik, this is code that will be called a lot.

          • when checking for isDuplicateDynField, if it is, nothing is done. Shouldn't this be where an exception is thrown or a message is logged? In the patch I'm attaching I took the log approach.

          It is logged, but for the poly fields, if the dyn field is already defined, that's just fine.

          Show
          Grant Ingersoll added a comment - Quick comment based on a spot check of the changes to IndexSchema: rather than make polyField special somehow w.r.t IndexSchema, and add a FieldType.getPolyFieldNames, etc, I had been thinking more along the lines of having an IndexSchema.registerDynamicFieldDefinition - just like the existing registerDynamicCopyField. This would (optionally) allow any field type to add other definitions to the IndexSchema. I continue to think it would be good to stay away of special logic for "polyfields" in the IndexSchema. So, then the FieldType would register it's Dynamic Fields in it's own init() method by calling this method? That can work. Why use Dynamic Field array The array is sorted and array access is much faster and we often have to loop over it to look it up. CoordinateFieldType: why process > 1 sub field types and then throw an exception at the end? I cleaned this up to throw the Exception when it occurs. OK. Actually, this should just be in the derived class, as it may be the case some other CoordinateFieldType has multiple sub types. parsePoint in DistanceUtils, why use ',' as the separator - use ' ' (at least conforms to georss point then). I guess because you are supporting N-dimensional points, right? parsePoint - instead of complicated isolation loops, why not just use trim()? I've taken that approach in the patch I've attached. I think comma makes sense. As for the optimization stuff, I agree w/ Yonik, this is code that will be called a lot. when checking for isDuplicateDynField, if it is, nothing is done. Shouldn't this be where an exception is thrown or a message is logged? In the patch I'm attaching I took the log approach. It is logged, but for the poly fields, if the dyn field is already defined, that's just fine.
          Hide
          Grant Ingersoll added a comment -

          I've got a patch almost ready that brings in the ValueSource stuff.

          Show
          Grant Ingersoll added a comment - I've got a patch almost ready that brings in the ValueSource stuff.
          Hide
          Chris A. Mattmann added a comment -

          Hi Grant:

          The array is sorted and array access is much faster and we often have to loop over it to look it up.

          OK, fair enough.

          I think comma makes sense.

          OK, so you think it makes sense – why? Because it's an N-dimensional array and spaces are less "user friendly" as Yonik put it?

          As for the optimization stuff, I agree w/ Yonik, this is code that will be called a lot.

          I'm wondering what there is to agree with since "optimization" was never defined. Are you talking speed? Are you talking memory efficiency? Code readability? Maintainability? Some combination of all of those? There are tradeoffs in everything. You could rewrite some of the provided java runtime methods to squeeze out extra performance, but what's the point of libraries or reusable functions then? The prior code that was in there basically rewrote exactly what split() and trim() do, so why not reuse what's there? If you throw up the performance flag, I would push back on readability and maintainability.

          It is logged, but for the poly fields, if the dyn field is already defined, that's just fine.

          Where is it logged? It wasn't in the most up-to-date patch, provided on 2009-12-10 04:34 PM. Here was the code snipped that was there:

          +    //For each poly field, go through and add the appropriate Dynamic field
          +      for (FieldType fieldType : polyFieldTypes.values()) {
          +        if (fieldType instanceof DelegatingFieldType){
          +          List<FieldType> subs = ((DelegatingFieldType) fieldType).getSubTypes();
          +          if (subs.isEmpty() == false){
          +            //add a new dynamic field for each sub field type
          +            for (FieldType type : subs) {
          +              log.debug("dynamic field creation for sub type: " + type.typeName);
          +              SchemaField df = SchemaField.create("*" + FieldType.POLY_FIELD_SEPARATOR + type.typeName,
          +                      type, type.args);//TODO: is type.args right?
          +              if (isDuplicateDynField(dFields, df) == false){
          +                addDynamicFieldNoDupCheck(dFields, df);
          +              }
                         // NOTE: there is no else here, so I added an else and a log message
          +            }
          +          }
                      // NOTE: there is no else here, so I added an else and a log message
          +        }
          +      }
          +
          

          Here's what I added:

          +    //For each poly field, go through and add the appropriate Dynamic field
          +      for (FieldType fieldType : polyFieldTypes.values()) {
          +        if (fieldType instanceof DelegatingFieldType){
          +          List<FieldType> subs = ((DelegatingFieldType) fieldType).getSubTypes();
          +          if (!subs.isEmpty()){
          +            //add a new dynamic field for each sub field type
          +            for (FieldType type : subs) {
          +              log.debug("dynamic field creation for sub type: " + type.typeName);
          +              SchemaField df = SchemaField.create("*" + FieldType.POLY_FIELD_SEPARATOR + type.typeName,
          +                      type, type.args);//TODO: is type.args right?
          +              if (!isDuplicateDynField(dFields, df)){
          +                addDynamicFieldNoDupCheck(dFields, df);
          +              }
          +              else{
          +                log.debug("dynamic field creation avoided: dynamic field: ["+df.getName()+"] " +
          +                		"already defined in the schema!");
          +              }
          +              
          +            }
          +          }
          +          else{
          +            log.debug("field type: ["+fieldType.getTypeName()+"]: no sub fields defined");
          +          }
          +        }
          +      }
          +
          

          Also, I get that it's fine for the poly fields if the dyn field is already defined (in fact, based on my mailing lists comments http://old.nabble.com/SOLR-1131:-disconnect-between-fields-created-by-poly-fields-td26736431.html I think this should always be the case), but whether it's fine or not, it's still worthy to log to provide someone more information.

          Show
          Chris A. Mattmann added a comment - Hi Grant: The array is sorted and array access is much faster and we often have to loop over it to look it up. OK, fair enough. I think comma makes sense. OK, so you think it makes sense – why? Because it's an N-dimensional array and spaces are less "user friendly" as Yonik put it? As for the optimization stuff, I agree w/ Yonik, this is code that will be called a lot. I'm wondering what there is to agree with since "optimization" was never defined. Are you talking speed? Are you talking memory efficiency? Code readability? Maintainability? Some combination of all of those? There are tradeoffs in everything. You could rewrite some of the provided java runtime methods to squeeze out extra performance, but what's the point of libraries or reusable functions then? The prior code that was in there basically rewrote exactly what split() and trim() do, so why not reuse what's there? If you throw up the performance flag, I would push back on readability and maintainability. It is logged, but for the poly fields, if the dyn field is already defined, that's just fine. Where is it logged? It wasn't in the most up-to-date patch, provided on 2009-12-10 04:34 PM. Here was the code snipped that was there: + //For each poly field, go through and add the appropriate Dynamic field + for (FieldType fieldType : polyFieldTypes.values()) { + if (fieldType instanceof DelegatingFieldType){ + List<FieldType> subs = ((DelegatingFieldType) fieldType).getSubTypes(); + if (subs.isEmpty() == false ){ + //add a new dynamic field for each sub field type + for (FieldType type : subs) { + log.debug( "dynamic field creation for sub type: " + type.typeName); + SchemaField df = SchemaField.create( "*" + FieldType.POLY_FIELD_SEPARATOR + type.typeName, + type, type.args); //TODO: is type.args right? + if (isDuplicateDynField(dFields, df) == false ){ + addDynamicFieldNoDupCheck(dFields, df); + } // NOTE: there is no else here, so I added an else and a log message + } + } // NOTE: there is no else here, so I added an else and a log message + } + } + Here's what I added: + //For each poly field, go through and add the appropriate Dynamic field + for (FieldType fieldType : polyFieldTypes.values()) { + if (fieldType instanceof DelegatingFieldType){ + List<FieldType> subs = ((DelegatingFieldType) fieldType).getSubTypes(); + if (!subs.isEmpty()){ + //add a new dynamic field for each sub field type + for (FieldType type : subs) { + log.debug( "dynamic field creation for sub type: " + type.typeName); + SchemaField df = SchemaField.create( "*" + FieldType.POLY_FIELD_SEPARATOR + type.typeName, + type, type.args); //TODO: is type.args right? + if (!isDuplicateDynField(dFields, df)){ + addDynamicFieldNoDupCheck(dFields, df); + } + else { + log.debug( "dynamic field creation avoided: dynamic field: [" +df.getName()+ "] " + + "already defined in the schema!" ); + } + + } + } + else { + log.debug( "field type: [" +fieldType.getTypeName()+ "]: no sub fields defined" ); + } + } + } + Also, I get that it's fine for the poly fields if the dyn field is already defined (in fact, based on my mailing lists comments http://old.nabble.com/SOLR-1131:-disconnect-between-fields-created-by-poly-fields-td26736431.html I think this should always be the case), but whether it's fine or not, it's still worthy to log to provide someone more information.
          Hide
          Grant Ingersoll added a comment -

          I'm wondering what there is to agree with since "optimization" was never defined. Are you talking speed? Are you talking memory efficiency? Code readability? Maintainability? Some combination of all of those?

          Speed and memory.

          As for logging, that code is all going away in the next patch, I think

          Show
          Grant Ingersoll added a comment - I'm wondering what there is to agree with since "optimization" was never defined. Are you talking speed? Are you talking memory efficiency? Code readability? Maintainability? Some combination of all of those? Speed and memory. As for logging, that code is all going away in the next patch, I think
          Hide
          Chris A. Mattmann added a comment -

          Speed and memory.

          Unfortunately, it's at the cost of readability and maintainability.

          As for logging, that code is all going away in the next patch, I think

          I'll take a look when you throw it up, thanks.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Speed and memory. Unfortunately, it's at the cost of readability and maintainability. As for logging, that code is all going away in the next patch, I think I'll take a look when you throw it up, thanks. Cheers, Chris
          Hide
          Grant Ingersoll added a comment -

          Unfortunately, it's at the cost of readability and maintainability.

          Maybe. It took me all of 30 seconds to figure out what it was doing. I'll put some comments on it. While readability is important, Solr's goal is not to make a product that a CS101 grad can read, it's too build a blazing fast search server. That call could hit millions of times when indexing points.

          Show
          Grant Ingersoll added a comment - Unfortunately, it's at the cost of readability and maintainability. Maybe. It took me all of 30 seconds to figure out what it was doing. I'll put some comments on it. While readability is important, Solr's goal is not to make a product that a CS101 grad can read, it's too build a blazing fast search server. That call could hit millions of times when indexing points.
          Hide
          Chris A. Mattmann added a comment -

          Maybe. It took me all of 30 seconds to figure out what it was doing. I'll put some comments on it. While readability is important, Solr's goal is not to make a product that a CS101 grad can read, it's too build a blazing fast search server. That call could hit millions of times when indexing points.

          Sure, maybe the goal isn't for a CS101 grad to be able to easily understand the code, but SOLR's goal should include being open to ideas from the community that involve reusing standard Java library functions and not rewriting based on perception of speed and memory without empirical proof. Where's the evidence that using things like split and trim are so much more costly than rewriting those basic capabilities that they warrant not using them?

          Show
          Chris A. Mattmann added a comment - Maybe. It took me all of 30 seconds to figure out what it was doing. I'll put some comments on it. While readability is important, Solr's goal is not to make a product that a CS101 grad can read, it's too build a blazing fast search server. That call could hit millions of times when indexing points. Sure, maybe the goal isn't for a CS101 grad to be able to easily understand the code, but SOLR's goal should include being open to ideas from the community that involve reusing standard Java library functions and not rewriting based on perception of speed and memory without empirical proof. Where's the evidence that using things like split and trim are so much more costly than rewriting those basic capabilities that they warrant not using them?
          Hide
          Grant Ingersoll added a comment -

          OK, this is getting a lot closer to ready to commit.

          Changes:

          1. Introduced a MultiValueSource - ValueSource that abstractly represents ValueSources for poly fields, and other things.
          2. Introduced PointValueSource - point(x,y,z) - a MultiValueSource that wraps other value sources (could be called something else, I suppose)
          3. Implemented PointTypeValueSource to represent ValueSource for the PointType class.
          4. Hooked in multivalue callbacks to DocValues. In addition to making functions work with Points (et. al) it should be possible to write functions that work on multivalued fields, but I did not undertake this work.
          5. Add in SchemaAware callback mechanism so that Field Types and other schema stuff can register dynamic fields, etc. after the schema has been created
          6. Updated the example to have spatial information in the docs, etc. See http://wiki.apache.org/solr/SpatialSearch
          7. Modified the distance functions to work with MultiValueSources
          8. cleaned up the tests
          9. Incorporated various comments from Chris and Yonik.
          Show
          Grant Ingersoll added a comment - OK, this is getting a lot closer to ready to commit. Changes: Introduced a MultiValueSource - ValueSource that abstractly represents ValueSources for poly fields, and other things. Introduced PointValueSource - point(x,y,z) - a MultiValueSource that wraps other value sources (could be called something else, I suppose) Implemented PointTypeValueSource to represent ValueSource for the PointType class. Hooked in multivalue callbacks to DocValues. In addition to making functions work with Points (et. al) it should be possible to write functions that work on multivalued fields, but I did not undertake this work. Add in SchemaAware callback mechanism so that Field Types and other schema stuff can register dynamic fields, etc. after the schema has been created Updated the example to have spatial information in the docs, etc. See http://wiki.apache.org/solr/SpatialSearch Modified the distance functions to work with MultiValueSources cleaned up the tests Incorporated various comments from Chris and Yonik.
          Hide
          Chris A. Mattmann added a comment - - edited

          Hi All,

          Updated patch:

          1. Introduced a MultiValueSource - ValueSource that abstractly represents ValueSources for poly fields, and other things.

          I added javadoc to this and the ASF license header.

          1. Introduced PointValueSource - point(x,y,z) - a MultiValueSource that wraps other value sources (could be called something else, I suppose)

          I put the ASF header before the package decl, to be consistent with the other SOLR java files.

          1. Add in SchemaAware callback mechanism so that Field Types and other schema stuff can register dynamic fields, etc. after the schema has been created

          Added more javadoc here, and ASF license.

          1. Incorporated various comments from Chris and Yonik.

          Thanks, I appreciate it. I'm still -1 on the way this patch deals with the "optimization" issue. I'd like to see evidence that it makes sense to not use split and trim.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - - edited Hi All, Updated patch: Introduced a MultiValueSource - ValueSource that abstractly represents ValueSources for poly fields, and other things. I added javadoc to this and the ASF license header. Introduced PointValueSource - point(x,y,z) - a MultiValueSource that wraps other value sources (could be called something else, I suppose) I put the ASF header before the package decl, to be consistent with the other SOLR java files. Add in SchemaAware callback mechanism so that Field Types and other schema stuff can register dynamic fields, etc. after the schema has been created Added more javadoc here, and ASF license. Incorporated various comments from Chris and Yonik. Thanks, I appreciate it. I'm still -1 on the way this patch deals with the "optimization" issue. I'd like to see evidence that it makes sense to not use split and trim. Cheers, Chris
          Hide
          Grant Ingersoll added a comment -

          I'm still -1 on the way this patch deals with the "optimization" issue. I'd like to see evidence that it makes sense to not use split and trim.

          My tests show it to be at least 7 times faster. But this should be obvious from static analysis, too. First of all, String.split() uses a regex which then makes a pass through the underlying character array. Then, trim has to go back through and analyze the char array too, not to mention the extra String creations. The optimized version here makes one pass and deals solely at the char array level and only has to do the substring, which I think can be optimized by the JVM to be a copy on write.

          
            public void testDistPerf() throws Exception {
              String [] input = new String[1000000];
              Random random = new Random();
              for (int i = 0; i < input.length; i++){
                input[i] = random.nextInt() + ", " + random.nextInt();
              }
              String [] out = new String[2];
              long time = 0;
              long start = System.currentTimeMillis();
              for (int j = 0; j < 50; j++) {
                for (int i = 0; i < input.length; i++){
                  split(input[i], out, 2);
                }
              }
              time = (System.currentTimeMillis() - start);
              System.out.println("Time: " + time);
              time = 0;
              start = System.currentTimeMillis();
              for (int j = 0; j < 50; j++) {
                for (int i = 0; i < input.length; i++){
                  DistanceUtils.parsePoint(out, input[i], 2);
                }
              }
              time = (System.currentTimeMillis() - start);
              System.out.println("Time: " + time);
            }
          
            private String[] split(String externalVal, String[] out, int dimension) {
              out = externalVal.split(",");
              if (out.length != dimension) {
                throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incompatible dimension (" + dimension +
                        ") and values (" + externalVal + ").  Only " + out.length + " values specified");
              }
              for (int j = 0; j < out.length; j++) {
                out[j] = out[j].trim();
              }
              return out;
            }
          
          Show
          Grant Ingersoll added a comment - I'm still -1 on the way this patch deals with the "optimization" issue. I'd like to see evidence that it makes sense to not use split and trim. My tests show it to be at least 7 times faster. But this should be obvious from static analysis, too. First of all, String.split() uses a regex which then makes a pass through the underlying character array. Then, trim has to go back through and analyze the char array too, not to mention the extra String creations. The optimized version here makes one pass and deals solely at the char array level and only has to do the substring, which I think can be optimized by the JVM to be a copy on write. public void testDistPerf() throws Exception { String [] input = new String [1000000]; Random random = new Random(); for ( int i = 0; i < input.length; i++){ input[i] = random.nextInt() + ", " + random.nextInt(); } String [] out = new String [2]; long time = 0; long start = System .currentTimeMillis(); for ( int j = 0; j < 50; j++) { for ( int i = 0; i < input.length; i++){ split(input[i], out, 2); } } time = ( System .currentTimeMillis() - start); System .out.println( "Time: " + time); time = 0; start = System .currentTimeMillis(); for ( int j = 0; j < 50; j++) { for ( int i = 0; i < input.length; i++){ DistanceUtils.parsePoint(out, input[i], 2); } } time = ( System .currentTimeMillis() - start); System .out.println( "Time: " + time); } private String [] split( String externalVal, String [] out, int dimension) { out = externalVal.split( "," ); if (out.length != dimension) { throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "incompatible dimension (" + dimension + ") and values (" + externalVal + "). Only " + out.length + " values specified" ); } for ( int j = 0; j < out.length; j++) { out[j] = out[j].trim(); } return out; }
          Hide
          Grant Ingersoll added a comment -

          I think this is ready to commit. I'd like to do so on Monday or Tuesday of next week, so that should give plenty of time for further review

          Show
          Grant Ingersoll added a comment - I think this is ready to commit. I'd like to do so on Monday or Tuesday of next week, so that should give plenty of time for further review
          Hide
          Chris A. Mattmann added a comment -

          Hi Grant:

          My tests show it to be at least 7 times faster. But this should be obvious from static analysis, too. First of all, String.split() uses a regex which then makes a pass through the underlying character array. Then, trim has to go back through and analyze the char array too, not to mention the extra String creations. The optimized version here makes one pass and deals solely at the char array level and only has to do the substring, which I think can be optimized by the JVM to be a copy on write.

          Got it. A couple of points:

          1. 7x faster is great, but could end up being noise if x = 2 ms. It matters if x is say 2 minutes, agreed. If it's on the ms end then the expense of more lines of (uncommented) code isn't worth it.
          2. This code is likely to get called heavily on the indexing side, so performance, though still an issue, is not as hugely important as say on the searching side.
          3. If you feel strongly about an optimized version of this magic splitAndTrim function, how ability a utility function and refactor then? I would guess this code could be used elsewhere, and that would help to satisfy my hunger for reusability. I'll even javadoc the function and do the refactor if you'd like.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Grant: My tests show it to be at least 7 times faster. But this should be obvious from static analysis, too. First of all, String.split() uses a regex which then makes a pass through the underlying character array. Then, trim has to go back through and analyze the char array too, not to mention the extra String creations. The optimized version here makes one pass and deals solely at the char array level and only has to do the substring, which I think can be optimized by the JVM to be a copy on write. Got it. A couple of points: 1. 7x faster is great, but could end up being noise if x = 2 ms. It matters if x is say 2 minutes, agreed. If it's on the ms end then the expense of more lines of (uncommented) code isn't worth it. 2. This code is likely to get called heavily on the indexing side, so performance, though still an issue, is not as hugely important as say on the searching side. 3. If you feel strongly about an optimized version of this magic splitAndTrim function, how ability a utility function and refactor then? I would guess this code could be used elsewhere, and that would help to satisfy my hunger for reusability. I'll even javadoc the function and do the refactor if you'd like. Cheers, Chris
          Hide
          Grant Ingersoll added a comment -

          Missing an & in DistanceUtils.parsePoint

          Show
          Grant Ingersoll added a comment - Missing an & in DistanceUtils.parsePoint
          Hide
          Chris A. Mattmann added a comment -

          Hi Grant:

          Thanks. Your latest patch omits class-level javadoc I wrote for DelegatingFieldType and for the #inform method in SchemaAware.

          +/**
          + * An interface for {@link FieldType}s that are poly fields, as defined in <a
          + * href="http://issues.apache.org/jira/browse/SOLR-1131">SOLR-1131</a>, so that
          + * poly fields can declare the {@link FieldType}s of their sub-fields.
          + * 
          + * @since SOLR-1131
          + * 
          + **/
          +public interface DelegatingFieldType {
          +
          +  /**
          +   * 
          +   * Returns the {@link FieldType}s of the sub-fields for this poly-field.
          +   * 
          +   * @return A {@link List} of {@link FieldType}s for the sub-fields of a poly
          +   *         field.
          +   */
          +  public List<FieldType> getSubTypes();
          +}
          
          +public interface SchemaAware {
          +
          +  /**
          +   * Informs the {@link IndexSchema} provided by the <code>schema</code>
          +   * parameter of an event (e.g., a new {@link FieldType} was added, etc.
          +   * 
          +   * @param schema
          +   *          The {@link IndexSchema} instance that inform of the update to.
          +   * 
          +   * @since SOLR-1131
          +   */
          +  public void inform(IndexSchema schema);
          +}
          

          Other than that +1. Thanks for seeing this through to a great patch.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Grant: Thanks. Your latest patch omits class-level javadoc I wrote for DelegatingFieldType and for the #inform method in SchemaAware. +/** + * An interface for {@link FieldType}s that are poly fields, as defined in <a + * href= "http: //issues.apache.org/jira/browse/SOLR-1131" >SOLR-1131</a>, so that + * poly fields can declare the {@link FieldType}s of their sub-fields. + * + * @since SOLR-1131 + * + **/ + public interface DelegatingFieldType { + + /** + * + * Returns the {@link FieldType}s of the sub-fields for this poly-field. + * + * @ return A {@link List} of {@link FieldType}s for the sub-fields of a poly + * field. + */ + public List<FieldType> getSubTypes(); +} + public interface SchemaAware { + + /** + * Informs the {@link IndexSchema} provided by the <code>schema</code> + * parameter of an event (e.g., a new {@link FieldType} was added, etc. + * + * @param schema + * The {@link IndexSchema} instance that inform of the update to. + * + * @since SOLR-1131 + */ + public void inform(IndexSchema schema); +} Other than that +1. Thanks for seeing this through to a great patch. Cheers, Chris
          Hide
          Grant Ingersoll added a comment -

          Added Chris's comments.

          Show
          Grant Ingersoll added a comment - Added Chris's comments.
          Hide
          Yonik Seeley added a comment -

          Given the concerns of some people around automatically registering a dynamic field, perhaps we should optionally allow a subFieldSuffix to be passed instead of subFieldTypes.

          minor nit: why is "subFieldTypes" plural? We're only specifying a single type, right?

          Show
          Yonik Seeley added a comment - Given the concerns of some people around automatically registering a dynamic field, perhaps we should optionally allow a subFieldSuffix to be passed instead of subFieldTypes. minor nit: why is "subFieldTypes" plural? We're only specifying a single type, right?
          Hide
          Grant Ingersoll added a comment -

          minor nit: why is "subFieldTypes" plural? We're only specifying a single type, right?

          That's inherited from a more general mechanism that would allow multiple field types. So, the PointType only allows one, but other implementations may allow more.

          Given the concerns of some people around automatically registering a dynamic field, perhaps we should optionally allow a subFieldSuffix to be passed instead of subFieldTypes.

          That's fine by me. Given my workload, I will try to do this and commit on Thursday of this week, unless someone wants to take it before then.

          Show
          Grant Ingersoll added a comment - minor nit: why is "subFieldTypes" plural? We're only specifying a single type, right? That's inherited from a more general mechanism that would allow multiple field types. So, the PointType only allows one, but other implementations may allow more. Given the concerns of some people around automatically registering a dynamic field, perhaps we should optionally allow a subFieldSuffix to be passed instead of subFieldTypes. That's fine by me. Given my workload, I will try to do this and commit on Thursday of this week, unless someone wants to take it before then.
          Hide
          Yonik Seeley added a comment -

          That's inherited from a more general mechanism that would allow multiple field types.

          But multiple sub-field types would be used for different things, and should hence be separate parameters? So if a field type indexed points separately and indexed a field that contains a list of cartesian tiers, I could see the params being

          <fieldType ..., coordinateType="..." cartesianTierType="..."/>
          (or coordinateSuffix / cartesianTierSuffix)

          Show
          Yonik Seeley added a comment - That's inherited from a more general mechanism that would allow multiple field types. But multiple sub-field types would be used for different things, and should hence be separate parameters? So if a field type indexed points separately and indexed a field that contains a list of cartesian tiers, I could see the params being <fieldType ..., coordinateType="..." cartesianTierType="..."/> (or coordinateSuffix / cartesianTierSuffix)
          Hide
          Noble Paul added a comment - - edited

          in FieldType#createFields(SchemaField field, FieldType delegatedType, String storageVal, boost, String ... externalVals)

          String name = field.getName();
                Map<String, String> props = new HashMap<String, String>();
                //Just set these, delegate everything else to the field type
                props.put("indexed", "true");
                props.put("stored", "false");
                //props.put("omitNorms", "true");
                //props.put("tokenized", "false");
                if (field.indexed()) {
          
                  for (int j = 0; j < externalVals.length; j++) {
                    //SchemaField is final, as is name, so we need to recreate each time
                    //put the counter before the separator, b/c dynamic fields can't be asterisks on both the front and the end of the String
                    SchemaField sf = SchemaField.create(name  + "_" + j + POLY_FIELD_SEPARATOR + delegatedType.typeName, delegatedType, props);
                    //QUESTION: should we allow for vectors, etc?  Not sure that it makes sense
                    results[j] = delegatedType.createField(sf, externalVals[j], boost);
                  }
                }
          

          It is not clear as to why can't the 'sf' instance be cached and reused?

          we can also avoid creating the synthetic field name at query time in PointField#.getFieldQuery

          Why do we have a map for flags why not use a bitset?

          Show
          Noble Paul added a comment - - edited in FieldType#createFields(SchemaField field, FieldType delegatedType, String storageVal, boost, String ... externalVals) String name = field.getName(); Map< String , String > props = new HashMap< String , String >(); //Just set these, delegate everything else to the field type props.put( "indexed" , " true " ); props.put( "stored" , " false " ); //props.put( "omitNorms" , " true " ); //props.put( "tokenized" , " false " ); if (field.indexed()) { for ( int j = 0; j < externalVals.length; j++) { //SchemaField is final , as is name, so we need to recreate each time //put the counter before the separator, b/c dynamic fields can't be asterisks on both the front and the end of the String SchemaField sf = SchemaField.create(name + "_" + j + POLY_FIELD_SEPARATOR + delegatedType.typeName, delegatedType, props); //QUESTION: should we allow for vectors, etc? Not sure that it makes sense results[j] = delegatedType.createField(sf, externalVals[j], boost); } } It is not clear as to why can't the 'sf' instance be cached and reused? we can also avoid creating the synthetic field name at query time in PointField#.getFieldQuery Why do we have a map for flags why not use a bitset?
          Hide
          Grant Ingersoll added a comment -

          It is not clear as to why can't the 'sf' instance be cached and reused?

          Because there is no way to change the name on a SchemaField w/o changing SchemaField to be non-final. I don't think SchemaField should be non-final.

          Why do we have a map for flags why not use a bitset?

          Yeah, we could add a new method that takes a bitset, b/c I believe that is what is used under the hood anyway.

          Show
          Grant Ingersoll added a comment - It is not clear as to why can't the 'sf' instance be cached and reused? Because there is no way to change the name on a SchemaField w/o changing SchemaField to be non-final. I don't think SchemaField should be non-final. Why do we have a map for flags why not use a bitset? Yeah, we could add a new method that takes a bitset, b/c I believe that is what is used under the hood anyway.
          Hide
          Noble Paul added a comment -

          Because there is no way to change the name on a SchemaField w/o changing SchemaField to be non-final. I don't think SchemaField should be non-final.

          Even if SchemaField is final we can precreate and cache the SchemaField objects because the properties of the synthetic field is known in advance. For instance, if you have a dimension of 2 ,the PointType instance will always have 2 well known synthetic names and types that can be created well in advance and they can be reused

          Show
          Noble Paul added a comment - Because there is no way to change the name on a SchemaField w/o changing SchemaField to be non-final. I don't think SchemaField should be non-final. Even if SchemaField is final we can precreate and cache the SchemaField objects because the properties of the synthetic field is known in advance. For instance, if you have a dimension of 2 ,the PointType instance will always have 2 well known synthetic names and types that can be created well in advance and they can be reused
          Hide
          Grant Ingersoll added a comment -

          Even if SchemaField is final we can precreate and cache the SchemaField objects because the properties of the synthetic field is known in advance. For instance, if you have a dimension of 2 ,the PointType instance will always have 2 well known synthetic names and types that can be created well in advance and they can be reused

          True, but you need to also be able to change the name and it needs to be able to rely on the existing createField signature, which uses these values on the SchemaField. Earlier patches had a separate, internal createField() method that took in all the options (thus not requiring the SF at all) but they don't work for the delegation.

          I'm open to ideas, though, so throw up some code.

          Show
          Grant Ingersoll added a comment - Even if SchemaField is final we can precreate and cache the SchemaField objects because the properties of the synthetic field is known in advance. For instance, if you have a dimension of 2 ,the PointType instance will always have 2 well known synthetic names and types that can be created well in advance and they can be reused True, but you need to also be able to change the name and it needs to be able to rely on the existing createField signature, which uses these values on the SchemaField. Earlier patches had a separate, internal createField() method that took in all the options (thus not requiring the SF at all) but they don't work for the delegation. I'm open to ideas, though, so throw up some code.
          Hide
          Shalin Shekhar Mangar added a comment -

          I guess Noble was referring to something like what is done in this patch.

          1. DelegatingFieldType has a new method:
            public SchemaField[] getSubFields(SchemaField mainField);
            
          2. PointType and PlusMinusField implement this new method. It is not the prettiest way but this is one way to do it.
          3. With this approach, we can get the names from the subFields wherever the name is used (not implemented in this patch).

          The PlusMinusField is actually a field type and not a field so we should probably rename it to PlusMinusFieldType.

          Show
          Shalin Shekhar Mangar added a comment - I guess Noble was referring to something like what is done in this patch. DelegatingFieldType has a new method: public SchemaField[] getSubFields(SchemaField mainField); PointType and PlusMinusField implement this new method. It is not the prettiest way but this is one way to do it. With this approach, we can get the names from the subFields wherever the name is used (not implemented in this patch). The PlusMinusField is actually a field type and not a field so we should probably rename it to PlusMinusFieldType.
          Hide
          Grant Ingersoll added a comment -

          OK, I see what you mean. I don't think we should add it onto the interface, though. I think it can just be handled by changing the signature of the createField method that takes in the delegatedFieldType.

          Show
          Grant Ingersoll added a comment - OK, I see what you mean. I don't think we should add it onto the interface, though. I think it can just be handled by changing the signature of the createField method that takes in the delegatedFieldType.
          Hide
          Noble Paul added a comment -

          Modified the Query creation to use the cached SchemaField names.

          Show
          Noble Paul added a comment - Modified the Query creation to use the cached SchemaField names.
          Hide
          Yonik Seeley added a comment -

          I'm spot-checking mutiple different patches at this point... but in general, we should strive to not expose the complexity further up the type hierarchy, and we should not limit what subclasses can do.

          isPolyField() returns true if more than one Fieldable can be returned from createFields()
          createFields() is free to return whatever the heck it likes.
          And from SchemaField and FieldType's perspective,that's it. Implementation details are up to subclasses and we shouldn't add assumptions in base classes. There should be no concept of subFieldTypes or whatever baked into anything.

          So, from Noble's patch: we shouldn't try caching subfields in SchemaField... and esp not via "if (type instanceof DelegatingFieldType)"... it really doesn't belong there.

          Show
          Yonik Seeley added a comment - I'm spot-checking mutiple different patches at this point... but in general, we should strive to not expose the complexity further up the type hierarchy, and we should not limit what subclasses can do. isPolyField() returns true if more than one Fieldable can be returned from createFields() createFields() is free to return whatever the heck it likes. And from SchemaField and FieldType's perspective,that's it. Implementation details are up to subclasses and we shouldn't add assumptions in base classes. There should be no concept of subFieldTypes or whatever baked into anything. So, from Noble's patch: we shouldn't try caching subfields in SchemaField... and esp not via "if (type instanceof DelegatingFieldType)"... it really doesn't belong there.
          Hide
          Noble Paul added a comment -

          we shouldn't try caching subfields in SchemaField

          I believe The SchemaField is an ideal place to cache the 'synthetic' field info.

          and esp not via "if (type instanceof DelegatingFieldType)"... it really doesn't belong there.

          true. It was a quick and dirty way to demo the idea.

          Show
          Noble Paul added a comment - we shouldn't try caching subfields in SchemaField I believe The SchemaField is an ideal place to cache the 'synthetic' field info. and esp not via "if (type instanceof DelegatingFieldType)"... it really doesn't belong there. true. It was a quick and dirty way to demo the idea.
          Hide
          Grant Ingersoll added a comment -

          I have a new patch in the works that makes creating the SchemaField lighter weight. I agree w/ Yonik, I don't think this can be cached in general. Also, I've done away with the Delegating Field Type.

          Show
          Grant Ingersoll added a comment - I have a new patch in the works that makes creating the SchemaField lighter weight. I agree w/ Yonik, I don't think this can be cached in general. Also, I've done away with the Delegating Field Type.
          Hide
          Noble Paul added a comment -

          I guess we need to revamp the API.

          The FieldType should act as a factory of SchemaField. And SchemaField does not have to be a final class. Solr Should do all the operations through that SchemaField

          Show
          Noble Paul added a comment - I guess we need to revamp the API. The FieldType should act as a factory of SchemaField. And SchemaField does not have to be a final class. Solr Should do all the operations through that SchemaField
          Hide
          Noble Paul added a comment -

          I have opened an issue for the same SOLR-1664

          Show
          Noble Paul added a comment - I have opened an issue for the same SOLR-1664
          Hide
          Grant Ingersoll added a comment -

          Cleaned up the field creation a bit, more documentation in the example. I think this is ready to go.

          Show
          Grant Ingersoll added a comment - Cleaned up the field creation a bit, more documentation in the example. I think this is ready to go.
          Hide
          Chris A. Mattmann added a comment -

          Hey Grant:

          Let me give this a quick review. Won't take longer than 20 mins. Thanks for pushing forward on this.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Grant: Let me give this a quick review. Won't take longer than 20 mins. Thanks for pushing forward on this. Cheers, Chris
          Hide
          Chris A. Mattmann added a comment -

          +1. Looks good, Grant. Let's get this sucker committed...

          Thanks!

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - +1. Looks good, Grant. Let's get this sucker committed... Thanks! Cheers, Chris
          Hide
          Noble Paul added a comment -

          Cleaned up the field creation a bit, more documentation in the example. I think this is ready to go.

          Are you sure that we want to create a new String for every query/field creation?

          Show
          Noble Paul added a comment - Cleaned up the field creation a bit, more documentation in the example. I think this is ready to go. Are you sure that we want to create a new String for every query/field creation?
          Hide
          Grant Ingersoll added a comment -

          Are you sure that we want to create a new String for every query/field creation?

          I don't see anyway around it. The caching doesn't work b/c there may be field types where the number of SchemaFields may not be known.

          Show
          Grant Ingersoll added a comment - Are you sure that we want to create a new String for every query/field creation? I don't see anyway around it. The caching doesn't work b/c there may be field types where the number of SchemaFields may not be known.
          Hide
          Noble Paul added a comment -

          it is possible to cache the objects in FieldType as

          private final Map<SchemaField, List<SchemaField>> subFields;
          

          We can lazily initialize this Map for each SchemaField .

          Show
          Noble Paul added a comment - it is possible to cache the objects in FieldType as private final Map<SchemaField, List<SchemaField>> subFields; We can lazily initialize this Map for each SchemaField .
          Hide
          Yonik Seeley added a comment -

          It doesn't look like subFieldSuffix actually works correctly - I've made a lot of little changes already, so I'll just go ahead and take a shot at fixing it.

          Show
          Yonik Seeley added a comment - It doesn't look like subFieldSuffix actually works correctly - I've made a lot of little changes already, so I'll just go ahead and take a shot at fixing it.
          Hide
          Yonik Seeley added a comment -

          Here's an updated patch that fixes a lot of little bugs - hopefully others can use this as a base so we don't lose all the little changes. I also attached a diff of the patch to the previous patch to help people see what's changed (yuck... doesn't seem that readable though). This isn't finished though - I only got to the FieldType / IndexSchema changes, and some of the ValueSource stuff. I didn't get to distance and value source parsing stuff.

          Some of the changes:

          • small javadoc cleanups
          • fix subFieldSuffix so that it actually uses that suffix
          • make any utility methods on FieldType/IndexSchema dealing with "poly" field creation package protected - I don't think we want these public... it's specific for a field that adds only to other fields that it defines (with a specific naming convention) all of the same type. They probably don't even belong on the base classes, but I don't care so much if they aren't public or protected, we can remove later.
          • make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups!
          • remove SchemaField creation from PointType - we should get fields from the schema
          • fixed some value sources that didn't weight correctly
          • fix createFields() to return Fieldable instead of Field

          When I fixed up point type, I did so in many places by assuming 2 points (so it will break for other dimensions).

          I had been working off the assumption that we wanted a geo specific base class to delegate some things to (like the most efficient way to get a bounding box, etc). If so, we need to decide what that class will be. Making it point or coordinate already bakes in a lot if implementation details (subType stuff). Do we want geo to just work off of a generic n dimentional point class, or should we have a 2d lat/lon? It does feel like we're loosing something by trying to over-generalize. The PointTypeValueSource is inner-loop stuff, so I did specialize that for lat/lon.

          Show
          Yonik Seeley added a comment - Here's an updated patch that fixes a lot of little bugs - hopefully others can use this as a base so we don't lose all the little changes. I also attached a diff of the patch to the previous patch to help people see what's changed (yuck... doesn't seem that readable though). This isn't finished though - I only got to the FieldType / IndexSchema changes, and some of the ValueSource stuff. I didn't get to distance and value source parsing stuff. Some of the changes: small javadoc cleanups fix subFieldSuffix so that it actually uses that suffix make any utility methods on FieldType/IndexSchema dealing with "poly" field creation package protected - I don't think we want these public... it's specific for a field that adds only to other fields that it defines (with a specific naming convention) all of the same type. They probably don't even belong on the base classes, but I don't care so much if they aren't public or protected, we can remove later. make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups! remove SchemaField creation from PointType - we should get fields from the schema fixed some value sources that didn't weight correctly fix createFields() to return Fieldable instead of Field When I fixed up point type, I did so in many places by assuming 2 points (so it will break for other dimensions). I had been working off the assumption that we wanted a geo specific base class to delegate some things to (like the most efficient way to get a bounding box, etc). If so, we need to decide what that class will be. Making it point or coordinate already bakes in a lot if implementation details (subType stuff). Do we want geo to just work off of a generic n dimentional point class, or should we have a 2d lat/lon? It does feel like we're loosing something by trying to over-generalize. The PointTypeValueSource is inner-loop stuff, so I did specialize that for lat/lon.
          Hide
          Grant Ingersoll added a comment -

          Do we want geo to just work off of a generic n dimentional point class, or should we have a 2d lat/lon?

          I think we want generic. Some geo stuff will want 3D (elevation).

          remove SchemaField creation from PointType - we should get fields from the schema

          I thought about that, but why should we have to figure out what the dynamic field is every time when we already know it?

          bq, make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups!

          Where do you mean? In createField? Or in the getFieldQuery?

          Show
          Grant Ingersoll added a comment - Do we want geo to just work off of a generic n dimentional point class, or should we have a 2d lat/lon? I think we want generic. Some geo stuff will want 3D (elevation). remove SchemaField creation from PointType - we should get fields from the schema I thought about that, but why should we have to figure out what the dynamic field is every time when we already know it? bq, make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups! Where do you mean? In createField? Or in the getFieldQuery?
          Hide
          Grant Ingersoll added a comment -

          Actually, we should probably move the creation of the actual PointType off to SOLR-1586, anyway. The reason I want n-dimensional is b/c this stuff is useful for more than just traditional lat/lon.

          Show
          Grant Ingersoll added a comment - Actually, we should probably move the creation of the actual PointType off to SOLR-1586 , anyway. The reason I want n-dimensional is b/c this stuff is useful for more than just traditional lat/lon.
          Hide
          Chris A. Mattmann added a comment -

          Actually, we should probably move the creation of the actual PointType off to SOLR-1586, anyway. The reason I want n-dimensional is b/c this stuff is useful for more than just traditional lat/lon.

          +1, I was going to suggest the same thing. BTW, the geohash field type is ready when you have a chance to take a look.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Actually, we should probably move the creation of the actual PointType off to SOLR-1586 , anyway. The reason I want n-dimensional is b/c this stuff is useful for more than just traditional lat/lon. +1, I was going to suggest the same thing. BTW, the geohash field type is ready when you have a chance to take a look. Cheers, Chris
          Hide
          Yonik Seeley added a comment -

          I think we want generic. Some geo stuff will want 3D (elevation).

          I imagine 99% of the users of Point will be those wanting straight 2D geo search (limit by simple distance, or sort by simple distance).
          I don't see how elevation would help in isolation since more accurate distance measures would presumably need map info for driving distance.
          I could see people wanting to store 3D points, etc, but they can do that today.

          I thought about that, but why should we have to figure out what the dynamic field is every time when we already know it?

          It's more of an issue of directly creating fields... that's currently up to the schema (or the FieldType if some form of SOLR-1664 goes in).

          > make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups!

          Where do you mean? In createField? Or in the getFieldQuery?

          In getFieldQuery and getRangeQuery... I changed to actually delegate to the subFieldType.getFieldQuery and getRangeQuery.

          Show
          Yonik Seeley added a comment - I think we want generic. Some geo stuff will want 3D (elevation). I imagine 99% of the users of Point will be those wanting straight 2D geo search (limit by simple distance, or sort by simple distance). I don't see how elevation would help in isolation since more accurate distance measures would presumably need map info for driving distance. I could see people wanting to store 3D points, etc, but they can do that today. I thought about that, but why should we have to figure out what the dynamic field is every time when we already know it? It's more of an issue of directly creating fields... that's currently up to the schema (or the FieldType if some form of SOLR-1664 goes in). > make PointType actually delegate to the subFieldType... before it was assuming thinks like TermQuery and TermRangeQuery... this would have actually disabled NumericRangeQuery speedups! Where do you mean? In createField? Or in the getFieldQuery? In getFieldQuery and getRangeQuery... I changed to actually delegate to the subFieldType.getFieldQuery and getRangeQuery.
          Hide
          Grant Ingersoll added a comment -

          I imagine 99% of the users of Point will be those wanting straight 2D geo search (limit by simple distance, or sort by simple distance).
          I don't see how elevation would help in isolation since more accurate distance measures would presumably need map info for driving distance.
          I could see people wanting to store 3D points, etc, but they can do that today.

          But it's not like the code is any faster or more complicated and neither is the interface the user sees (I'm sure most people know what a point is). If I have a generic Point, now I have a Vector. With a Vector, I can do all kinds of interesting things. But rather than fret over it, we can have both. LatLonFieldType and PointFieldType.

          Show
          Grant Ingersoll added a comment - I imagine 99% of the users of Point will be those wanting straight 2D geo search (limit by simple distance, or sort by simple distance). I don't see how elevation would help in isolation since more accurate distance measures would presumably need map info for driving distance. I could see people wanting to store 3D points, etc, but they can do that today. But it's not like the code is any faster or more complicated and neither is the interface the user sees (I'm sure most people know what a point is). If I have a generic Point, now I have a Vector. With a Vector, I can do all kinds of interesting things. But rather than fret over it, we can have both. LatLonFieldType and PointFieldType.
          Hide
          Steve Rowe added a comment -

          Maybe Point ->

          { Point2D, Point3D }

          ?

          Show
          Steve Rowe added a comment - Maybe Point -> { Point2D, Point3D } ?
          Hide
          Grant Ingersoll added a comment -

          It's more of an issue of directly creating fields... that's currently up to the schema

          I'm not following. Is the problem in:

          @Override
            public Fieldable[] createFields(SchemaField field, String externalVal, float boost) {
              String[] point = DistanceUtils.parsePoint(null, externalVal, dimension);
              return createFields(field, dynFieldProps, subType, externalVal, boost, point);
            }
          
          Show
          Grant Ingersoll added a comment - It's more of an issue of directly creating fields... that's currently up to the schema I'm not following. Is the problem in: @Override public Fieldable[] createFields(SchemaField field, String externalVal, float boost) { String [] point = DistanceUtils.parsePoint( null , externalVal, dimension); return createFields(field, dynFieldProps, subType, externalVal, boost, point); }
          Hide
          Grant Ingersoll added a comment -

          fix createFields() to return Fieldable instead of Field

          This seems a bit weird (even though I understand why) due to the fact that the other createField methods actually return Field and not Fieldable.

          Show
          Grant Ingersoll added a comment - fix createFields() to return Fieldable instead of Field This seems a bit weird (even though I understand why) due to the fact that the other createField methods actually return Field and not Fieldable.
          Hide
          Yonik Seeley added a comment -

          > > It's more of an issue of directly creating fields... that's currently up to the schema
          > I'm not following. Is the problem in:
          [...]
          No, I meant directly creating SchemaFields doesn't seem great. If we put a cache in, it would bypass that too.

          This seems a bit weird (even though I understand why) due to the fact that the other createField methods actually return Field and not Fieldable.

          If it weren't for back compat, we would have already changed createField to return Fieldable (and I think there's a SOLR issue somewhere that either does this or depends on it). Fields are a lot more limiting than Fieldables (but createField was in Solr before there even was a Fieldable).

          if SOLR-1664 goes ahead it might be a natural place to make everything in SchemaField deal in Fieldabes?

          Show
          Yonik Seeley added a comment - > > It's more of an issue of directly creating fields... that's currently up to the schema > I'm not following. Is the problem in: [...] No, I meant directly creating SchemaFields doesn't seem great. If we put a cache in, it would bypass that too. This seems a bit weird (even though I understand why) due to the fact that the other createField methods actually return Field and not Fieldable. If it weren't for back compat, we would have already changed createField to return Fieldable (and I think there's a SOLR issue somewhere that either does this or depends on it). Fields are a lot more limiting than Fieldables (but createField was in Solr before there even was a Fieldable). if SOLR-1664 goes ahead it might be a natural place to make everything in SchemaField deal in Fieldabes?
          Hide
          Noble Paul added a comment -

          if SOLR-1664 goes ahead it might be a natural place to make everything in SchemaField deal in Fieldabes?

          What stops us from resolving SOLR-1664 ?

          Show
          Noble Paul added a comment - if SOLR-1664 goes ahead it might be a natural place to make everything in SchemaField deal in Fieldabes? What stops us from resolving SOLR-1664 ?
          Hide
          Grant Ingersoll added a comment -

          Updated to include some of Yonik's concerns about implementation. I didn't get his Javadoc changes b/c they were too hard to determine the differences.

          Other notes:

          1. Removed PlusMinusField and just used the PointType for those tests. This then allowed me to move the registerPoly static method to the CoordinateFieldType
          2. I kept the n-dimensional point. All of our distances work on vectors, I see no reason not to keep them. Performance wise, most people w/ dimension of 2 or 3 will see little if any difference between this and specifically calling out a lat/lon field type.
          3. I believe I cleaned up the public method stuff so that helper methods are now package private.
          Show
          Grant Ingersoll added a comment - Updated to include some of Yonik's concerns about implementation. I didn't get his Javadoc changes b/c they were too hard to determine the differences. Other notes: Removed PlusMinusField and just used the PointType for those tests. This then allowed me to move the registerPoly static method to the CoordinateFieldType I kept the n-dimensional point. All of our distances work on vectors, I see no reason not to keep them. Performance wise, most people w/ dimension of 2 or 3 will see little if any difference between this and specifically calling out a lat/lon field type. I believe I cleaned up the public method stuff so that helper methods are now package private.
          Hide
          Grant Ingersoll added a comment - - edited

          Note, I also changed PointValueSource to be ToMultiValueSource, as it really isn't just a point, but I'm not married to that name either.

          Show
          Grant Ingersoll added a comment - - edited Note, I also changed PointValueSource to be ToMultiValueSource, as it really isn't just a point, but I'm not married to that name either.
          Hide
          Grant Ingersoll added a comment -

          Added some range tests.

          Show
          Grant Ingersoll added a comment - Added some range tests.
          Hide
          Grant Ingersoll added a comment -

          Committed revision 893746.

          Leaving open for a little while to deal with any side effects.

          Show
          Grant Ingersoll added a comment - Committed revision 893746. Leaving open for a little while to deal with any side effects.
          Hide
          Yonik Seeley added a comment - - edited

          OK, I'll try to diff what I had done before and re-make those changes.
          edit: I also see bug fixes I had made that got lost... I'll do a full re-review.
          edit: It appears the subFieldSuffix is broken again too.

          Show
          Yonik Seeley added a comment - - edited OK, I'll try to diff what I had done before and re-make those changes. edit: I also see bug fixes I had made that got lost... I'll do a full re-review. edit: It appears the subFieldSuffix is broken again too.
          Hide
          Grant Ingersoll added a comment -

          OK, I thought I had got them all, but feel free to commit as you see fit.

          Show
          Grant Ingersoll added a comment - OK, I thought I had got them all, but feel free to commit as you see fit.
          Hide
          Grant Ingersoll added a comment -

          Changed toMultiVS to vector(): Committed revision 894183.

          Show
          Grant Ingersoll added a comment - Changed toMultiVS to vector(): Committed revision 894183.
          Hide
          Hoss Man added a comment -

          Correcting Fix Version based on CHANGES.txt, see this thread for more details...

          http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E

          Show
          Hoss Man added a comment - Correcting Fix Version based on CHANGES.txt, see this thread for more details... http://mail-archives.apache.org/mod_mbox/lucene-dev/201005.mbox/%3Calpine.DEB.1.10.1005251052040.24672@radix.cryptio.net%3E
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1.0 release

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1.0 release

            People

            • Assignee:
              Grant Ingersoll
              Reporter:
              Ryan McKinley
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development