Details

Type: New Feature

Status: Closed

Priority: Major

Resolution: Fixed

Affects Version/s: None

Fix Version/s: 5.2

Component/s: modules/spatial

Labels:None

Lucene Fields:New
Description
I would like to explore contributing a geo3d package to Lucene. This can be used in conjunction with Lucene search, both for generating geohashes (via spatial4j) for complex geographic shapes, as well as limiting results resulting from those queries to those results within the exact shape in highly performant ways.
The package uses 3d planar geometry to do its magic, which basically limits computation necessary to determine membership (once a shape has been initialized, of course) to only multiplications and additions, which makes it feasible to construct a performant BoostSourcebased filter for geographic shapes. The math is somewhat more involved when generating geohashes, but is still more than fast enough to do a good job.

 geo3d.zip
 52 kB
 Karl Wright

 geo3dtests.zip
 7 kB
 Karl Wright

 LUCENE6196_Geo3d.patch
 237 kB
 David Smiley

 LUCENE6196additions.patch
 48 kB
 Karl Wright

 LUCENE6196fixes.patch
 23 kB
 Karl Wright

 ShapeImpl.java
 5 kB
 Karl Wright
Activity
 All
 Comments
 Work Log
 History
 Activity
 Transitions
Commit 1678060 from David Smiley in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1678060 ]
LUCENE6196: Fix RandomizedShapeTestCase.randomPointIn(Shape)
Commit 1678059 from David Smiley in branch 'dev/trunk'
[ https://svn.apache.org/r1678059 ]
LUCENE6196: Fix RandomizedShapeTestCase.randomPointIn(Shape)
Regarding the test failure earlier today: http://jenkins.thetaphi.de/job/LuceneSolrtrunkMacOSX/2270/ this is a bug in RandomizedShapeTestCase, not Geo3d. (RSTC in turn was copied from Spatial4j and is here temporarily while Geo3d is). The randomPointIn method should be testing if the random point is in the shape, not the bbox. But I'm concerned especially skinny shapes may loop forever... so I think I should address this somehow, like returning null so that the caller can give up. I'll do that.
Committed. Thanks again for your contribution Karl! On the Spatial4j side I'll be adding some changes to get this in with more integration (e.g. tiein to WKT parse & SpatialContext). I'm not sure if it'll be the next release or the one after.
Commit 1678007 from David Smiley in branch 'dev/branches/branch_5x'
[ https://svn.apache.org/r1678007 ]
LUCENE6196: Geo3d API, 3d planar geometry for surface of a sphere.
This merge commit renames a couple test utilities that appeared to be tests but weren't.
Commit 1678005 from David Smiley in branch 'dev/trunk'
[ https://svn.apache.org/r1678005 ]
LUCENE6196: Geo3d API, 3d planar geometry for surface of a sphere.
This merge commit renames a couple test utilities that appeared to be tests but weren't.
+1 for merge to trunk.
I think the Geo3d branch, technically lucene6196, is now ready to merge into trunk, and then the 5x branch. I could generate a patch, but unless there are process reasons (e.g. I "have to"?) or technical reasons I am unaware of, I'll simply merge in the branch. The CHANGES.txt entry I plan to add is as follows:
* LUCENE6196: New Spatial "Geo3d" API with partial Spatial4j integration. It is a set of shapes implemented using 3D planar geometry for calculating spatial relations on the surface of a sphere. Shapes include Point, BBox, Circle, Path (buffered line string), and Polygon. (Karl Wright via David Smiley)
Karl, if you suggest any changes then just let me know. If I don't get another +1 then I'll commit in two days.
Commit 1677670 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677670 ]
LUCENE6196: committing Karl's latest patch
https://reviews.apache.org/r/33811/ (diff #3)
Commit 1677669 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677669 ]
LUCENE6196: Mark @lucene.experimental or @lucene.internal
Commit 1677658 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677658 ]
LUCENE6196: Mark @lucene.experimental or @lucene.internal
Commit 1677656 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677656 ]
LUCENE6196: Fix javadoc issues; ant precommit is happy.
Commit 1677595 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677595 ]
LUCENE6196: Reformat code. Removed System.err & legacy comments in test. Fixed test compile warning.
Commit 1677527 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677527 ]
LUCENE6196: committing Karl's latest patch
https://reviews.apache.org/r/33780/ (diff #1)
Commit 1677205 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1677205 ]
LUCENE6196: committing Karl's latest patch
https://reviews.apache.org/r/33509/ (diff #6)
Commit 1675777 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1675777 ]
LUCENE6196 testRelateWithRectangle shouldn't print last shape relation when we fail due to too few predicate occurrences.
Commit 1675747 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1675747 ]
LUCENE6196: committing Karl's latest patch
https://reviews.apache.org/r/33476/ (diff #2)
Addresses getBoundingBox issues
Commit 1675386 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1675386 ]
LUCENE6196: Geo3dShapeTest > Geo3dShapeRectRelationTest and complete testing of the 4 shape types
Commit 1675385 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1675385 ]
LUCENE6196: RectIntersectionTestHelper: fix to work with testing rectangles, and more clearly test when the shape.getBoundingBox may be faulty
Commit 1675374 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1675374 ]
LUCENE6196: committing Karl's latest patch
https://reviews.apache.org/r/33353/ (diff #9) https://reviews.apache.org/r/33353/diff/raw/
Commit 1674732 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1674732 ]
committing Karl's latest LUCENE6196fixes.patch (plus my test failures)
https://reviews.apache.org/r/33268/diff/raw/
WIP; tests fail
Commit 1674049 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1674049 ]
committing Karl's latest LUCENE6196fixes.patch
https://issues.apache.org/jira/secure/attachment/12725741/LUCENE6196fixes.patch
WIP; tests fail
Final set of tests working.
Yet more fixes, this time for circles
Fixes that include polygon testing for 4 and 5 sided polygons, with a better "illegal polygon" detector.
This revised fixes patch catches IllegalArgumentException and picks a different triangle when it occurs.
Fixes for failing tests
New version of additions with problem fixed
This patch adds support for degenerate cases, and corrects a bug in GeoWideRectangle.
Commit 1673165 from David Smiley in branch 'dev/branches/lucene6196'
[ https://svn.apache.org/r1673165 ]
LUCENE6196: Geo3d initial checkin
Deltas from Karl's first upload: change of package, some hashCode() impls, a few toString() impls, some javadoc formatting. New Geo3dRtTest. Geo3dShape throws an exception if not geo.
Thanks Karl for the precision.
On the javac error, you can fix them by replacing expression like PI/2 > PI/2 by
{@literal PI/2 > PI/2}
or alternatively replace > by >
Hi Martin,
The geo3d package is based wholly on the unit sphere centered on an x,y,z coordinate origin. There is no place where the radius of the earth etc factor into the math. In order to cast the problem in terms of any particular standard, therefore, you need to multiply distances computed by the radius of the earth.
Oblateness is also not considered for the purposes of this package. Its best use is in conjunction with maps which also are projected to a sphere. Luckily this happens to be quite common.
I had a quick look at the new classes. I would like to mention one possibility in case it may be of interest. I saw various classes related to a latitude/longitude bounding box. If a dependency toward the GeoAPI interfaces is considered acceptable (GeoAPI is a translation of some international standards into Java interfaces, and is published by the Open Geospatial Consortium), then those classes could implement the Envelope interface. The getCoordinateReferenceSystem() method could return null for now, but this would keep the door open for connecting the Geo3D bounding box to a map reprojection engine in a future version if desired. For example if a future Geo3D version returns a nonnull value, then the Apache SIS reprojection method could work directly with those Geo3D classes. If a GeoAPI dependency is considered not desirable at this time, just checking that there is no incompatibility between the Geo3D classes and Envelope API may help to keep the door open.
By the way, the Geo3D documentation said that it works with latitude/longitude in radians, but I saw no mention of which geographic system (there is many, with differences up to 3 kilometres between them  ignoring those who do not use Greenwich as their prime meridian). This is the kind of information which is normally provided by Envelope.getCoordinateReferenceSystem(), but otherwise just a note in the documentation may help. I presume that Geo3D implies the WGS84 system?
Hi David,
Your test fails because it is trying to define shapes using lat/lon in degrees. geo3d uses radians for everything.
Also, when I apply the patch to current lucene trunk, I get 10 build errors. Looks like the lucene javadoc settings are blowing up the build:
[javac] C:\wip\lucene\lucene\spatial\src\java\org\apache\lucene\spatial\spatial4j\geo3d\GeoConvexPolygon.java:40: error: bad use of '>' [javac] * Accepts only values in the following ranges: lat: PI/2 > PI/2, lon: PI > PI [javac] ^ [javac] C:\wip\lucene\lucene\spatial\src\java\org\apache\lucene\spatial\spatial4j\geo3d\GeoConvexPolygon.java:40: error: bad use of '>' [javac] * Accepts only values in the following ranges: lat: PI/2 > PI/2, lon: PI > PI [javac] ^ [javac] C:\wip\lucene\lucene\spatial\src\java\org\apache\lucene\spatial\spatial4j\geo3d\GeoConvexPolygon.java:56: error: bad use of '>' [javac] * Accepts only values in the following ranges: lat: PI/2 > PI/2, lon: PI > PI [javac] ^ [javac] C:\wip\lucene\lucene\spatial\src\java\org\apache\lucene\spatial\spatial4j\geo3d\GeoConvexPolygon.java:56: error: bad use of '>' [javac] * Accepts only values in the following ranges: lat: PI/2 > PI/2, lon: PI > PI
Hi David,
I received an invite from the review board. Not sure what to do with it exactly.
I'll look at the test failure this evening and get back to you this evening about that.
Attached is a patch file that includes this code in the spatial module under this package: org.apache.lucene.spatial.spatial4j.geo3d.
I added Geo3dShape to org.apache.lucene.spatial.spatial4j. I also added a Geo3dRptTest class which extends RandomSpatialOpStrategyTestClass configured to use the new CompositeSpatialStrategy, and with random indexed rectangles. Right now to keep things simple, the query shapes are always triangles created via Geo3d. I quickly encountered a failure and I minimized it to a simple test; see testTriangleDisjointRect().
I added some toString()'s to a few classes to aid debugging but not all of them. To those same classes I added hashCode – but again, almost all of these classes should have these things.
I put the code up in ReviewBoard:
https://reviews.apache.org/r/33114/
I've only used that tool a little so I'm not sure if it's possible for you to upload more diff's, Karl, or wether only the creator (me) can. If that's a problem; you could simply recreate. I couldn't add you as a Reviewer because you don't seem to be in the system.
I do like the idea of committing it temporarily for the aforementioned reasons. Still I need to spend some handson time preparing a patch of it integrated with a SpatialContext implementation and with some existing tests using these Shape implementations instead of the default ones. For example, a subclass of RandomSpatialOpFuzzyPrefixTreeTest that provides randomish Geo3d shapes. In doing so, it'll test that the bounding box of a shape is not smaller than it needs to be, and that a Geo3d Shape reports the correct intersection relationship with a rectangle. At least, this is what I plan to do when I start.
Thanks. If this (temporarily at least) resides in lucene, I'll do what is necessary to extend/maintain it.
It sounds like the SIS people have similar concerns about the searchfocus of the package that I did. Still, they're welcome to use it as a basis for the 3d package they intend to develop for the long term.
In the meantime it might not be a bad idea putting this in an org.apache.lucene.spatial.geometry package in the _5x branch to give it some mileage and iterations in the wild until it can find its permanent home.
+1, let's just add it here: progress not perfection!
I had reached out to SIS a few days ago to make them aware of his contribution
Thanks Nicholas Knize! Here's the discussion over in our sister project, Apache SIS: http://markmail.org/message/f2k2ykyqisclrzi3
Hi Nicholas,
I'm afraid that the only project I can contribute workrelated development to would be Lucene/Solr.
In the meantime it might not be a bad idea putting this in an org.apache.lucene.spatial.geometry package in the _5x branch to give it some mileage and iterations in the wild until it can find its permanent home. That guarantees Karl the ability to continue to contribute while David works it into S4J.
Also, since Karl can't contribute if it goes into S4J (anything changed here Karl?) I had reached out to SIS a few days ago to make them aware of his contribution and gauge their interest on collaboration. I'm assuming you can contribute to a sister Apache project? They're diligent on being ISO 19107 and OGC compliant which is a plus but also complicates the integration a bit. A nice integration exposes the functionality to the referencing and projection capabilities (something I'll be getting back to with S4J, David). There's an open dialog on the SIS mailing list tossing around ideas if you guys want to jump in the conversation.
Ok, I will wait to see how the integration progresses before asking the SIS team if they would have an interest.
The body of code involved is obviously highly tuned for the search problem; it's not as good a fit for the SIS mission (seems to me). But I'd rather it went there than nowhere.
It's entry into Spatial4j is somewhat entangled with the incubation process for Spatial4j into LocationTech/Eclipse which is slowly advancing along. Since that's been happening slowly, I'm tempted to integrate it before and not after. I think in a week or so I'll give that a shot, esp. if there's no incubation progress.
Wether Geo3d becomes a part of SIS or not is up to thatproject. I imagine they would be interested but I don't know.
Not sure what happened to this, but if it hasn't landed in either Lucene or Spatial4j, perhaps Apache SIS would be interested in it?
Shalin, see Nick and David's comments above, might help answer your question..
Based on the license included with the contribution, anyone can take the code anywhere they want, provided the license remains intact.
Okay, I missed that part. But still, quoting you:
However, as I have indicated earlier in this thread, unless this code winds up in Lucene or Solr, for the moment I can't contribute to it any further. This may change in the future, but that is the situation at the moment.
You (the original author) can neither contribute it directly to another project nor are you able to contribute to it further unless it lands in Lucene or Solr, at least for now. That, for me, is reason enough to have this code in Lucene/Solr land.
...by whose permission (certainly not the original contributor's)?
Based on the license included with the contribution, anyone can take the code anywhere they want, provided the license remains intact. However, I agree that that still doesn't answer the question as to why it isn't being included in Lucene, as originally proposed.
I plan to incorporate it into Spatial4j, and the first step is an Eclipse "CQ" (Contribution Questionnaire) which is part of the IP process for code > 1000 lines. I've done that and it needs one more PMC vote there (I think).
This is a contribution to Lucene/Solr and it has been made quite clear by the contributor that he can only contribute this to Lucene/Solr. Why are you trying to contribute it to some other project and by whose permission (certainly not the original contributor's)? This is a genuine question, not a complaint. Please help me understand.
Hi Karl,
I plan to incorporate it into Spatial4j, and the first step is an Eclipse "CQ" (Contribution Questionnaire) which is part of the IP process for code > 1000 lines. I've done that and it needs one more PMC vote there (I think).
p.s. I'm on vacation this week so I may not be so responsive.
Any news as to what is going to happen with this contribution?
Fixed an issue with bounding boxes and longitude slices; updated tests
Minor cleanups, and reduction of duplicated code...
I theoretically don't need anything from you for this to have a home in Spatial4j but I'll let you know otherwise â€“ e.g. a simple attestation that you had permission from your employer to license it assuch.
Are you asking me yet again whether I had permission to contribute this?
The permission I have from my employer is a longgranted corporate permission to contribute to Lucene and Solr. My employer is aware that any and all code contributed to this project will be licensed with the Apache 2 license. There are, of course, certain restrictions as to exactly what can be contributed, but I do not believe this geospatial library violates any of those restrictions.
I hope this is sufficient; you are unlikely to get more detail.
However, as I have indicated earlier in this thread, unless this code winds up in Lucene or Solr, for the moment I can't contribute to it any further. This may change in the future, but that is the situation at the moment.
Aha, my hunch was right then. You've already made your contribution even though it hasn't been committed here. I theoretically don't need anything from you for this to have a home in Spatial4j but I'll let you know otherwise – e.g. a simple attestation that you had permission from your employer to license it assuch.
I still want to hook a Lucene spatial test to it so see how it fairs.
Nicholas,
In the long run I doubt I will continue to have the same restrictions in place as to what I can contribute to. But for now, they are what they are, and I have to live within them.
My recommendation: Keep geo3d as a distinct library, within Lucene for the moment, but with the option of eventually spinning it off or integrating with Spatial4j. To that end, it would be ideal if it were packaged in its own jar, etc.
Intent and design of general utility libraries certainly vary. One thing is for sure, they all start somewhere. IMHO this is a good start to the 3D computational geospatial problem space and incubating it within Spatial4J (rc0.4.2 or rc0.5) would give it broader exposure / potential for greater contribution from the geospatial community.
That being said, you're the original author. I suppose its up to the community to decide where the contribution best fits? Taking into consideration, of course, the inability for you to continue contributing to one of the candidate packages.
Where in the design is it restricted to lucene?
I did not claim it was restricted to Lucene. My claim is that the functionality is expressly designed to solve the geographic search problem, which to me consists of three parts: geohash construction, highlyperformant result filtering, and highlyperformant distance scoring functionality. A general package, in my view, is distinct in the following ways:
(1) It tends to try to solve a broader set of geographic problems, i.e. computing a shape's area, intersecting shapes, etc.
(2) There is much less emphasis on the highlyperformant computational requirements mentioned above; general packages by and large don't have the "expensive construction/dirt cheap individual evaluation" requirement that search engines like Lucene would have.
Having said that, I have no objection if you want to use this code in spatial4j. I just cannot contribute to spatial4j at the moment. And I do think that there is a closeenough relationship between the search problem and geo3d that it isn't unreasonable to include geo3d in Lucene.
Spatial4j is totally wired into a latitude/longitude bounding box geometry, and most of the methods required by Shape objects have no implementation in a geo3d world.
David can elaborate, but at the moment that's only a temporary delimitation of spatial4j, not a limitation. Spatial4J is intended to expand on JTS (but w/ an ASF license) for geospatial applications, not intended to remain restricted to a 2D (lat/lon) set of capabilities. That's why integrating this within S4J is an attractive approach.
The geo3d world has exactly what is needed for lucene searching, and no more.
It looks to me that any index/search approach based on a space filling curve would benefit from geo3d. In fact, what you have here can just as easily be applied to an R/R*Tree algorithm. Where in the design is it restricted to lucene?
I know on the dev list you mentioned you only want to contribute this to Lucene but you didn't say the reason. I'm guessing it's so that you needn't reask permission from your employer, whom you may have agreements for a limited number of specific opensource projects.
That is the case.
The license permits it to be incorporated into any other ASL project (e.g. Spatial4j) with ease.
Yes, I know that. Spatial4j, however, seemed at its core to have a much different set of capabilities in mind. Spatial4j is totally wired into a latitude/longitude bounding box geometry, and most of the methods required by Shape objects have no implementation in a geo3d world. The geo3d world has exactly what is needed for lucene searching, and no more. So they are not great fits to one another, IMHO.
There's one thing I want to confirm with you Karl: so these shares are "accurate" to the spherical model (the unit sphere)? That is, if I have say a line string, then is each line segment a greatcircle path between its start & end? If not then can you please explain?
In part this depends on the shape. For GeoConvexPolygons, the surface paths are all great circles. For GeoCircles, it's a circle not a great circle. For GeoPaths, the boundary of the shape consists of a zone of a specific width on either side of a great circle, so the boundary is not (if you think about it) actually a great circle. For GeoBBox shapes (e.g. GeoRectangles), they are great circles in longitude, but horizontal slices in latitude. This matches the standard "rectangle" metaphor that quad trees are built on.
I completely agree with Nick but was waiting to make my Spatial4j pitch Computational geometry has nothing to do with Lucene, but Lucene has a use for tapping into computational geometry libs. I know on the dev list you mentioned you only want to contribute this to Lucene but you didn't say the reason. I'm guessing it's so that you needn't reask permission from your employer, whom you may have agreements for a limited number of specific opensource projects. If that's the case, you've already done your part and complied – the code is attached to this JIRA and has an ASL license. The license permits it to be incorporated into any other ASL project (e.g. Spatial4j) with ease. I didn't want to hinder you getting to this point.
So I went ahead and took a quick look. I do have a question worth thinking about. Is there anything in here I'm missing that's necessitating this be a part of the Lucene spatial code base? Since its essentially a computational geometry package for 3D space using a unit sphere (and not enhancing or changing the core indexing and search capabilities for geo) why not instead add it to spatial4j as is? That way it can benefit from enhancements from spatial4j (e.g., crs projections) and remain decoupled from Lucene capabilities of handling the evolving spatial indexing strategies?
Polygons – awesome! And of course tests are essential. I'll take another look in a couple days or so. What I plan to do is enhance an RPT test (like RandomSpatialOpFuzzyPrefixTreeTest) to actually use these shapes and see if it works from that side of things. As long as these shapes can compute their relationships with a latlon bounding box correctly, then we should be in business.
There's one thing I want to confirm with you Karl: so these shares are "accurate" to the spherical model (the unit sphere)? That is, if I have say a line string, then is each line segment a greatcircle path between its start & end? If not then can you please explain?
Rather than continue to iterate this to perfection (though I do love seeing all this math!), I think it would make sense to commit it and open further issues to improve it? It seems like an awesome contribution already?
Progress not perfection...
Added polygon support. Is there anything else you need before committing this code?
Problems have been fixed, and tests are reasonably comprehensive.
Found some problems with the Bounds computation. I won't have time to look at these until the weekend though.
Had some time today, and finished the implementation. Started writing tests but not done with that yet.
I'm completely out of time to work on this this week, and probably next week as well. Where it is left:
 I have a correct formula for min and max latitude for plane/sphere intersect
 I have what appears to be a workable approach for min/lmax longitude for plane/sphere intersect, but there are at least a half dozen corner cases, and I have yet to come up with a formula that checks out.
Picking this up again sometime mid February...
Unless I made a math error, I get the following for longitude bounds:
ADx0 + BDy0  C^2 = 0
... combined with either:
[C^2 * D^2 * (B^2 + A^2)] * y0^2 + [2*C^4*BD] * y0 + [C^6 + A^2 * C^4 + A^2 * D^2 * C^2] = 0 x0 = (BDy0 + C^2) / AD
... or its counterpart, solving for x0:
[C^2 * D^2 * (B^2 + A^2)] * x0^2 + [2*C^4*BD] * x0 + [C^6 + B^2 * C^4 + B^2 * D^2 * C^2] = 0 y0 = (ADx0 + C^2) / BD
Either way, the minimum or maximum longitude, provided there are any solutions (x0,y0), is given by:
longitude = atan(y0/x0)
So as soon as I confirm this math, I can implement a general bounds solution for an arbitrary plane intersecting the unit sphere, which should allow me to provide bounding boxes for all shapes built on 3d bounding planes.
LOL Absolutely.
The full point formulas for (x,y,z) for min/max z are as follows:
z = DC +/ sqrt(D^2*C^2 + 1  C^2  D^2) y = B[D+Cz] / [A^2 + B^2] x = A[D+Cz] / [A^2 + B^2]
For rectangles, and any shape that uses only planes going through the origin, this is all that I need. But for shapes using other planes, I need a similar formula for min/max longitude. Working on that now.
I note that I"m preaching to the choir... chalk it up as an enhancement.
So for a GEO context (I'm assuming "geo" implies geospatial) "spherical" operations should use the oblate spheroid (WGS84 is fine for keeping it simple). Otherwise its just a normal spherical coordinate system and geospatial operations will be erroneous and misleading to users.
Thus far, Lucene spatial supports a Euclidean model (for "projected" data or nongeospatial uses), and a spherical one. Supporting ellipsoidal or other models is definitely out of scope of this issue.
OK. Cool. So this initial package is for a spherical system not a geodesic system then. I assume followon work (before lucenespatial integration) includes correcting for error introduced by an incorrect datum.
Hi Nicholas,
The rest of the geo3d package uses the unit sphere. Oblate spheres would require quite a number of changes, so I'm not going there for the time being.
You'll want the oblate spheroid not the sphere. So don't forget the semiaxis coefficients (normalized if you're using unit normal vectors) e.g.,
(x^2+y^2)/a^2 + z^2/c^2  1
I found a noncalculus way to do it, yielding:
z = DC +/ sqrt(D^2*C^2 + 1  C^2  D^2)
This should be enough to allow me to solve the problem completely.
Attention math whizzes: So in order to generally solve the problem of bounding box for a shape, this is what we need.
We have a plane with equation Ax + By + Cz + D = 0. We have the unit circle, with equation x^2 + y^2 + z^2 = 1. We want to find the points (x,y,z) satisfying both equations with the maximum z and minimum z values. There should be zero, one, or two of these.
I've tried a number of things; the technique that looks like it has the best chance of success at the moment is the technique of Lagrangian multipliers. See http://en.wikipedia.org/wiki/Lagrange_multiplier . I think this would basically apply with:
g(x,y) = x^2 + y^2 + [(A/C)x  (B/C)y  (D/C)]^2  1 = [(A^2/C^2)x^2 + (B^2/C^2)y^2 + (D^2/C^2) + (2AB/C^2)xy + (2AD/C^2)x + (2BD/C^2)y] + x^2 + y^2  1 = [1+A^2/C^2] x^2 + [1+B^2/C^2] y^2 + [(D^2/C^2)1] + (2AB/C^2)xy + (2AD/C^2)x + (2BD/C^2)y f(x,y) = (A/C)x  (B/C)y  (D/C)
But it's been quite a while since my multivariable calculus days. Anyone want a nice math challenge? E.g. Mike McCandless?
General polygons are trivial to implement in this framework, although I haven't yet done so. Would you prefer that happen first, or finding a solution to the bounding box problem?
I've been trying to work the math; so far I haven't come up with a good general way of calculating bounds for planes intersecting the unit sphere. I will keep at it, but so far I do not have an acceptable solution. I'd be happy to attach a separate breadthfirst method patch; that's much easier actually – it's just code.
The other part missing aside from implementing Shape, to include getBoundingBox, is a SpatialContext subclass that implements the factory methods creating Geo3d variants. That is very easy. At that point; Geo3d becomes very usable with Lucene spatial, and it would be the first release of geo3d into Lucene. This is a first step; at some point work could be done to get shapes for which there are no factory methods (e.g. polygons), which would involve a little refactoring over in Spatial4j. Then we'd have WKT parsing hooked in for other shapes (polygons, etc.).
A breadthfirst feature would be great but I don't have the time right now. I will eventually, I want to do it, but not now. Computing the bounding box may be expensive from your point of view but it is at least a onetime event and could be cached. This is what CircleImpl does.
I don't know that its necessary to determine the expense of computing bounding boxes for irregular shapes (if for only satisfying curiosity). Its a known problem with doing all computation in equirectangular lat/lon (or any single projection, for that matter) a choice is being made to sacrifice accuracy at some level. Its why there exists so many different projections for various parts of the earths surface. Spherical (or ellipsoidal) mathematics using an approximation of the earth is just plain inaccurate since there is no perfect model of the earth's shape. To achieve the best accuracy for large irregular polygon's one will either reproject the data into a more accurate projection, slice the polygon into smaller (more accurate) polygons, introduce something like a bloom filter for catching inaccurate boundary conditions, or.... insert other options here.... All of which are going to be just as, or more, expensive.
I'm all for spending time working up the fastest, most accurate logic (its a responsibility for any mature mapping or GIS utility). But it needs to be well documented and at the choice of the end user. For 95% of private sector users, greatcircle is OK. For publicsector applications on the other hand.
As a side, David's using the bounding box to determine detailLevel for speeding up traversal of the RPT. As an FYI, distance might be a little misleading in that logic as its in units of decimal degrees.
The problem with most arbitrary shapes and "bounding boxes" is twofold:
(1) Bounding boxes, if you actually look at the earth, represent bounds on the top and bottom that are not great circles. So it is very possible to have a shape that is within a bounding box, according to that box's four corners, but which curves outside of the box on the top or bottom. Therefore, computing the actual bounds of a shape is fairly expensive, compared with everything else we do in geo3d. I'll try to work out the planar math to find out exactly how expensive.
(2) The stated reason to use the bounding boxes is to determine resolution. This is fine for regular shapes like circles and rectangles, but it is very wasteful for shapes that are long and thin or irregular.
For this reason, I'd much prefer that a different QuadTree getCells() variant be introduced, which (as I stated) has a cell count bound, and does a breadthfirst descent. If that's not possible, then one alternative is to create a new interface in the geo3d package, which computes the lat/lon bounds for a shape – but then maybe not all shapes would implement it. That would imply two distinct spatial4j implementations as well – one for the kind of shape that does bounding boxes, one for the kind that doesn't. Stay tuned.
Ok, the changes I made should make the geo3d library significantly more general. What I did was basically twofold: rearranged the interfaces to increase their overall ease of applicability, and second, added a missing capability: bounding boxes with longitude ranges greater than 180 degrees. In order to enforce proper usage, I also added argument range checks for all the major shape objects.
More cleanup for the generic case. Also, handle bounding boxes with greater than 180 degrees width properly.
Make the notion of membership be even more generic, and able to be used for relationship determination.
Updating interface structure to have a distinction between the moregeneric "Membership" and the more specific "Distance" capabilities.
I can tell that you've been focused on distance boosting/ranking applications – and in that context I see where you're coming from. But it shouldn't at all be a forlorn conclusion that the application is going to score/rank the results by distance. The query might be for analytics to compare a count with multiple other filters (e.g. time) or it might be a spatialrich data set like "tracks" generating from GPS and other sensors matching thousands of points and once you get into the thousands, I think let alone millions, there would be benefit to avoiding thousands of DocValiues lookups (randomIO) versus reading postings for a couple terms or so known to be within the shape.
Any way; thanks again for releasing geo3d here.
But even if it were theoretically 0cost, what isn't 0cost is getting the points to test in the first place, wether it be from DocValues or turning the terms scanned inprogress into points.
Agreed, although one could argue that the solution to that problem is proper query design. Any query that results in the scoring of millions or hundreds of millions of items hasn't been well thought through. I don't think it makes sense for us to try to solve that problem in Lucene Spatial.
With geo3d, the cost of determining membership in the shape is so low that I think you do better by just doing it, rather than trying to avoid doing it. In a way that was the whole point behind the development of geo3d.
It's awesome the code your contributing does this so fast! But even if it were theoretically 0cost, what isn't 0cost is getting the points to test in the first place, wether it be from DocValues or turning the terms scanned inprogress into points. There's no telling how many points lie within the shape in the general case because the circle (or whatever shape) could be large and there could be a ton of data within it. Granted for your specific application which may have constraints on how big a shape is permitted to be and knowledge of your data density, it could very well be a nonissue for you.
As of last summer the spatial module has had SerializedDVStrategy to generalize the later check. It would further be awesome to have a derivative of RPT that is able to detect which cells are within the query geometry and so don't doublecheck those documents, and likewise for leaf cells containing the query geometry need not be doublechecked either. It's a wishlist feature
LUCENE5579. Ideally the leaf cells would be differentiated as edgeapproximated or within but it's not essential.
With geo3d, the cost of determining membership in the shape is so low that I think you do better by just doing it, rather than trying to avoid doing it. In a way that was the whole point behind the development of geo3d.
For example, have a look at the GeoCircle code. To determine whether a point is within the circle, the evaluation that is needed is only the following:
@Override public boolean isWithin(GeoPoint point) { // Fastest way of determining membership return circlePlane.onCorrectSide(point); }
... which involves three multiplications, three additions, and a comparison. That's cheap by any standard. The most expensive thing here is constructing the GeoPoint from (x,y,z) coordinates, because it does an allocation. Further optimization of the library can avoid even that by having an alternate form of the method: isWithin(double x, double y, double z) .
Right now the traversal done in TreeCellIterator (used with CellTokenStream, via PrefixTreeStrategy) is depthfirst, so it doesn't know how many cells there are going to be as it walks (it's given a target/max depth level apriori). If it was breadthfirst, then it could instead be given a minimum number of cells and then it'll go to the level just beyond if it reaches this threshold midway through tokenizing a level.
That sounds correct to me. All you would need to do is to introduce a new method invocation that accepted a max depth and a min number of tokens, and hook it up to a different iterator that did what was needed.
If the bounding box is used for determining detail level, what I've done in the past is to set bounds for the number of geohash tokens, and descend until either you hit the maximum depth, or you hit the minimum number of desired tokens.
That makes sense. It hasn't been an issue thus far because all of the existing shapes have a bounding box. Right now the traversal done in TreeCellIterator (used with CellTokenStream, via PrefixTreeStrategy) is depthfirst, so it doesn't know how many cells there are going to be as it walks (it's given a target/max depth level apriori). If it was breadthfirst, then it could instead be given a minimum number of cells and then it'll go to the level just beyond if it reaches this threshold midway through tokenizing a level. The rationale behind it being depthfirst is that I intend to reuse this on the TreeCellIterator on the search side in a refactor of AbstractVisitingPrefixTreeFilter (LUCENE5745) which underpins Intersects, Within, and now recently for heatmap & time faceting, which needs to consume the tokens in sorted order to leapfrog with TermsEnum, and so it must be depthfirst.
The concept of using the cells for fast and approximate filtering and then lookup the vector geometry in, say, DocValues totally makes sense. As of last summer the spatial module has had SerializedDVStrategy to generalize the later check. It would further be awesome to have a derivative of RPT that is able to detect which cells are within the query geometry and so don't doublecheck those documents, and likewise for leaf cells containing the query geometry need not be doublechecked either. It's a wishlist feature LUCENE5579. Ideally the leaf cells would be differentiated as edgeapproximated or within but it's not essential. The optimization I'm talking about here wouldn't be appropriate when you want incorporate the distance from a midpoint of a pointradius shape in the ranking since you might as well filter at the collector as you describe.
We've got some WIP for indexing both 3&4D spacetime using a hilbertcurve
Sweet! 2015 is shaping up to be an awesome year for Lucene spatial. I wish I had the ability to contribute at the levels I was supported at a couple years ago.
If the bounding box is used for determining detail level, what I've done in the past is to set bounds for the number of geohash tokens, and descend until either you hit the maximum depth, or you hit the minimum number of desired tokens. That's of course not terribly refined, but yields a superset of the documents actually being described. By filtering at scoring time using the BoostSource technique (namely creating a FunctionQuery with a BoostSource that returns score values that will be rejected by the collector when a document is outside the shape), it's quite workable.
I'm interested in checking this out as well. We've got some WIP for indexing both 3&4D spacetime using a hilbertcurve I'd be interested in comparing notes between the two.
As a side, David's using the bounding box to determine detailLevel for speeding up traversal of the RPT. As an FYI, distance might be a little misleading in that logic as its in units of decimal degrees.
Hi David,
I have an ICLA on file, yes.
I'll look into implementing boundingbox. It makes more sense for some shapes than for others. For some shapes it is darned near useless. It did not seem to get used at all in geohash generation.
I can contribute unit tests as well. Just let me know.
Cool; I'll have to dig into this in the upcoming week. If I'm not mistaken, you are already an ASF committer for ManifoldCF and thus you already have a contributor license agreement onfile?
I took a quick peek at ShapeImpl.java. What would it take to implement the boundingbox? It's not strictly necessary but it helps performance a lot.
Oops, needed apache license...
This is the spatial4j integration.
Here's the library itself. Integration with spatial4j will be attached shortly.
Bulk close for 5.2.0.