Details

    • Type: New Feature New Feature
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: parser
    • Labels:
      None

      Description

      I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.

      If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

      1. getFDOMetadata.xml
        8 kB
        Arturo Beltran

        Issue Links

          Activity

          Arturo Beltran created issue -
          Hide
          Chris A. Mattmann added a comment -

          Hi Arturo,

          Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Arturo, Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can. Cheers, Chris
          Hide
          Mayank Singh added a comment - - edited

          Hi Arturo
          I would like to collaborate on this issue. I have also sent you an e-mail regarding the same.
          Thanks and regards
          Mayank

          Show
          Mayank Singh added a comment - - edited Hi Arturo I would like to collaborate on this issue. I have also sent you an e-mail regarding the same. Thanks and regards Mayank
          Hide
          Arturo Beltran added a comment -

          Hi all,

          I am pleased by the interest shown by the community on my proposal. As I said, any help is welcome.
          I have sent Mayank all the details about my work on this issue. If anyone else is interested in collaborating or simply provide their ideas/comments do not hesitate to contact me.

          Cheers,
          Arturo

          Show
          Arturo Beltran added a comment - Hi all, I am pleased by the interest shown by the community on my proposal. As I said, any help is welcome. I have sent Mayank all the details about my work on this issue. If anyone else is interested in collaborating or simply provide their ideas/comments do not hesitate to contact me. Cheers, Arturo
          Hide
          Chris A. Mattmann added a comment -

          Hi Guys,

          Thanks for the effort here. Please try hard to keep the discussions on list as the community will benefit from them and can help provide feedback incrementally.

          Thanks,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Guys, Thanks for the effort here. Please try hard to keep the discussions on list as the community will benefit from them and can help provide feedback incrementally. Thanks, Chris
          Hide
          Arturo Beltran added a comment -

          You are right Chris. Since now, I will try to keep the discussions on the list or here.

          I will try to explain in brief where exactly I'm working in order that you can get involved.
          The first piece is what allows us to access resources, we need a platform to access by the most homogenous way to heterogeneous resources. The best approach I've found has been FDO (http://fdo.osgeo.org/). In short, FDO is an API for manipulating, defining and analyzing geospatial information regardless of where it is stored.

          So it looks simple, I only have to integrate FDO as a Tika parser and I have it. The problem appeared when trying to connect this C++ API with Java. I have worked with SWIG and directly with JNI but I have not gotten it to work.
          Finally, temporary and to serve as a proof of concept, I implemented a simple HTTP server in .NET that offers resource descriptions using FDO. And now I'm trying to create a dummy parser for Tika to make calls to that server.

          I hope I explained well and that you could understand something, otherwise, feel free to ask again.

          Greetings and thanks for your interest:
          Arturo

          Show
          Arturo Beltran added a comment - You are right Chris. Since now, I will try to keep the discussions on the list or here. I will try to explain in brief where exactly I'm working in order that you can get involved. The first piece is what allows us to access resources, we need a platform to access by the most homogenous way to heterogeneous resources. The best approach I've found has been FDO ( http://fdo.osgeo.org/ ). In short, FDO is an API for manipulating, defining and analyzing geospatial information regardless of where it is stored. So it looks simple, I only have to integrate FDO as a Tika parser and I have it. The problem appeared when trying to connect this C++ API with Java. I have worked with SWIG and directly with JNI but I have not gotten it to work. Finally, temporary and to serve as a proof of concept, I implemented a simple HTTP server in .NET that offers resource descriptions using FDO. And now I'm trying to create a dummy parser for Tika to make calls to that server. I hope I explained well and that you could understand something, otherwise, feel free to ask again. Greetings and thanks for your interest: Arturo
          Hide
          Mayank Singh added a comment -

          Arturo I am not very comfortable with C++ and have no knowledge of .NET platform (I'm a Java guy) so my help in this matter will be very limited to you if you plan on using FDO. However, I was looking around for alternatives and found OGDI (http://ogdi.sourceforge.net/) which can act as a middle layer between various data sources and has almost the same capabilities of data dissemination over the network as FDO (more info here: http://www.gisdevelopment.net/technology/gis/techgi0057b.htm).
          So what I am suggesting is we look into it and once we get the heterogeneous data into the OGDI supported uniform data structure we can use Java to integrate it with Tika.
          I'll keep searching for more info. Do tell me your views on this
          Regards
          Mayank

          Show
          Mayank Singh added a comment - Arturo I am not very comfortable with C++ and have no knowledge of .NET platform (I'm a Java guy) so my help in this matter will be very limited to you if you plan on using FDO. However, I was looking around for alternatives and found OGDI ( http://ogdi.sourceforge.net/ ) which can act as a middle layer between various data sources and has almost the same capabilities of data dissemination over the network as FDO (more info here: http://www.gisdevelopment.net/technology/gis/techgi0057b.htm ). So what I am suggesting is we look into it and once we get the heterogeneous data into the OGDI supported uniform data structure we can use Java to integrate it with Tika. I'll keep searching for more info. Do tell me your views on this Regards Mayank
          Hide
          Arturo Beltran added a comment -

          XML Example that contains the information extracted of a SHP by my proof of concept server based on FDO

          Show
          Arturo Beltran added a comment - XML Example that contains the information extracted of a SHP by my proof of concept server based on FDO
          Arturo Beltran made changes -
          Field Original Value New Value
          Attachment getFDOMetadata.xml [ 12447682 ]
          Hide
          Arturo Beltran added a comment -

          I'm not convinced about using OGDI. From what I understand from reading the documentation, OGDI offers an API in C, so we encounter the same problem to integrate it with Java. In addition, the project is not updated since 2008, so new geographic formats are not supported (i.e: KML). Also, I think OGDI does not support databases or services.

          However, you can do some proof of concept to see if it would be very difficult to integrate with Java and see exactly what metadata can be extracted using OGDI. Then we can compare these results with mine and decide.

          As you can see, I've attached a sample XML file (getFDOMetadata.xml) that contains the information extracted of a SHP by my proof of concept server based on FDO. This is the result after a simple HTTP call (http://localhost:12345/getFDOMetadata?source=C:\ExampleData\shp_world_countries\country.shp&provider=SHP)

          For now, I'll keep trying to run muy "Hello world" Tika parser.

          Regards,
          Arturo

          Show
          Arturo Beltran added a comment - I'm not convinced about using OGDI. From what I understand from reading the documentation, OGDI offers an API in C, so we encounter the same problem to integrate it with Java. In addition, the project is not updated since 2008, so new geographic formats are not supported (i.e: KML). Also, I think OGDI does not support databases or services. However, you can do some proof of concept to see if it would be very difficult to integrate with Java and see exactly what metadata can be extracted using OGDI. Then we can compare these results with mine and decide. As you can see, I've attached a sample XML file (getFDOMetadata.xml) that contains the information extracted of a SHP by my proof of concept server based on FDO. This is the result after a simple HTTP call ( http://localhost:12345/getFDOMetadata?source=C:\ExampleData\shp_world_countries\country.shp&provider=SHP ) For now, I'll keep trying to run muy "Hello world" Tika parser. Regards, Arturo
          Hide
          Nick Burch added a comment -

          I was wondering about extracting geo data from jpeg exif tags. For this, we'd probably want dedicated metadata properties for lat and long

          (Other files can have a single lat+long in them too, eg html pages with the icbm meta tags)

          Not sure how well that might integrate with this work though, since shapefiles will typically contain a large number lats+longs (or similar geographic points)

          Anyone have any ideas about a single created-at position vs stream of locations from geo formats?

          Show
          Nick Burch added a comment - I was wondering about extracting geo data from jpeg exif tags. For this, we'd probably want dedicated metadata properties for lat and long (Other files can have a single lat+long in them too, eg html pages with the icbm meta tags) Not sure how well that might integrate with this work though, since shapefiles will typically contain a large number lats+longs (or similar geographic points) Anyone have any ideas about a single created-at position vs stream of locations from geo formats?
          Hide
          Chris A. Mattmann added a comment -

          Hey Nick,

          I think we need to support both cases (single lat/lon per document as well as many lat/lon pairs per document). In the case of the former, it's easy, we have:

          key: Metadata.LATITUDE
          val: some lat

          key: Metadata.LONGITUDE
          val: some lon

          And, in the case of the latter, we have:

          key: Metadata.LATITUDE
          val: some lat, some lat2, some lat3, some lat n...

          key: Metadata.LONGITUDE
          val: some lon, some lon2, some lon3, some lon n...

          Because the keys are ordered in the Metadata object, I think that we can make sure they match up and treat single points the same as for multiple points. It's great to have support for both on a per Metadata object basis too since many scientific data formats have both scenarios in them (e.g., NetCDF and HDF typically have arrays of lats and lons, and sometimes, singe point values as well).

          The reason we need to support both is that distance computation (point/radius, bounding box, and polygon) would require both scenarios to be supported. I've been thinking that once this work is prototyped, to integrate Tika with the work in SIS to build out a computational spatial library. I think Tika could be used to feed in lats/lons into SIS.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Nick, I think we need to support both cases (single lat/lon per document as well as many lat/lon pairs per document). In the case of the former, it's easy, we have: key: Metadata.LATITUDE val: some lat key: Metadata.LONGITUDE val: some lon And, in the case of the latter, we have: key: Metadata.LATITUDE val: some lat, some lat2, some lat3, some lat n... key: Metadata.LONGITUDE val: some lon, some lon2, some lon3, some lon n... Because the keys are ordered in the Metadata object, I think that we can make sure they match up and treat single points the same as for multiple points. It's great to have support for both on a per Metadata object basis too since many scientific data formats have both scenarios in them (e.g., NetCDF and HDF typically have arrays of lats and lons, and sometimes, singe point values as well). The reason we need to support both is that distance computation (point/radius, bounding box, and polygon) would require both scenarios to be supported. I've been thinking that once this work is prototyped, to integrate Tika with the work in SIS to build out a computational spatial library. I think Tika could be used to feed in lats/lons into SIS. Cheers, Chris
          Hide
          Nick Burch added a comment -

          I was thinking that making sure you put in the right matching pairs, and remove them again is a little fiddly, but that's nothing that a little wrapper library wouldn't fix for you. With that in mind, I think your proposed solution is likely to be much better than changing tika to support composite values, with the problems that that would bring

          Any objections to creating a new Metadata keyspace of Geographic, with to start with LATITUDE = geo:latitude & LONGITUDE = geo:longitude ? I can think of a few others we might want in future (height, bearing etc), which makes me think its own space might make sense

          Show
          Nick Burch added a comment - I was thinking that making sure you put in the right matching pairs, and remove them again is a little fiddly, but that's nothing that a little wrapper library wouldn't fix for you. With that in mind, I think your proposed solution is likely to be much better than changing tika to support composite values, with the problems that that would bring Any objections to creating a new Metadata keyspace of Geographic, with to start with LATITUDE = geo:latitude & LONGITUDE = geo:longitude ? I can think of a few others we might want in future (height, bearing etc), which makes me think its own space might make sense
          Hide
          Chris A. Mattmann added a comment -

          Hey Nick,

          Yep +1 on having the new namespace called "Geographic" with the given 2 fields as a starting point. We should probably track it and commit in a new issue.

          Thanks for your thoughts on this!

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Nick, Yep +1 on having the new namespace called "Geographic" with the given 2 fields as a starting point. We should probably track it and commit in a new issue. Thanks for your thoughts on this! Cheers, Chris
          Hide
          Nick Burch added a comment -

          I've opened TIKA-445 and uploaded a first stab at a patch to implement it. Feedback appreciated!

          Show
          Nick Burch added a comment - I've opened TIKA-445 and uploaded a first stab at a patch to implement it. Feedback appreciated!
          Hide
          Arturo Beltran added a comment -

          As I commented in the issue TIKA-445, after a few days off I found a pleasant surprise. Good job.

          Greetings and thanks for your work

          Show
          Arturo Beltran added a comment - As I commented in the issue TIKA-445 , after a few days off I found a pleasant surprise. Good job. Greetings and thanks for your work
          Chris A. Mattmann made changes -
          Link This issue relates to TIKA-605 [ TIKA-605 ]

            People

            • Assignee:
              Unassigned
              Reporter:
              Arturo Beltran
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Development