Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.9
    • Component/s: parser
    • Labels:

      Description

      I'm working in the automatic description of geospatial resources, and I think that might be interesting to incorporate new parser/s to Tika in order to manage and describe some geo-formats. These geo-formats include files, services and databases.

      If anyone is interested in this issue or want to collaborate do not hesitate to contact me. Any help is welcome.

      1. getFDOMetadata.xml
        8 kB
        Arturo Beltran

        Issue Links

          Activity

          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open In Progress In Progress
          1589d 20h 8m 1 Chris A. Mattmann 24/Oct/14 06:34
          In Progress In Progress Resolved Resolved
          189d 2h 26m 1 Chris A. Mattmann 01/May/15 09:01
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #657 (See https://builds.apache.org/job/tika-trunk-jdk1.7/657/)
          Fix for TIKA-443 Geographic Information Parser contributed by unknown <gautham.g44@gmail.com> this closes #47. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1677100)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-bundle/pom.xml
          • /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          • /tika/trunk/tika-parsers/pom.xml
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          Show
          Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #657 (See https://builds.apache.org/job/tika-trunk-jdk1.7/657/ ) Fix for TIKA-443 Geographic Information Parser contributed by unknown <gautham.g44@gmail.com> this closes #47. (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1677100 ) /tika/trunk/CHANGES.txt /tika/trunk/tika-bundle/pom.xml /tika/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml /tika/trunk/tika-parsers/pom.xml /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java /tika/trunk/tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          Hide
          Gautham Gowrishankar added a comment -

          Your Welcome Professor Mattmann !!!!

          Show
          Gautham Gowrishankar added a comment - Your Welcome Professor Mattmann !!!!
          Chris A. Mattmann made changes -
          Labels new-parser memex new-parser
          Chris A. Mattmann made changes -
          Status In Progress [ 3 ] Resolved [ 5 ]
          Fix Version/s 1.9 [ 12329574 ]
          Resolution Fixed [ 1 ]
          Hide
          Chris A. Mattmann added a comment -

          Committed and closed! Thanks Gautham Gowrishankar!

          [mattmann-0420740:~/tmp/tika1.9] mattmann% svn commit -m "Fix for TIKA-443 Geographic Information Parser contributed by unknown <gautham.g44@gmail.com> this closes #47."
          Sending        CHANGES.txt
          Sending        tika-bundle/pom.xml
          Sending        tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml
          Sending        tika-parsers/pom.xml
          Adding         tika-parsers/src/main/java/org/apache/tika/parser/geoinfo
          Adding         tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java
          Sending        tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          Adding         tika-parsers/src/test/java/org/apache/tika/parser/geoinfo
          Adding         tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java
          Adding         tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          Transmitting file data ........
          Committed revision 1677100.
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          
          Show
          Chris A. Mattmann added a comment - Committed and closed! Thanks Gautham Gowrishankar ! [mattmann-0420740:~/tmp/tika1.9] mattmann% svn commit -m "Fix for TIKA-443 Geographic Information Parser contributed by unknown <gautham.g44@gmail.com> this closes #47." Sending CHANGES.txt Sending tika-bundle/pom.xml Sending tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml Sending tika-parsers/pom.xml Adding tika-parsers/src/main/java/org/apache/tika/parser/geoinfo Adding tika-parsers/src/main/java/org/apache/tika/parser/geoinfo/GeographicInformationParser.java Sending tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser Adding tika-parsers/src/test/java/org/apache/tika/parser/geoinfo Adding tika-parsers/src/test/java/org/apache/tika/parser/geoinfo/GeographicInformationParserTest.java Adding tika-parsers/src/test/resources/test-documents/sampleFile.iso19139 Transmitting file data ........ Committed revision 1677100. [mattmann-0420740:~/tmp/tika1.9] mattmann%
          Hide
          ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/tika/pull/47

          Show
          ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tika/pull/47
          Hide
          Chris A. Mattmann added a comment -

          I fixed it by moving the extractContent function after the metadata extraction happens first.

          [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -m tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          May 01, 2015 12:57:16 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod
          WARNING: This operation requires the “sis-temporal” module.
          AccessContraints : OTHER_RESTRICTIONS
          CharacterSet: UTF-8
          CitationDate : CREATION-->Mon Dec 16 00:00:00 PST 2013
          CitationDate : modified-->Wed Mar 11 00:00:00 PDT 2015
          CitedResponsiblePartyEMail : hollistr@gvsu.edu
          CitedResponsiblePartyName : Robert Hollister
          CitedResponsiblePartyName : Robert Hollister
          CitedResponsiblePartyRole : Role[POINT_OF_CONTACT]
          CitedResponsiblePartyRole : Role[AUTHOR]
          ContactPartyName-: UCAR/NCAR - CISL - ACADIS
          ContactRole: RESOURCE_PROVIDER
          Content-Length: 19370
          Content-Type: text/iso19139+xml
          DateInfo : CREATION Mon Dec 16 05:26:08 PST 2013
          DistributionFormatSpecificationAlternativeTitle : Other ASCII
          Distributor Contact : RESOURCE_PROVIDER
          Distributor Organization Name : UCAR/NCAR - CISL - ACADIS
          GeographicIdentifierAuthorityAlternativeTitle : Locations
          GeographicIdentifierAuthorityDate : REVISION Thu Aug 28 00:00:00 PDT 2014
          GeographicIdentifierAuthorityTitle : NASA/GCMD Earth Science Keywords
          GeographicIdentifierCode : UNITED STATES OF AMERICA > ALASKA
          IdentificationInfoAbstract : These files contain data representing the periodic plant measures of species within each plot in a text tab delimited format. The data presented are seasonal growth of graminoids (length of leaf and length of inflorescence) and seasonal flowering of all species (number of inflorescences in flower within a plot), collected weekly during the summers of 2012-20XX for a subset of 30 grid plots at two sites (Barrow ARCSS grid and Atqasuk ARCSS grid).
          IdentificationInfoCitationTitle : Barrow Atqasuk ARCSS Plant
          IdentificationInfoLanguage-->: English
          IdentificationInfoStatus : ON_GOING
          IdentificationInfoTopicCategory-->: BIOTA
          Keywords 2: EARTH SCIENCE > BIOSPHERE > TERRESTRIAL ECOSYSTEMS > ALPINE/TUNDRA
          Keywords 3: FIELD SURVEY
          Keywords 4: POINT
          Keywords 5: LESS THAN 1 METER
          Keywords 6: DAILY TO WEEKLY
          KeywordsType 2: THEME
          KeywordsType 3: THEME
          KeywordsType 4: THEME
          KeywordsType 5: THEME
          KeywordsType 6: THEME
          MetaDataIdentifierCode: urn:x-wmo:md:org.aoncadis.www::4c1a919d-6690-11e3-9147-00c0f03d5b7c
          MetaDataResourceScope : DATASET
          MetaDataStandardEdition : ISO 19115:2003(E)
          MetaDataStandardTitle : ISO 19115 Geographic information - Metadata
          OtherConstraints : Access Constraints: No Access Constraints. Use Constraints: No Use Constraints.
          ParentMetaDataTitle: urn:x-wmo:md:org.aoncadis.www::d2e4e808-6830-11df-abb3-00c0f03d5b7c
          ResourceFormatSpecificationAlternativeTitle : Other ASCII
          ThesaurusNameAlternativeTitle 2: [Science and Services Keywords]
          ThesaurusNameAlternativeTitle 3: [Platforms]
          ThesaurusNameAlternativeTitle 4: [Spatial Data Type]
          ThesaurusNameAlternativeTitle 5: [Horizontal Data Resolution]
          ThesaurusNameAlternativeTitle 6: [Temporal Data Resolution]
          ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014
          ThesaurusNameDate : REVISION-->Tue Oct 07 00:00:00 PDT 2014
          ThesaurusNameDate : REVISION-->Tue Oct 07 00:00:00 PDT 2014
          ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014
          ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014
          ThesaurusNameTitle 2: NASA/GCMD Earth Science Keywords
          ThesaurusNameTitle 3: ACADIS Keywords
          ThesaurusNameTitle 4: ACADIS Keywords
          ThesaurusNameTitle 5: NASA/GCMD Earth Science Keywords
          ThesaurusNameTitle 6: NASA/GCMD Earth Science Keywords
          TransferOptionsOnlineDescription : Metadata Link
          TransferOptionsOnlineFunction : DOWNLOAD
          TransferOptionsOnlineLinkage : https://www.aoncadis.org/dataset/id/4c1a919d-6690-11e3-9147-00c0f03d5b7c.html
          TransferOptionsOnlineName : Barrow Atqasuk ARCSS Plant
          TransferOptionsOnlineProfile : browser
          TransferOptionsOnlineProtocol : https
          UserConstraints : OTHER_RESTRICTIONS
          X-Parsed-By: org.apache.tika.parser.DefaultParser
          X-Parsed-By: org.apache.tika.parser.geoinfo.GeographicInformationParser
          resourceName: sampleFile.iso19139
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          

          Works great!
          Committing.

          Show
          Chris A. Mattmann added a comment - I fixed it by moving the extractContent function after the metadata extraction happens first. [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -m tika-parsers/src/test/resources/test-documents/sampleFile.iso19139 May 01, 2015 12:57:16 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod WARNING: This operation requires the “sis-temporal” module. AccessContraints : OTHER_RESTRICTIONS CharacterSet: UTF-8 CitationDate : CREATION-->Mon Dec 16 00:00:00 PST 2013 CitationDate : modified-->Wed Mar 11 00:00:00 PDT 2015 CitedResponsiblePartyEMail : hollistr@gvsu.edu CitedResponsiblePartyName : Robert Hollister CitedResponsiblePartyName : Robert Hollister CitedResponsiblePartyRole : Role[POINT_OF_CONTACT] CitedResponsiblePartyRole : Role[AUTHOR] ContactPartyName-: UCAR/NCAR - CISL - ACADIS ContactRole: RESOURCE_PROVIDER Content-Length: 19370 Content-Type: text/iso19139+xml DateInfo : CREATION Mon Dec 16 05:26:08 PST 2013 DistributionFormatSpecificationAlternativeTitle : Other ASCII Distributor Contact : RESOURCE_PROVIDER Distributor Organization Name : UCAR/NCAR - CISL - ACADIS GeographicIdentifierAuthorityAlternativeTitle : Locations GeographicIdentifierAuthorityDate : REVISION Thu Aug 28 00:00:00 PDT 2014 GeographicIdentifierAuthorityTitle : NASA/GCMD Earth Science Keywords GeographicIdentifierCode : UNITED STATES OF AMERICA > ALASKA IdentificationInfoAbstract : These files contain data representing the periodic plant measures of species within each plot in a text tab delimited format. The data presented are seasonal growth of graminoids (length of leaf and length of inflorescence) and seasonal flowering of all species (number of inflorescences in flower within a plot), collected weekly during the summers of 2012-20XX for a subset of 30 grid plots at two sites (Barrow ARCSS grid and Atqasuk ARCSS grid). IdentificationInfoCitationTitle : Barrow Atqasuk ARCSS Plant IdentificationInfoLanguage-->: English IdentificationInfoStatus : ON_GOING IdentificationInfoTopicCategory-->: BIOTA Keywords 2: EARTH SCIENCE > BIOSPHERE > TERRESTRIAL ECOSYSTEMS > ALPINE/TUNDRA Keywords 3: FIELD SURVEY Keywords 4: POINT Keywords 5: LESS THAN 1 METER Keywords 6: DAILY TO WEEKLY KeywordsType 2: THEME KeywordsType 3: THEME KeywordsType 4: THEME KeywordsType 5: THEME KeywordsType 6: THEME MetaDataIdentifierCode: urn:x-wmo:md:org.aoncadis.www::4c1a919d-6690-11e3-9147-00c0f03d5b7c MetaDataResourceScope : DATASET MetaDataStandardEdition : ISO 19115:2003(E) MetaDataStandardTitle : ISO 19115 Geographic information - Metadata OtherConstraints : Access Constraints: No Access Constraints. Use Constraints: No Use Constraints. ParentMetaDataTitle: urn:x-wmo:md:org.aoncadis.www::d2e4e808-6830-11df-abb3-00c0f03d5b7c ResourceFormatSpecificationAlternativeTitle : Other ASCII ThesaurusNameAlternativeTitle 2: [Science and Services Keywords] ThesaurusNameAlternativeTitle 3: [Platforms] ThesaurusNameAlternativeTitle 4: [Spatial Data Type] ThesaurusNameAlternativeTitle 5: [Horizontal Data Resolution] ThesaurusNameAlternativeTitle 6: [Temporal Data Resolution] ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014 ThesaurusNameDate : REVISION-->Tue Oct 07 00:00:00 PDT 2014 ThesaurusNameDate : REVISION-->Tue Oct 07 00:00:00 PDT 2014 ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014 ThesaurusNameDate : REVISION-->Wed May 21 00:00:00 PDT 2014 ThesaurusNameTitle 2: NASA/GCMD Earth Science Keywords ThesaurusNameTitle 3: ACADIS Keywords ThesaurusNameTitle 4: ACADIS Keywords ThesaurusNameTitle 5: NASA/GCMD Earth Science Keywords ThesaurusNameTitle 6: NASA/GCMD Earth Science Keywords TransferOptionsOnlineDescription : Metadata Link TransferOptionsOnlineFunction : DOWNLOAD TransferOptionsOnlineLinkage : https://www.aoncadis.org/dataset/id/4c1a919d-6690-11e3-9147-00c0f03d5b7c.html TransferOptionsOnlineName : Barrow Atqasuk ARCSS Plant TransferOptionsOnlineProfile : browser TransferOptionsOnlineProtocol : https UserConstraints : OTHER_RESTRICTIONS X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.geoinfo.GeographicInformationParser resourceName: sampleFile.iso19139 [mattmann-0420740:~/tmp/tika1.9] mattmann% Works great! Committing.
          Hide
          Chris A. Mattmann added a comment -

          OK after combining with my patch, we have success!

          [INFO] Skipping execution for packaging "pom"
          [INFO] 
          [INFO] --- forbiddenapis:1.7:testCheck (default) @ tika ---
          [INFO] Skipping execution for packaging "pom"
          [INFO] 
          [INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika ---
          [INFO] 
          [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika ---
          [INFO] Installing /Users/mattmann/tmp/tika1.9/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.9-SNAPSHOT/tika-1.9-SNAPSHOT.pom
          [INFO] ------------------------------------------------------------------------
          [INFO] Reactor Summary:
          [INFO] 
          [INFO] Apache Tika parent ................................. SUCCESS [  1.500 s]
          [INFO] Apache Tika core ................................... SUCCESS [ 19.538 s]
          [INFO] Apache Tika parsers ................................ SUCCESS [02:17 min]
          [INFO] Apache Tika XMP .................................... SUCCESS [  2.995 s]
          [INFO] Apache Tika serialization .......................... SUCCESS [  2.224 s]
          [INFO] Apache Tika batch .................................. SUCCESS [01:58 min]
          [INFO] Apache Tika application ............................ SUCCESS [ 40.534 s]
          [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 22.864 s]
          [INFO] Apache Tika server ................................. SUCCESS [ 21.619 s]
          [INFO] Apache Tika translate .............................. SUCCESS [  3.870 s]
          [INFO] Apache Tika examples ............................... SUCCESS [  5.872 s]
          [INFO] Apache Tika Java-7 Components ...................... SUCCESS [  2.427 s]
          [INFO] Apache Tika ........................................ SUCCESS [  0.037 s]
          [INFO] ------------------------------------------------------------------------
          [INFO] BUILD SUCCESS
          [INFO] ------------------------------------------------------------------------
          [INFO] Total time: 06:20 min
          [INFO] Finished at: 2015-05-01T00:39:38-07:00
          [INFO] Final Memory: 109M/1658M
          [INFO] ------------------------------------------------------------------------
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          

          Ran a simple test too:

          Detect

          [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -d tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          text/iso19139+xml
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          

          Parse Text

          [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -t tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          May 01, 2015 12:45:29 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod
          WARNING: This operation requires the “sis-temporal” module.
          Barrow Atqasuk ARCSS Plant
          
          
          CitedResponsiblePartyRole Role[POINT_OF_CONTACT]CitedResponsiblePartyName Robert Hollister
          
          
          CitedResponsiblePartyRole Role[AUTHOR]CitedResponsiblePartyName Robert Hollister
          
          
          IdentificationInfoAbstract These files contain data representing the periodic plant measures of species within each plot in a text tab delimited format. The data presented are seasonal growth of graminoids (length of leaf and length of inflorescence) and seasonal flowering of all species (number of inflorescences in flower within a plot), collected weekly during the summers of 2012-20XX for a subset of 30 grid plots at two sites (Barrow ARCSS grid and Atqasuk ARCSS grid).
          
          	GeographicElementWestBoundLatitude	-157.24
          	GeographicElementEastBoundLatitude	-156.4
          	GeographicElementNorthBoundLatitude	71.18
          	GeographicElementSouthBoundLatitude	70.27
          
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          

          Parse Met

          [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -m tika-parsers/src/test/resources/test-documents/sampleFile.iso19139
          May 01, 2015 12:46:25 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod
          WARNING: This operation requires the “sis-temporal” module.
          Content-Length: 19370
          Content-Type: text/iso19139+xml
          X-Parsed-By: org.apache.tika.parser.DefaultParser
          X-Parsed-By: org.apache.tika.parser.geoinfo.GeographicInformationParser
          resourceName: sampleFile.iso19139
          [mattmann-0420740:~/tmp/tika1.9] mattmann% 
          

          Something is weird here, met not getting added. Going to commit and investigate.

          Show
          Chris A. Mattmann added a comment - OK after combining with my patch, we have success! [INFO] Skipping execution for packaging "pom" [INFO] [INFO] --- forbiddenapis:1.7:testCheck (default) @ tika --- [INFO] Skipping execution for packaging "pom" [INFO] [INFO] --- maven-site-plugin:3.0:attach-descriptor (attach-descriptor) @ tika --- [INFO] [INFO] --- maven-install-plugin:2.3.1:install (default-install) @ tika --- [INFO] Installing /Users/mattmann/tmp/tika1.9/pom.xml to /Users/mattmann/.m2/repository/org/apache/tika/tika/1.9-SNAPSHOT/tika-1.9-SNAPSHOT.pom [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Apache Tika parent ................................. SUCCESS [ 1.500 s] [INFO] Apache Tika core ................................... SUCCESS [ 19.538 s] [INFO] Apache Tika parsers ................................ SUCCESS [02:17 min] [INFO] Apache Tika XMP .................................... SUCCESS [ 2.995 s] [INFO] Apache Tika serialization .......................... SUCCESS [ 2.224 s] [INFO] Apache Tika batch .................................. SUCCESS [01:58 min] [INFO] Apache Tika application ............................ SUCCESS [ 40.534 s] [INFO] Apache Tika OSGi bundle ............................ SUCCESS [ 22.864 s] [INFO] Apache Tika server ................................. SUCCESS [ 21.619 s] [INFO] Apache Tika translate .............................. SUCCESS [ 3.870 s] [INFO] Apache Tika examples ............................... SUCCESS [ 5.872 s] [INFO] Apache Tika Java-7 Components ...................... SUCCESS [ 2.427 s] [INFO] Apache Tika ........................................ SUCCESS [ 0.037 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 06:20 min [INFO] Finished at: 2015-05-01T00:39:38-07:00 [INFO] Final Memory: 109M/1658M [INFO] ------------------------------------------------------------------------ [mattmann-0420740:~/tmp/tika1.9] mattmann% Ran a simple test too: Detect [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -d tika-parsers/src/test/resources/test-documents/sampleFile.iso19139 text/iso19139+xml [mattmann-0420740:~/tmp/tika1.9] mattmann% Parse Text [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -t tika-parsers/src/test/resources/test-documents/sampleFile.iso19139 May 01, 2015 12:45:29 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod WARNING: This operation requires the “sis-temporal” module. Barrow Atqasuk ARCSS Plant CitedResponsiblePartyRole Role[POINT_OF_CONTACT]CitedResponsiblePartyName Robert Hollister CitedResponsiblePartyRole Role[AUTHOR]CitedResponsiblePartyName Robert Hollister IdentificationInfoAbstract These files contain data representing the periodic plant measures of species within each plot in a text tab delimited format. The data presented are seasonal growth of graminoids (length of leaf and length of inflorescence) and seasonal flowering of all species (number of inflorescences in flower within a plot), collected weekly during the summers of 2012-20XX for a subset of 30 grid plots at two sites (Barrow ARCSS grid and Atqasuk ARCSS grid). GeographicElementWestBoundLatitude -157.24 GeographicElementEastBoundLatitude -156.4 GeographicElementNorthBoundLatitude 71.18 GeographicElementSouthBoundLatitude 70.27 [mattmann-0420740:~/tmp/tika1.9] mattmann% Parse Met [mattmann-0420740:~/tmp/tika1.9] mattmann% java -jar tika-app/target/tika-app-1.9-SNAPSHOT.jar -m tika-parsers/src/test/resources/test-documents/sampleFile.iso19139 May 01, 2015 12:46:25 AM org.apache.sis.internal.jaxb.gml.TM_Primitive setTimePeriod WARNING: This operation requires the “sis-temporal” module. Content-Length: 19370 Content-Type: text/iso19139+xml X-Parsed-By: org.apache.tika.parser.DefaultParser X-Parsed-By: org.apache.tika.parser.geoinfo.GeographicInformationParser resourceName: sampleFile.iso19139 [mattmann-0420740:~/tmp/tika1.9] mattmann% Something is weird here, met not getting added. Going to commit and investigate.
          Hide
          Chris A. Mattmann added a comment -

          OK I also had to grab: https://github.com/gautham4/GeographicDR/commit/e04a7824ab3d9fb8517479007b545d7e8fcee704.patch since the tika-bundle stuff I helped you with wasn't part of your PR. Re-testing.

          Show
          Chris A. Mattmann added a comment - OK I also had to grab: https://github.com/gautham4/GeographicDR/commit/e04a7824ab3d9fb8517479007b545d7e8fcee704.patch since the tika-bundle stuff I helped you with wasn't part of your PR. Re-testing.
          Hide
          Chris A. Mattmann added a comment -

          OK thanks Gautham Gowrishankar going to test this out now.

          Show
          Chris A. Mattmann added a comment - OK thanks Gautham Gowrishankar going to test this out now.
          Hide
          ASF GitHub Bot added a comment -

          GitHub user gautham4 opened a pull request:

          https://github.com/apache/tika/pull/47

          PULL REQUEST for TIKA-443

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/gautham4/tika TIKA-443

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/tika/pull/47.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #47


          commit 6bfdbcd869455bbae7a4547b738e5a0b249053e8
          Author: unknown <gautham.g44@gmail.com>
          Date: 2015-04-22T04:35:03Z

          fix for TIKA-443 contributed by gautham4

          commit 66ba03ee85946d7babf9815b9734f0ee83b4767f
          Author: unknown <gautham.g44@gmail.com>
          Date: 2015-05-01T05:34:38Z

          fix for TIKA-443 contributed by gautham.g44@gmail.com


          Show
          ASF GitHub Bot added a comment - GitHub user gautham4 opened a pull request: https://github.com/apache/tika/pull/47 PULL REQUEST for TIKA-443 You can merge this pull request into a Git repository by running: $ git pull https://github.com/gautham4/tika TIKA-443 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tika/pull/47.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #47 commit 6bfdbcd869455bbae7a4547b738e5a0b249053e8 Author: unknown <gautham.g44@gmail.com> Date: 2015-04-22T04:35:03Z fix for TIKA-443 contributed by gautham4 commit 66ba03ee85946d7babf9815b9734f0ee83b4767f Author: unknown <gautham.g44@gmail.com> Date: 2015-05-01T05:34:38Z fix for TIKA-443 contributed by gautham.g44@gmail.com
          Tyler Palsulich made changes -
          Labels new-parser
          Hide
          Martin Desruisseaux added a comment -

          For Tika to ISO 19115, I see those choices:

          • Some core Tika classes could implement some org.opengis.metadata interfaces. For example if there is a Tika class somewhere which contains the (latitude, longitude) coordinates of a rectangle, that class could implement the GeographicBoundingBox interface. All the org.opengis.metadata interfaces follow ISO 19115 model, so this is not like a purely arbitrary API.
          • Alternatively, if Tika prefer to not modify their core classes, the data could be copied from the Tika class to a separated GeographicBoundingBox implementation just before marshalling. That separated implementation could be the SIS one or an other one if the Tika group prefer. However using the SIS one would avoid an other copy since SIS will need to copy the data into its own implementation before to marshall anyway (because of the way JAXB works).

          Once Tika has identified the information of interest to them (GeographicBoundingBox, maybe DataIdentification, etc.), those data needs to be put together into a org.opengis.metadata.Metadata implementation, which is usually the root of ISO 19115 hierarchy. Again it can be either a core SIS class implementing Metadata, or a separated implementation like the SIS one, at your choice.

          Once you have a Metadata instance, the easiest way to marshall it is using org.apache.sis.XML. This convenience class provides several marshal methods, so you can pick the most convenient. An easy one for testing purpose is:

          System.out.println(XML.marshal(metadata));
          

          For the reverse operation (ISO 19115 to Tika), the starting point could be:

          Metadata md = (Metadata) XML.unmarshal(inputStream);
          

          but the next issue is to use that Metadata information. Again I see two choices:

          • Tika may copy the information into its own internal structure.
          • Or alternatively, some Tika API may be designed to accept Metadata, GeographicBoundingBox, etc. arguments. Again they are GeoAPI interfaces, so not necessarily SIS implementations. If Tika implemented those interfaces as a result of above discussion, the modified API would work with Tika classes.
          Show
          Martin Desruisseaux added a comment - For Tika to ISO 19115, I see those choices: Some core Tika classes could implement some org.opengis.metadata interfaces. For example if there is a Tika class somewhere which contains the (latitude, longitude) coordinates of a rectangle, that class could implement the GeographicBoundingBox interface. All the org.opengis.metadata interfaces follow ISO 19115 model, so this is not like a purely arbitrary API. Alternatively, if Tika prefer to not modify their core classes, the data could be copied from the Tika class to a separated GeographicBoundingBox implementation just before marshalling. That separated implementation could be the SIS one or an other one if the Tika group prefer. However using the SIS one would avoid an other copy since SIS will need to copy the data into its own implementation before to marshall anyway (because of the way JAXB works). Once Tika has identified the information of interest to them ( GeographicBoundingBox , maybe DataIdentification , etc.), those data needs to be put together into a org.opengis.metadata.Metadata implementation, which is usually the root of ISO 19115 hierarchy. Again it can be either a core SIS class implementing Metadata , or a separated implementation like the SIS one, at your choice. Once you have a Metadata instance, the easiest way to marshall it is using org.apache.sis.XML . This convenience class provides several marshal methods, so you can pick the most convenient. An easy one for testing purpose is: System .out.println(XML.marshal(metadata)); For the reverse operation (ISO 19115 to Tika), the starting point could be: Metadata md = (Metadata) XML.unmarshal(inputStream); but the next issue is to use that Metadata information. Again I see two choices: Tika may copy the information into its own internal structure. Or alternatively, some Tika API may be designed to accept Metadata , GeographicBoundingBox , etc. arguments. Again they are GeoAPI interfaces, so not necessarily SIS implementations. If Tika implemented those interfaces as a result of above discussion, the modified API would work with Tika classes.
          Hide
          Chris A. Mattmann added a comment -

          Thanks Martin. I think the use case here that would be great, would be something like:

          tika < geofile (e.g., ISO-19115) > Tika XHTML
          tika -m < geofile (e.g., ISO-19115) > ISO-19115 metadata

          Thoughts of easy ways of accomplishing the above?

          Show
          Chris A. Mattmann added a comment - Thanks Martin. I think the use case here that would be great, would be something like: tika < geofile (e.g., ISO-19115) > Tika XHTML tika -m < geofile (e.g., ISO-19115) > ISO-19115 metadata Thoughts of easy ways of accomplishing the above?
          Hide
          Martin Desruisseaux added a comment - - edited

          A note just in case: Tika does not need to have a strong dependency to SIS if you prefer to avoid it. The ISO 19115 metadata are defined by interfaces in a separated JAR file, (geoapi-3.0.0.jar), which is in turn implemented by SIS. But the Tika project could decide to implement itself a subset of those interfaces considered most pertinent to Tika needs (e.g. GeographicBoundingBox, DataIdentification, etc.), which should allow Tika to switch between its own implementation and SIS implementation transparently. For example Tika could have basic geographic information support as a standalone application, and delegate to SIS only for more advanced needs if the user wish.

          I'm just mentioning that as one possible strategy.

          Show
          Martin Desruisseaux added a comment - - edited A note just in case: Tika does not need to have a strong dependency to SIS if you prefer to avoid it. The ISO 19115 metadata are defined by interfaces in a separated JAR file, ( geoapi-3.0.0.jar ), which is in turn implemented by SIS. But the Tika project could decide to implement itself a subset of those interfaces considered most pertinent to Tika needs (e.g. GeographicBoundingBox , DataIdentification , etc.), which should allow Tika to switch between its own implementation and SIS implementation transparently. For example Tika could have basic geographic information support as a standalone application, and delegate to SIS only for more advanced needs if the user wish. I'm just mentioning that as one possible strategy.
          Hide
          Chris A. Mattmann added a comment -

          Guys, I wonder if we should (now 4 years later) standardize on Apache SIS (http://sis.apache.org/) and incorporate its support for parsing ISO19115 metadata. It seems to have the same types of properties that FDO metadata XML has.

          I'm going to give a whirl at creating a GeoParser that extracts information from ISO 19115 XML files. Martin Desruisseaux FYI Adam Estrada FYI.

          Show
          Chris A. Mattmann added a comment - Guys, I wonder if we should (now 4 years later) standardize on Apache SIS ( http://sis.apache.org/ ) and incorporate its support for parsing ISO19115 metadata. It seems to have the same types of properties that FDO metadata XML has. I'm going to give a whirl at creating a GeoParser that extracts information from ISO 19115 XML files. Martin Desruisseaux FYI Adam Estrada FYI.
          Chris A. Mattmann made changes -
          Status Open [ 1 ] In Progress [ 3 ]
          Chris A. Mattmann made changes -
          Assignee Chris A. Mattmann [ chrismattmann ]
          Chris A. Mattmann made changes -
          Link This issue relates to TIKA-605 [ TIKA-605 ]
          Hide
          Arturo Beltran added a comment -

          As I commented in the issue TIKA-445, after a few days off I found a pleasant surprise. Good job.

          Greetings and thanks for your work

          Show
          Arturo Beltran added a comment - As I commented in the issue TIKA-445 , after a few days off I found a pleasant surprise. Good job. Greetings and thanks for your work
          Hide
          Nick Burch added a comment -

          I've opened TIKA-445 and uploaded a first stab at a patch to implement it. Feedback appreciated!

          Show
          Nick Burch added a comment - I've opened TIKA-445 and uploaded a first stab at a patch to implement it. Feedback appreciated!
          Hide
          Chris A. Mattmann added a comment -

          Hey Nick,

          Yep +1 on having the new namespace called "Geographic" with the given 2 fields as a starting point. We should probably track it and commit in a new issue.

          Thanks for your thoughts on this!

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Nick, Yep +1 on having the new namespace called "Geographic" with the given 2 fields as a starting point. We should probably track it and commit in a new issue. Thanks for your thoughts on this! Cheers, Chris
          Hide
          Nick Burch added a comment -

          I was thinking that making sure you put in the right matching pairs, and remove them again is a little fiddly, but that's nothing that a little wrapper library wouldn't fix for you. With that in mind, I think your proposed solution is likely to be much better than changing tika to support composite values, with the problems that that would bring

          Any objections to creating a new Metadata keyspace of Geographic, with to start with LATITUDE = geo:latitude & LONGITUDE = geo:longitude ? I can think of a few others we might want in future (height, bearing etc), which makes me think its own space might make sense

          Show
          Nick Burch added a comment - I was thinking that making sure you put in the right matching pairs, and remove them again is a little fiddly, but that's nothing that a little wrapper library wouldn't fix for you. With that in mind, I think your proposed solution is likely to be much better than changing tika to support composite values, with the problems that that would bring Any objections to creating a new Metadata keyspace of Geographic, with to start with LATITUDE = geo:latitude & LONGITUDE = geo:longitude ? I can think of a few others we might want in future (height, bearing etc), which makes me think its own space might make sense
          Hide
          Chris A. Mattmann added a comment -

          Hey Nick,

          I think we need to support both cases (single lat/lon per document as well as many lat/lon pairs per document). In the case of the former, it's easy, we have:

          key: Metadata.LATITUDE
          val: some lat

          key: Metadata.LONGITUDE
          val: some lon

          And, in the case of the latter, we have:

          key: Metadata.LATITUDE
          val: some lat, some lat2, some lat3, some lat n...

          key: Metadata.LONGITUDE
          val: some lon, some lon2, some lon3, some lon n...

          Because the keys are ordered in the Metadata object, I think that we can make sure they match up and treat single points the same as for multiple points. It's great to have support for both on a per Metadata object basis too since many scientific data formats have both scenarios in them (e.g., NetCDF and HDF typically have arrays of lats and lons, and sometimes, singe point values as well).

          The reason we need to support both is that distance computation (point/radius, bounding box, and polygon) would require both scenarios to be supported. I've been thinking that once this work is prototyped, to integrate Tika with the work in SIS to build out a computational spatial library. I think Tika could be used to feed in lats/lons into SIS.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Nick, I think we need to support both cases (single lat/lon per document as well as many lat/lon pairs per document). In the case of the former, it's easy, we have: key: Metadata.LATITUDE val: some lat key: Metadata.LONGITUDE val: some lon And, in the case of the latter, we have: key: Metadata.LATITUDE val: some lat, some lat2, some lat3, some lat n... key: Metadata.LONGITUDE val: some lon, some lon2, some lon3, some lon n... Because the keys are ordered in the Metadata object, I think that we can make sure they match up and treat single points the same as for multiple points. It's great to have support for both on a per Metadata object basis too since many scientific data formats have both scenarios in them (e.g., NetCDF and HDF typically have arrays of lats and lons, and sometimes, singe point values as well). The reason we need to support both is that distance computation (point/radius, bounding box, and polygon) would require both scenarios to be supported. I've been thinking that once this work is prototyped, to integrate Tika with the work in SIS to build out a computational spatial library. I think Tika could be used to feed in lats/lons into SIS. Cheers, Chris
          Hide
          Nick Burch added a comment -

          I was wondering about extracting geo data from jpeg exif tags. For this, we'd probably want dedicated metadata properties for lat and long

          (Other files can have a single lat+long in them too, eg html pages with the icbm meta tags)

          Not sure how well that might integrate with this work though, since shapefiles will typically contain a large number lats+longs (or similar geographic points)

          Anyone have any ideas about a single created-at position vs stream of locations from geo formats?

          Show
          Nick Burch added a comment - I was wondering about extracting geo data from jpeg exif tags. For this, we'd probably want dedicated metadata properties for lat and long (Other files can have a single lat+long in them too, eg html pages with the icbm meta tags) Not sure how well that might integrate with this work though, since shapefiles will typically contain a large number lats+longs (or similar geographic points) Anyone have any ideas about a single created-at position vs stream of locations from geo formats?
          Hide
          Arturo Beltran added a comment -

          I'm not convinced about using OGDI. From what I understand from reading the documentation, OGDI offers an API in C, so we encounter the same problem to integrate it with Java. In addition, the project is not updated since 2008, so new geographic formats are not supported (i.e: KML). Also, I think OGDI does not support databases or services.

          However, you can do some proof of concept to see if it would be very difficult to integrate with Java and see exactly what metadata can be extracted using OGDI. Then we can compare these results with mine and decide.

          As you can see, I've attached a sample XML file (getFDOMetadata.xml) that contains the information extracted of a SHP by my proof of concept server based on FDO. This is the result after a simple HTTP call (http://localhost:12345/getFDOMetadata?source=C:\ExampleData\shp_world_countries\country.shp&provider=SHP)

          For now, I'll keep trying to run muy "Hello world" Tika parser.

          Regards,
          Arturo

          Show
          Arturo Beltran added a comment - I'm not convinced about using OGDI. From what I understand from reading the documentation, OGDI offers an API in C, so we encounter the same problem to integrate it with Java. In addition, the project is not updated since 2008, so new geographic formats are not supported (i.e: KML). Also, I think OGDI does not support databases or services. However, you can do some proof of concept to see if it would be very difficult to integrate with Java and see exactly what metadata can be extracted using OGDI. Then we can compare these results with mine and decide. As you can see, I've attached a sample XML file (getFDOMetadata.xml) that contains the information extracted of a SHP by my proof of concept server based on FDO. This is the result after a simple HTTP call ( http://localhost:12345/getFDOMetadata?source=C:\ExampleData\shp_world_countries\country.shp&provider=SHP ) For now, I'll keep trying to run muy "Hello world" Tika parser. Regards, Arturo
          Arturo Beltran made changes -
          Field Original Value New Value
          Attachment getFDOMetadata.xml [ 12447682 ]
          Hide
          Arturo Beltran added a comment -

          XML Example that contains the information extracted of a SHP by my proof of concept server based on FDO

          Show
          Arturo Beltran added a comment - XML Example that contains the information extracted of a SHP by my proof of concept server based on FDO
          Hide
          Mayank Singh added a comment -

          Arturo I am not very comfortable with C++ and have no knowledge of .NET platform (I'm a Java guy) so my help in this matter will be very limited to you if you plan on using FDO. However, I was looking around for alternatives and found OGDI (http://ogdi.sourceforge.net/) which can act as a middle layer between various data sources and has almost the same capabilities of data dissemination over the network as FDO (more info here: http://www.gisdevelopment.net/technology/gis/techgi0057b.htm).
          So what I am suggesting is we look into it and once we get the heterogeneous data into the OGDI supported uniform data structure we can use Java to integrate it with Tika.
          I'll keep searching for more info. Do tell me your views on this
          Regards
          Mayank

          Show
          Mayank Singh added a comment - Arturo I am not very comfortable with C++ and have no knowledge of .NET platform (I'm a Java guy) so my help in this matter will be very limited to you if you plan on using FDO. However, I was looking around for alternatives and found OGDI ( http://ogdi.sourceforge.net/ ) which can act as a middle layer between various data sources and has almost the same capabilities of data dissemination over the network as FDO (more info here: http://www.gisdevelopment.net/technology/gis/techgi0057b.htm ). So what I am suggesting is we look into it and once we get the heterogeneous data into the OGDI supported uniform data structure we can use Java to integrate it with Tika. I'll keep searching for more info. Do tell me your views on this Regards Mayank
          Hide
          Arturo Beltran added a comment -

          You are right Chris. Since now, I will try to keep the discussions on the list or here.

          I will try to explain in brief where exactly I'm working in order that you can get involved.
          The first piece is what allows us to access resources, we need a platform to access by the most homogenous way to heterogeneous resources. The best approach I've found has been FDO (http://fdo.osgeo.org/). In short, FDO is an API for manipulating, defining and analyzing geospatial information regardless of where it is stored.

          So it looks simple, I only have to integrate FDO as a Tika parser and I have it. The problem appeared when trying to connect this C++ API with Java. I have worked with SWIG and directly with JNI but I have not gotten it to work.
          Finally, temporary and to serve as a proof of concept, I implemented a simple HTTP server in .NET that offers resource descriptions using FDO. And now I'm trying to create a dummy parser for Tika to make calls to that server.

          I hope I explained well and that you could understand something, otherwise, feel free to ask again.

          Greetings and thanks for your interest:
          Arturo

          Show
          Arturo Beltran added a comment - You are right Chris. Since now, I will try to keep the discussions on the list or here. I will try to explain in brief where exactly I'm working in order that you can get involved. The first piece is what allows us to access resources, we need a platform to access by the most homogenous way to heterogeneous resources. The best approach I've found has been FDO ( http://fdo.osgeo.org/ ). In short, FDO is an API for manipulating, defining and analyzing geospatial information regardless of where it is stored. So it looks simple, I only have to integrate FDO as a Tika parser and I have it. The problem appeared when trying to connect this C++ API with Java. I have worked with SWIG and directly with JNI but I have not gotten it to work. Finally, temporary and to serve as a proof of concept, I implemented a simple HTTP server in .NET that offers resource descriptions using FDO. And now I'm trying to create a dummy parser for Tika to make calls to that server. I hope I explained well and that you could understand something, otherwise, feel free to ask again. Greetings and thanks for your interest: Arturo
          Hide
          Chris A. Mattmann added a comment -

          Hi Guys,

          Thanks for the effort here. Please try hard to keep the discussions on list as the community will benefit from them and can help provide feedback incrementally.

          Thanks,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Guys, Thanks for the effort here. Please try hard to keep the discussions on list as the community will benefit from them and can help provide feedback incrementally. Thanks, Chris
          Hide
          Arturo Beltran added a comment -

          Hi all,

          I am pleased by the interest shown by the community on my proposal. As I said, any help is welcome.
          I have sent Mayank all the details about my work on this issue. If anyone else is interested in collaborating or simply provide their ideas/comments do not hesitate to contact me.

          Cheers,
          Arturo

          Show
          Arturo Beltran added a comment - Hi all, I am pleased by the interest shown by the community on my proposal. As I said, any help is welcome. I have sent Mayank all the details about my work on this issue. If anyone else is interested in collaborating or simply provide their ideas/comments do not hesitate to contact me. Cheers, Arturo
          Hide
          Mayank Singh added a comment - - edited

          Hi Arturo
          I would like to collaborate on this issue. I have also sent you an e-mail regarding the same.
          Thanks and regards
          Mayank

          Show
          Mayank Singh added a comment - - edited Hi Arturo I would like to collaborate on this issue. I have also sent you an e-mail regarding the same. Thanks and regards Mayank
          Hide
          Chris A. Mattmann added a comment -

          Hi Arturo,

          Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can.

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hi Arturo, Thanks for reporting this issue and it sounds awesome! I'm definitely interested in this topic and will be sure to help however I can. Cheers, Chris
          Arturo Beltran created issue -

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Arturo Beltran
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development