Details

      Description

      The Geospatial Data Abstraction Library (GDAL) is the swiss army knife of dealing with geospatial formats. http://gdal.org/ Incorporating GDAL and wrapping it as a Tika external parser will give Tika the win for understanding and helping to index geospatial formats.

      1. TIKA-605.Mattmann.101114.patch.2.txt
        21 kB
        Chris A. Mattmann
      2. TIKA-605.Mattmann.100914.2.patch.txt
        17 kB
        Chris A. Mattmann
      3. TIKA-605.Mattmann.100914.1.patch.txt
        14 kB
        Chris A. Mattmann
      4. TIKA-605.Mattmann.092511.patch.txt
        5 kB
        Chris A. Mattmann
      5. 0001-TIKA-605-Tika-GDAL-parser.patch
        6 kB
        Jukka Zitting

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -
          • totally incomplete patch, but attaching so I can clean my local workspace, and get back to this later. Need to get GDAL bindings jar up on Maven Central too.
          Show
          Chris A. Mattmann added a comment - totally incomplete patch, but attaching so I can clean my local workspace, and get back to this later. Need to get GDAL bindings jar up on Maven Central too.
          Hide
          Chris A. Mattmann added a comment -

          The other tricky thing about this is that GDAL seems to have its own MIME identification system, that is based on file name, or glob pattern. So, when I used TikaInputStream.getFile() which returns a temp file name as well, GDAL was complaining that it didn't understand that file type. I think I specifically request a file extension for the temp file to get, or if I can't, then I'll update TikaInputStream.getFile() to allow this.

          Show
          Chris A. Mattmann added a comment - The other tricky thing about this is that GDAL seems to have its own MIME identification system, that is based on file name, or glob pattern. So, when I used TikaInputStream.getFile() which returns a temp file name as well, GDAL was complaining that it didn't understand that file type. I think I specifically request a file extension for the temp file to get, or if I can't, then I'll update TikaInputStream.getFile() to allow this.
          Hide
          Chris A. Mattmann added a comment -
          • i'll try and get this in for 1.0
          Show
          Chris A. Mattmann added a comment - i'll try and get this in for 1.0
          Hide
          Jukka Zitting added a comment -

          I guess ideally we should ask the GDAL toolkit to support parsing just an InputStream.

          But until that happens, the attached patch implements a simple mechanism by which a parser can provide a default file name suffix to use by TikaInputStream.getFile(). The relevant parser code would be something like this:

          File file = tis.getFile(metadata.get(Metadata.RESOURCE_NAME_KEY));
          

          or:

          File file = tis.getFile("pattern.pdf");
          
          Show
          Jukka Zitting added a comment - I guess ideally we should ask the GDAL toolkit to support parsing just an InputStream. But until that happens, the attached patch implements a simple mechanism by which a parser can provide a default file name suffix to use by TikaInputStream.getFile(). The relevant parser code would be something like this: File file = tis.getFile(metadata.get(Metadata.RESOURCE_NAME_KEY)); or: File file = tis.getFile( "pattern.pdf" );
          Hide
          Chris A. Mattmann added a comment -

          Thanks Jukka that really helps!

          Show
          Chris A. Mattmann added a comment - Thanks Jukka that really helps!
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.1: prep for 1.0.
          Show
          Chris A. Mattmann added a comment - push out to 1.1: prep for 1.0.
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.2
          Show
          Chris A. Mattmann added a comment - push out to 1.2
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.3
          Show
          Chris A. Mattmann added a comment - push to 1.3
          Hide
          Chris A. Mattmann added a comment -

          Martin Desruisseaux contributed a generic Envelope and simple Envelope implementation in Apache SIS:

          http://s.apache.org/Xc4

          We can leverage this to handle the Geometry portions of the GDAL parser.

          Show
          Chris A. Mattmann added a comment - Martin Desruisseaux contributed a generic Envelope and simple Envelope implementation in Apache SIS: http://s.apache.org/Xc4 We can leverage this to handle the Geometry portions of the GDAL parser.
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push out to 1.4
          Show
          Chris A. Mattmann added a comment - push out to 1.4
          Hide
          Chris A. Mattmann added a comment -
          • push to 1.5, get ready for 1.4 RC #1.
          Show
          Chris A. Mattmann added a comment - push to 1.5, get ready for 1.4 RC #1.
          Hide
          Dave Meikle added a comment -

          Pushed out to 1.6, preparing for 1.5 RC

          Show
          Dave Meikle added a comment - Pushed out to 1.6, preparing for 1.5 RC
          Hide
          Chris A. Mattmann added a comment -

          OK I'm working on this again. First step:

          $ brew install gdal --complete
          

          Note if you encounter errors while upgrading to Mavericks here, the answer is to first:

          $  brew rm $(join <(brew leaves) <(brew deps gdal --complete ))
          

          Then re-install it.

          Thanks to http://stackoverflow.com/questions/19548011/cannot-install-lxml-on-mac-os-x-10-9 for the info.

          Show
          Chris A. Mattmann added a comment - OK I'm working on this again. First step: $ brew install gdal --complete Note if you encounter errors while upgrading to Mavericks here, the answer is to first: $ brew rm $(join <(brew leaves) <(brew deps gdal --complete )) Then re-install it. Thanks to http://stackoverflow.com/questions/19548011/cannot-install-lxml-on-mac-os-x-10-9 for the info.
          Hide
          Chris A. Mattmann added a comment -
          • ok here is my patch - it requires TIKA-1441 so apply that first. Also be sure to install gdal first.
          Show
          Chris A. Mattmann added a comment - ok here is my patch - it requires TIKA-1441 so apply that first. Also be sure to install gdal first.
          Hide
          Chris A. Mattmann added a comment -
          • ok here is a fully working complete test. Unit tests pass. System.out.printlns removed, and it handles all metadata now. I had to change the invocation command b/c the ExternalParser cannot both extract Metadata and XHTML output from the same stream. Instead, I carried forward the ExternalParser's applyPatterns strategy, and am simply calling that locally (since inheritance was blocked by private methods), and I'm simply using ExternalParser to set up the command invocation and parsing both the output and the metadata from this myself. Give it a whirl!
          Show
          Chris A. Mattmann added a comment - ok here is a fully working complete test. Unit tests pass. System.out.printlns removed, and it handles all metadata now. I had to change the invocation command b/c the ExternalParser cannot both extract Metadata and XHTML output from the same stream. Instead, I carried forward the ExternalParser's applyPatterns strategy, and am simply calling that locally (since inheritance was blocked by private methods), and I'm simply using ExternalParser to set up the command invocation and parsing both the output and the metadata from this myself. Give it a whirl!
          Show
          Chris A. Mattmann added a comment - https://reviews.apache.org/r/26542
          Hide
          Chris A. Mattmann added a comment -
          • this patch fully works but I had to drop direct support for the ExternalParser (see the method comments) and bring in some of that functionality directly into this class. This is due to the ExternalParser not really handling the case where I need to get Metadata and text output from the external command output, and I need the metadata first before I call the handler.
          • added in a test for a FITS file as well.
          • will be adding docs on the wiki for this soon. Hope to get this committed in the next few hours.

          FITS file located here: http://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits

          Show
          Chris A. Mattmann added a comment - this patch fully works but I had to drop direct support for the ExternalParser (see the method comments) and bring in some of that functionality directly into this class. This is due to the ExternalParser not really handling the case where I need to get Metadata and text output from the external command output, and I need the metadata first before I call the handler. added in a test for a FITS file as well. will be adding docs on the wiki for this soon. Hope to get this committed in the next few hours. FITS file located here: http://fits.gsfc.nasa.gov/samples/WFPC2u5780205r_c0fx.fits
          Hide
          Chris A. Mattmann added a comment -

          Docs are now added here: https://wiki.apache.org/tika/TikaGDAL

          Show
          Chris A. Mattmann added a comment - Docs are now added here: https://wiki.apache.org/tika/TikaGDAL
          Hide
          Chris A. Mattmann added a comment -
          • committed in r1631073. Thanks for all the help everyone!
          Show
          Chris A. Mattmann added a comment - committed in r1631073. Thanks for all the help everyone!
          Hide
          Tyler Palsulich added a comment -

          See my comments on the RB from a few minutes ago.

          Show
          Tyler Palsulich added a comment - See my comments on the RB from a few minutes ago.
          Hide
          Mattmann, Chris A (388J) added a comment -

          Great +1 please update in SVN

          Sent from my iPhone

          Show
          Mattmann, Chris A (388J) added a comment - Great +1 please update in SVN Sent from my iPhone
          Hide
          Hudson added a comment -

          UNSTABLE: Integrated in tika-trunk-jdk1.7 #256 (See https://builds.apache.org/job/tika-trunk-jdk1.7/256/)
          Update for TIKA-605 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631074)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/WFPC2u5780205r_c0fx.fits
          Show
          Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.7 #256 (See https://builds.apache.org/job/tika-trunk-jdk1.7/256/ ) Update for TIKA-605 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631074 ) /tika/trunk/CHANGES.txt fix for TIKA-605 : GDAL Parser (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631073 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java /tika/trunk/tika-parsers/src/test/resources/test-documents/WFPC2u5780205r_c0fx.fits
          Hide
          Hudson added a comment -

          UNSTABLE: Integrated in tika-trunk-jdk1.6 #235 (See https://builds.apache.org/job/tika-trunk-jdk1.6/235/)
          Update for TIKA-605 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631074)

          • /tika/trunk/CHANGES.txt
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal
          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          • /tika/trunk/tika-parsers/src/test/resources/test-documents/WFPC2u5780205r_c0fx.fits
          Show
          Hudson added a comment - UNSTABLE: Integrated in tika-trunk-jdk1.6 #235 (See https://builds.apache.org/job/tika-trunk-jdk1.6/235/ ) Update for TIKA-605 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631074 ) /tika/trunk/CHANGES.txt fix for TIKA-605 : GDAL Parser (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631073 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java /tika/trunk/tika-parsers/src/test/resources/test-documents/WFPC2u5780205r_c0fx.fits
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #257 (See https://builds.apache.org/job/tika-trunk-jdk1.7/257/)

          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Show
          Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #257 (See https://builds.apache.org/job/tika-trunk-jdk1.7/257/ ) TIKA-605 : fix remainder of tpalsulich comments from https://reviews.apache.org/r/26542 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631149 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.6 #236 (See https://builds.apache.org/job/tika-trunk-jdk1.6/236/)

          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Show
          Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.6 #236 (See https://builds.apache.org/job/tika-trunk-jdk1.6/236/ ) TIKA-605 : fix remainder of tpalsulich comments from https://reviews.apache.org/r/26542 (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631149 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.6 #238 (See https://builds.apache.org/job/tika-trunk-jdk1.6/238/)

          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Show
          Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.6 #238 (See https://builds.apache.org/job/tika-trunk-jdk1.6/238/ ) TIKA-605 : deal with heading boundaries; add associated unit tests to expose and prove fixed for regression (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631191 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in tika-trunk-jdk1.7 #259 (See https://builds.apache.org/job/tika-trunk-jdk1.7/259/)

          • /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java
          • /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java
          Show
          Hudson added a comment - SUCCESS: Integrated in tika-trunk-jdk1.7 #259 (See https://builds.apache.org/job/tika-trunk-jdk1.7/259/ ) TIKA-605 : deal with heading boundaries; add associated unit tests to expose and prove fixed for regression (mattmann: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1631191 ) /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/gdal/GDALParser.java /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/gdal/TestGDALParser.java

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development