Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8
    • Component/s: parser
    • Labels:
      None
    • Environment:

      indep. of env.

      Description

      Along with TIKA-399, netCDF is also a widely used scientific data format. I'm going to throw up a Tika parser that can deal with netCDF.

        Issue Links

          Activity

          Hide
          Chris A. Mattmann added a comment -
          • basic support added in r931037. Room for extension. Currently I have to load the whole netCDF file into memory to overcome the limitations of the netCDF java API from NCAR, which doesn't handle streams (perhaps, it's even a limitation of the netCDF api, which is random access file based, according to the docs). I included basic unit tests right now. So, we've got a start, extensions welcome!
          Show
          Chris A. Mattmann added a comment - basic support added in r931037. Room for extension. Currently I have to load the whole netCDF file into memory to overcome the limitations of the netCDF java API from NCAR, which doesn't handle streams (perhaps, it's even a limitation of the netCDF api, which is random access file based, according to the docs). I included basic unit tests right now. So, we've got a start, extensions welcome!
          Hide
          Jukka Zitting added a comment -

          It's bad practice (see [1]) to reference external repositories in POMs meant to be released to Maven Central. Can we ask NetCDF to consider uploading their jars to Maven Central?

          We should also update the license files in tika-app and tika-bundle to include the NetCDF license terms.

          [1] http://www.sonatype.com/people/2010/03/why-external-repos-are-being-phased-out-of-central/

          Show
          Jukka Zitting added a comment - It's bad practice (see [1] ) to reference external repositories in POMs meant to be released to Maven Central. Can we ask NetCDF to consider uploading their jars to Maven Central? We should also update the license files in tika-app and tika-bundle to include the NetCDF license terms. [1] http://www.sonatype.com/people/2010/03/why-external-repos-are-being-phased-out-of-central/
          Hide
          Chris A. Mattmann added a comment -

          Hey Jukka:

          Interesting – makes sense, but oddly, why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways?

          I'll send an email to the NetCDF'ers asking if they would be willing to upload their jars to central and copy tika-dev@, so stay tuned. In the meanwhile, FWIW, can we resolve this issue and then open a new one to track updating the Tika POM to ref the new central NetCDF jars assuming the NetCDF'ers are cool with uploading? I'm a fan of just creating new issues and linking them.

          Cool?

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Jukka: Interesting – makes sense, but oddly, why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways? I'll send an email to the NetCDF'ers asking if they would be willing to upload their jars to central and copy tika-dev@, so stay tuned. In the meanwhile, FWIW, can we resolve this issue and then open a new one to track updating the Tika POM to ref the new central NetCDF jars assuming the NetCDF'ers are cool with uploading? I'm a fan of just creating new issues and linking them. Cool? Cheers, Chris
          Hide
          Jukka Zitting added a comment -

          > why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways?

          It's useful for example for companies that have their private repositories and never plan to release their code to the central. But if you're an open source project, then a separate repository just makes life more difficult for downstream users.

          > can we resolve this issue and then open a new one

          Sure, re-resolving.

          Show
          Jukka Zitting added a comment - > why does Maven allow external repos to referenced at all then if they want you to just put all your artifacts into central anyways? It's useful for example for companies that have their private repositories and never plan to release their code to the central. But if you're an open source project, then a separate repository just makes life more difficult for downstream users. > can we resolve this issue and then open a new one Sure, re-resolving.
          Hide
          Jukka Zitting added a comment -

          BTW, do we need the commons-httpclient dependency for this? If possible, it would be good if the parsing process didn't try to access external resources.

          Show
          Jukka Zitting added a comment - BTW, do we need the commons-httpclient dependency for this? If possible, it would be good if the parsing process didn't try to access external resources.
          Hide
          Chris A. Mattmann added a comment -

          Hey Jukka:

          I think we do since the NetCDF lib relies on it. I agree with you on accessing internal resources. The problem is, this NetCDF library (which seems to be the most used/maintained from a Java perspective), expects to be responsible for handling the way content is delivered to it too. In fact, NetCDF and HDF concern themselves not only with obtaining data from a particular stream/content, but also, how that content is represented, because the data volumes are so large, they have to make optimizations in how to extract and represent the data for the purposes of access to it.

          So, I actually ran into something similar here in terms of e.g., the core abstraction for opening up a NetCdfFile in the lib is only a File as input – it's really hard to pass it a stream, which is what Tika expects. Arg! Very frustrating indeed. I'll look around and see if there is another ASL friendly NetCDF Java library (does anyone else know of one?)

          Cheers,
          Chris

          Show
          Chris A. Mattmann added a comment - Hey Jukka: I think we do since the NetCDF lib relies on it. I agree with you on accessing internal resources. The problem is, this NetCDF library (which seems to be the most used/maintained from a Java perspective), expects to be responsible for handling the way content is delivered to it too. In fact, NetCDF and HDF concern themselves not only with obtaining data from a particular stream/content, but also, how that content is represented, because the data volumes are so large, they have to make optimizations in how to extract and represent the data for the purposes of access to it. So, I actually ran into something similar here in terms of e.g., the core abstraction for opening up a NetCdfFile in the lib is only a File as input – it's really hard to pass it a stream, which is what Tika expects. Arg! Very frustrating indeed. I'll look around and see if there is another ASL friendly NetCDF Java library (does anyone else know of one?) Cheers, Chris

            People

            • Assignee:
              Chris A. Mattmann
              Reporter:
              Chris A. Mattmann
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development