Couple of comments.
Thanks for doing the test. I know this already because I hit that, too. Its caused by TIKA's dependencies. The NetCDF (http://www.unidata.ucar.edu/software/netcdf/) parser is only compiled with Java 1.6, although TIKA is also only Java 1.5, so this is a TIKA bug.
In Tika, I wouldn't classify this as a bug, since our parser jar dependencies can be excluded in various ways. It's simply a requirement for folks that are interested in all of the features that the NetCDF library provides, but if you don't care about parsing those types of files, you can simply omit that parser and exclude the jar file dependency.
's obscure, indeed, especially for people outside the climate community.
Obscure? Sorry, not meaning to argue here, but that's pretty patently untrue. All data formats are at some level obscure, depending on the community that you work in. The "climate" one that you are talking about includes a broad range of folks, dealing with remote sensing, climate modeling, decision making, etc., at some of the highest levels of government, funding, and other areas, both in the U.S. and internationally. NetCDF, and HDF, OPeNDAP, and other formats are pretty broadly accepted standards. The use of data from NetCDF for example, resulted in over 2000+ publications generated as part of the last Intergovernmental Panel on Climate Change (IPCC) and its 4th assessment report So, not sure it's obscure.
he UCAR netcdf library is on the other hand not able to handle streaming file input, so TIKA loads the whole file into memory
Yep, it's part of the issue of the underlying data file format more so than the actually library itself. It's because it doesn't support random access and yes the current code I had to bake into Tika unfortunately must work around it by loading the whole file into memory. Jukka and I have discussed some better support for this including temporary file support in Tika and we're working on improving it, but not there yet.
don't really see the use-case for support in Solr
It's up to you guys. If you want to tell users of Solr, "hey you can drop a scientific data file format onto Solr and magically its metadata will be indexed", then it might be important. We do this in OODT quite often, and it's one of the core use cases (and we even use Lucene and Solr for the metadata catalogs ).
Loading a 500 Megabyte file into memory just to get the header
A lot of times that header contains the key parameters (spatial and temporal bounds) that are required to make a decision as to what to do with the file, as well as other met fields including the remote sensing variables, or climate variables being measured, valid units, links to publications, etc. So it's more than useless information.
Right, but how many people have these gigabyte climate data files
Depends on who is using it. Like I said, this is pretty much all of the files that I deal with , but to each their own. Disabling it in Solr isn't really going to affect me (or others much) since OODT pretty much does this anyways, but meh.