Thanks. Replies below:
- loses info by removing newlines
Only does this when
, and actually adds functionality in doing so (without doing this, you can't load the data into Excel, see my comments above and in the code).
- always encapsulates with quotes - not as readable
See the CSV spec, via Wikipedia in the links in the code. Doing so reduces ambiguity, and clearly delineates where the value starts, and where it stops.
- doesn't escape encapsulator in values
Is there a need to do this? I don't think so...
- doesn't escape separator in multi-valued fields
Same as above: no need, really.
- isn't really nested CSV, so it's not compatible with the CSVLoader
What do you mean not compatible with CSV loader?
- uses System.getProperty("line.separator")... we should avoid different behavior on different platforms
Hmm, I've never been dinged before for writing platform independent code. That's what they put the property in there, so line.separator means the same thing, programming-construct wise, across platforms. So, I don't really get your ding here.
- doesn't stream documents (dumping your entire index will be one use case)
I actually implemented both the streaming method (#writeDoc) and the aggregate method (#writeAllDocs). I set #isStreaming to false, because it makes for a clean CSV header writing, rather than hacky code in #writeDoc to take care of the (potential) non-uniformity. Additionally, I'm using this in production right now, on solr-1.5 branch with an index of over 1M documents, and the performance overhead for the write is quite fast.
- performance: patterns shouldn't be compiled per-doc
This only matters when
, and I think the performance hit isn't really an issue. If you feel strongly about it though we could always compile the pattern above the loop, and reuse it...