I am using Apache Camel to process a lot of large CSV files, and relying on Bindy to assist with unmarshalling them into POJOs.
We have an upstream data bug which causes a record of ours to contain the Unicode character NEL, but while we're working through the cause of that, I found it curious as to what Bindy is actually doing with it. We rely on the unmarshal process to perform a batch insert, and because our POJO is missing certain fields, we started observing that the
Bindy is relying on Scanner to read lines in a large file; however, Scanner itself also does some parsing of the line with the assumption that, if it sees the NEL character, it will regard it as a newline character. The modern Files API does not make this distinction and reads to a newline designation only (e.g \n, \r, or \r\n).
There are two ways to fix this from what I've been able to smoke test:
- Change the Scanner implementation to use a delimeter of the more traditional newline characters
- Use Java 8's Files API and stream the file in
I would personally want to use the Files API to handle this since it's more robust and capable of higher performance, but I'll explore both approaches and see where I end up.