[CAMEL-12698] Unmarshaling a CSV file with the NEL (next line) character will cause Bindy to misread the entire file - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 2.22.0
Fix Version/s: 2.23.0
Component/s: camel-bindy
Labels:
None

Estimated Complexity:
Unknown

Description

I am using Apache Camel to process a lot of large CSV files, and relying on Bindy to assist with unmarshalling them into POJOs.

We have an upstream data bug which causes a record of ours to contain the Unicode character NEL, but while we're working through the cause of that, I found it curious as to what Bindy is actually doing with it. We rely on the unmarshal process to perform a batch insert, and because our POJO is missing certain fields, we started observing that the

Bindy is relying on Scanner to read lines in a large file; however, Scanner itself also does some parsing of the line with the assumption that, if it sees the NEL character, it will regard it as a newline character. The modern Files API does not make this distinction and reads to a newline designation only (e.g \n, \r, or \r\n).

There are two ways to fix this from what I've been able to smoke test:

Change the Scanner implementation to use a delimeter of the more traditional newline characters
Use Java 8's Files API and stream the file in

I would personally want to use the Files API to handle this since it's more robust and capable of higher performance, but I'll explore both approaches and see where I end up.

Attachments

Issue Links

links to

GitHub Pull Request #2454

Activity

People

Assignee:: Onder Sezgin

Reporter:: Jason Black

Votes:: 1 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 30/Jul/18 15:23

Updated:: 05/Oct/18 22:56

Resolved:: 05/Oct/18 22:56