[CSV-277] Review Lexer simpleToken for Performance - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

Running the Apache ORC benchmarks which has commons-csv as a dependency and noticed the bulk of running time is in commons-csv.

I attached the VisualVM output and here is my test setup:

JVM: OpenJDK 64-Bit Server VM (25.292-b10, mixed mode)
Java: version 1.8.0_292, vendor Private Build
Java Home: /usr/lib/jvm/java-8-openjdk-amd64/jre
JVM Flags: <none>

I suspect this is in part because ExtendedBufferedReader extends BufferedReader. BufferedReader is a synchronized method class which means that every call to read requires synchronization. Usually it's not an issue, but for commons-csv, it adds a lot of overhead because it reads each byte one-at-a-time. So even though it's buffered, it has to go through a synchronization processes for each byte read. It also has to perform a "jump" into the parent class for each byte.

Nothing else stands out to me as being "slow."