Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
1.11.0
Description
This method in SimpleGroup uses `+=` for String concatenation which is a known performance problem in Java, the performance degrades exponentially the more strings that are added.
We ran into a performance problem whereby a single column in a Parquet file was defined as a group:
optional group customer_ids (LIST) { repeated group list { optional binary element (STRING); } }
and had over 31,000 values. Reading this single column took over 8 minutes due to time spent in the `toString()` method. Using a different implementation that uses `StringBuffer` like this:
StringBuffer result = new StringBuffer(); int i = 0; for (Type field : schema.getFields()) { String name = field.getName(); List<Object> values = data[i]; ++i; if (values != null) { if (values.size() > 0) { for (Object value : values) { result.append(indent); result.append(name); if (value == null) { result.append(": NULL\n"); } else if (value instanceof Group){ result.append("\n"); result.append(betterToString((SimpleGroup)value, indent+" ")); } else { result.append(": "); result.append(value.toString()); result.append("\n"); } } } } } return result.toString();
reduced that time to less than 500 milliseconds.
The existing implementation is really poor and exhibits an infamous Java string performance issue and should be fixed.
This was a significant problem for us but we were able to work around it so I am marking this issue as "Minor".
Attachments
Issue Links
- links to