Affects Version/s: None
Fix Version/s: 0.11
Currently Pig doesn't support lazy see/de, so as one of the best practices, we recommend users not to declare types in the schema so that Pig will guess the right types and cast them lazily. However, if Pig guesses a wrong type, especially mistakes a double field as an integer field, the overhead of casting is tremendous due to the exception handling.
See Utf8StorageConverter#bytesToIntege. It first casts bytes to Integer by Integer.parseInt(), and if exception occurs, it tries to cast it to Double by Double.parseDouble() and convert it back to Integer. The problem is that the exception handling can be 10x slower than the actual casting. bytesToLong has the same problem. Below is a mini-benchmark:
And the results:
|Integer.parseInt(i + ".0");||118|
|Integer.parseInt(i + "");||13|
|Double.parseDouble(i + "");||16|
We can see Integer.parseInt(i + ".0") is 10x slower than the other due to the exception handling.
This issue was found when I benchmark TPC-H Query 1, for which Pig was terribly slower than Hive:
After declaring three double fields as double, the performance was boosted.
|pig without types||pig with three doubles||hive|
|76 min||34 min||16 min|
Besides recommending users to declare actual double fields as double, we can also improve the casting to avoid this happening. Maybe the easiest way is to remove the Integer.parseInt and only use the Double.parseDouble and convert back to Integer. The mini benchmark above shows Double.parseDouble + range checking + Integer.valueOf(Double.intValue()) takes about 17 seconds. I think the small percent of extra overhead (3 seconds compared to Integer.parseInt()) is acceptable as it won't be the dominant bottleneck?