[SPARK-19843] UTF8String => (int / long) conversion expensive for invalid inputs - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.1.0
Fix Version/s: 2.2.0
Component/s: SQL
Labels:
None

Description

In case of invalid inputs, converting a UTF8String to int or long returns null. This comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception handling is expensive as it will convert the UTF8String into a java string, populate the stack trace (which is a native call). While migrating workloads from Hive -> Spark, I see that this at an aggregate level affects the performance of queries in comparison with hive.

The exception is just indicating that the conversion failed.. its not propagated to users so it would be good to avoid.

Couple of options:

Return Integer / Long (instead of primitive types) which can be set to NULL if the conversion fails. This is boxing and super bad for perf so a big no.
Hive has a pre-check [1] for this which is not a perfect safety net but helpful to capture typical bad inputs eg. empty string, "null".

[0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
[1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90

Attachments

Issue Links

links to

[Github] Pull Request #17184 (tejasapatil)

[Github] Pull Request #17205 (tejasapatil)

Activity

People

Assignee:: Tejas Patil

Reporter:: Tejas Patil

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/Mar/17 00:49

Updated:: 08/Mar/17 06:12

Resolved:: 08/Mar/17 04:19