Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19843

UTF8String => (int / long) conversion expensive for invalid inputs

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.2.0
    • SQL
    • None

    Description

      In case of invalid inputs, converting a UTF8String to int or long returns null. This comes at a cost wherein the method for conversion (e.g [0]) would throw an exception. Exception handling is expensive as it will convert the UTF8String into a java string, populate the stack trace (which is a native call). While migrating workloads from Hive -> Spark, I see that this at an aggregate level affects the performance of queries in comparison with hive.

      The exception is just indicating that the conversion failed.. its not propagated to users so it would be good to avoid.

      Couple of options:

      • Return Integer / Long (instead of primitive types) which can be set to NULL if the conversion fails. This is boxing and super bad for perf so a big no.
      • Hive has a pre-check [1] for this which is not a perfect safety net but helpful to capture typical bad inputs eg. empty string, "null".

      [0] : https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
      [1] : https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90

      Attachments

        Activity

          People

            tejasp Tejas Patil
            tejasp Tejas Patil
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: