Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-9213

Improve regular expression performance (via joni)

    XMLWordPrintableJSON

Details

    • Umbrella
    • Status: Resolved
    • Major
    • Resolution: Incomplete
    • None
    • None
    • SQL

    Description

      I'm creating an umbrella ticket to improve regular expression performance for string expressions. Right now our use of regular expressions is inefficient for two reasons:

      1. Java regex in general is slow.
      2. We have to convert everything from UTF8 encoded bytes into Java String, and then run regex on it, and then convert it back.

      There are libraries in Java that provide regex support directly on UTF8 encoded bytes. One prominent example is joni, used in JRuby.

      Note: all regex functions are in https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringOperations.scala

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              rxin Reynold Xin
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: