Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25807

Mitigate 1-based substr() confusion

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • 1.3.0, 2.3.2, 2.4.0, 3.0.0
    • None
    • Java API, PySpark
    • None

    Description

      The method Column.substr() is 1-based, conforming with SQL and Hive's SUBSTRING, and contradicting both Python's substr and Java's substr, which are zero-based.  Both PySpark users and Java API users often naturally expect a 0-based substr(). Adding to the confusion, substr() currently allows a startPos value of 0, which returns the same result as startPos==1.

      Since changing substr() to 0-based is probably NOT a reasonable option here, I suggest making one or more of the following changes:

      1. Adding a method substr0, which would be zero-based
      2. Renaming substr to substr1
      3. Making the existing substr() throw an exception on startPos==0, which should catch and alert most users who expect zero-based behavior.

      This is my first discussion on this project, apologies for any faux pas.

      Attachments

        Activity

          People

            Unassigned Unassigned
            oron.navon Oron Navon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: