[SPARK-25807] Mitigate 1-based substr() confusion - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 1.3.0, 2.3.2, 2.4.0, 3.0.0
Fix Version/s: None
Component/s: Java API, PySpark
Labels:
None

Description

The method Column.substr() is 1-based, conforming with SQL and Hive's SUBSTRING, and contradicting both Python's substr and Java's substr, which are zero-based. Both PySpark users and Java API users often naturally expect a 0-based substr(). Adding to the confusion, substr() currently allows a startPos value of 0, which returns the same result as startPos==1.

Since changing substr() to 0-based is probably NOT a reasonable option here, I suggest making one or more of the following changes:

Adding a method substr0, which would be zero-based
Renaming substr to substr1
Making the existing substr() throw an exception on startPos==0, which should catch and alert most users who expect zero-based behavior.

This is my first discussion on this project, apologies for any faux pas.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Oron Navon

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 23/Oct/18 08:21

Updated:: 12/Dec/22 18:10

Resolved:: 25/Oct/18 13:28