Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29748

Remove sorting of fields in PySpark SQL Row creation



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.0.0
    • 3.0.0
    • PySpark, SQL


      Currently, when a PySpark Row is created with keyword arguments, the fields are sorted alphabetically. This has created a lot of confusion with users because it is not obvious (although it is stated in the pydocs) that they will be sorted alphabetically, and then an error can occur later when applying a schema and the field order does not match.

      The original reason for sorting fields is because kwargs in python < 3.6 are not guaranteed to be in the same order that they were entered. Sorting alphabetically would ensure a consistent order. Matters are further complicated with the flag _from_dict_ that allows the Row fields to to be referenced by name when made by kwargs, but this flag is not serialized with the Row and leads to inconsistent behavior.

      This JIRA proposes that any sorting of the Fields is removed. Users with Python 3.6+ creating Rows with kwargs can continue to do so since Python will ensure the order is the same as entered. Users with Python < 3.6 will have to create Rows with an OrderedDict or by using the Row class as a factory (explained in the pydoc). If kwargs are used, an error will be raised or based on a conf setting it can fall back to a LegacyRow that will sort the fields as before. This LegacyRow will be immediately deprecated and removed once support for Python < 3.6 is dropped.


        Issue Links



              bryanc Bryan Cutler
              bryanc Bryan Cutler
              1 Vote for this issue
              6 Start watching this issue