Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32098

Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.4.6, 3.0.0
    • Fix Version/s: 2.4.7, 3.0.1, 3.1.0
    • Component/s: PySpark
    • Labels:

      Description

      When you use floats are index of pandas, it produces a wrong results:

      >>> import pandas as pd
      >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
      +---+
      |  a|
      +---+
      |  1|
      |  1|
      |  2|
      +---+
      

      This is because direct slicing uses the value as index when the index contains floats:

      >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
           a
      2.0  1
      3.0  2
      4.0  3
      >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
           a
      4.0  3
      >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
         a
      4  3
      

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              hyukjin.kwon Hyukjin Kwon
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: