Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32098

Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 2.4.6, 3.0.0
    • 2.4.7, 3.0.1, 3.1.0
    • PySpark

    Description

      When you use floats are index of pandas, it produces a wrong results:

      >>> import pandas as pd
      >>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
      +---+
      |  a|
      +---+
      |  1|
      |  1|
      |  2|
      +---+
      

      This is because direct slicing uses the value as index when the index contains floats:

      >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
           a
      2.0  1
      3.0  2
      4.0  3
      >>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
           a
      4.0  3
      >>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
         a
      4  3
      

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: