Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20012

spark.read.csv schemas effectively ignore headers

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Not A Problem
    • 2.1.0
    • None
    • Input/Output
    • None
    • pyspark

    Description

      New to Spark, so please direct me elsewhere if there is another place for this kind of discussion.

      To my understanding, schema are ordered named structures however it seems the names are not being used when reading files with headers.

      I had a quick look at the DataFrameReader code and it seems like it might not be too hard to
      a) let the schema set the "global" order of the columns
      b) per file, map the columns by name to the schema ordering and apply the types on load.

      A simple way of saying this is that the schema is an ordered dictionary and the files with headers only define dictionaries.

      A typical example showing what I think are the implications of this problem:

      In [248]: a = spark.read.csv('./data/test.csv.gz', header=True, inferSchema=True).toPandas()
      
      In [249]: b = spark.read.csv('./data/0.csv.gz', header=True, inferSchema=True).toPandas()
      
      In [250]: d = pd.concat([a, b])
      
      In [251]: df = spark.read.csv('./data/{test,0}.csv.gz', header=True, inferSchema=True).toPandas()
      
      In [252]: df[['b', 'c', 'd', 'e']] = df[['b', 'c', 'd', 'e']].astype(float)
      
      In [253]: a
      Out[253]:
            a         b         e         d         c
      0  test -0.874197  0.168660 -0.948726  0.479723
      1  test  1.124383  0.620870  0.159186  0.993676
      2  test -1.429108 -0.048814 -0.057273 -1.331702
      
      In [254]: b
      Out[254]:
         a         b         c         d         e
      0  0 -1.671828 -1.259530  0.905029  0.487244
      1  0 -0.024553 -1.750904  0.004466  1.978049
      2  0  1.686806  0.175431  0.677609 -0.851670
      
      In [255]: d
      Out[255]:
            a         b         c         d         e
      0  test -0.874197  0.479723 -0.948726  0.168660
      1  test  1.124383  0.993676  0.159186  0.620870
      2  test -1.429108 -1.331702 -0.057273 -0.048814
      0     0 -1.671828 -1.259530  0.905029  0.487244
      1     0 -0.024553 -1.750904  0.004466  1.978049
      2     0  1.686806  0.175431  0.677609 -0.851670
      
      In [256]: df
      Out[256]:
            a         b         c         d         e
      0  test -0.874197  0.168660 -0.948726  0.479723
      1  test  1.124383  0.620870  0.159186  0.993676
      2  test -1.429108 -0.048814 -0.057273 -1.331702
      3     0 -1.671828 -1.259530  0.905029  0.487244
      4     0 -0.024553 -1.750904  0.004466  1.978049
      5     0  1.686806  0.175431  0.677609 -0.851670
      

      Example also posted here: http://stackoverflow.com/questions/42637497/pyspark-2-1-0-spark-read-csv-scrambles-columns

      Attachments

        Activity

          People

            Unassigned Unassigned
            cottrell david cottrell
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: