Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13455

Periods in dataframe column names breaks df.drop(<string>)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Duplicate
    • 1.6.0
    • None
    • PySpark, SQL
    • None
    • Spark 1.6.0 installed via homebrew

    Description

      When calling the .drop method using a string on a dataframe that contains a column name with a period in it, an AnalysisException is raised. This doesn't happen when dropping using the column object itself.

      >>> import json
      >>> ds = {'a': "test", "b.no": "testagain"}
      >>> df = sqlContext.jsonRDD(sc.parallelize([json.dumps(ds)]))
      >>> df.drop('a')
      

      yields

      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/dataframe.py", line 1347, in drop
          jdf = self._jdf.drop(col)
        File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
        File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.py", line 51, in deco
          raise AnalysisException(s.split(': ', 1)[1], stackTrace)
      pyspark.sql.utils.AnalysisException: u"cannot resolve 'b.no' given input columns a, b.no;"
      

      whereas this works,

      >>> df.drop(df.a)
      DataFrame[b.no: string]
      

      current workaround if you want to drop a column using a string is to use

      >>> df.drop(df.select("a")[0])
      DataFrame[b.no: string]
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jpiper Jason Piper
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: