[SPARK-13455] Periods in dataframe column names breaks df.drop(<string>) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Duplicate
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: PySpark, SQL
Labels:
None
Environment:

Spark 1.6.0 installed via homebrew

Description

When calling the .drop method using a string on a dataframe that contains a column name with a period in it, an AnalysisException is raised. This doesn't happen when dropping using the column object itself.

>>> import json
>>> ds = {'a': "test", "b.no": "testagain"}
>>> df = sqlContext.jsonRDD(sc.parallelize([json.dumps(ds)]))
>>> df.drop('a')

yields

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/dataframe.py", line 1347, in drop
    jdf = self._jdf.drop(col)
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py", line 813, in __call__
  File "/usr/local/Cellar/apache-spark/1.6.0/libexec/python/pyspark/sql/utils.py", line 51, in deco
    raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: u"cannot resolve 'b.no' given input columns a, b.no;"

whereas this works,

>>> df.drop(df.a)
DataFrame[b.no: string]

current workaround if you want to drop a column using a string is to use

>>> df.drop(df.select("a")[0])
DataFrame[b.no: string]

Attachments

Issue Links

duplicates

SPARK-12988 Can't drop columns that contain dots

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jason Piper

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 23/Feb/16 14:42

Updated:: 23/Feb/16 15:08

Resolved:: 23/Feb/16 15:01