Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-39654

parameters quotechar and escapechar needs to limite to a char in pyspark pandas read_csv function

    XMLWordPrintableJSON

Details

    • Bug
    • Status: In Progress
    • Minor
    • Resolution: Unresolved
    • 3.3.0
    • None
    • PySpark
    • None
    • pyspark pandas: master

      OS: Ubuntu 1804

      Python version: 3.8.14

      pandas version: 1.4.2

    Description

      Different behavior between pyspark and pandas with the single blank string. We need to keep the same behavior like pandas, even the backend DataFrame support this input.

       

      test case(test3.csv):

      "column1","column2", "column3", "column4", "column5", "column6" "AM", 7, "1", "SD", "SD", "CR" "AM", 8, "1,2 ,3", "PR, SD,SD", "PR ; , SD,SD", "PR , ,, SD ,SD" "AM", 1, "2", "SD", "SD", "SD" 

       

      For quotechar

      pandas:

      >>> pd.read_csv('/home/spark/test3.csv', quotechar=' ')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
          return func(*args, **kwargs)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
          return _read(filepath_or_buffer, kwds)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
          return parser.read(nrows)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
          index, columns, col_dict = self._engine.read(nrows)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
          chunks = self._reader.read_low_memory(nrows)
        File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
        File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
        File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
        File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
      pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 2
      >>> 
       

      pyspark:

      >>> sp.read_csv('/home/spark/test3.csv', quotechar=' ')
      /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
        warnings.warn(message, PandasAPIOnSparkAdviceWarning)
        "column1" "column2"  "column3", "column4"  "column5", "column6"
      0      "AM"    7, "1"            "SD", "SD"                  "CR"
      1      "AM"     8, "1                    2                     3"
      2      "AM"    1, "2"            "SD", "SD"                  "SD"
       

       

      For escapechar

      pandas:

      >>> pd.read_csv('/home/spark/test3.csv', escapechar=' ')
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper
          return func(*args, **kwargs)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv
          return _read(filepath_or_buffer, kwds)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read
          return parser.read(nrows)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read
          index, columns, col_dict = self._engine.read(nrows)
        File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read
          chunks = self._reader.read_low_memory(nrows)
        File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
        File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows
        File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows
        File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error
      pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11

      pyspark:

      >>> sp.read_csv('/home/spark/test3.csv', escapechar=' ')
      /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead.
        warnings.warn(message, PandasAPIOnSparkAdviceWarning)
        column1  column2  "column3"  "column4"  "column5"  "column6"
      0      AM      7.0        "1"       "SD"       "SD"       "CR"
      1      AM      8.0         "1         2          3"        "PR
      2      AM      1.0        "2"       "SD"       "SD"       "SD"
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            bzhaoop bo zhao
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: