Details
Description
Different behavior between pyspark and pandas with the single blank string. We need to keep the same behavior like pandas, even the backend DataFrame support this input.
test case(test3.csv):
"column1","column2", "column3", "column4", "column5", "column6" "AM", 7, "1", "SD", "SD", "CR" "AM", 8, "1,2 ,3", "PR, SD,SD", "PR ; , SD,SD", "PR , ,, SD ,SD" "AM", 1, "2", "SD", "SD", "SD"
For quotechar
pandas:
>>> pd.read_csv('/home/spark/test3.csv', quotechar=' ') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read return parser.read(nrows) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read index, columns, col_dict = self._engine.read(nrows) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read chunks = self._reader.read_low_memory(nrows) File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: EOF inside string starting at row 2 >>>
pyspark:
>>> sp.read_csv('/home/spark/test3.csv', quotechar=' ') /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead. warnings.warn(message, PandasAPIOnSparkAdviceWarning) "column1" "column2" "column3", "column4" "column5", "column6" 0 "AM" 7, "1" "SD", "SD" "CR" 1 "AM" 8, "1 2 3" 2 "AM" 1, "2" "SD", "SD" "SD"
For escapechar
pandas:
>>> pd.read_csv('/home/spark/test3.csv', escapechar=' ') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/util/_decorators.py", line 311, in wrapper return func(*args, **kwargs) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 680, in read_csv return _read(filepath_or_buffer, kwds) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 581, in _read return parser.read(nrows) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/readers.py", line 1254, in read index, columns, col_dict = self._engine.read(nrows) File "/home/spark/.pyenv/versions/3.8.13/lib/python3.8/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 225, in read chunks = self._reader.read_low_memory(nrows) File "pandas/_libs/parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory File "pandas/_libs/parsers.pyx", line 861, in pandas._libs.parsers.TextReader._read_rows File "pandas/_libs/parsers.pyx", line 847, in pandas._libs.parsers.TextReader._tokenize_rows File "pandas/_libs/parsers.pyx", line 1960, in pandas._libs.parsers.raise_parser_error pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11
pyspark:
>>> sp.read_csv('/home/spark/test3.csv', escapechar=' ') /home/spark/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: If `index_col` is not specified for `read_csv`, the default index is attached which can cause additional overhead. warnings.warn(message, PandasAPIOnSparkAdviceWarning) column1 column2 "column3" "column4" "column5" "column6" 0 AM 7.0 "1" "SD" "SD" "CR" 1 AM 8.0 "1 2 3" "PR 2 AM 1.0 "2" "SD" "SD" "SD"