[SPARK-26019] pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates() - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.3.2, 2.4.0
Fix Version/s: 2.3.3, 2.4.1
Component/s: PySpark
Labels:
None

Target Version/s:

2.3.3, 2.4.1

Description

pyspark's accumulator server expects a secure py4j connection between python and the jvm. Spark will normally create a secure connection, but there is a public api which allows you to pass in your own py4j connection. (this is used by zeppelin, at least.) When this happens, you get an error like:

pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

We should change pyspark to
1) warn loudly if a user passes in an insecure connection
1a) I'd like to suggest that we even error out, unless the user actively opts-in with a config like "spark.python.allowInsecurePy4j=true"
2) The accumulator server should be changed to allow insecure connections.

note that ~~SPARK-26349~~ will disallow insecure connections completely in 3.0.

More info on how this occurs:

Exception happened during processing of request from ('127.0.0.1', 43418)
----------------------------------------
Traceback (most recent call last):
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 318, in process_request
    self.finish_request(request, client_address)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 331, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 652, in __init__
    self.handle()
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 263, in handle
    poll(authenticate_and_accum_updates)
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 238, in poll
    if func():
  File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 251, in authenticate_and_accum_updates
    received_token = self.rfile.read(len(auth_token))
TypeError: object of type 'NoneType' has no len()

Error happens here:
https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254

The PySpark code was just running a simple pipeline of
binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
and then converting it to a dataframe and running a count on it.

It seems error is flaky - on next rerun it didn't happen. (But accumulators don't actually work.)

Attachments

Issue Links

duplicates

SPARK-26113 TypeError: object of type 'NoneType' has no len() in authenticate_and_accum_updates of pyspark/accumulators.py

Resolved

relates to

SPARK-26349 Pyspark should not accept insecure p4yj gateways

Resolved

links to

[Github] Pull Request #23113 (Tagar)

GitHub Pull Request #23113

GitHub Pull Request #23337

PR-23113

(1 links to)

Activity

People

Assignee:: Imran Rashid

Reporter:: Ruslan Dautkhanov

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 12/Nov/18 16:54

Updated:: 12/Dec/22 18:11

Resolved:: 03/Jan/19 03:13