Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26019

pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.2, 2.4.0
    • Fix Version/s: 2.3.3, 2.4.1
    • Component/s: PySpark
    • Labels:
      None

      Description

      pyspark's accumulator server expects a secure py4j connection between python and the jvm. Spark will normally create a secure connection, but there is a public api which allows you to pass in your own py4j connection. (this is used by zeppelin, at least.) When this happens, you get an error like:

      pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
      

      We should change pyspark to
      1) warn loudly if a user passes in an insecure connection
      1a) I'd like to suggest that we even error out, unless the user actively opts-in with a config like "spark.python.allowInsecurePy4j=true"
      2) The accumulator server should be changed to allow insecure connections.

      note that SPARK-26349 will disallow insecure connections completely in 3.0.
       
      More info on how this occurs:

      Exception happened during processing of request from ('127.0.0.1', 43418)
      ----------------------------------------
      Traceback (most recent call last):
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
          self.process_request(request, client_address)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 318, in process_request
          self.finish_request(request, client_address)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 331, in finish_request
          self.RequestHandlerClass(request, client_address, self)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 652, in __init__
          self.handle()
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 263, in handle
          poll(authenticate_and_accum_updates)
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 238, in poll
          if func():
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 251, in authenticate_and_accum_updates
          received_token = self.rfile.read(len(auth_token))
      TypeError: object of type 'NoneType' has no len()
       
      

       
      Error happens here:
      https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254

      The PySpark code was just running a simple pipeline of
      binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
      and then converting it to a dataframe and running a count on it.

      It seems error is flaky - on next rerun it didn't happen. (But accumulators don't actually work.)

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                irashid Imran Rashid
                Reporter:
                Tagar Ruslan Dautkhanov
              • Votes:
                0 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: