Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26019

pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.2, 2.4.0
    • 2.3.3, 2.4.1
    • PySpark
    • None

    Description

      pyspark's accumulator server expects a secure py4j connection between python and the jvm. Spark will normally create a secure connection, but there is a public api which allows you to pass in your own py4j connection. (this is used by zeppelin, at least.) When this happens, you get an error like:

      pyspark/accumulators.py: "TypeError: object of type 'NoneType' has no len()" in authenticate_and_accum_updates()
      

      We should change pyspark to
      1) warn loudly if a user passes in an insecure connection
      1a) I'd like to suggest that we even error out, unless the user actively opts-in with a config like "spark.python.allowInsecurePy4j=true"
      2) The accumulator server should be changed to allow insecure connections.

      note that SPARK-26349 will disallow insecure connections completely in 3.0.
       
      More info on how this occurs:

      Exception happened during processing of request from ('127.0.0.1', 43418)
      ----------------------------------------
      Traceback (most recent call last):
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 290, in _handle_request_noblock
          self.process_request(request, client_address)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 318, in process_request
          self.finish_request(request, client_address)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 331, in finish_request
          self.RequestHandlerClass(request, client_address, self)
        File "/opt/cloudera/parcels/Anaconda/lib/python2.7/SocketServer.py", line 652, in __init__
          self.handle()
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 263, in handle
          poll(authenticate_and_accum_updates)
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 238, in poll
          if func():
        File "/opt/cloudera/parcels/SPARK2-2.3.0.cloudera4-1.cdh5.13.3.p0.611179/lib/spark2/python/lib/pyspark.zip/pyspark/accumulators.py", line 251, in authenticate_and_accum_updates
          received_token = self.rfile.read(len(auth_token))
      TypeError: object of type 'NoneType' has no len()
       
      

       
      Error happens here:
      https://github.com/apache/spark/blob/cb90617f894fd51a092710271823ec7d1cd3a668/python/pyspark/accumulators.py#L254

      The PySpark code was just running a simple pipeline of
      binary_rdd = sc.binaryRecords(full_file_path, record_length).map(lambda .. )
      and then converting it to a dataframe and running a count on it.

      It seems error is flaky - on next rerun it didn't happen. (But accumulators don't actually work.)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            irashid Imran Rashid
            Tagar Ruslan Dautkhanov
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment