Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-12980

pyspark crash for large dataset - clone

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Invalid
    • 1.5.2
    • None
    • None
    • None
    • windows

    Description

      I installed spark 1.6 on many different computers.

      On Windows, PySpark textfile method, followed by take(1), does not work on a file of 13M.
      If I set numpartitions to 2000 or take a smaller file, the method works well.
      The Pyspark is set with all RAM memory of the computer thanks to the command --conf spark.driver.memory=5g in local mode.

      On Mac OS, I'm able to launch the exact same program with Pyspark with 16G RAM for a file of much bigger in comparison, of 5G. Memory is correctly allocated, removed etc

      On Ubuntu, no trouble, I can also launch a cluster http://christopher5106.github.io/big/data/2016/01/19/computation-power-as-you-need-with-EMR-auto-termination-cluster-example-random-forest-python.html

      The error message on Windows is : java.net.SocketException: Connection reset by peer: socket write error
      Configuration is : Java 8 64 bit, Python 2.7.11, on Windows 7 entreprise SP1 v2.42.01
      What could be the reason to have the windows spark textfile method fail ?

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              christopher5106 Christopher Bourez
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: