Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-14103

Python DataFrame CSV load on large file is writing to console in Ipython

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.0.0
    • Component/s: PySpark
    • Environment:

      Ubuntu, Python 2.7.11, Anaconda 2.5.0, Spark from Master branch

      Description

      I am using the spark from the master branch and when I run the following command on a large tab separated file then I get the contents of the file being written to the stderr

      df = sqlContext.read.load("temp.txt", format="csv", header="false", inferSchema="true", delimiter="\t")
      

      Here is a sample of output:

      ^M[Stage 1:>                                                          (0 + 2) / 2]16/03/23 14:01:02 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
      com.univocity.parsers.common.TextParsingException: Error processing input: Length of parsed input (1000001) exceeds the maximum number of characters defined in your parser settings (1000000). Identified line separator characters in the parsed content. This may be the cause of the error. The line separator in your parser settings is set to '\n'. Parsed content:
              Privacy-shake",: a haptic interface for managing privacy settings in mobile location sharing applications       privacy shake a haptic interface for managing privacy settings in mobile location sharing applications  2010    2010/09/07              international conference on human computer interaction  interact                43331058        19371[\n]        3D4F6CA1        Between the Profiles: Another such Bias. Technology Acceptance Studies on Social Network Services       between the profiles another such bias technology acceptance studies on social network services 2015    2015/08/02      10.1007/978-3-319-21383-5_12    international conference on human-computer interaction  interact                43331058        19502[\n]
      
      .......
      
      .........
      
      web snippets    2008    2008/05/04      10.1007/978-3-642-01344-7_13    international conference on web information systems and technologies    webist          44F29802        19489
      06FA3FFA        Interactive 3D User Interfaces for Neuroanatomy Exploration     interactive 3d user interfaces for neuroanatomy exploration     2009                    internationa]
              at com.univocity.parsers.common.AbstractParser.handleException(AbstractParser.java:241)
              at com.univocity.parsers.common.AbstractParser.parseNext(AbstractParser.java:356)
              at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:137)
              at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.next(CSVParser.scala:120)
              at scala.collection.Iterator$class.foreach(Iterator.scala:742)
              at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foreach(CSVParser.scala:120)
              at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:155)
              at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.foldLeft(CSVParser.scala:120)
              at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:212)
              at org.apache.spark.sql.execution.datasources.csv.BulkCsvReader.aggregate(CSVParser.scala:120)
              at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
              at org.apache.spark.rdd.RDD$$anonfun$aggregate$1$$anonfun$22.apply(RDD.scala:1058)
              at org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
              at org.apache.spark.SparkContext$$anonfun$35.apply(SparkContext.scala:1827)
              at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:69)
              at org.apache.spark.scheduler.Task.run(Task.scala:82)
              at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:231)
              at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
              at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
              at java.lang.Thread.run(Thread.java:745)
      Caused by: java.lang.ArrayIndexOutOfBoundsException
      16/03/23 14:01:03 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
      ^M[Stage 1:>                                                          (0 + 1) / 2]
      
      
      

      For a small sample (<10,000 lines) of the data, I am not getting any error. But as soon as I go above more than 100,000 samples, I start getting the error.

      I don't think the spark platform should output the actual data to stderr ever as it decreases the readability.

        Attachments

          Activity

            People

            • Assignee:
              hyukjin.kwon Hyukjin Kwon
              Reporter:
              shubhanshumishra@gmail.com Shubhanshu Mishra
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: