Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-1019

pyspark RDD take() throws NPE

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 0.8.1, 0.9.0
    • 0.9.1, 1.0.0
    • PySpark
    • None

    Description

      I'm getting sporadic NPEs from pyspark that I can't narrow down, but I'm able to reproduce it consistently from the attached data file. If I delete any single line from the file, it works; but the file as is or larger (it's a snippet of a much larger log file) causes the problem.

      Printing the lines read works fine...but afterwards, I get the exception.

      Problem occurs with vanilla pyspark and IPython, but NOT with the scala spark shell.

      sc.textFile("testlog13").take(5) (does not seem to matter how many lines I take).

      In [32]: sc.textFile("testlog16").take(5)
      14/01/09 14:16:45 INFO MemoryStore: ensureFreeSpace(69808) called with curMem=1185875, maxMem=342526525
      14/01/09 14:16:45 INFO MemoryStore: Block broadcast_16 stored as values to memory (estimated size 68.2 KB, free 325.5 MB)
      14/01/09 14:16:45 INFO FileInputFormat: Total input paths to process : 1
      14/01/09 14:16:45 INFO SparkContext: Starting job: runJob at PythonRDD.scala:288
      14/01/09 14:16:45 INFO DAGScheduler: Got job 23 (runJob at PythonRDD.scala:288) with 1 output partitions (allowLocal=true)
      14/01/09 14:16:45 INFO DAGScheduler: Final stage: Stage 23 (runJob at PythonRDD.scala:288)
      14/01/09 14:16:45 INFO DAGScheduler: Parents of final stage: List()
      14/01/09 14:16:45 INFO DAGScheduler: Missing parents: List()
      14/01/09 14:16:45 INFO DAGScheduler: Computing the requested partition locally
      14/01/09 14:16:45 INFO HadoopRDD: Input split: file:/home/training/testlog16:0+61124
      14/01/09 14:16:45 INFO PythonRDD: Times: total = 14, boot = 1, init = 3, finish = 10
      14/01/09 14:16:45 INFO SparkContext: Job finished: runJob at PythonRDD.scala:288, took 0.021146108 s
      Out[32]:
      [u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
      u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
      u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET theme.css HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
      u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET KBDOC-00076.html HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ',
      u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET theme.css HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ']

      In [33]: Exception in thread "stdin writer for python" java.lang.NullPointerException
      at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:54)
      at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:60)
      at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:246)
      at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:275)
      at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:227)
      at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:195)
      at java.io.DataInputStream.read(DataInputStream.java:100)
      at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
      at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
      at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206)
      at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:167)
      at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:150)
      at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
      at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
      at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
      at scala.collection.Iterator$class.foreach(Iterator.scala:772)
      at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
      at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:98)

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            joshrosen Josh Rosen
            dcarroll@cloudera.com Diana Carroll
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment