[SPARK-1019] pyspark RDD take() throws NPE - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.8.1, 0.9.0
Fix Version/s: 0.9.1, 1.0.0
Component/s: PySpark
Labels:
None

Description

I'm getting sporadic NPEs from pyspark that I can't narrow down, but I'm able to reproduce it consistently from the attached data file. If I delete any single line from the file, it works; but the file as is or larger (it's a snippet of a much larger log file) causes the problem.

Printing the lines read works fine...but afterwards, I get the exception.

Problem occurs with vanilla pyspark and IPython, but NOT with the scala spark shell.

sc.textFile("testlog13").take(5) (does not seem to matter how many lines I take).

In [32]: sc.textFile("testlog16").take(5)
14/01/09 14:16:45 INFO MemoryStore: ensureFreeSpace(69808) called with curMem=1185875, maxMem=342526525
14/01/09 14:16:45 INFO MemoryStore: Block broadcast_16 stored as values to memory (estimated size 68.2 KB, free 325.5 MB)
14/01/09 14:16:45 INFO FileInputFormat: Total input paths to process : 1
14/01/09 14:16:45 INFO SparkContext: Starting job: runJob at PythonRDD.scala:288
14/01/09 14:16:45 INFO DAGScheduler: Got job 23 (runJob at PythonRDD.scala:288) with 1 output partitions (allowLocal=true)
14/01/09 14:16:45 INFO DAGScheduler: Final stage: Stage 23 (runJob at PythonRDD.scala:288)
14/01/09 14:16:45 INFO DAGScheduler: Parents of final stage: List()
14/01/09 14:16:45 INFO DAGScheduler: Missing parents: List()
14/01/09 14:16:45 INFO DAGScheduler: Computing the requested partition locally
14/01/09 14:16:45 INFO HadoopRDD: Input split: file:/home/training/testlog16:0+61124
14/01/09 14:16:45 INFO PythonRDD: Times: total = 14, boot = 1, init = 3, finish = 10
14/01/09 14:16:45 INFO SparkContext: Job finished: runJob at PythonRDD.scala:288, took 0.021146108 s
Out[32]:
[u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET theme.css HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET KBDOC-00076.html HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET theme.css HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ']

In [33]: Exception in thread "stdin writer for python" java.lang.NullPointerException
at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:54)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:60)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:246)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:275)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:227)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:195)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:167)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:150)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:98)

Attachments

Issue Links

relates to

SPARK-1579 PySpark should distinguish expected IOExceptions from unexpected ones in the worker

Resolved

Activity

People

Assignee:: Josh Rosen

Reporter:: Diana Carroll

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 09/Jan/14 14:20

Updated:: 12/Sep/14 20:26

Resolved:: 12/Mar/14 23:18