Description
I'm getting sporadic NPEs from pyspark that I can't narrow down, but I'm able to reproduce it consistently from the attached data file. If I delete any single line from the file, it works; but the file as is or larger (it's a snippet of a much larger log file) causes the problem.
Printing the lines read works fine...but afterwards, I get the exception.
Problem occurs with vanilla pyspark and IPython, but NOT with the scala spark shell.
sc.textFile("testlog13").take(5) (does not seem to matter how many lines I take).
In [32]: sc.textFile("testlog16").take(5)
14/01/09 14:16:45 INFO MemoryStore: ensureFreeSpace(69808) called with curMem=1185875, maxMem=342526525
14/01/09 14:16:45 INFO MemoryStore: Block broadcast_16 stored as values to memory (estimated size 68.2 KB, free 325.5 MB)
14/01/09 14:16:45 INFO FileInputFormat: Total input paths to process : 1
14/01/09 14:16:45 INFO SparkContext: Starting job: runJob at PythonRDD.scala:288
14/01/09 14:16:45 INFO DAGScheduler: Got job 23 (runJob at PythonRDD.scala:288) with 1 output partitions (allowLocal=true)
14/01/09 14:16:45 INFO DAGScheduler: Final stage: Stage 23 (runJob at PythonRDD.scala:288)
14/01/09 14:16:45 INFO DAGScheduler: Parents of final stage: List()
14/01/09 14:16:45 INFO DAGScheduler: Missing parents: List()
14/01/09 14:16:45 INFO DAGScheduler: Computing the requested partition locally
14/01/09 14:16:45 INFO HadoopRDD: Input split: file:/home/training/testlog16:0+61124
14/01/09 14:16:45 INFO PythonRDD: Times: total = 14, boot = 1, init = 3, finish = 10
14/01/09 14:16:45 INFO SparkContext: Job finished: runJob at PythonRDD.scala:288, took 0.021146108 s
Out[32]:
[u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET KBDOC-00087.html HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'100.219.90.44 - 102 [15/Sep/2013:23:58:51 +0100] "GET theme.css HTTP/1.0" 200 8681 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET KBDOC-00076.html HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ',
u'182.4.148.56 - 173 [15/Sep/2013:23:58:30 +0100] "GET theme.css HTTP/1.0" 200 17546 "http://www.loudacre.com" "Loudacre CSR Browser" ']
In [33]: Exception in thread "stdin writer for python" java.lang.NullPointerException
at org.apache.hadoop.fs.BufferedFSInputStream.getPos(BufferedFSInputStream.java:54)
at org.apache.hadoop.fs.FSDataInputStream.getPos(FSDataInputStream.java:60)
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:246)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:275)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:227)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:195)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:211)
at org.apache.hadoop.util.LineReader.readLine(LineReader.java:174)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:206)
at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:45)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:167)
at org.apache.spark.rdd.HadoopRDD$$anon$1.getNext(HadoopRDD.scala:150)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:71)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:27)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
at org.apache.spark.api.python.PythonRDD$$anon$2.run(PythonRDD.scala:98)
Attachments
Issue Links
- relates to
-
SPARK-1579 PySpark should distinguish expected IOExceptions from unexpected ones in the worker
- Resolved