Data is table with 300K rows, 16 cols, covers a single year, so there are 12 months and 365 days with roughly similar number of rows (each row is a scheduled flight)
Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", :
Unable to retrieve JDBC result set for SELECT `year`, `month`, `flights`
FROM (select `year`, `month`, sum(`flights`) as `flights`
from (select `year`, `month`, `day`, count(*) as `flights`
group by `year`, `month`, `day`) as `_w21`
group by `year`, `month`) AS `_w22`
LIMIT 10 (org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 237.0 failed 1 times, most recent failure: Lost task 0.0 in stage 237.0 (TID 8634, localhost): java.io.FileNotFoundException: /user/hive/warehouse/flights/file11ce460c958e (Too many open files)
at java.io.FileInputStream.open0(Native Method)
As you can see the query is not something one would write by hand very easily, because it's computer generated, but it makes perfect sense: it's a count of flights by month. Could be done without the nested query, but that's not the point.
This query used to work on 1.4, doesn't on 1.5. There has also been a os upgrade to yosemite in the meantime, so it's hard to separate the effects of the two. Following suggestions that default system limits for open files are too low for spark to work properly, I increase hard and soft limits to 32k. For some reason, the error happens when java has about 10250 open files as reported by lsof. Not clear to me where that limit is coming from. Total files open is 16k. If this is not a bug, I would like to ask what a safe number of allowed open files is and if there are other configurations that need to be tuned.