Details
-
Bug
-
Status: Open
-
Major
-
Resolution: Unresolved
-
0.8.2
-
None
-
None
-
None
-
scalding 0.15-SNAPSHOT per https://github.com/twitter/scalding/pull/1446
cascading 3.1.0-wip-54
tez-0.8.2
OpenJDK 8 on AMD64
Hadoop 2.6.0 (YARN, HDFS); Apache distribution
Debian Linux 8
8 * Intel Core i7-3770K
Description
While running a (fairly complex) scalding DAG that was working fine using tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.
Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.
Difficulties can be traced to this shape of code:
val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset x .group .mapValues(x => 1L) .sum .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
where the incoming data contains many, many different keys. Observed behaviour of PipelinedSorter is that several hundred thousand different files are put flat in the same per-TezChild local temporary directories, and thing become very slow (not alleging any causality).