[TEZ-3113] massive increase of run time using PipelinedSorter rather than DefaultSorter - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 0.8.2
Fix Version/s: None
Component/s: None
Labels:
None
Environment:

Hide

scalding 0.15-SNAPSHOT per https://github.com/twitter/scalding/pull/1446
cascading 3.1.0-wip-54
tez-0.8.2
OpenJDK 8 on AMD64
Hadoop 2.6.0 (YARN, HDFS); Apache distribution
Debian Linux 8
8 * Intel Core i7-3770K

Show
scalding 0.15-SNAPSHOT per https://github.com/twitter/scalding/pull/1446 cascading 3.1.0-wip-54 tez-0.8.2 OpenJDK 8 on AMD64 Hadoop 2.6.0 (YARN, HDFS); Apache distribution Debian Linux 8 8 * Intel Core i7-3770K

Description

While running a (fairly complex) scalding DAG that was working fine using tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.

Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.

Difficulties can be traced to this shape of code:

val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset 

x
  .group
  .mapValues(x => 1L)
  .sum
  .write(TypedTsvHeader("foo.tsv", ('key, 'count)))

where the incoming data contains many, many different keys. Observed behaviour of PipelinedSorter is that several hundred thousand different files are put flat in the same per-TezChild local temporary directories, and thing become very slow (not alleging any causality).

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Cyrille Chépélov

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 11/Feb/16 14:27

Updated:: 24/Aug/16 17:15