Uploaded image for project: 'Apache Tez'
  1. Apache Tez
  2. TEZ-3113

massive increase of run time using PipelinedSorter rather than DefaultSorter

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 0.8.2
    • None
    • None
    • None

    Description

      While running a (fairly complex) scalding DAG that was working fine using tez-0.6.2, now under tez-0.8.2, the run time became suddenly extremely large.

      Reverting "tez.runtime.sorter.class" -> "LEGACY" restored proper behaviour.

      Difficulties can be traced to this shape of code:

      val x: TypedPipe[(String, String)] = ??? // get *LARGE* dataset 
      
      x
        .group
        .mapValues(x => 1L)
        .sum
        .write(TypedTsvHeader("foo.tsv", ('key, 'count)))
      

      where the incoming data contains many, many different keys. Observed behaviour of PipelinedSorter is that several hundred thousand different files are put flat in the same per-TezChild local temporary directories, and thing become very slow (not alleging any causality).

      Attachments

        Activity

          People

            Unassigned Unassigned
            cchepelov Cyrille Chépélov
            Votes:
            0 Vote for this issue
            Watchers:
            12 Start watching this issue

            Dates

              Created:
              Updated: