Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-15527

Memory usage is unbound in SortByShuffler for Spark

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Not A Problem
    • 1.1.0
    • None
    • Spark
    • None

    Description

      In SortByShuffler.java, an ArrayList is used to back the iterator for values that have the same key in shuffled result produced by spark transformation sortByKey. It's possible that memory can be exhausted because of a large key group.

                  @Override
                  public Tuple2<HiveKey, Iterable<BytesWritable>> next() {
                    // TODO: implement this by accumulating rows with the same key into a list.
                    // Note that this list needs to improved to prevent excessive memory usage, but this
                    // can be done in later phase.
                    while (it.hasNext()) {
                      Tuple2<HiveKey, BytesWritable> pair = it.next();
                      if (curKey != null && !curKey.equals(pair._1())) {
                        HiveKey key = curKey;
                        List<BytesWritable> values = curValues;
                        curKey = pair._1();
                        curValues = new ArrayList<BytesWritable>();
                        curValues.add(pair._2());
                        return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, values);
                      }
                      curKey = pair._1();
                      curValues.add(pair._2());
                    }
                    if (curKey == null) {
                      throw new NoSuchElementException();
                    }
                    // if we get here, this should be the last element we have
                    HiveKey key = curKey;
                    curKey = null;
                    return new Tuple2<HiveKey, Iterable<BytesWritable>>(key, curValues);
                  }
      

      Since the output from sortByKey is already sorted on key, it's possible to backup the value iterable using the same input iterator.

      Attachments

        1. HIVE-15527.8.patch
          14 kB
          Chao Sun
        2. HIVE-15527.7.patch
          6 kB
          Chao Sun
        3. HIVE-15527.0.patch
          6 kB
          Xuefu Zhang
        4. HIVE-15527.0.patch
          5 kB
          Xuefu Zhang
        5. HIVE-15527.6.patch
          10 kB
          Chao Sun
        6. HIVE-15527.5.patch
          10 kB
          Chao Sun
        7. HIVE-15527.4.patch
          10 kB
          Chao Sun
        8. HIVE-15527.3.patch
          8 kB
          Xuefu Zhang
        9. HIVE-15527.2.patch
          8 kB
          Xuefu Zhang
        10. HIVE-15527.1.patch
          3 kB
          Xuefu Zhang
        11. HIVE-15527.patch
          4 kB
          Xuefu Zhang

        Issue Links

          Activity

            People

              csun Chao Sun
              xuefuz Xuefu Zhang
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: