Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24356

Duplicate strings in File.path managed by FileSegmentManagedBuffer

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersStop watchingWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • Shuffle, Spark Core
    • None

    Description

      I recently analyzed a heap dump of Yarn Node Manager that was suffering from high GC pressure due to high object churn. Analysis was done with the jxray tool (www.jxray.com) that checks a heap dump for a number of well-known memory issues. One problem that it found in this dump is 19.5% of memory wasted due to duplicate strings. Of these duplicates, more than a half come from FileInputStream.path and File.path. All the FileInputStream objects that JXRay shows are garbage - looks like they are used for a very short period and then discarded (I guess there is a separate question of whether that's a good pattern). But File instances are traceable to org.apache.spark.network.buffer.FileSegmentManagedBuffer.file field. Here is the full reference chain:
       

      ↖java.io.File.path
      ↖org.apache.spark.network.buffer.FileSegmentManagedBuffer.file
      ↖{j.u.ArrayList}
      ↖j.u.ArrayList$Itr.this$0
      ↖org.apache.spark.network.server.OneForOneStreamManager$StreamState.buffers
      ↖{java.util.concurrent.ConcurrentHashMap}.values
      ↖org.apache.spark.network.server.OneForOneStreamManager.streams
      ↖org.apache.spark.network.shuffle.ExternalShuffleBlockHandler.streamManager
      ↖org.apache.spark.network.yarn.YarnShuffleService.blockHandler
      ↖Java Static org.apache.spark.network.yarn.YarnShuffleService.instance
      

       
      Values of these File.path's and FileInputStream.path's look very similar, so I think FileInputStream}}s are generated by the {{FileSegmentManagedBuffer code. Instances of File, in turn, likely come from
      https://github.com/apache/spark/blob/master/common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockResolver.java#L258-L263
       
      To avoid duplicate strings in File.path's in this case, it is suggested that in the above code we create a File with a complete, normalized pathname, that has been already interned. This will prevent the code inside java.io.File from modifying this string, and thus it will use the interned copy, and will pass it to FileInputStream. Essentially the current line

      return new File(new File(localDir, String.format("%02x", subDirId)), filename);

      should be replaced with something like

      String pathname = localDir + File.separator + String.format(...) + File.separator + filename;
      pathname = fileSystem.normalize(pathname).intern();
      return new File(pathname);

       

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            misha@cloudera.com Misha Dmitriev Assign to me
            misha@cloudera.com Misha Dmitriev
            Votes:
            2 Vote for this issue
            Watchers:
            8 Stop watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment