Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-19556

Broadcast data is not encrypted when I/O encryption is on



    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.0
    • 2.2.0
    • Spark Core
    • None


      TorrentBroadcast uses a couple of "back doors" into the block manager to write and read data:

            if (!blockManager.putBytes(pieceId, bytes, MEMORY_AND_DISK_SER, tellMaster = true)) {
              throw new SparkException(s"Failed to store $pieceId of $broadcastId in local BlockManager")
            bm.getLocalBytes(pieceId) match {
              case Some(block) =>
                blocks(pid) = block
              case None =>
                bm.getRemoteBytes(pieceId) match {
                  case Some(b) =>
                    if (checksumEnabled) {
                      val sum = calcChecksum(b.chunks(0))
                      if (sum != checksums(pid)) {
                        throw new SparkException(s"corrupt remote block $pieceId of $broadcastId:" +
                          s" $sum != ${checksums(pid)}")
                    // We found the block from remote executors/driver's BlockManager, so put the block
                    // in this executor's BlockManager.
                    if (!bm.putBytes(pieceId, b, StorageLevel.MEMORY_AND_DISK_SER, tellMaster = true)) {
                      throw new SparkException(
                        s"Failed to store $pieceId of $broadcastId in local BlockManager")
                    blocks(pid) = b
                  case None =>
                    throw new SparkException(s"Failed to get $pieceId of $broadcastId")

      The thing these block manager methods have in common is that they bypass the encryption code; so broadcast data is stored unencrypted in the block manager, causing unencrypted data to be written to disk if those blocks need to be evicted from memory.

      The correct fix here is actually not to change TorrentBroadcast, but to fix the block manager so that:

      • data stored in memory is not encrypted
      • data written to disk is encrypted

      This would simplify the code paths that use BlockManager / SerializerManager APIs (e.g. see SPARK-19520), but requires some tricky changes inside the BlockManager to still be able to use file channels to avoid reading whole blocks back into memory so they can be decrypted.




            vanzin Marcelo Masiero Vanzin
            vanzin Marcelo Masiero Vanzin
            0 Vote for this issue
            5 Start watching this issue