Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41550 Dynamic Allocation on K8S GA
  3. SPARK-38062

FallbackStorage shouldn't attempt to resolve arbitrary "remote" hostname

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.3.0
    • 3.3.0
    • Spark Core
    • None

    Description

      FallbackStorage uses a placeholder block manager ID:

      private[spark] object FallbackStorage extends Logging {
        /** We use one block manager id as a place holder. */
        val FALLBACK_BLOCK_MANAGER_ID: BlockManagerId = BlockManagerId("fallback", "remote", 7337)
      

      That second argument is normally interpreted as a hostname, but is passed as the string "remote" in this case.

      BlockManager will consider this placeholder as one of the peers in some cases:

      BlockManager.scala
        private[storage] def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = {
          peerFetchLock.synchronized {
            ...
            if (cachedPeers.isEmpty &&
                conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) {
              Seq(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID)
            } else {
              cachedPeers
            }
          }
        }
      

      BlockManagerDecommissioner.ShuffleMigrationRunnable will then attempt to perform an upload to this placeholder ID:

                  try {
                    blocks.foreach { case (blockId, buffer) =>
                      logDebug(s"Migrating sub-block ${blockId}")
                      bm.blockTransferService.uploadBlockSync(
                        peer.host,
                        peer.port,
                        peer.executorId,
                        blockId,
                        buffer,
                        StorageLevel.DISK_ONLY,
                        null) // class tag, we don't need for shuffle
                      logDebug(s"Migrated sub-block $blockId")
                    }
                    logInfo(s"Migrated $shuffleBlockInfo to $peer")
                  } catch {
                    case e: IOException =>
                      ...
                      if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) {
                        logWarning(s"Skipping block $shuffleBlockInfo, block deleted.")
                      } else if (fallbackStorage.isDefined) {
                        fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm))
                      } else {
                        logError(s"Error occurred during migrating $shuffleBlockInfo", e)
                        keepRunning = false
                      }
      

      Since "remote" is not expected to be a resolvable hostname, an IOException occurs, and fallbackStorage is used. But, we shouldn't try to resolve this. First off, it's completely unnecessary and strange to be treating the placeholder ID as a resolvable hostname, relying on an exception to realize that we need to use the fallbackStorage.

      To make matters worse, in some network environments, "remote" may be a resolvable hostname, completely breaking this functionality. In the particular environment that I use for running automated tests, there is a DNS entry for "remote" which, when you attempt to connect to it, will hang for a long period of time. This essentially hangs the executor decommission process, and in the case of unit tests, breaks FallbackStorageSuite as it exceeds its timeouts. I'm not sure, but it's possible this is related to SPARK-35584 as well (if sometimes in the GA environment, it takes a long time for the OS to decide that "remote" is not a valid hostname).

      We shouldn't attempt to treat this placeholder ID as a real hostname.

      Attachments

        Activity

          People

            xkrogen Erik Krogen
            xkrogen Erik Krogen
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: