Details
-
Sub-task
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.3.0
-
None
Description
FallbackStorage uses a placeholder block manager ID:
private[spark] object FallbackStorage extends Logging { /** We use one block manager id as a place holder. */ val FALLBACK_BLOCK_MANAGER_ID: BlockManagerId = BlockManagerId("fallback", "remote", 7337)
That second argument is normally interpreted as a hostname, but is passed as the string "remote" in this case.
BlockManager will consider this placeholder as one of the peers in some cases:
private[storage] def getPeers(forceFetch: Boolean): Seq[BlockManagerId] = { peerFetchLock.synchronized { ... if (cachedPeers.isEmpty && conf.get(config.STORAGE_DECOMMISSION_FALLBACK_STORAGE_PATH).isDefined) { Seq(FallbackStorage.FALLBACK_BLOCK_MANAGER_ID) } else { cachedPeers } } }
BlockManagerDecommissioner.ShuffleMigrationRunnable will then attempt to perform an upload to this placeholder ID:
try { blocks.foreach { case (blockId, buffer) => logDebug(s"Migrating sub-block ${blockId}") bm.blockTransferService.uploadBlockSync( peer.host, peer.port, peer.executorId, blockId, buffer, StorageLevel.DISK_ONLY, null) // class tag, we don't need for shuffle logDebug(s"Migrated sub-block $blockId") } logInfo(s"Migrated $shuffleBlockInfo to $peer") } catch { case e: IOException => ... if (bm.migratableResolver.getMigrationBlocks(shuffleBlockInfo).size < blocks.size) { logWarning(s"Skipping block $shuffleBlockInfo, block deleted.") } else if (fallbackStorage.isDefined) { fallbackStorage.foreach(_.copy(shuffleBlockInfo, bm)) } else { logError(s"Error occurred during migrating $shuffleBlockInfo", e) keepRunning = false }
Since "remote" is not expected to be a resolvable hostname, an IOException occurs, and fallbackStorage is used. But, we shouldn't try to resolve this. First off, it's completely unnecessary and strange to be treating the placeholder ID as a resolvable hostname, relying on an exception to realize that we need to use the fallbackStorage.
To make matters worse, in some network environments, "remote" may be a resolvable hostname, completely breaking this functionality. In the particular environment that I use for running automated tests, there is a DNS entry for "remote" which, when you attempt to connect to it, will hang for a long period of time. This essentially hangs the executor decommission process, and in the case of unit tests, breaks FallbackStorageSuite as it exceeds its timeouts. I'm not sure, but it's possible this is related to SPARK-35584 as well (if sometimes in the GA environment, it takes a long time for the OS to decide that "remote" is not a valid hostname).
We shouldn't attempt to treat this placeholder ID as a real hostname.