Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27851

Allow for custom BroadcastMode#transform return values

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Invalid
    • 2.4.3
    • None
    • Optimizer, SQL
    • None

    Description

      According to the BroadcastMode API the BroadcastMode#transform methods are allows to return a result object of an arbitrary type:

      /**
       * Marker trait to identify the shape in which tuples are broadcasted. Typical examples of this are
       * identity (tuples remain unchanged) or hashed (tuples are converted into some hash index).
       */
      trait BroadcastMode {
        def transform(rows: Array[InternalRow]): Any
      
        def transform(rows: Iterator[InternalRow], sizeHint: Option[Long]): Any
      
        def canonicalized: BroadcastMode
      }
      

      When looking at the code which later uses the instantiated BroadcastMode objects in BroadcastExchangeExec it becomes that this is not really the base.

      The following lines in BroadcastExchangeExec suggest that only objects of type HashedRelation and Array[InternalRow] are allowed as a result for the BroadcastMode#transform methods:

      // Construct the relation.
      val relation = mode.transform(input, Some(numRows))
      
      val dataSize = relation match {
          case map: HashedRelation =>
              map.estimatedSize
          case arr: Array[InternalRow] =>
              arr.map(_.asInstanceOf[UnsafeRow].getSizeInBytes.toLong).sum
          case _ =>
              throw new SparkException("[BUG] BroadcastMode.transform returned unexpected type: " +
                  relation.getClass.getName)
      }
      

      I believe that this is the only occurrence in the code where the result of the BroadcastMode#transform method must be either of type HashedRelation or Array[InternalRow]. Therefore to allow for broader custom implementations of the BroadcastMode I believe it would be a good idea to solve the calculation of the data size of the broadcast value in an independent way of the used BroadcastMode implemented.

      One way this could be done is by providing an additional BroadcastMode#calculateDataSize method, which needs to be implemented by the BroadcastMode implementations. This methods could then return the required number of bytes for a given broadcast value.

      Attachments

        Activity

          People

            Unassigned Unassigned
            madoar Marc Arndt
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: