Description
input_file_name() function damage applying projection to the physical plan of the query.
if use this function and a new column, column-oriented formats like parquet and orc put all columns to Physical plan.
While without it, only selected columns uploaded.
In my case, performance influence is x30.
import org.apache.spark.sql.SparkSession import org.apache.spark.sql.functions._ object TestSize { def main(args: Array[String]): Unit = { implicit val spark: SparkSession = SparkSession.builder() .master("local") .config("spark.sql.shuffle.partitions", "5") .getOrCreate() import spark.implicits._ val query1 = spark.read.parquet( "s3a://part-00040-a19f0d20-eab3-48ef-be5a-602c7f9a8e58.c000.gz.parquet" ) .select($"app_id", $"idfa", input_file_name().as("fileName")) .distinct() .count() val query2 = spark.read.parquet( "s3a://part-00040-a19f0d20-eab3-48ef-be5a- 602c7f9a8e58.c000.gz.parquet" ) .select($"app_id", $"idfa") .distinct() .count() Thread.sleep(10000000000L) } }
`query1` has all columns in the physical plan, while `query2` only two.