Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-3528

Reading data from file:/// should be called NODE_LOCAL not PROCESS_LOCAL

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Labels:
      None

      Description

      Note that reading from file:///.../pom.xml is called a PROCESS_LOCAL task

      spark> sc.textFile("pom.xml").count
      ...
      14/09/15 00:59:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, PROCESS_LOCAL, 1191 bytes)
      14/09/15 00:59:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, PROCESS_LOCAL, 1191 bytes)
      14/09/15 00:59:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
      14/09/15 00:59:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
      14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:20862+20863
      14/09/15 00:59:13 INFO HadoopRDD: Input split: file:/Users/aash/git/spark/pom.xml:0+20862
      

      There is an outstanding TODO in HadoopRDD.scala that may be related:

        override def getPreferredLocations(split: Partition): Seq[String] = {
          // TODO: Filtering out "localhost" in case of file:// URLs
          val hadoopSplit = split.asInstanceOf[HadoopPartition]
          hadoopSplit.inputSplit.value.getLocations.filter(_ != "localhost")
        }
      

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              aash Andrew Ash
            • Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated: