Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25126

avoid creating OrcFile.Reader for all orc files

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.3.1
    • Fix Version/s: 2.4.0
    • Component/s: Input/Output
    • Labels:
      None

      Description

      We have a spark job that starts by reading orc files under an S3 directory and we noticed the job consumes a lot of memory when both the number of orc files and the size of the file are large. The memory bloat went away with the following workaround.

      1) create a DataSet<Row> from a single orc file.

      Dataset<Row> rowsForFirstFile = spark.read().format("orc").load(oneFile);

      2) when creating DataSet<Row> from all files under the directory, use the schema from the previous DataSet.

      Dataset<Row> rows = spark.read().schema(rowsForFirstFile.schema()).format("orc").load(path);

      I believe the issue is due to the fact in order to infer the schema a FileReader is created for each orc file under the directory although only the first one is used. The FileReader creation loads the metadata of the orc file and the memory consumption is very high when there are many files under the directory.

      The issue exists in both 2.0 and HEAD.

      In 2.0, OrcFileOperator.readSchema is used.

      https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileOperator.scala#L95

      In HEAD, OrcUtils.readSchema is used.

      https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcUtils.scala#L82

       

       

        Attachments

          Activity

            People

            • Assignee:
              rfu Rao Fu
              Reporter:
              rfu Rao Fu
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: