  1. Apache Drill
  2. DRILL-5273

CompliantTextReader exhausts 4 GB memory when reading 5000 small files



      A test case was created that consists of 5000 text files, each with a single line with the file number: 1 to 5001. Each file has a single record, and at most 4 characters per record.

      Run the following query:

      SELECT * FROM `dfs.data`.`5000files/text

      The query will fail with an OOM in the scan batch on around record 3700 on a Mac with 4GB of direct memory.

      The code to read records in


      is complex. The following appears to occur:

      • Iterate over the record readers for each file.
      • For each, call setup

      The setup code is:

        public void setup(OperatorContext context, OutputMutator outputMutator) throws ExecutionSetupException {
          oContext = context;
          readBuffer = context.getManagedBuffer(READ_BUFFER);
          whitespaceBuffer = context.getManagedBuffer(WHITE_SPACE_BUFFER);

      The two buffers are in direct memory. There is no code that releases the buffers.

      The sizes are:

        private static final int READ_BUFFER = 1024*1024;
        private static final int WHITE_SPACE_BUFFER = 64*1024;
      = 1,048,576 + 65536 = 1,114,112

      This is exactly the amount of memory that accumulates per call to ScanBatch.next()

      Ctor: 0  -- Initial memory in constructor
      Init setup: 1114112  -- After call to first record reader setup
      Entry Memory: 1114112  -- first next() call, returns one record
      Entry Memory: 1114112  -- second next(), eof and start second reader
      Entry Memory: 2228224 -- third next(), second reader returns EOF

      If we leak 1 MB per file, with 5000 files we would leak 5 GB of memory, which would explain the OOM when given only 4 GB.


