Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5275

Sort spill serialization is slow due to repeated buffer allocations

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.10.0
    • 1.10.0
    • None

    Description

      Drill provides a sort operator that spills to disk. The spill and read operations use the serialization code in the VectorAccessibleSerializable. This code, in turn, uses the DrillBuf.getBytes() method to write to an output stream. (Yes, the "get" method writes, and the "write" method reads...)

      The DrillBuf method turns around and calls the UDLE method that does:

                  byte[] tmp = new byte[length];
                  PlatformDependent.copyMemory(addr(index), tmp, 0, length);
                  out.write(tmp);
      

      That is, for each write the code allocates a heap buffer. Since Drill buffers can be quite large (4, 8, 16 MB or larger), the above rapidly fills the heap and causes GC.

      The result is slow performance. On a Mac, with an SSD that can do 700 MB/s of I/O, we get only about 40 MB/s. Very likely because of excessive CPU cost and GC.

      The solution is to allocate a single read or write buffer, then use that same buffer over and over when reading or writing. This must be done in VectorAccessibleSerializable as it is a per-thread class that has visibility to all the buffers to be written.

      Attachments

        Issue Links

          Activity

            People

              paul-rogers Paul Rogers
              paul-rogers Paul Rogers
              Boaz Ben-Zvi Boaz Ben-Zvi
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: