Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-8096

format-excel reader: support different Shared String implementations

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Execution - Data Types
    • None

    Description

      One of the biggest users of memory and processing time when reading Excel files is handling the Shared Strings Table.

      excel-streaming-reader v3.3.0 supports 3 implementations.

      I would suggest that Drill should use the ReadOnlySharedStringTable as the default.

      Drill currently uses the full featured Apache POI SharedStringTable by default (which requires more memory and parsing effort).

      There is also a TempFileSharedStringTable which uses a temp file to keep the data out of heap memory. This is still pretty fast because it is implemented using a H2 database MVMap.

      If supporting allowing users configure which implementation they want sounds useful, I can do a PR.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            pj.fanning PJ Fanning
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: