Uploaded image for project: 'Parquet'
  1. Parquet
  2. PARQUET-580

Potentially unnecessary creation of large int[] in IntList for columns that aren't used

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • None
    • 1.9.0, 1.8.2
    • None
    • None

    Description

      Noticed that for a dataset that we were trying to import that had a lot of columns (few thousand) that weren't being used, we ended up allocating a lot of unnecessary int arrays (each 64K in size) as we create an IntList object for every column. Heap footprint for all those int[]s turned out to be around 2GB or so (and results in some jobs OOMing). This seems unnecessary for columns that might not be used.

      Also wondering if 64K is the right size to start off with. Wondering if a potential improvement is if we could allocate these int[]s in IntList in a way that slowly ramps up their size. So rather than create arrays of size 64K at a time (which is potentially wasteful if there are only a few hundred bytes), we could create say a 4K int[], then when it fills up an 8K[] and so on till we reach 64K (at which point the behavior is the same as the current implementation).

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pnarang Piyush Narang
            pnarang Piyush Narang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment