Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24133

Reading Parquet files containing large strings can fail with java.lang.ArrayIndexOutOfBoundsException

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskMoveLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete CommentsDelete
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1, 2.4.0
    • SQL
    • None

    Description

      ColumnVectors store string data in one big byte array. Since the array size is capped at just under Integer.MAX_VALUE, a single ColumnVector cannot store more than 2GB of string data.

      However, since the Parquet files commonly contain large blobs stored as strings, and ColumnVectors by default carry 4096 values, it's entirely possible to go past that limit.

      In such cases a negative capacity is requested from WritableColumnVector.reserve(). The call succeeds (requested capacity is smaller than already allocated), and consequently  java.lang.ArrayIndexOutOfBoundsException is thrown when the reader actually attempts to put the data into the array.

      This behavior is hard to troubleshoot for the users. Spark should instead check for negative requested capacity in WritableColumnVector.reserve() and throw more informative error, instructing the user to tweak ColumnarBatch size.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            ala.luszczak Ala Luszczak Assign to me
            ala.luszczak Ala Luszczak
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment