Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7267

csi will cause data loss during sql query

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsAdd voteVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • index

    Description

      from the picture, csi will use parquet chunk block meta calculate min/max value, and save it to mdt col stat. For complex cols, such as *info array<struct<name: string, age: int>>* , parquet meta will contain only `info.array.name`, `infor.array.age`, but hudi will only calculate `info` column, so this meta in mdt will be null.

      And if sql expression contain `IsNotNull(info)`, the file will all be skip.

      And consider common cols, which will be add in the future and old file will not contain this col, may cause some other question. So, make code logical clean, Check for null before evaluating the value:min/mav/nullValue.

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned Assign to me
            KnightChess KnightChess

            Dates

              Created:
              Updated:

              Slack

                Issue deployment