Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-8421 [Rust] [Parquet] Implement parquet writer
  3. ARROW-9728

[Rust] [Parquet] Compute nested definition and repetition for structs

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 1.0.0
    • 3.0.0
    • Rust

    Description

      When computing definition levels for deeply nested arrays that include lists, the definition levels are correctly calculated, but they are not translated into correct indexes for the eventual primitive arrays.

      For example, an int32 array could have no null values, but be a child of a list that has null values. If say the first 5 values of the int32 array are members of the first list item (i.e. list_array[0] = [1,2,3,4,5], and that list is itself a child of a struct whose index is null, the whole 5 values of the int32 array should be skipped. Further, the list's definition and repetition levels will be represented by 1 slot instead of the 5.

      The current logic cannot cater for this, and potentially results in slicing the int32 array incorrectly (sometimes including some of those first 5 values).

      This Jira is for the work necessary to compute the index into the eventual leaf arrays correctly.

      I started doing it as part of the initial writer PR, but it's complex and is blocking progress.

      Attachments

        Issue Links

          Activity

            People

              nevi_me Neville Dipale
              nevi_me Neville Dipale
              Votes:
              1 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 20m
                  3h 20m