Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11409

[Integration] Enable Arrow to read Parquet files from Spark 2.x with illegal nulls

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 3.0.0
    • None
    • Integration
    • None

    Description

      While running integration tests with Arrow and Spark, I observed that Spark 2.x can in some circumstances write Parquet files with illegal nulls in non-nullable columns. (This appears to have been fixed in Spark 3.0.) Arrow throws an Unexpected end of stream error when attempting to read illegal Parquet files like this.

      The attached Parquet file written by Spark 2.0.0 can be used to repro this behavior. It contains only one column, a non-nullable integer named x, with three records:

      +-----+
      |    x|
      +-----+
      |    1|
      | null|
      |    3|
      +-----+ 
      

      This issue is for awareness only. I expect this should be closed as "won't fix".

      Attachments

        Activity

          People

            Unassigned Unassigned
            icook Ian Cook
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: