Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-4070

Files written with versions of Drill before v1.3 record metadata that is indistinguishable from bad metadata from other Parquet creators

VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Won't Fix
    • 1.3.0
    • 1.3.0
    • Metadata
    • None

    Description

      Drill uses the parquet-mr library to write Parquet files. The metadata signature that Drill produced in 1.2 and earlier versions of Drill is indistinguishable from older footers written by other tools (such as Pig and Hive). There was a known bug when those tools wrote metadata that caused the statistics to be incorrect. To correct this, the parquet-mr library adopted a behavior of ignoring statistics from the old form of the Parquet footer.

      With 1.3, Drill upgraded to the latest version of parquet-mr and has now started ignoring these statistics as well. This ensures correct result but produces performance regressions (compared to Drill v1 and v2) when querying against partitioned Parquet files generated in Drill 1.1 and 1.2.

      Attachments

        1. cache.txt
          44 kB
          Rahul Kumar Challapalli
        2. fewtypes_varcharpartition.tar.tgz
          5 kB
          Rahul Kumar Challapalli

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            parthc Parth Chandra
            rkins Rahul Kumar Challapalli
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment