Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-3866

Parquet schema details not being utilised for metadata information

    XMLWordPrintableJSON

Details

    Description

      To access parquet files using Tableau, Drill must be configured with individual views for each parquet schema, and every column cast to specific data types before Tableau can access the data correctly, or for that matter even see the list of available tables.

      Understandably, this is a necessary requirement for other file formats which do not persist schema information, since Drill does not know the data types for any fields until the query is executed, but why for parquet files ?

      Having defined AVRO schemas for each parquet file in the AvroParquetWriter phase, and the parquet files storing the schema as part of the data, couldn't Drill leverage the information from the footers and make it available to reporting tools ?

      Also, as part of these investigations some parquet files were created using CTAS. The directory is created and the files contain the data but the tables do not seem to be displayed when we do a SHOW TABLES command. Shouldn't the metadata also be available for these tables ?

      I understand that with the new REFRESH TABLE METADATA feature Drill collects all the information from the parquet footers and store it in a cache file, but even in this case Drill does not seem to leverage this information to provide metadata to reporting tools such as Tableau.

      I know there have been discussions around this in the past but I could not find a Jira for this specific use-case.

      My thanks to Rahul Challapalli of MapR Technologies for his help here.

      Attachments

        Activity

          People

            Unassigned Unassigned
            cmathews Chris Mathews
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: