Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-5471

Provide better documentation around Parquet, Options and Integration with Arrow

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 1.10.0
    • None
    • Documentation
    • None

    Description

      Apache Drill makes heavy use of the Apache Parquet file format. This is a great thing. In addition, with the advent of Apache Arrow, and JIRAs like https://issues.apache.org/jira/browse/DRILL-4455 understanding the integration with projects that are important to Drill (Parquet/Arrow) is both important and very opaque to end users.

      What do I mean by this? Well that Arrow JIRA is interesting, it looks like there is benefit to get Drill and Arrow on the same path, yet, asking the community "Is there interest in this?" is a very difficult proposition. I would love to chime in on this topic, but I don't understand what is happening enough to make an informed comment. This is true of Arrow, and it's true of Parquet.

      For Parquet, there are two readers included in Apache Drill. There are a number of options for encoding in the writer, there settings for row group sizes, compression, etc. How do these all play out? For end users who are maybe trying to read parquet files created with older versions of Parquet, or versions of Parquet used by Spark, Impala, Hive etc, how can we better provide them some things to try to get better performance or troubleshoot errors in queries?

      Yes, there are lots of JIRA and/or code comments around projects, however having better documentation of where we are now with some of these critical projects (Calcite as well?) are we using releases of those projects? Have we rewritten Drills own version (Like a Parquet reader?), are we on forks of other projects? Do we have project goals? I.e. Do we believe it would be a good project goal to work to use a standardized Parquet writer, but still use our reader? What about the Arrow integration? What benefits would an end user see?

      For some of these major components, describing what the current challenges are, what other potential future states could be, and what those futures states could bring the end user could help users generate interest, or even contribute to moving the future state forward. In addition, a page or pages on roadmaps, features, tweaks etc in the Documentation website could also help link to relevant JIRAs and provide a way to track progress.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mandoskippy John Omernik
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: