Apache Drill
  1. Apache Drill
  2. DRILL-315

Reading only select columns from a parquet file

    Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
      None

      Activity

      Show
      Jason Altekruse added a comment - https://reviews.apache.org/r/15859/
      Hide
      Timothy Chen added a comment -

      Looking at your patch, it seems to also aiming to provide projection and predicate push down to all scans too.
      I tried clicking on your reviewboard link but it says it was marked private?

      Show
      Timothy Chen added a comment - Looking at your patch, it seems to also aiming to provide projection and predicate push down to all scans too. I tried clicking on your reviewboard link but it says it was marked private?
      Hide
      Jason Altekruse added a comment -

      I do apologize, I had thought I finished posting to review board, but apparently it did not go through. It should not be available publicly for comments. I did put in syntax for predicate push-down while I was adding the plan syntax for limit and column selection. I did not plan on implementing it right away fro parquet, because we likely will not get that great a performance gain from doing the predicates on individual values before coping them into VVs. The problem is right now we seem to able to get objects or byte arrays out of the interface how we are using them now. If we copy out individual values to evaluate a predicate, I believe we'll get worse performance than just copying out all of the data and sending it along to the filter operation in a lot of cases. This is definitely something worth implementing for json though, as the files will likely be smaller, and we have to copy/translate individual values anyway.

      Show
      Jason Altekruse added a comment - I do apologize, I had thought I finished posting to review board, but apparently it did not go through. It should not be available publicly for comments. I did put in syntax for predicate push-down while I was adding the plan syntax for limit and column selection. I did not plan on implementing it right away fro parquet, because we likely will not get that great a performance gain from doing the predicates on individual values before coping them into VVs. The problem is right now we seem to able to get objects or byte arrays out of the interface how we are using them now. If we copy out individual values to evaluate a predicate, I believe we'll get worse performance than just copying out all of the data and sending it along to the filter operation in a lot of cases. This is definitely something worth implementing for json though, as the files will likely be smaller, and we have to copy/translate individual values anyway.
      Hide
      Timothy Chen added a comment -

      Definitely shouldn't do a per object filter, but using metadata such as statistics/indexes when they are available in Parquet.
      Perhaps remove the stuff that isn't related to the patch? I see a byte[200] read that isn't related to anything as well.

      Show
      Timothy Chen added a comment - Definitely shouldn't do a per object filter, but using metadata such as statistics/indexes when they are available in Parquet. Perhaps remove the stuff that isn't related to the patch? I see a byte [200] read that isn't related to anything as well.
      Hide
      Jacques Nadeau added a comment -

      I agree, let's only include the column lists for now in this patch. How does one just pick all the columns?

      I've built on top of your stuff to try reduce copies and improve read performance. Take my stuff and merge into a single patch and then let's discuss. You can find my additions at: https://github.com/jacques-n/incubator-drill/tree/parquet-updates

      Show
      Jacques Nadeau added a comment - I agree, let's only include the column lists for now in this patch. How does one just pick all the columns? I've built on top of your stuff to try reduce copies and improve read performance. Take my stuff and merge into a single patch and then let's discuss. You can find my additions at: https://github.com/jacques-n/incubator-drill/tree/parquet-updates
      Hide
      Jason Altekruse added a comment -

      Jacques,
      I actually just realized what that buffer was being used for. I was using it for debugging, there was no permanent use for it. I was getting a thrift error when I had made the changes, so what I did was made that temporary buffer to see what was stored in the contents of the buffer and compared it against what I was supposed to be seeing by reverting back to the old code and inserting the same debugging hack. It was just a stupid way to get around not being able to see the bytes in the buffer within the debugging environment, before thift tried reading them and threw an exception. That also explains why I re-created the buffer at the beginning, because once the bytes were read into my debugging buffer they were not able to be re-read out of the buffer by thrift, as there is no seek-backwards method for that type of stream.

      This is more than needed to be said, because jacques fixed it, but we were looking at the code together and I couldn't remember for the life of me why it was there.

      I will create a modified patch with only the related changes.

      Show
      Jason Altekruse added a comment - Jacques, I actually just realized what that buffer was being used for. I was using it for debugging, there was no permanent use for it. I was getting a thrift error when I had made the changes, so what I did was made that temporary buffer to see what was stored in the contents of the buffer and compared it against what I was supposed to be seeing by reverting back to the old code and inserting the same debugging hack. It was just a stupid way to get around not being able to see the bytes in the buffer within the debugging environment, before thift tried reading them and threw an exception. That also explains why I re-created the buffer at the beginning, because once the bytes were read into my debugging buffer they were not able to be re-read out of the buffer by thrift, as there is no seek-backwards method for that type of stream. This is more than needed to be said, because jacques fixed it, but we were looking at the code together and I couldn't remember for the life of me why it was there. I will create a modified patch with only the related changes.
      Hide
      Jacques Nadeau added a comment -

      Hey Jason, have you been able to update this so we can merge?

      Show
      Jacques Nadeau added a comment - Hey Jason, have you been able to update this so we can merge?
      Show
      Jason Altekruse added a comment - https://reviews.apache.org/r/16620/
      Hide
      Timothy Chen added a comment -

      Is this merged?

      Show
      Timothy Chen added a comment - Is this merged?
      Hide
      Jacques Nadeau added a comment -

      Added in 2dfadc9

      Show
      Jacques Nadeau added a comment - Added in 2dfadc9

        People

        • Assignee:
          Jason Altekruse
          Reporter:
          Jason Altekruse
        • Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

          Dates

          • Created:
            Updated:
            Resolved:

            Development