Apache Drill
  1. Apache Drill
  2. DRILL-1072

Drill is very slow when we have a large number of text files

    Details

      Description

      git.commit.id.abbrev=efa3274
      Build# 26178

      As the total number of files under the below directory increase, drill becomes very slow. Check the results for different file counts for the below query.

      All files just contain 1 number and have a '.tbl' extension

      select count from dfs.`/drill/testdata/morefiles`;

      100 files — 5.183 seconds
      250 files — 15.021 seconds
      500 files — 26.846 seconds
      1000 files — 69.835 seconds
      5000 files — 1573.589 seconds

      The logs contain these messages repeatedly when executing against 5000 files:

      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.819 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:22.840 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.864 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:23.035 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.060 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5

        Issue Links

          Activity

          Rahul Challapalli created issue -
          Hide
          Rahul Challapalli added a comment -

          FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number

          select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;

          Show
          Rahul Challapalli added a comment - FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;
          Jacques Nadeau made changes -
          Field Original Value New Value
          Fix Version/s 0.5.0 [ 12324880 ]
          Jacques Nadeau made changes -
          Assignee Steven Phillips [ sphillips ]
          Hide
          Steven Phillips added a comment -

          Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.

          Show
          Steven Phillips added a comment - Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.
          Sudheesh Katkam made changes -
          Due Date 15/Aug/14
          Jacques Nadeau made changes -
          Fix Version/s 0.6.0 [ 12327472 ]
          Fix Version/s 0.5.0 [ 12324880 ]
          Parth Chandra made changes -
          Fix Version/s 0.8.0 [ 12328812 ]
          Fix Version/s 0.6.0 [ 12327472 ]
          Jacques Nadeau made changes -
          Priority Major [ 3 ] Minor [ 4 ]
          Jacques Nadeau made changes -
          Fix Version/s 0.9.0 [ 12328813 ]
          Fix Version/s 0.8.0 [ 12328812 ]
          Tony Stevenson made changes -
          Workflow no-reopen-closed, patch-avail, testing [ 12871822 ] Drill workflow [ 12935132 ]
          Hide
          Steven Phillips added a comment -

          There have been some improvements with regards to query planning with lots of files.
          Rahul Challapalli, could you please run this test again, to see where we are at?

          Show
          Steven Phillips added a comment - There have been some improvements with regards to query planning with lots of files. Rahul Challapalli , could you please run this test again, to see where we are at?
          Hide
          Rahul Challapalli added a comment -

          git.commit.id.abbrev=f1b59ed

          The numbers do look a lot better now

          100files : 1.145 seconds
          250files : 2.678 seconds
          500files : 5.263 seconds
          1000files : 12.614 seconds
          5000files : 55.8 seconds

          Show
          Rahul Challapalli added a comment - git.commit.id.abbrev=f1b59ed The numbers do look a lot better now 100files : 1.145 seconds 250files : 2.678 seconds 500files : 5.263 seconds 1000files : 12.614 seconds 5000files : 55.8 seconds
          Chris Westin made changes -
          Fix Version/s 1.0.0 [ 12325568 ]
          Fix Version/s 0.9.0 [ 12328813 ]
          Chris Westin made changes -
          Link This issue is duplicated by DRILL-1681 [ DRILL-1681 ]
          Chris Westin made changes -
          Fix Version/s 1.1.0 [ 12329689 ]
          Fix Version/s 1.0.0 [ 12325568 ]
          Chris Westin made changes -
          Fix Version/s 1.2.0 [ 12332042 ]
          Fix Version/s 1.1.0 [ 12329689 ]
          Parth Chandra made changes -
          Fix Version/s 1.4.0 [ 12332947 ]
          Fix Version/s 1.2.0 [ 12332042 ]
          Jacques Nadeau made changes -
          Fix Version/s Future [ 12326743 ]
          Fix Version/s 1.4.0 [ 12332947 ]
          Assignee Steven Phillips [ sphillips ]
          Rahul Challapalli made changes -
          Status Open [ 1 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Closed Closed
          470d 6h 36m 1 Rahul Challapalli 09/Oct/15 00:54

            People

            • Assignee:
              Unassigned
              Reporter:
              Rahul Challapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Development