Apache Drill
  1. Apache Drill
  2. DRILL-1072

Drill is very slow when we have a large number of text files

    Details

      Description

      git.commit.id.abbrev=efa3274
      Build# 26178

      As the total number of files under the below directory increase, drill becomes very slow. Check the results for different file counts for the below query.

      All files just contain 1 number and have a '.tbl' extension

      select count from dfs.`/drill/testdata/morefiles`;

      100 files — 5.183 seconds
      250 files — 15.021 seconds
      500 files — 26.846 seconds
      1000 files — 69.835 seconds
      5000 files — 1573.589 seconds

      The logs contain these messages repeatedly when executing against 5000 files:

      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.819 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:22.840 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.864 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:23.035 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.060 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5

        Issue Links

          Activity

          Hide
          Rahul Challapalli added a comment -

          FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number

          select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;

          Show
          Rahul Challapalli added a comment - FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;
          Hide
          Steven Phillips added a comment -

          Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.

          Show
          Steven Phillips added a comment - Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.
          Hide
          Steven Phillips added a comment -

          There have been some improvements with regards to query planning with lots of files.
          Rahul Challapalli, could you please run this test again, to see where we are at?

          Show
          Steven Phillips added a comment - There have been some improvements with regards to query planning with lots of files. Rahul Challapalli , could you please run this test again, to see where we are at?
          Hide
          Rahul Challapalli added a comment -

          git.commit.id.abbrev=f1b59ed

          The numbers do look a lot better now

          100files : 1.145 seconds
          250files : 2.678 seconds
          500files : 5.263 seconds
          1000files : 12.614 seconds
          5000files : 55.8 seconds

          Show
          Rahul Challapalli added a comment - git.commit.id.abbrev=f1b59ed The numbers do look a lot better now 100files : 1.145 seconds 250files : 2.678 seconds 500files : 5.263 seconds 1000files : 12.614 seconds 5000files : 55.8 seconds

            People

            • Assignee:
              Steven Phillips
              Reporter:
              Rahul Challapalli
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:

                Development