Apache Drill
  1. Apache Drill
  2. DRILL-1072

Drill is very slow when we have a large number of text files

    Details

      Description

      git.commit.id.abbrev=efa3274
      Build# 26178

      As the total number of files under the below directory increase, drill becomes very slow. Check the results for different file counts for the below query.

      All files just contain 1 number and have a '.tbl' extension

      select count from dfs.`/drill/testdata/morefiles`;

      100 files — 5.183 seconds
      250 files — 15.021 seconds
      500 files — 26.846 seconds
      1000 files — 69.835 seconds
      5000 files — 1573.589 seconds

      The logs contain these messages repeatedly when executing against 5000 files:

      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.818 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.819 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:22.840 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.841 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:22.863 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:22.864 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5
      22:02:23.035 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.036 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 0
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector value capacity 65536
      22:02:23.059 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - vector byte capacity 32767500
      22:02:23.060 [b5a7fdd3-f788-4a40-9fd7-bf525bad09e3:frag:0:0] DEBUG o.a.d.e.s.text.DrillTextRecordReader - text scan batch size 5

        Activity

        Hide
        Rahul Challapalli added a comment -

        FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number

        select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;

        Show
        Rahul Challapalli added a comment - FYI : Executing count queries on table with 5000 PARQUET files looks very fast. However the below query took 152.711 seconds and returned 0 rows(as expected). Each parquet file only contained one number select * from dfs.`/drill/testdata/morefiles/parquet` where idx > 1;
        Hide
        Steven Phillips added a comment -

        Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.

        Show
        Steven Phillips added a comment - Is the 1573 second you show in the bug description a typo? Because you have 152.7 seconds for the same table in the first comment.

          People

          • Assignee:
            Steven Phillips
            Reporter:
            Rahul Challapalli
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Due:
              Created:
              Updated:

              Development