Uploaded image for project: 'CarbonData'
  1. CarbonData
  2. CARBONDATA-4177

performence issue with Query

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: core
    • Labels:
      None

      Description

      Hi Team,Hi Team,
      We are working on a POC using carbondata 2.0.1 and have come across parformance issue.Below are the details:
      1.Table creation query:

      ======================
      spark.sql("create table Flow_TS_1day_stats_16042021(start_time timestamp,end_time timestamp,source_ip_address string,destintion_ip_address string,appname string,protocol_name string,source_tos smallint,in_interface smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask smallint,destination_mask smallint, dst_tos smallint,input_pkt bigint,input_byt bigint,output_pkt bigint,output_byt bigint,source_port int,destination_port int) stored as carbondata TBLPROPERTIES ('local_dictionary_enable'='false')").show()

      TWO MVs are there on this table, Below are the queries for those MVs

      :===========================================================
      1. Network MV

      ================
      spark.sql("create materialized view Network_Level_Agg_10min_MV_with_ip_15042021_again as select timeseries(end_time,'ten_minute') as end_time,source_ip_address, destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_15042021_again group by timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

      2. Interfae MV:

      ==================Interface :==============
      spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as select timeseries(end_time,'ten_minute') as end_time, source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 group by timeseries(end_time,'ten_minute'), source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

      We are firing below query for fethcing data which is taking almost 10 seconds:

      Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= '2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by appname,input_byt order by input_byt desc LIMIT 10

       

      The above query is only fetching 10 records but it is taking almost 10 seconds to complete.
      Could you please review above schemas and help us to understand how can we get some improvement in the qury execution time. We are expectingt he response should be in subseconds.

      Table Name : RAW Table (1 Day - 300K/Sec)#Records : 25920000000

      RegardsSuyash Yadav                          

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              imsuyash suyash yadav
            • Votes:
              1 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: