[CARBONDATA-4177] performence issue with Query - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 2.0.1
Fix Version/s: 2.0.1
Component/s: core
Labels:
None

Description

Hi Team,Hi Team,
We are working on a POC using carbondata 2.0.1 and have come across parformance issue.Below are the details:
1.Table creation query:

======================
spark.sql("create table Flow_TS_1day_stats_16042021(start_time timestamp,end_time timestamp,source_ip_address string,destintion_ip_address string,appname string,protocol_name string,source_tos smallint,in_interface smallint,out_interface smallint,src_as bigint,dst_as bigint,source_mask smallint,destination_mask smallint, dst_tos smallint,input_pkt bigint,input_byt bigint,output_pkt bigint,output_byt bigint,source_port int,destination_port int) stored as carbondata TBLPROPERTIES ('local_dictionary_enable'='false')").show()

TWO MVs are there on this table, Below are the queries for those MVs

:===========================================================
1. Network MV

================
spark.sql("create materialized view Network_Level_Agg_10min_MV_with_ip_15042021_again as select timeseries(end_time,'ten_minute') as end_time,source_ip_address, destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_15042021_again group by timeseries(end_time,'ten_minute'),source_ip_address,destintion_ip_address, appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

2. Interfae MV:

==================Interface :==============
spark.sql("create materialized view Interface_Level_Agg_10min_MV_16042021 as select timeseries(end_time,'ten_minute') as end_time, source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface,sum(input_pkt) as input_pkt,sum(input_byt) as input_byt,sum(output_pkt) as output_pkt,sum(output_byt) as output_byt from Flow_TS_1day_stats_16042021 group by timeseries(end_time,'ten_minute'), source_ip_address,destintion_ip_address,appname,protocol_name,source_port,destination_port,source_tos,src_as,dst_as,in_interface,out_interface order by input_pkt,input_byt,output_pkt,output_byt desc").show(false)

We are firing below query for fethcing data which is taking almost 10 seconds:

Select appname,input_byt from Flow_TS_1day_stats_16042021 where end_time >= '2021-03-02 00:00:00' and end_time < '2021-03-03 00:00:00' group by appname,input_byt order by input_byt desc LIMIT 10

The above query is only fetching 10 records but it is taking almost 10 seconds to complete.
Could you please review above schemas and help us to understand how can we get some improvement in the qury execution time. We are expectingt he response should be in subseconds.

Table Name : RAW Table (1 Day - 300K/Sec)#Records : 25920000000

RegardsSuyash Yadav

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: suyash yadav

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 07/May/21 03:50

Updated:: 03/Aug/21 08:42