Hi Team ,
We have been doing a POC by using Carbon 2.1.0 and created a wrapper code around carbon and deployed it as docker container.
Concurrent data loading is happening in many tables.
Our objective if get optimal performance for aggregated queries and using materialized views .
Our observation is after creating MVs data loading is slow and not able to keep-up the pace of incoming data .
Process is also consuming a lot of memory when MVs are created .
Data is received in continuous manner and MVs are refreshed which is resulting in increased load time.
Ideally MVs should only perform incremental refresh as it doesnot require to calculate old data again.
But it seems the full refresh is causing high memory usages and increased loading time.
Testing involved loading data without MVs for 6 hrs , then creating MVs and load data again for 4 hours.
Loading time with MVs increased there creating backlog of data ( loaded only 1/5 th no. of rows than expected).
Below are major bottlenecks observed :
1. High Memory consumption after creating MVs
2. MVs doing a full refresh
Please find attached details of testing with list of tables.
Below is definition of table :
create table if not exists fact_365_1_eutrancell_1 (ts timestamp, metric STRING, tags_id STRING, value DOUBLE, epoch bigint) partitioned by (ts2 timestamp) STORED AS carbondata TBLPROPERTIES ('SORT_COLUMNS'='metric')
Below is definition of MV :
create materialized view if not exists fact_365_1_eutrancell_1_hour as select tags_id ,metric,timeseries(ts,'hour') as ts,sum(value),avg(value),min(value),max(value) from fact_365_1_eutrancell_1 group by metric, tags_id, timeseries(ts,'hour')
Can you suggest why MV creation is slowing down the ingestion so much and what can be done to improve ?
Is there any way to have incremental refresh of MV - refresh only that hour for which we are loading the data ?