Affects Version/s: 1.5.1
Fix Version/s: None
Environment:3 node cluster, 32GB each, 8 core per machine. Install spark 2.3.2, hadoop and hive with Mysql.
With report published on site its exciting to use CarbonData in our projects.
We did tpc-ds test on 100GB of data for both parquet and CarbonData, but the results are not upto the mark, on average carbon data is slower than parquet when we use getorCreateCarbonSession. We used
SparkSession spark = SparkSession.builder().config(sparkConf).appName("WritetocarbonData").enableHiveSupport().getOrCreate();
SparkSession.Builder builder = SparkSession.builder().config(sparkConf).master("local").appName("WritetocarbonData")
SparkSession carbon = new CarbonSession.CarbonBuilder(builder).getOrCreateCarbonSession("/home/ec2-user/efs/mysql");
We don't see CarbonData is performing @query level better than parquet or any significant difference.
I would like to know how did you perform bench marking and results are better than Parquet.
Latest ppts presented by Huwaie in one of China Conference, showcased CarbonData is 10x to 20x faster.
Can any one share the detailed bencmarking steps and code.