Hello everyone

With reference to the description of this issue, I am working on step 3 which involves creating a sampling job and executing naive cube computation algorithm over the sample dataset. The requirement for this sample job is that I should be able to select sample size proportional to the size of the input data size. This sampling job is required to determine the large group size and perform partitioning of the large groups so that no single reducer gets overloaded with large groups.

One thing I am stuck with is dynamically choosing the sample size. In the current implementation I am using sample operator to load a fixed sample size (10% data). Since the sample size is not chosen dynamically this fixed sampling will result in over sampling for large datasets. For dynamically choosing the sample size, we need to know the total number of tuples in the input dataset. But finding the total number of tuples is not trivial. One way to find the total number of tuples is to first find the total input size and size of one tuple in memory. The problem with this approach is that since tuple is List<Object> the reported in-memory size of tuple will be much larger than actual size of row in bytes. To verify this I tested with a simple dataset

Input file size : 319 bytes

Actual number of rows: 13

Number of dimensions: 5

Schema: int, chararray, chararray, chararray, int

Actual row size: 319/13 ~= 25 bytes

In-memory tuple size reported: 264 bytes (~10x greater than actual size of row)

Since, in-memory tuple size is higher we cannot make a good estimation of the total number of rows in the dataset and hence the sample size.

Other approaches,

I looked into how PoissonSampleLoader and RandomSampleLoader works. Both takes a different approach for loading sample dataset. PoissonSampleLoader uses the distribution of the skewed key to generate sample rows that best represent the underlying data. This loader inserts a special marker at the last tuple with the number of rows in the dataset. Since, this loader is specifically meant for handling skewed keys, I cannot use this in my case for generating sample dataset.

For using RandomSampleLoader, we need to specify the number of samples to be loaded beforehand so that the loader stops after loading the specified number of tuples. Since we need to specify the sample size before loading we have no means to dynamically load samples for datasets of varying size.

Also, for using these 2 loaders we need to copy the entire dataset to a temp file and use any of these loaders to load data from temp file. This consumes an additional map job. I don't know why there is a need for copying entire dataset to a temp file and then reading back again. I believe the reason (from what I can understand from the source) for copying the dataset to temp file and reading from it is that the loader classes can only read using InterStorage format.

I have listed below few pros and cons of different approaches

1) Using sample operator

**Pros:**

1 less map job compared to other loaders

**Cons:**

Reads entire dataset for generating sample dataset because sample operator is implemented as filter + RANDOM udf + less than expression(sample size) after projecting the input columns.

May result in oversampling for larger dataset

2) RandomSampleLoader

**Pros:**

Fixed sample size (the paper provided in the description mentions that 2M sample size is good enough to represent 20B tuples, 100K is good enough for 2B tuples. plz refer page-6 in the paper.)

Stops reading after sample size is reached (useful for large dataset) - NOT sure about this!! Please correct me if I am wrong.

**Cons:**

1 additional map job required ( including post processing there will be 4 MR jobs with 2 map only jobs )

Since fixed sample size is used this method is not scalable

3) PoissonSampleLoader

**Pros:**

Dynamically determines sample size

Can determine number of rows in dataset using special tuple

**Cons:**

1 additional map job required ( including post processing there will be 4 MR jobs with 2 map only jobs )

Not suitable for my usecase since the sample size generated is not proportional to input size

I think what I need is a hybrid loader (combination of concepts from random + poisson) which dynamically loads sample tuples based on the input dataset size.

Any thoughts about how I can generate sample size proportional to input data size? Or is there any way I can find the number of rows available in a dataset? Am I missing any other ideas for finding/estimating the number of rows in the dataset?

Updated the patch with following changes

1) Partition factor algorithm is tweaked to better distributed the reducer workload.

2) Partition factor in PartitionLargeGroups UDF is initialized to 0 (earlier it was 1), which generates many smaller bags (depends on cardinality of algebraic attribute). Earlier method initialized to 1 which generated few large bags.

The above changes also reduced the amount of records/bags spilled during full cube materialization job. In a test experiment, with 3M tuples and rollup on 3 dimensions following improvements were observed with the above changes

PROACTIVE_SPILL_COUNT_RECS improved by ~34% (from 5206793 to 3440694)

PROACTIVE_SPILL_COUNT_BAGS improved by ~54% (from 22 to 10)