[BIGTOP-1366] Updated, Richer Model for Generating Data for BigPetStore - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: backlog
Fix Version/s: 1.0.0
Component/s: blueprints
Labels:
- bigpetstore

Description

BigPetStore uses synthetic data as the basis for its workflow. BPS's current model for generating customer data is sufficient for basic testing of the Hadoop ecosystem, *but the model is very basic and lacks sufficient complexity for embedding interesting patterns into the data*.

As a result, *more complex, scalable testing such as testing clustering algorithms in Mahout on non-trivial data or multidimensional data with factors influencing it* is not currently possible.

Efforts are currently underway to incrementally improve the current model (see ~~BIGTOP-1271~~ and ~~BIGTOP-1272~~).

To create a model that can that incorporate *realistic, non-hierarchichal patterns* and input data to generate rich customer/transaction data with interesting correlations will require a re-imagining of the current model and its framework.

To support the improvements to the model in BigPetStore, I have been working on an *alternative ab initio model, developed from scratch*. Since the development of a new model involves substantial R&D work with more specialized tools (mathematical and plotting libraries), I'm doing the current work outside of BPS using the iPython Notebook environment. Due to the long time frame, the model will be developed on a separate timeline to prevent slowing the development of BPS.

Once the model has stabilized, I will begin incorporating the model into BPS itself. One option is to implement the model in using Scala for clean integration with *spark* which is likely to play an increasingly important role in the hadoop ecosystem, and thus will be an important part of bigpetstore as a test/blueprint app.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

BIGTOP-1366.patch
18/Nov/14 19:19
341 kB
R J Nowling

Issue Links

blocks

BIGTOP-1535 Add Spark ETL script to BigPetStore

Resolved

BIGTOP-1537 [BigPetStore] Add Spark Product Recommender example

Resolved

BIGTOP-1538 Add customer clustering example to BPS Spark

Open

BIGTOP-1539 Add Spark analytics example for distribution of time between customer purchases

Open

BIGTOP-1540 Add Spark analytics example for distribution of customer-store and transaction customer-store distances

Open

BIGTOP-1414 Add Apache Spark implementation to BigPetStore

Closed

BIGTOP-1536 BPS Spark : Unit test upgrade + Basic Sales Analytics Example.

Closed

is blocked by

BIGTOP-1471 BigPetStore: Cleanup sources

Open

is related to

HIVE-8792 add simple test data generator to Hive

Patch Available

(2 blocks, 1 is blocked by, 1 is related to)

Activity

People

Assignee:: R J Nowling

Reporter:: R J Nowling

Votes:: 0 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 07/Jul/14 14:08

Updated:: 18/Mar/15 22:47

Resolved:: 19/Nov/14 23:44

Time Tracking

Estimated:

8,736h

Remaining:

8,736h

Logged:

Not Specified