[MAHOUT-1233] Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Question
Status: Closed
Priority: Minor
Resolution: Incomplete
Affects Version/s: 0.7, 0.8
Fix Version/s: 0.8
Component/s: classic
Labels:
None

Description

I am trying to process a dataset and i do it in two ways.
Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks in order to increase the throughput of my machine.
The problem is that when i perform the single chunk computation the results are fine
and by fine i mean that if i have in the input 1000 vectors i get in the output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
However when i split the dataset in order to speed up the computations then strange phenomena occur.
For instance the same dataset that contains 1000 vectors and is split in for example 10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding clusterids).
The question is, am i doing something wrong in the process?
Is there a problem in clusterdump and seqdumper when the input is in many files?
I have observed when mahout is performing the computations that in the screen says that processed the correct number of vectors.
Am i missing something?
I use as input the transformed to mvc weka vectors.
I have tried this in v0.7 and the v0.8 snapshot.

Thank you in advance for your time.

Attachments

Activity

People

Assignee:: yannis ats

Reporter:: yannis ats

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 30/May/13 09:28

Updated:: 31/Jan/24 22:14

Resolved:: 11/Jun/13 14:11