Uploaded image for project: 'Mahout'
  1. Mahout
  2. MAHOUT-1233

Problem in processing datasets as a single chunk vs many chunks in HADOOP mode in mostly all the clustering algos



    • Type: Question
    • Status: Closed
    • Priority: Minor
    • Resolution: Incomplete
    • Affects Version/s: 0.7, 0.8
    • Fix Version/s: 0.8
    • Component/s: Clustering
    • Labels:


      I am trying to process a dataset and i do it in two ways.
      Firstly i give it as a single chunk(all the dataset) and secondly as many smaller chunks in order to increase the throughput of my machine.
      The problem is that when i perform the single chunk computation the results are fine
      and by fine i mean that if i have in the input 1000 vectors i get in the output 1000 vectorids with their cluster_ids (i have tried in canopy,kmeans and fuzzy kmeans).
      However when i split the dataset in order to speed up the computations then strange phenomena occur.
      For instance the same dataset that contains 1000 vectors and is split in for example 10 files then in the output i will obtain more vector ids(w.g 1100 vectorids with their corresponding clusterids).
      The question is, am i doing something wrong in the process?
      Is there a problem in clusterdump and seqdumper when the input is in many files?
      I have observed when mahout is performing the computations that in the screen says that processed the correct number of vectors.
      Am i missing something?
      I use as input the transformed to mvc weka vectors.
      I have tried this in v0.7 and the v0.8 snapshot.

      Thank you in advance for your time.




            • Assignee:
              yannis_at yannis ats
              yannis_at yannis ats
            • Votes:
              0 Vote for this issue
              2 Start watching this issue


              • Created: