Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1380

Select number of centroids in k-means

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.17
    • None

    Description

      kmeans_random( rel_source,
                     expr_point,
                     k,                     	-- can be a single value like now or an array of k values
                     fn_dist,               	-- optional
                     agg_centroid,			-- optional
                     max_num_iterations,		-- optional
                     min_frac_reassigned,		-- optional
                     k_selection_algorithm    -- optional (only applies if 'k' parameter is an array with multiple k values)
                   )
      
      kmeanspp( rel_source,
                expr_point,
                k,                        	-- can be a single value like now or an array of k values
                fn_dist,						-- optional
                agg_centroid,					-- optional
                max_num_iterations,			-- optional
                min_frac_reassigned,			-- optional
                seeding_sample_ratio,			-- optional
                k_selection_algorithm    		-- optional (only applies if 'k' parameter is an array with multiple k values)
              )
      
      k
      INTEGER of INTEGER[]. The number of centroids to calculate.  Can be a single value
      or an array of k values to explore.  If array of k values given, the parameter 'k_selection_algorithm'
      determines the evaluation method.
      
      k_selection_algorithm (optional)
      TEXT, default: 'elbow'. Method to evaluate number of centroids k.
      Only applies if the parameter 'k' is an array with multiple k values.
      Currently two approaches are supported: 'elbow', and 'silhouette'. 
      The text can be any subset of the strings; for e.g., 'silh' will use the silhouette method.
      

      e.g.,

      SELECT * FROM madlib.kmeanspp (
      								'km_sample', 			-- rel_source
      								'points', 				-- expr_point
      								'ARRAY[2, 4, 6, 8, 10]',  	-- k       
          							'madlib.squared_dist_norm2',	-- fn_dist
          							'madlib.avg', 				-- agg_centroid
          							20, 						-- max_num_iterations
          							0.001,					-- min_frac_reassigned
          							'elbow'					-- k_selection_algorithm
          							);
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            fmcquillan Frank McQuillan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: