Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-1239

Columns to Vector

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • v1.15
    • Module: Utilities
    • None

    Description

      related to https://issues.apache.org/jira/browse/MADLIB-1240

      Columns to Vector

      Converts features from multiple columns of an input table into a feature array in a single column.
      This process can be reversed using the function vec2cols.

      cols2vec(
          source_table,
          out_table,
          list_of_features,
          list_of_features_to_exclude,
          cols_to_output
          )
      
      source_table
      TEXT. Name of the table containing the source data.
      
      out_table
      TEXT. Name of the generated table containing the output. If a table with the same name already exists, an error will be returned. 
      
      list_of_features
      TEXT. Comma-separated string of column names or expressions to put into feature array. Can also be a '*' implying all columns are to be put into feature array (except for the ones included in the next argument that lists exclusions). Array columns in the source table are not supported in the 'list_of_features'.
      
      PostgreSQL arrays only allow elements of the same type.  If multiple numeric types are present in the 'list_of_features', they will be cast to the largest type.  For example, if there are INTEGER and DOUBLE PRECISION columns in the feature list, the feature array will be of type DOUBLE PRECISION[].  Invalid combinations like TEXT and INTEGER will result in an error.
      
      list_of_features_to_exclude (optional)
      TEXT, default NULL. Comma-separated string of column names to exclude from the feature array.  Use only when 'list_of_features' is '*'.
      
      cols_to_output (optional)
      TEXT, default NULL. Comma-separated string of column names from the source table to keep in the output table, in addition to the feature array.  To keep all columns from the source table, use '*'.
      
      
      Output
      
      The output table produced by the cols2vec function contains the following columns:
      
      <...>
      Columns from source table, depending on which ones are kept (if any).
      
      feature_vector
      Array of features.  Array type will depend on feature type in the source table.
      
      
      A summary table named <out_table>_summary is also created at the same time, which has the following columns:
      
      source_table                            TEXT. Source table name.
      list_of_features                        Input list of features.
      list_of_features_to_exclude     Input list of features to exclude.
      feature_names                        TEXT[]. Array of names of features which is a dictionary for the 'feature_vector'.
      

      Notes

      (1)
      The function
      http://pivotalsoftware.github.io/PDLTools/group__grp__array__utilities.html#cols2vec_example
      is similar but the proposed MADlib one has more options. To do the equivalent of the PDL Tools one in MADlib, you would do:

      cols2vec(
          table_name,
          output_table,
          '*',
          exclude_columns,
          '*',
          )
      

      (2)
      Please put the feature vector on the right side of the output table, i.e., it will be the last column on the right.

      Attachments

        Activity

          People

            hpandey Himanshu Pandey
            fmcquillan Frank McQuillan
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: