Uploaded image for project: 'Apache MADlib'
  1. Apache MADlib
  2. MADLIB-909

Sessionization - Phase 1

    XMLWordPrintableJSON

Details

    Description

      Story

      As a data scientist, I want to perform session reconstruction on my data set, so that I can prepare for input into other algorithms like path functions, or predictive analytics algorithms.

      Details

      1) The PDL Tools module sessionization module [1] is one example implementation. Source code is located at [2]. Also see [7].

      2) How to sessionize. PDL Tools uses a time based session reconstruction that defines a session as a sequence of events by a particular user where no more than n seconds has elapsed between successive events. That is, if we don’t see an event from a user for n seconds, start a new session. The requirement for MADlib is similar but with the following addition:

      • generalize partition expression

      3) Proposed interface:

      sessionize (
         source_table,
         output_table,
         partition_expr,
         time_stamp,
         max_time)
      

      where

      output_table
      add 2 new columns to the source_table: session_id and new_session:

      • session_id=1,2, ...n where n is the number of sessions in the partition

      partition_expr
      VARCHAR. The 'partition_expr' can be a single column or a list of comma-separated columns/expressions to divide all rows into groups, or partitions. Matching is applied across the rows that fall into t he same partition. This can be NULL or '' to indicate the matching is to be applied to the whole table.

      time_stamp
      Column name with time used for sessionize calculation. Cannot be a PostgreSQL ORDER BY expression. This is simply a column name.

      max_time
      Delta time between subsequent events to define a sessions, i.e., session timeout.

      Questions

      1) Q: Do we need separate 'order_expr' and 'time_stamp' columns? Aster does it this way.
      A: No, we can't come up with a reason why a user would need this. If we want to add later, we can add as an optional parameter.

      2) Q: What to do if negative delta_t between events?
      A: Do not include in session and output a warning message.

      References

      [1] PDL Tools sessionization module
      http://pivotalsoftware.github.io/PDLTools/group__grp__sessionization.html

      [2] PDL tools source code
      https://github.com/pivotalsoftware/PDLTools

      [3] Blog on bot signatures from Akamai
      https://blogs.akamai.com/2013/06/identifying-and-mitigating-unwanted-bot-traffic.html

      [4] Aster Analytics users guide, see "sessionize" function
      http://www.info.teradata.com/edownload.cfm?itemid=143450001
      http://www.info.teradata.com/templates/eSrchResults.cfm?txtpid=&txtrelno=&prodline=all&frmdt=&txtsrchstring=aster%20analytics&srtord=Desc&todt=&rdSort=Date
      https://www.youtube.com/watch?v=C760M9ttK9Q

      [5] General information on sessionization
      https://en.wikipedia.org/wiki/Session_(web_analytics)

      [6] See path function for partition and order by params
      http://madlib.incubator.apache.org/docs/latest/group__grp__path.html

      [7] SQL sessionization example from blog
      https://blog.pivotal.io/pivotal/products/time-series-analysis-1-introduction-to-window-functions

      [8] Postgres example of SQL based sessionization
      http://randyzwitch.com/sessionizing-log-data-sql/

      Attachments

        Issue Links

          Activity

            People

              njayaram Nandish Jayaram
              fmcquillan Frank McQuillan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: