Uploaded image for project: 'Spot (Retired)'
  1. Spot (Retired)
  2. SPOT-153

[ML] Feature improvement on proxy and elimination of quantiles.

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.0

    Description

      There are currently 7 features that are considered to form the words in the proxy case.

      Domain belongs to Alexa - Heuristic
      Time of day - Binned into Deciles
      Request Method - String (eg. “GET”, “POST” etc)
      String Entropy of URI - Binned into Quintiles
      Top level content type - String (eg. “image”, “binary”)
      Frequency of user agent type - Binned into Quintiles
      Response code - First digit

      But there are a couple of problems or areas of improvement with this:

      1. Quantiles (quintiles, deciles, etc) are expensive to compute because they require the data to be sorted in the first place and then to be assigned to a bin or bucket. Depending on the sorting algorithm the time complexity for the best case could be n , the worst case n^2 and most likely n log n . Another disadvantage of using quantiles is that it is data depended. This happens because exactly the same instance appearing on a slightly different data set could end up on a different bin. This behavior can be problematic because a different word assigned to exactly the same instance could have a different probability associated with it and the results and behavior become unstable.

      2. Time of the day is a feature that carries some logical meaning about the nature of our data but using quantiles can cause times which are similar to be placed in different bins with no logical meaning. (For example: 10:40 and 11:15 can be in the same bin and 11:45 and 12:03 in another one). Furthermore, the same problem explained in the last bullet is present in this case (Following the same example, 11:15 can be assigned to one bin in one day and be assigned to a different one on a different day or dataset)

      3. Response code is currently using only the first digit instead of three and a lot of information is lost. Many response codes are being assigned to the same category when maybe they are totally unrelated.

      See attached file for suggested improvements.

      See Word document attached below for a more detailed explanation of the improvements.

      Thoughts ?

      Attachments

        1. spot-ml-improvement.docx
          121 kB
          Gustavo Lujan

        Issue Links

          Activity

            People

              lujangus Gustavo Lujan
              lujangus Gustavo Lujan
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified