Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
Important
Description
There are currently 7 features that are considered to form the words in the proxy case.
Domain belongs to Alexa - Heuristic
Time of day - Binned into Deciles
Request Method - String (eg. “GET”, “POST” etc)
String Entropy of URI - Binned into Quintiles
Top level content type - String (eg. “image”, “binary”)
Frequency of user agent type - Binned into Quintiles
Response code - First digit
But there are a couple of problems or areas of improvement with this:
1. Quantiles (quintiles, deciles, etc) are expensive to compute because they require the data to be sorted in the first place and then to be assigned to a bin or bucket. Depending on the sorting algorithm the time complexity for the best case could be n , the worst case n^2 and most likely n log n . Another disadvantage of using quantiles is that it is data depended. This happens because exactly the same instance appearing on a slightly different data set could end up on a different bin. This behavior can be problematic because a different word assigned to exactly the same instance could have a different probability associated with it and the results and behavior become unstable.
2. Time of the day is a feature that carries some logical meaning about the nature of our data but using quantiles can cause times which are similar to be placed in different bins with no logical meaning. (For example: 10:40 and 11:15 can be in the same bin and 11:45 and 12:03 in another one). Furthermore, the same problem explained in the last bullet is present in this case (Following the same example, 11:15 can be assigned to one bin in one day and be assigned to a different one on a different day or dataset)
3. Response code is currently using only the first digit instead of three and a lot of information is lost. Many response codes are being assigned to the same category when maybe they are totally unrelated.
See attached file for suggested improvements.
See Word document attached below for a more detailed explanation of the improvements.
Thoughts ?
Attachments
Attachments
Issue Links
- links to