Unsupervised Learning — Super Interesting Real-life use cases
The Internet is filled with algorithms that talk about unsupervised learning, understanding real-life use cases will give a deeper perspective, however.
As we know Data is the new oil and User Data is pure Gold. However, tracking users across the internet is extremely difficult, especially when there are no clear Dependent variables to measure.
Imagine a case a company like Google or a comparison site with access to its partner sites that wants to track users across 100s of its domains, where there is limited data about users.
Let's assume that we have just these parameters.
Event ID — An ID that is generated by an event ( any event like filling a form or clicking a learn more button), Email, Phone no, IP Address, Cookie ID, User Agent & Timestamp
Given just these parameters, how do we go about tracking user journey? more fundamentally, how to create an identity mechanism for users?
Solution
As you can see having the features across multiple websites to track a single user across these sites is a pretty challenging problem.
The goal of exercise to come up with a cluster of event id that indicates a unique user along with a confidence score ( 0-100% probability) for the same.
Feature engineering & Ideas based on different features
Since email is a pretty unique figure across users and chance of collision by typos are very rare ( i.e different users getting same emails), we can use email as an anchor for our clustering and start our seed clusters. However, there could be instances where emails are missing which will be dealt with later.
Now, say the total number of users is 1 million, and by taking distinct email count is about 1.1 million owing to typos.
We need to sort the emails by alphabetical order and take the Levenstein distance with the next row in the Database. we can safely consider emails within Levenstein distance of 1 to be the same, thus cutting down the no of instances.
However, the issue of missing email is still unsolved and that needs to our attention.
Phone number
Phone numbers and second most reliable items after email, so in case we have an exact phone number match and the emails are missing we can attach such events to the clusters where the phone number is matching.
Timestamp
Timestamp needs to be converted into chunks of 10 min windows with numbers starting at 0 for the lowest times stamp in the data, this way it becomes very easy for us to any time-based comparisons later.
IP Address
IP Address keeps changing for every 3 months or so and when you change router ( which is much less frequent ) so this is a reasonably good indicator of personhood and is a pretty accurate number which can be used for identification.
Cookie ID
Cookie IDs have a shelf life of 1 day - 1 year depending on websites and user behavior ( cleaning of cookies ) so cookie ID are difficult to track individuality.
User agent
User-agent info is usually unreliable as millions of computers could have the same user agent settings unless if their browser is set to a custom size and is kept constant.
As you can understand in real-world data there are no hard set rules and there always exceptions, however, problems like this are far more interesting because of that.
We should use Graph algorithms like ‘Connected Components’ to further identify the patterns here and to better identify the relationship.
Other interesting use cases are Document Classification Customer segmentation, Crime profiling, Call Record analysis and so on…
P.S let me your comments below.