The focus of this blog post is to walk you through a machine learning classification problem at a high level: from the beginning steps of defining a data science problem and extracting meaning from the data to the latter steps of model evaluation. This Part I focuses on turning a business problem into a data science question. We will focus on extracting and cleaning data then building KPIs that will eventually serve as machine learning features. Part II (stay tuned) will directly tackle machine learning. We discuss how to choose among the variety of classification algorithms available, what it means to train a model, and how to evaluate the success of the model. Throughout the discussion, we will follow a real-world business use of machine learning classification. The task at hand is to categorize the type of network traffic received by a large telecom company. For simplicity, we propose a binary solution: video data vs. nonvideo data.
From Business to Data Science:
A large telecom company wanted to classify the type of traffic they were seeing on their network. The business incentives for doing this were two-fold. Firstly, they wanted an accurate report of how much video data their customers were consuming to better understand user behavior and user intentions. Secondly, they knew that they could throttle video packets of data without slowing down their customers’ streaming experiences. Most customers were streaming video on their mobile devices at low resolutions at or around 480p. When customers attempted to watch at higher resolutions, they experienced slow connection speeds and frequent buffering. Knowing which domains supplied video traffic would reduce costs and resources as well as provide them with a wealth of user insights.
The first step to building a data science solution for the client is to understand the nature of their telecom data, how to extract it, and which fields of the data provide meaningful information that might serve as features for a machine learning model. We were told that telecom data was sent in flows to mobile devices. For example, if a user wanted to watch a video, flows would be sent and received to the device at regular intervals. We would suspect that the number of flows and session duration of an event would be longer for video streaming events. The flows could be grouped by SNI (a telecom term for app or domain name) and start-time to give a chronological picture of a user’s behavior. In addition to SNI, start time and end time, the number of upload bytes, download bytes, upload packets and download packets could be extracted per flow. Whether or not the flow was charged as a video instance by the company was also extracted. Therefore, our dataset was labeled even though many flows were labeled as unknown content.
In order to feature engineer our model, we first needed to calculate straight-forward KPIs and analyze their distributions to see how separable they are. To do this we would need to arbitrarily divide each user’s daily telecom data into discrete sessions. We decided that all flow clusters belonging to the same SNI without breaks of 30 seconds or more would belong to the same session. All KPIs under consideration were then plotted as a function of the number of sessions to help visualize how separable the data was. As an example, the number of sessions as a function of byte ratio is plotted below:
By examining byte ratio alone, it appears the video data is quite separable from the nonvideo data. We see several more nonvideo sessions with low byte ratios. This could be due to instances with high upload bytes (e.g. uploading content to social media) or small-signal nonvideo traffic (e.g. push notifications). On the other hand, labeled video sessions require high download bytes in order to cache the readahead buffer while streaming.
In Part II of this blog, we will dive deeper into applying machine learning. Stay tuned to look at model selection, training and evaluation.