Agile Methodologies in Data Science

What problem were we trying to solve?

At ProCogia, our goal is to provide the highest level of data science support we possibly can. With our larger clients, we operate as long-term support for several data science initiatives. This type of engagement requires a greater level of project management than single project or short-term engagements. On one of our current client sites, we maintain a staff of five data scientists who support many of their engineers. Before long we identified several inefficiencies that hindered our ability to develop the high-quality data science solutions that we pride ourselves on producing. The two main pain points we identified were a lack of clarity in the project requirements, and a propensity for projects to extend far beyond their initial expected duration. To address these problems, organize our efforts, and maximize our output, we adopted an agile project management process that we customized to meet the needs of our data science team.

 

What is Agile:

The original Manifesto for Agile Software Development was an effort by 17 veteran software developers who identified a few key features to successful development teams. Their primary values are:

 

Individuals and Interactions over processes and tools

Working Software over comprehensive documentation

Customer Collaboration over contract negotiation

Responding to Change over following a plan

 

Additionally, 12 principles are identified to further define best practices for software development. As data scientists, we recognized that not all these principles would apply to our work, but we realized that many of them do, and we set out to organize our teams and projects in an agile way.

 

Our Experience:

As consultants we often find ourselves taking on projects from our business clients that are not fully defined. After all, that’s part of the fun of data science, the science of it. Form a hypothesis, design and carry out an experiment to test your hypothesis, evaluate the results, and iterate from there. We use a scrum process to manage our team and ensure that maximum value is delivered to our clients. Our process follows the standard scrum framework shown below.

Johns Q1 2019 Blog.png

Within this framework, we develop our projects in two-week cycles called sprints.

By holding ourselves to two-weeks and communicating our work to our clients during this time, we provide a much-needed lens into our development lifecycle. It is all too easy for clients on the business end to neglect to account for the difficulties we may face in gathering, cleaning, and pre-processing their data. Instead of our team working in a silo for two months before delivering what we think the client wants, we deliver smaller pieces more often to allow them time to digest our work and provide feedback along the way. The greatest benefit we have noticed after implementing this framework has been our ability to more quickly iterate and reach a solution that meets our client’s needs.

 

We began our agile implementation four months ago and although the improvements have been evident, some aspects of the scrum process have been difficult for our team to enact. We are currently focusing on the sprint retrospective phase by gathering input from our team, as well as from our clients, to tweak our process. One important change we have made is in the project estimation that we do during our sprint planning. We set out to engage in a team wide level of effort estimation at the beginning of each sprint cycle. What we learned after a few tries was that our breadth of active projects meant that usually only one team member was fully aware of the overall complexities of a project to accurately estimate the level of effort. This meant that the rest of the team was providing little input to the estimation and our time spent estimating was not useful. Now, we allow individual contributors to estimate their own level of effort and reserve the time for team estimation for new projects coming into our backlog.

 

Overall, our implementation of agile management processes has made positive change in our team’s level of communication and productivity. This adoption has streamlined our project intake process and eliminated much of the uncertainty we used to deal with when taking on complex data science projects for our clients. Reducing the unknowns helps us reach a solution faster and improving communication helps us adapt to changing requirements. The most important lesson we have learned though is that there is always room to improve. As ProCogia grows and takes on new and varied clients, the only way to keep up with the complexities is to be agile.

Domopalooza 2019

Domo has found a way to connect all the data that you could possibly want access to and made it into a user-friendly platform that translates the data into visualizations that anyone in your company can understand. Domo has made their platform incredibly user-friendly. If you are technologically challenged, you will still be able to create incredible visualizations. They also allow the platform to be customized based on client needs where a partner like ProCogia can come in and build intricate connectors for Domo clients. The capabilities of this platform are astounding.

Domopalooza was an action-packed week with all the knowledge your head could hold shared at general sessions, break-out sessions and at their genius lab, an abundance of delicious food, and activities such as concerts in the evenings and skiing at world-renowned Snowbird on the last day.

The Power of the Platform was the theme of the conference. Domo made some big announcements at the conference like new data science, AI and machine learning tiles that will expand Domo to become a more capable all-in-one platform. They also made an announcement of their Domo for Good program that will support non-profits to use the platform to make decisions that will directly affect communities and those in need.

Derek White explained how Domo has changed the game for BBVA by allowing them to see data in real-time which directly affects their business. One after another, you could hear stories from speakers explaining how Domo has affected their work and their company. The recurring theme was that Domo made it simple for companies to understand the information that their data was telling them. This led to collaboration between various departments based on real-time data. They were able to make informed decisions based on the data that they already had available.

Domo wrote a quick recap post that you can access here for more information on Domopalooza.

We came to Domopalooza as a sponsor and spent most of our time near the ProCogia booth talking with anyone that came by who wanted to know more about what we do. We are one of the few partners that are capable of implementing the data science capabilities within Domo. We were pleasantly surprised to see how many people from across the globe made it a priority to be at Domopalooza this year. Domopalooza was the perfect meeting place for everyone to come together, find support in their implementation, and learn about the new capabilities that Domo is releasing!

ProCogia recently partnered with Domo because we have seen what they were capable of and we want to be a part of it. ProCogia is also certified in Domo and we can hit the ground running to implement any part of Domo for companies.

Please reach out to us at outreach@procogia.com or give us a call at 425-624-7532 if you want information on how we can help your company succeed.

Which Cell Phone Carrier Should I Choose?

ProCogia at RStudio::conf 2019

As a designated enterprise partner and cherished sponsor of RStudio’s annual conference, ProCogia brought a strong contingent of Data Scientists and Sales executives to this year’s conference in Austin. It was an honor to witness the tremendous growth and popularity of this conference, which saw over 1700 participants this year. The 2-day main conference was preceded by 2 additional days of workshops and certifications. Some of our own ProCogia data scientists got certified in implementing RStudio products at an enterprise level.

It would be a mountainous job to recap the entirety of the conference, which had multiple tracks and themes, but here is a humble recap of what became the highlights of this year:

R is production ready

This was the dominant theme of this year’s edition and we could not agree more. RStudio’s CTO and inventor of Shiny, Joe Cheng kicked off the conference with his Shiny in production Keynote. He led the audience through a journey of how shiny started as data scientist’s visualization tool and gradually became a production ready product that can be used at an organizational level. There were some obvious technical and cultural challenges: software engineers’ reluctance to use Shiny apps as they were built using a data scientist’s perspective, and broader push back from organizational IT regarding scaling of shiny apps. He provided plenty of examples to emphasize how Shiny has evolved to become more robust and ready for widespread use. The conference also provided introductions to RStudio’s emerging products RStudio Package Manager and RStudio Connect: Past, present, and future which can be used to create an enterprise level data science environment (something ProCogia can help with).

Sharing is caring

David Robinson from DataCamp presented his keynote around The Unreasonable Effectiveness of Public Work, emphasizing how important it is for a data scientist to evangelize their work within their organization and the broader community. Data science doesn’t operate in a silo and neither should a data scientist. He suggested using social media, authoring blogs and articles, even writing a book (with practice and experience) to engage with the community. We, at ProCogia, also believe in giving back to the local R programming community and do so through organizing Bellevue’s useR group meetup every other month in our office location.

Learning, Learning, Learning!

Keeping an open mind around learning new concepts and techniques is the key for professional growth. Noted education researcher Felienne‘s keynote was around understanding how code can thought in a very non-trivial manner, using her research with teaching programming to school kids. It provided a very interesting way to look at programming in general and how you can master this concept using a practice-based approach.

Other notable talks that caught our attention were:

Of course, this blog post can’t cover all the talks and workshops from the conference, but hopefully it provides enough summary for you to get excited about R and Rstudio! Please feel free to reach out to us at outreach@procogia.com if you want to engage RStudio products at your enterprise, give a talk at our Seattle meetup, or are just inquisitive about R in general.

Stay tuned for our coverage of the next conference from San Francisco!

Image Processing in Python

Deep learning is a widely used technique that is renowned for its high accuracy. It can be used in various fields such as regression/classification, image processing, and natural language processing. The downside of deep learning is that it requires a large amount of data and high computational power to tune the parameters. Thus, it may not be an effective method to solve simple problems or build models on a small data set. One way to take advantage of the power of deep learning on a small data set is using pre-trained models built by companies or research groups. VGG16, which is used in this post, was developed and trained by Oxford’s Visual Geometry Group (VGG) to classify images into 1000 categories. Besides being used for classification, VGG16 can also be used for different applications for image processing by changing the last layer of the model.

I have a 2-month-old daughter, so my wife and I had to prepare some clothes, accessories, and furniture before her arrival. My wife asked me to build a model to find onesies similar to the ones she wanted to buy. I can utilize the pre-trained deep learning model to help my wife find similar onesies.

To explain how a pre-trained deep learning model can be used for this situation, I collected total of 20 images from the internet; 10 short sleeve baby onesies, eight other baby clothes, one adult t-shirt, and one pair of baby shoes. I included other baby clothes to validate if this model can distinguish them, and the images of a t-shirt and shoes are included as outliers. This is a great example to explain how pre-trained models can be utilized for the small data set.

Capture1.JPG

Keras is a neural network library written in Python that runs Tensorflow at backend. Codes below will guide you through the details on how to utilize VGG16 using Keras in Python.

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import Image, display     

VGG16 requires the input image size to be 224 by 224. A function below pre-processes images and converts them into arrays.

def preprocess(file):
     try:
         img = image.load_img(file, target_size=(224, 224))
         feature = image.img_to_array(img)
         feature = np.expand_dims(feature, axis=0)
         feature = preprocess_input(feature)
     except:
         print('Error:', file)
     return feature[0]


imgs = [preprocess('C:/Users/pc-procogia/Desktop/ProBlogia/'+str(i)+'.jpg') for i in range(1,21)]
X_pics = np.array(imgs)     

The last layer of neural networks determines the output of the model. VGG16 is primarily built for classification, thus the last layer needs to be modified to extract the raw image features from data set. In this post, no parameters will be trained, but only pre-trained layers will be used.

def feature_extraction(images):
     base_model = VGG16(weights='imagenet', include_top=True, input_shape = (224,224,3))

     for layer in base_model.layers:
         layer.trainable = False

     base_model.layers.pop()          # this removes the last layer
     base_model.layers.pop()
     base_model.outputs = [base_model.layers[-1].output]
     base_model.summary( )

     pic_features = base_model.predict(images)
     return pic_features


pic_features = feature_extraction(X_pics)     

Capture.JPG

As you can see from the results above, fc1 in the red box indicates that total 4096 features were extracted from each image, and the number of trainable parameters is zero since none of them were trained. The number of non-trainable parameters is more than 100 million which means it will take a long time to tune the parameters with your local machine. Because not everyone has access to a high computing system, the pre-trained model can come in handy here.

Now all features are extracted from each baby clothing images. Cosine similarity is a good metric to find similar clothes because all images are in the form of vectors. The code below will calculate the cosine similarities and show the most similar clothes.

dists = cosine_similarity(pic_features)
dists = pd.DataFrame(dists)

def get_similar(dists):
     L_list=[]
     for item in range(len(dists)):
         L = [i[0]+1 for i in sorted(enumerate(dists[item]), key=lambda x:x[1], reverse=True)]
         L_list.append(L)
         print('=====================')
         print('=== Your Favorite ===', item)
         display(Image(filename=str(item+1)+'.jpg',width=200, height=200))
         print('--- Similar Clothes---')
         for i in L[1:6]:
             display(Image(filename=str(i+1)+'.jpg',width=200, height=200))
     return L_list     

If you used the image of a onesie as the input, the following five images are similar clothes that the deep learning model chose. The model successfully picked the short sleeve onesies for us.

1.jpg
1.jpg

In this case below, the search results includes the onesies but also an adult t-shirt. It seems like the color was one of the features to show the adult t-shirt in the result.

5.jpg
result2.jpg

The least similar image to the short sleeve onesies were the baby shoes. The shape of the shoes are obviously different from onesies. The simple feature extraction from images was able to distinguish it even though the model was not specifically tuned for this case.

19.jpg

To increase model accuracy or to focus on certain features of clothes such as color, shape, or pattern, you can include those features in the model and train them on top of the pre-trained model. Then, you do not have to train 100 million features, but you can include some features that your local machine can handle to achieve ideal results.

Learning how to add additional features to the pre-trained model will be an interesting topic for the next post. I hope you enjoyed this post on deep learning and feel motivated to start your own projects using pre-trained models.

Data Visualization in R: a ggplot2 primer

Visualization is one of the most efficient techniques to present results. In the realm of statistical analysis, R is a popular programming language used to perform initial exploratory analysis and statistical modelling. Ggplot2 is the most popular package from R’s tidyverse which is primarily used to build supporting visualizations. This blog post provides a quick guide on how to start with the package.

Ggplot2 stands for “grammar of graphics”. Think of it as semantics for a visualization language; the package has been built to follow similar technique as constructing a meaningful sentence in a language. As in a sentence, you combine different element types together to create any kind of graphic display. Let’s look at most prominent building blocks for a plot in ggplot2:

1. Data

The goal of any visualization is to present data in an effective manner. There is a feedback loop- to have an effective visualization, it’s imminent to have data in a formatted and concise manner. The dataset needs to be in a ‘tidy’ state (avid R users will get the reference). What do I mean when I say tidy data? Each row should contain all of the information you would want to be plotting for a particular point. Design decisions directly drive formatting of data – should the dataset be wide or long? Is scaling required? Would faceting or grouping certain columns make a better input for the visualization?

Building a data manipulation pipeline is highly recommended, and I can dedicate a whole other blog post for it- stay tuned! For the current exercise I will provide R commands used to manipulate the data for the plot being discussed.

2. Aesthetic mapping

In ggplot2 universe, aesthetic means “something you can see”. This building block refers to the primary elements for making a visualization- what are the x and y axis? What colors should they be? What is the shape of the data point? In summary, here are some aesthetic elements of a ggplot2 command:

·       Position (i.e., on the x and y axes)

·       Color (“outside” color)

·       Fill (“inside” color)

·       Shape (of points)

·       Line type

·       Size

 

3. Geometric objects

After you have finalized the data and decided on the aesthetic elements, its time to put them together using a geometric block. These are the actual marks that define what kind of plot will be produced. If you have some background in visualization, you will recognize ggplot2 geometric elements very easily:

·       Points (geom_point, for scatter plots, dot plots, etc.),

·       Lines (geom_line, for time series, trend lines, etc.),

·       Boxplot (geom_boxplot)

·       And so much more…

Now you have the basic understanding of how gglpot2 works, it’s time to start making plots!

We will be using the popular nycflights13 dataset which contains on time airline data for all flights departing NYC in 2013. Also included is useful metadata on airlines, airports, weather and planes.

1.png

Scatterplot:

We will use scatterplot to identify any relationship between arrival and departure delays of United Airlines flights. Here is the initial data preparation required to do that:

2.png
3.png

And here is the scatterplot command:

4.png

Let us understand what is happening in the above command. Within the ggplot() function call, we specify two of the components of the grammar:

  • The data frame to be all_alaska_flights by setting data = all_alaska_flights

  • Aesthetic mapping by setting aes(x = dep_delay, y = arr_delay)

We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object. In this case the geometric object are points, set by specifying geom_point().

Some notes on layers:

·       The + sign comes at the end of lines, not at the beginning. You'll get an error in R if you put it at the beginning.

·       When adding layers to a plot, you are encouraged to hit Return on your keyboard after entering the + so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.

 And here is the output of the command:

5.png

It is very evident that there is some over-plotting happening in our plot. The first way of relieving over-plotting is by changing the alpha argument in geom_point() which controls the transparency of the points. By default, this value is set to 1. We can change this to any value between 0 and 1 where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque.

6.png
7.png

Another popular method is to add a color component. In the given dataset, we can use the origin airports as a differentiating dimension for color to add to the scatterplot.

8.png

Line Chart:

Let’s plot a line chart to understand variations in early January (first 15 days) weather at the Newark airport.

As usual, we need to do some data prep for the graphic:

10.png
11.png

Here is how the ggplot command would look like:

12.png
13.png

Box Plot:

We will use ggplot2 to create a box plot of temperature variations over the year.

14.png

Do you see something amiss with the plot above? We are plotting a numerical, and not categorical variable, on the x-axis. This gives us the overall boxplot without any other groupings. We can get around this by introducing ‘factor’ function for our x variable:

16.png

Bar Chart:

Next, we will plot a bar chart to compare number of flights from each airline at the New York airports.

18.png
19.png

We can modify the aesthetic elements of the bar chart to plot different variations of the same dataset. We can use bar charts to compare two categorical variables, this variation is called a stacked bar chart.

20.png

Here is the new ggplot2 bar chart command:

21.png
22.png

Another variation on the stacked bar chart is the side-by-side bar chart also called a dodged bar chart.

23.png
24.png

We can also create a faceted bar chart, which creates separate bar plot for a categorical variable and creates a view to assist in comparing values for a common variable.

25.png
26.png

Heat map:

We will build a heatmap for plotting departure delays for all days of the week at the JFK airport. We will use the color aspect of the heat map to compare delays for all hours of each day of the week. It sounds intense, but we can achieve it easily through a heat map.

Let’s do the data preparation first:

27.png

Here is the command:

Again, as in other plots, there are ample ways you can manipulate the box plot above. Hopefully this blog post has enabled you with the skills to start exploring and using ggplot2 in your work.

Do you have any questions on the commands above? Or you want to connect and know more about how to use ggplot2 for your work? Please comment on this post and I will reach out to you.

All content on this post has been sourced from the eastside chapter of the Seattle useR group, of which ProCogia is a co-organizer. Here the group’s github page: https://github.com/SeattleRUser/2018-08-23_Data_Visualization_in_R.

Tweet Clustering with word2vec and K-means

Most of the data we encounter in the real world is unstructured. A perfect example of unstructured data, text contains a vast amount of information that isn’t structured in a way a computer can easily process. Translating unstructured data into something that a computer can act upon is a major focus of artificial intelligence. The field of natural language processing (NLP) has grown tremendously in the past several years, and products such as smart assistants have developed out of this work. With our clients at ProCogia, we find that an important application of data science is to bring structure to large unstructured datasets like tweets, customer reviews, or survey responses.

Let’s look at some real-world textual data and see how applying machine learning can lead to potentially valuable insights.

For this short analysis we’ll look at tweets. FiveThirtyEight, the online news organization best known for political polling analysis, published a dataset of tweets linked to Russian trolls. We’ll explore this dataset and use K-means, a relatively simple machine learning algorithm, to extract topics from similar tweets. Finally, we’ll look at when some of these topics were popular in relation to news stories during the 2016 election.

The published data is available in 13 csv files and amounts to nearly three million tweets. For each tweet we know the author’s Twitter handle, tweet content, published date, language, and region. To help focus our attention, let’s examine only English language tweets in the United States. Applying these filters to the dataset results in just under two million tweets. To get started, let’s plot the number of tweets per month from December 2015 to May 2018.

dsf.png

Right away, we can see clear patterns between tweet activity and major developments in the presidential campaigns.

Next, let’s clean up the text of the tweets a bit. I remove all punctuation, make all words lowercase, and remove stop words (common words like “the”), and non-words like internet links and emojis. Here’s an example of the tweet cleaning process:

“Veteran strategist Paul Manafort becomes Trump's campaign chairman https://t.co/8c9twG9smG

Cleaned up, this tweet is:

“veteran strategist paul manafort becomes trumps campaign chairman”

Although the cleaning process goes a long way toward rendering the text understandable to a computer, at the end of the day the computer must deal with numbers, not words. The process of converting words to numbers is called “vectorizing” and there are many techniques available to accomplish this task. I use a Python library called Gensim to train a shallow neural network according to the word2vec algorithm developed by researchers at Google to vectorize the tweet words.

The nice thing about word2vec is that similar words, or words that are used in sentences in similar ways, are close to each other in vector space. For example, if we ask the word2vec model about the most similar words to “america,” we get “country,” “usa,” and “nation,” among others. Conveniently, the word2vec model has already grouped similar word topics for us, but we desire to group whole tweets. To accomplish this, we need to translate the collection of words in an individual tweet into a single vector representation. I have chosen to simply take the average of each word vector in the tweet.

With each tweet now represented by an average word vector, we can use any unsupervised clustering technique to group similar tweets. For simplicity, I have used K-means, an algorithm that iteratively updates a predetermined number of cluster centers based on the Euclidean distance between the centers and the data points nearest them. In the end, any single tweet will fall into one of k clusters, where k is the user-defined number of expected clusters. Best practices exist for determining the optimal value of k, but in this case I have simply chosen a large number—50. I have inspected the clusters manually to combine similar clusters and identify the most distinguished clusters.

Here is the breakdown of cluster size (it’s worth noting that the cluster label number is completely random—there is no larger meaning behind any pattern that may arise in such a distribution):

sfa.png

We see that some clusters have many more tweets than others. This could mean one of two things—first, that there simply were many tweets posted with a similar topic, or that those clusters are defined by vague words and are not unique. The latter seems to be true of cluster 8, while the former is true of cluster 37 where a couple of popular, generic quotes were highly shared. By inspecting a few randomly chosen tweets in each cluster, I found the following interesting groups:

  • Clusters 0, 5, 21, 36, and 39 were all related to holiday celebrations. This isn’t so interesting by itself, but we see an expected increase in these tweets around Thanksgiving, Christmas, and New Year’s Day.

  • Clusters 1, 10, 12, 29, 41, 42, and 46 are related to politics in one way or another. These tweets show a sharp increase in frequency following candidate announcements along with an increase leading up to election day.

  • Clusters 4, 15, 19, 25, and 33 were related to crime and or terrorism. Overall, these were the most frequent tweets to be posted, with a slight uptick as the election grew nearer.

  • Clusters 32 and 39 were tweets that contained political news stories. These stories remained constant throughout the campaigns.

  • Finally, clusters 12 and 30 are two unique cases. These tweets were related to the Black Lives Matter movement. They started off as infrequent tweets but became more frequent as the campaigns heated up.

eer.png

Using machine learning techniques, we were able to extract a few insights from a large dataset of tweets associated with Russian trolls during the 2016 Presidential campaign. When working with clients, I often find that dealing with unstructured data is a pain point for them. But with the techniques showcased in this blog, we can extract actionable insights from social media messages, online reviews, survey responses, and even advertisements. Unstructured data sources become less intimidating and more valuable to analyze. Our clients can turn these insights into action through targeted marketing campaigns or new product features and gain an advantage in their respective markets.

A Note from Our Advisor

It has been almost four years since I joined ProCogia as an advisor. Today, the values of ProCogia’s founder Mehar Singh still inspire me. At ProCogia, we strive for open communication, teamwork, continuous learning, and career development for all our employees. Our data scientists embody these values as they exercise their deep analytic and mathematical skills, working together to bring new insights to our clients. That culture of open, respectful dialogue mixed with challenging new learning enriches my own experience with the team.

I regularly meet with all team members individually and discuss both their growth in the company and their growth as professionals. Our conversations include everything from specific technical skills to relevant business topics in Seattle-area industries. As we talk, we’re shaping our understanding of our customers’ business drivers as well as pinpointing technologies that are important for us to learn. I find these interchanges help bring us together and make us a stronger team. Not only that, I’m learning a lot from the team about the direction of data science as they practice it.

My work with Mehar is exciting as we discuss the maturation of data science and how to best apply it for our customers in the Seattle area and beyond. It’s rewarding to spend time sharing what we have learned from each team member and investigate new ways to help them grow professionally. I am thankful for having joined this team and look forward to our next set of opportunities.