Domopalooza 2019

Domo has found a way to connect all the data that you could possibly want access to and made it into a user-friendly platform that translates the data into visualizations that anyone in your company can understand. Domo has made their platform incredibly user-friendly. If you are technologically challenged, you will still be able to create incredible visualizations. They also allow the platform to be customized based on client needs where a partner like ProCogia can come in and build intricate connectors for Domo clients. The capabilities of this platform are astounding.

Domopalooza was an action-packed week with all the knowledge your head could hold shared at general sessions, break-out sessions and at their genius lab, an abundance of delicious food, and activities such as concerts in the evenings and skiing at world-renowned Snowbird on the last day.

The Power of the Platform was the theme of the conference. Domo made some big announcements at the conference like new data science, AI and machine learning tiles that will expand Domo to become a more capable all-in-one platform. They also made an announcement of their Domo for Good program that will support non-profits to use the platform to make decisions that will directly affect communities and those in need.

Derek White explained how Domo has changed the game for BBVA by allowing them to see data in real-time which directly affects their business. One after another, you could hear stories from speakers explaining how Domo has affected their work and their company. The recurring theme was that Domo made it simple for companies to understand the information that their data was telling them. This led to collaboration between various departments based on real-time data. They were able to make informed decisions based on the data that they already had available.

Domo wrote a quick recap post that you can access here for more information on Domopalooza.

We came to Domopalooza as a sponsor and spent most of our time near the ProCogia booth talking with anyone that came by who wanted to know more about what we do. We are one of the few partners that are capable of implementing the data science capabilities within Domo. We were pleasantly surprised to see how many people from across the globe made it a priority to be at Domopalooza this year. Domopalooza was the perfect meeting place for everyone to come together, find support in their implementation, and learn about the new capabilities that Domo is releasing!

ProCogia recently partnered with Domo because we have seen what they were capable of and we want to be a part of it. ProCogia is also certified in Domo and we can hit the ground running to implement any part of Domo for companies.

Please reach out to us at outreach@procogia.com or give us a call at 425-624-7532 if you want information on how we can help your company succeed.

Which Cell Phone Carrier Should I Choose?

ProCogia at RStudio::conf 2019

As a designated enterprise partner and cherished sponsor of RStudio’s annual conference, ProCogia brought a strong contingent of Data Scientists and Sales executives to this year’s conference in Austin. It was an honor to witness the tremendous growth and popularity of this conference, which saw over 1700 participants this year. The 2-day main conference was preceded by 2 additional days of workshops and certifications. Some of our own ProCogia data scientists got certified in implementing RStudio products at an enterprise level.

It would be a mountainous job to recap the entirety of the conference, which had multiple tracks and themes, but here is a humble recap of what became the highlights of this year:

R is production ready

This was the dominant theme of this year’s edition and we could not agree more. RStudio’s CTO and inventor of Shiny, Joe Cheng kicked off the conference with his Shiny in production Keynote. He led the audience through a journey of how shiny started as data scientist’s visualization tool and gradually became a production ready product that can be used at an organizational level. There were some obvious technical and cultural challenges: software engineers’ reluctance to use Shiny apps as they were built using a data scientist’s perspective, and broader push back from organizational IT regarding scaling of shiny apps. He provided plenty of examples to emphasize how Shiny has evolved to become more robust and ready for widespread use. The conference also provided introductions to RStudio’s emerging products RStudio Package Manager and RStudio Connect: Past, present, and future which can be used to create an enterprise level data science environment (something ProCogia can help with).

Sharing is caring

David Robinson from DataCamp presented his keynote around The Unreasonable Effectiveness of Public Work, emphasizing how important it is for a data scientist to evangelize their work within their organization and the broader community. Data science doesn’t operate in a silo and neither should a data scientist. He suggested using social media, authoring blogs and articles, even writing a book (with practice and experience) to engage with the community. We, at ProCogia, also believe in giving back to the local R programming community and do so through organizing Bellevue’s useR group meetup every other month in our office location.

Learning, Learning, Learning!

Keeping an open mind around learning new concepts and techniques is the key for professional growth. Noted education researcher Felienne‘s keynote was around understanding how code can thought in a very non-trivial manner, using her research with teaching programming to school kids. It provided a very interesting way to look at programming in general and how you can master this concept using a practice-based approach.

Other notable talks that caught our attention were:

Of course, this blog post can’t cover all the talks and workshops from the conference, but hopefully it provides enough summary for you to get excited about R and Rstudio! Please feel free to reach out to us at outreach@procogia.com if you want to engage RStudio products at your enterprise, give a talk at our Seattle meetup, or are just inquisitive about R in general.

Stay tuned for our coverage of the next conference from San Francisco!

Image Processing in Python

Deep learning is a widely used technique that is renowned for its high accuracy. It can be used in various fields such as regression/classification, image processing, and natural language processing. The downside of deep learning is that it requires a large amount of data and high computational power to tune the parameters. Thus, it may not be an effective method to solve simple problems or build models on a small data set. One way to take advantage of the power of deep learning on a small data set is using pre-trained models built by companies or research groups. VGG16, which is used in this post, was developed and trained by Oxford’s Visual Geometry Group (VGG) to classify images into 1000 categories. Besides being used for classification, VGG16 can also be used for different applications for image processing by changing the last layer of the model.

I have a 2-month-old daughter, so my wife and I had to prepare some clothes, accessories, and furniture before her arrival. My wife asked me to build a model to find onesies similar to the ones she wanted to buy. I can utilize the pre-trained deep learning model to help my wife find similar onesies.

To explain how a pre-trained deep learning model can be used for this situation, I collected total of 20 images from the internet; 10 short sleeve baby onesies, eight other baby clothes, one adult t-shirt, and one pair of baby shoes. I included other baby clothes to validate if this model can distinguish them, and the images of a t-shirt and shoes are included as outliers. This is a great example to explain how pre-trained models can be utilized for the small data set.

Capture1.JPG

Keras is a neural network library written in Python that runs Tensorflow at backend. Codes below will guide you through the details on how to utilize VGG16 using Keras in Python.

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import Image, display     

VGG16 requires the input image size to be 224 by 224. A function below pre-processes images and converts them into arrays.

def preprocess(file):
     try:
         img = image.load_img(file, target_size=(224, 224))
         feature = image.img_to_array(img)
         feature = np.expand_dims(feature, axis=0)
         feature = preprocess_input(feature)
     except:
         print('Error:', file)
     return feature[0]


imgs = [preprocess('C:/Users/pc-procogia/Desktop/ProBlogia/'+str(i)+'.jpg') for i in range(1,21)]
X_pics = np.array(imgs)     

The last layer of neural networks determines the output of the model. VGG16 is primarily built for classification, thus the last layer needs to be modified to extract the raw image features from data set. In this post, no parameters will be trained, but only pre-trained layers will be used.

def feature_extraction(images):
     base_model = VGG16(weights='imagenet', include_top=True, input_shape = (224,224,3))

     for layer in base_model.layers:
         layer.trainable = False

     base_model.layers.pop()          # this removes the last layer
     base_model.layers.pop()
     base_model.outputs = [base_model.layers[-1].output]
     base_model.summary( )

     pic_features = base_model.predict(images)
     return pic_features


pic_features = feature_extraction(X_pics)     

Capture.JPG

As you can see from the results above, fc1 in the red box indicates that total 4096 features were extracted from each image, and the number of trainable parameters is zero since none of them were trained. The number of non-trainable parameters is more than 100 million which means it will take a long time to tune the parameters with your local machine. Because not everyone has access to a high computing system, the pre-trained model can come in handy here.

Now all features are extracted from each baby clothing images. Cosine similarity is a good metric to find similar clothes because all images are in the form of vectors. The code below will calculate the cosine similarities and show the most similar clothes.

dists = cosine_similarity(pic_features)
dists = pd.DataFrame(dists)

def get_similar(dists):
     L_list=[]
     for item in range(len(dists)):
         L = [i[0]+1 for i in sorted(enumerate(dists[item]), key=lambda x:x[1], reverse=True)]
         L_list.append(L)
         print('=====================')
         print('=== Your Favorite ===', item)
         display(Image(filename=str(item+1)+'.jpg',width=200, height=200))
         print('--- Similar Clothes---')
         for i in L[1:6]:
             display(Image(filename=str(i+1)+'.jpg',width=200, height=200))
     return L_list     

If you used the image of a onesie as the input, the following five images are similar clothes that the deep learning model chose. The model successfully picked the short sleeve onesies for us.

1.jpg
1.jpg

In this case below, the search results includes the onesies but also an adult t-shirt. It seems like the color was one of the features to show the adult t-shirt in the result.

5.jpg
result2.jpg

The least similar image to the short sleeve onesies were the baby shoes. The shape of the shoes are obviously different from onesies. The simple feature extraction from images was able to distinguish it even though the model was not specifically tuned for this case.

19.jpg

To increase model accuracy or to focus on certain features of clothes such as color, shape, or pattern, you can include those features in the model and train them on top of the pre-trained model. Then, you do not have to train 100 million features, but you can include some features that your local machine can handle to achieve ideal results.

Learning how to add additional features to the pre-trained model will be an interesting topic for the next post. I hope you enjoyed this post on deep learning and feel motivated to start your own projects using pre-trained models.

Data Visualization in R: a ggplot2 primer

Visualization is one of the most efficient techniques to present results. In the realm of statistical analysis, R is a popular programming language used to perform initial exploratory analysis and statistical modelling. Ggplot2 is the most popular package from R’s tidyverse which is primarily used to build supporting visualizations. This blog post provides a quick guide on how to start with the package.

Ggplot2 stands for “grammar of graphics”. Think of it as semantics for a visualization language; the package has been built to follow similar technique as constructing a meaningful sentence in a language. As in a sentence, you combine different element types together to create any kind of graphic display. Let’s look at most prominent building blocks for a plot in ggplot2:

1. Data

The goal of any visualization is to present data in an effective manner. There is a feedback loop- to have an effective visualization, it’s imminent to have data in a formatted and concise manner. The dataset needs to be in a ‘tidy’ state (avid R users will get the reference). What do I mean when I say tidy data? Each row should contain all of the information you would want to be plotting for a particular point. Design decisions directly drive formatting of data – should the dataset be wide or long? Is scaling required? Would faceting or grouping certain columns make a better input for the visualization?

Building a data manipulation pipeline is highly recommended, and I can dedicate a whole other blog post for it- stay tuned! For the current exercise I will provide R commands used to manipulate the data for the plot being discussed.

2. Aesthetic mapping

In ggplot2 universe, aesthetic means “something you can see”. This building block refers to the primary elements for making a visualization- what are the x and y axis? What colors should they be? What is the shape of the data point? In summary, here are some aesthetic elements of a ggplot2 command:

·       Position (i.e., on the x and y axes)

·       Color (“outside” color)

·       Fill (“inside” color)

·       Shape (of points)

·       Line type

·       Size

 

3. Geometric objects

After you have finalized the data and decided on the aesthetic elements, its time to put them together using a geometric block. These are the actual marks that define what kind of plot will be produced. If you have some background in visualization, you will recognize ggplot2 geometric elements very easily:

·       Points (geom_point, for scatter plots, dot plots, etc.),

·       Lines (geom_line, for time series, trend lines, etc.),

·       Boxplot (geom_boxplot)

·       And so much more…

Now you have the basic understanding of how gglpot2 works, it’s time to start making plots!

We will be using the popular nycflights13 dataset which contains on time airline data for all flights departing NYC in 2013. Also included is useful metadata on airlines, airports, weather and planes.

1.png

Scatterplot:

We will use scatterplot to identify any relationship between arrival and departure delays of United Airlines flights. Here is the initial data preparation required to do that:

2.png
3.png

And here is the scatterplot command:

4.png

Let us understand what is happening in the above command. Within the ggplot() function call, we specify two of the components of the grammar:

  • The data frame to be all_alaska_flights by setting data = all_alaska_flights

  • Aesthetic mapping by setting aes(x = dep_delay, y = arr_delay)

We add a layer to the ggplot() function call using the + sign. The layer in question specifies the third component of the grammar: the geometric object. In this case the geometric object are points, set by specifying geom_point().

Some notes on layers:

·       The + sign comes at the end of lines, not at the beginning. You'll get an error in R if you put it at the beginning.

·       When adding layers to a plot, you are encouraged to hit Return on your keyboard after entering the + so that the code for each layer is on a new line. As we add more and more layers to plots, you'll see this will greatly improve the legibility of your code.

 And here is the output of the command:

5.png

It is very evident that there is some over-plotting happening in our plot. The first way of relieving over-plotting is by changing the alpha argument in geom_point() which controls the transparency of the points. By default, this value is set to 1. We can change this to any value between 0 and 1 where 0 sets the points to be 100% transparent and 1 sets the points to be 100% opaque.

6.png
7.png

Another popular method is to add a color component. In the given dataset, we can use the origin airports as a differentiating dimension for color to add to the scatterplot.

8.png

Line Chart:

Let’s plot a line chart to understand variations in early January (first 15 days) weather at the Newark airport.

As usual, we need to do some data prep for the graphic:

10.png
11.png

Here is how the ggplot command would look like:

12.png
13.png

Box Plot:

We will use ggplot2 to create a box plot of temperature variations over the year.

14.png

Do you see something amiss with the plot above? We are plotting a numerical, and not categorical variable, on the x-axis. This gives us the overall boxplot without any other groupings. We can get around this by introducing ‘factor’ function for our x variable:

16.png

Bar Chart:

Next, we will plot a bar chart to compare number of flights from each airline at the New York airports.

18.png
19.png

We can modify the aesthetic elements of the bar chart to plot different variations of the same dataset. We can use bar charts to compare two categorical variables, this variation is called a stacked bar chart.

20.png

Here is the new ggplot2 bar chart command:

21.png
22.png

Another variation on the stacked bar chart is the side-by-side bar chart also called a dodged bar chart.

23.png
24.png

We can also create a faceted bar chart, which creates separate bar plot for a categorical variable and creates a view to assist in comparing values for a common variable.

25.png
26.png

Heat map:

We will build a heatmap for plotting departure delays for all days of the week at the JFK airport. We will use the color aspect of the heat map to compare delays for all hours of each day of the week. It sounds intense, but we can achieve it easily through a heat map.

Let’s do the data preparation first:

27.png

Here is the command:

Again, as in other plots, there are ample ways you can manipulate the box plot above. Hopefully this blog post has enabled you with the skills to start exploring and using ggplot2 in your work.

Do you have any questions on the commands above? Or you want to connect and know more about how to use ggplot2 for your work? Please comment on this post and I will reach out to you.

All content on this post has been sourced from the eastside chapter of the Seattle useR group, of which ProCogia is a co-organizer. Here the group’s github page: https://github.com/SeattleRUser/2018-08-23_Data_Visualization_in_R.

Tweet Clustering with word2vec and K-means

Most of the data we encounter in the real world is unstructured. A perfect example of unstructured data, text contains a vast amount of information that isn’t structured in a way a computer can easily process. Translating unstructured data into something that a computer can act upon is a major focus of artificial intelligence. The field of natural language processing (NLP) has grown tremendously in the past several years, and products such as smart assistants have developed out of this work. With our clients at ProCogia, we find that an important application of data science is to bring structure to large unstructured datasets like tweets, customer reviews, or survey responses.

Let’s look at some real-world textual data and see how applying machine learning can lead to potentially valuable insights.

For this short analysis we’ll look at tweets. FiveThirtyEight, the online news organization best known for political polling analysis, published a dataset of tweets linked to Russian trolls. We’ll explore this dataset and use K-means, a relatively simple machine learning algorithm, to extract topics from similar tweets. Finally, we’ll look at when some of these topics were popular in relation to news stories during the 2016 election.

The published data is available in 13 csv files and amounts to nearly three million tweets. For each tweet we know the author’s Twitter handle, tweet content, published date, language, and region. To help focus our attention, let’s examine only English language tweets in the United States. Applying these filters to the dataset results in just under two million tweets. To get started, let’s plot the number of tweets per month from December 2015 to May 2018.

dsf.png

Right away, we can see clear patterns between tweet activity and major developments in the presidential campaigns.

Next, let’s clean up the text of the tweets a bit. I remove all punctuation, make all words lowercase, and remove stop words (common words like “the”), and non-words like internet links and emojis. Here’s an example of the tweet cleaning process:

“Veteran strategist Paul Manafort becomes Trump's campaign chairman https://t.co/8c9twG9smG

Cleaned up, this tweet is:

“veteran strategist paul manafort becomes trumps campaign chairman”

Although the cleaning process goes a long way toward rendering the text understandable to a computer, at the end of the day the computer must deal with numbers, not words. The process of converting words to numbers is called “vectorizing” and there are many techniques available to accomplish this task. I use a Python library called Gensim to train a shallow neural network according to the word2vec algorithm developed by researchers at Google to vectorize the tweet words.

The nice thing about word2vec is that similar words, or words that are used in sentences in similar ways, are close to each other in vector space. For example, if we ask the word2vec model about the most similar words to “america,” we get “country,” “usa,” and “nation,” among others. Conveniently, the word2vec model has already grouped similar word topics for us, but we desire to group whole tweets. To accomplish this, we need to translate the collection of words in an individual tweet into a single vector representation. I have chosen to simply take the average of each word vector in the tweet.

With each tweet now represented by an average word vector, we can use any unsupervised clustering technique to group similar tweets. For simplicity, I have used K-means, an algorithm that iteratively updates a predetermined number of cluster centers based on the Euclidean distance between the centers and the data points nearest them. In the end, any single tweet will fall into one of k clusters, where k is the user-defined number of expected clusters. Best practices exist for determining the optimal value of k, but in this case I have simply chosen a large number—50. I have inspected the clusters manually to combine similar clusters and identify the most distinguished clusters.

Here is the breakdown of cluster size (it’s worth noting that the cluster label number is completely random—there is no larger meaning behind any pattern that may arise in such a distribution):

sfa.png

We see that some clusters have many more tweets than others. This could mean one of two things—first, that there simply were many tweets posted with a similar topic, or that those clusters are defined by vague words and are not unique. The latter seems to be true of cluster 8, while the former is true of cluster 37 where a couple of popular, generic quotes were highly shared. By inspecting a few randomly chosen tweets in each cluster, I found the following interesting groups:

  • Clusters 0, 5, 21, 36, and 39 were all related to holiday celebrations. This isn’t so interesting by itself, but we see an expected increase in these tweets around Thanksgiving, Christmas, and New Year’s Day.

  • Clusters 1, 10, 12, 29, 41, 42, and 46 are related to politics in one way or another. These tweets show a sharp increase in frequency following candidate announcements along with an increase leading up to election day.

  • Clusters 4, 15, 19, 25, and 33 were related to crime and or terrorism. Overall, these were the most frequent tweets to be posted, with a slight uptick as the election grew nearer.

  • Clusters 32 and 39 were tweets that contained political news stories. These stories remained constant throughout the campaigns.

  • Finally, clusters 12 and 30 are two unique cases. These tweets were related to the Black Lives Matter movement. They started off as infrequent tweets but became more frequent as the campaigns heated up.

eer.png

Using machine learning techniques, we were able to extract a few insights from a large dataset of tweets associated with Russian trolls during the 2016 Presidential campaign. When working with clients, I often find that dealing with unstructured data is a pain point for them. But with the techniques showcased in this blog, we can extract actionable insights from social media messages, online reviews, survey responses, and even advertisements. Unstructured data sources become less intimidating and more valuable to analyze. Our clients can turn these insights into action through targeted marketing campaigns or new product features and gain an advantage in their respective markets.

A Note from Our Advisor

It has been almost four years since I joined ProCogia as an advisor. Today, the values of ProCogia’s founder Mehar Singh still inspire me. At ProCogia, we strive for open communication, teamwork, continuous learning, and career development for all our employees. Our data scientists embody these values as they exercise their deep analytic and mathematical skills, working together to bring new insights to our clients. That culture of open, respectful dialogue mixed with challenging new learning enriches my own experience with the team.

I regularly meet with all team members individually and discuss both their growth in the company and their growth as professionals. Our conversations include everything from specific technical skills to relevant business topics in Seattle-area industries. As we talk, we’re shaping our understanding of our customers’ business drivers as well as pinpointing technologies that are important for us to learn. I find these interchanges help bring us together and make us a stronger team. Not only that, I’m learning a lot from the team about the direction of data science as they practice it.

My work with Mehar is exciting as we discuss the maturation of data science and how to best apply it for our customers in the Seattle area and beyond. It’s rewarding to spend time sharing what we have learned from each team member and investigate new ways to help them grow professionally. I am thankful for having joined this team and look forward to our next set of opportunities.

Guessing the State of Jeopardy!

I am a big Jeopardy! fan and when I found out about the Jeopardy! data set, I had to play around with it. After taking a look, I decided to see if I could, in a small way, replicate the famous quiz-playing machine Watson developed by IBM.

For this blog post, I'll investigate how well I can build a simple classifier to answer Jeopardy! questions about states in the U.S. I'll use previous Jeopardy! answers and questions to train my model.

Let's get started!

Step #1: Setup and Import Packages

# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# set up display properties
%matplotlib inline
# use this command to set the display wider.
pd.set_option('max_colwidth', 300)  
# packages used in modeling
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
from sklearn.metrics import confusion_matrix

Step #2: Read in data and explore a little

# read in data set
df = pd.read_csv('JEOPARDY_CSV.csv')
# how big is the data set
df.shape

(216930, 7)

# take a look at the first few rows of the dataframe
df.head()

Show_Number Air_Date Round Category Value Question Answer
0 4680 12/31/04 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory Copernicus
1 4680 12/31/04 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves Jim Thorpe
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
3 4680 12/31/04 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", this company served its billionth burger McDonald's
4 4680 12/31/04 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States John Adams

Side note:

So, Jeopardy! is a bit unusual in its naming of answers and questions, e.g., I just said "answers and questions" instead of "questions and answers." Jeopardy! requires contestants to state their responses in the form of a question. For example, for row 0 shown above, Alex Trebek (the game show host) would say, "Answer: For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory". Then a contestant would have to answer with "Who is Copernicus?" in order to be awarded $200. If they instead said "Copernicus," then they the buzzer would sound and they would lose $200. For sanity-sake, I'll reverse the standard Jeopardy! naming and use the column names shown in the dataframe above.

# what are the most common Jeopardy! answers?
df['Answer'].value_counts().head()

China        216
Australia    215
Japan        196
Chicago      194
France       193
Name: Answer, dtype: int64

# what about for the Jeopardy! vs Double Jeopardy! rounds?
df[df['Round']=='Jeopardy!']['Answer'].value_counts().head()

China         117
California    115
Chicago       114
Australia     109
Japan         106
Name: Answer, dtype: int64

# what about for the Jeopardy! vs Double Jeopardy! rounds?
df[df['Round']=='Double Jeopardy!']['Answer'].value_counts().head()

Australia    102
China         95
India         93
Paris         89
Mexico        89
Name: Answer, dtype: int64

Step #3: Build dataframe of only U.S. states

Geography is big in Jeopardy! in both rounds. Let's select only answers that are U.S. states.

Map of the United States of America (CC BY-SA 3.0,  https://commons.wikimedia.org/w/index.php?curid=8884010 )

Map of the United States of America (CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8884010)

# build list of U.S. states
state_list = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina',  'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',  'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',  'West Virginia', 'Wisconsin', 'Wyoming']
# verify the length is 50
len(state_list)

50

# create dataset that contains only answers from state_list.
df_states = df[df['Answer'].isin(state_list)].copy(deep= True)
# check the size
df_states.shape

(3893, 7)

Our original data set had about 200,000 rows and selecting only rows where the answer is a U.S. state drops it down to almost 4,000. Let's explore this data set a bit.

# what does the data look like?
df_states.head()
Show_Number Air_Date Round Category Value Question Answer
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
95 5957 7/6/10 Double Jeopardy! SEE & SAY $800 Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859 Oregon
108 5957 7/6/10 Double Jeopardy! NEWS TO ME $1,600 In a surprise, Ted Kennedy's old Senate seat in this state went to a Republican in a January 2010 election Massachusetts
299 5690 5/8/09 Jeopardy! A STATE OF COLLEGE-NESS $200 Baylor, Stephen F. Austin, Rice Texas
# what are the most popular answers about U.S. states?
df_states['Answer'].value_counts().head()

California    180
Alaska        161
Hawaii        157
Texas         153
Florida       140
Name: Answer, dtype: int64

# what percentage of questions are about the top states?
df_states['Answer'].value_counts(normalize = True).head() *100

California    4.623684
Alaska        4.135628
Hawaii        4.032880
Texas         3.930131
Florida       3.596198
Name: Answer, dtype: float64

The top three answers appear 4% of the time. If the states were uniformly sampled, then they should all appear about 2% of the time (1/50 X 100%). So California, Alaska, and Hawaii are over-sampled in the Jeopardy! clues.

The goal of my exploration is to build a model that can correctly predict the correct answer to a Jeopardy! question. I'm going to simplify this a bit more because there are 50 states and that's a lot to work with at once. So, I'll focus on only the top 3 states, California, Alaska, and Hawaii, and then 3 of the states where I've lived, Arizona, Washington, and Texas. Let's see where Washington and Arizona rank in popularity.

df_states['Answer'].value_counts(normalize = True)['Arizona'] *100

2.003596198304649

df_states['Answer'].value_counts(normalize = True)['Washington']*100

2.440277421012073

# build a dataframe with only 6 states
states_6 = ['California', 'Alaska', 'Hawaii', 'Arizona', 'Washington', 'Texas']
df_states_6 = df_states[df_states['Answer'].isin(states_6)].copy(deep= True)
# check the size
df_states_6.shape

(824, 7)

# take a look 
df_states_6.head()
Show_Number Air_Date Round Category Value Question Answer
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
299 5690 5/8/09 Jeopardy! A STATE OF COLLEGE-NESS $200 Baylor, Stephen F. Austin, Rice Texas
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
477 5243 5/30/07 Jeopardy! STATE SUPERLATIVES $200 A valley at 282 feet below sea level in this state is the lowest point in the Western Hemisphere California
# 10 questions about Washington
df_states_6[df_states_6['Answer']=='Washington'].head(10)
Show_Number Air_Date Round Category Value Question Answer
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
501 5243 5/30/07 Jeopardy! STATE SUPERLATIVES $1,000 Its Boeing manufacturing plant in Everett is the world's largest building by volume Washington
3088 4487 2/24/04 Double Jeopardy! THE PRODUCERS $1,600 It leads the states in apple production Washington
3250 5084 10/19/06 Jeopardy! AMERICAN COUNTIES $1,000 While many states have counties named Lincoln, this is the only state that has one named Snohomish Washington
5099 4842 10/4/05 Jeopardy! YOUNG ABE LINCOLN $400 Among the books read by Lincoln as a youngster were "Robinson Crusoe", "Aesop's Fables", & Mason Weems' "Life of" this man Washington
6787 4306 4/28/03 Double Jeopardy! STATE: THE OBVIOUS $800 Florida's in the southeast corner of the 48 contiguous states; this state is in the northwest corner Washington
9033 3409 6/3/99 Jeopardy! BILL GATES' 50 BILLION $200 In this state where he lives, Bill could pay the governor's salary for 413,000 years Washington
11369 5279 7/19/07 Jeopardy! WASHINGTON, LINCOLN OR GEORGE W. BUSH $200 Was born a British subject Washington
12308 6230 10/21/11 Jeopardy! STATE OF THE NOVEL $200 "Twilight" Washington

Washington: State or President?

Taking a look at questions where "Washington" is the answer, shows that the question could be referring to Washington, the state, or Washington, the president. This is something that might play a role in how accurate our model can be with questions about Washington versus questions about Arizona.

Step 4: Feature Engineering

Now that we've decided that we want to predict the correct state name given a question, we have to figure out what sort of features a question has that can inform us about how to pick the correct answer. What patterns emerge for each state? Looking at the previous output showing the clues about Washington, a common word that appears is "Snohomish." Does this word appear in questions about any of the other states?

df_states_6[df_states_6['Question'].str.contains("Snohomish")].head(10)
Show_Number Air_Date Round Category Value Question Answer
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
3250 5084 10/19/06 Jeopardy! AMERICAN COUNTIES $1,000 While many states have counties named Lincoln, this is the only state that has one named Snohomish Washington

Nope, the only two questions containing the word "Snohomish" have "Washington" as their answer. What about "Evergreen"? After all, Washington is the "Evergreen State."

df_states_6[df_states_6['Question'].str.contains("Evergreen")].head(10)
Show_Number Air_Date Round Category Value Question Answer
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
106424 3345 3/5/99 Double Jeopardy! STATE LICENSE PLATES $800 301-JRJ Evergreen State Washington
108152 3631 5/22/00 Jeopardy! ASIAN-AMERICAN ACHIEVERS $100 Gary Locke of this "Evergreen State" is the first Chinese-American governor in U.S. history Washington
164029 5122 12/12/06 Jeopardy! STATE THE STATE $600 Shots like the one seen <a href="http://www.j-archive.com/media/2006-12-12_J_13.jpg" target="_blank">here</a> show you why it's the Evergreen State Washington

Yes! Evergreen is a great indicator that a question is about the state of Washington.

Using this technique of finding words that are popular in questions about specific states, we will build a model that counts words contained in the questions for each state and use that to predict the answer when we present the model with a never-before-seen question. The test of the model is to see how well the accuracy is on questions where we know the answer, but the model doesn't. We will train the model on 80% of the data and then test it by asking it to predict the answer on the remaining 20%.

# transform dataframe columns containing question and answers to a list
list_docs = df_states_6['Question'].tolist()
list_labels = df_states_6['Answer'].tolist()
# split the data into a training and test set. 
X_train, X_test, y_train, y_test = train_test_split(list_docs, list_labels, test_size = 0.2, random_state = 42)

Step #5: Building the model

Now, I haven't yet counted all the words in my data set; I'm going to do this at the same time as I build my model. Python's sk-learn package has a really great tool called Pipeline. It lets you pass your data through a pipeline of feature engineering and modeling all in one-step. Here, I'll build a pipeline that uses the CountVectorizer package and then passes the data to a logistic regression model.

# build the pipeline
pipeline_LR = Pipeline([('vect', CountVectorizer(decode_error='ignore')), ('clf', LogisticRegression())])
# pass the data to the pipeline and build a model using the training data set
pipeline_LR = pipeline_LR.fit(X_train, y_train)
# pass the test data to the pipeline and get the predicted values
y_test_predicted = pipeline_LR.predict(X_test)
# calculate the accuracy of the model using the test data and the predicted value of the test data
accuracy = accuracy_score(y_test, y_test_predicted)
accuracy

0.5818181818181818

Hmmm... the accuracy of the model is 58%. Is that good or is that bad? Well, I'm trying to predict the answer to a question where there are 6 possible answers (California, Alaska, Hawaii, Arizona, Washington, or Texas). If I were to guess randomly, I would be correct 1/6th of the time or ~17% of the time. So, an accuracy of 58% is actually predict good. It's 3 times better than randomly guessing.

Now there's a lot more that I could do with improving the model. I can try a few other classification models, like decision trees or support-vector classifier. I could also adjust the default parameters of the model or clean up my data a bit more by removing stop words and punctuation. For now, I'm pretty happy with an accuracy that is better than random.

Step #6: Investigating the model

Where did the model do well and where did it perform poorly? A confusion matrix is a great visual representation of the answer to this question. A good-looking confusion matrix will have a strongly defined diagonal line. When the true label is California, then most of the predicted labels should fall under California as well.

CM = confusion_matrix(y_test, list(y_test_predicted),labels = states_6)
ax = sns.heatmap(CM, linewidths=.5,  xticklabels = states_6, yticklabels = states_6, cmap="YlGnBu", annot = True, fmt = "d")
plt.ylabel('True label')
plt.xlabel('Predicted label');
Confusion Matrix for Six States

Confusion Matrix for Six States

I had a theory that Washington might be tricky to get right because of possible confusion with the president. It looks like I might be right.

Both the states of Arizona and Washington were present in the data set by about the same amount. However, in the test data set, Washington is correctly predicted 6 out of 23 times (i. e., 26% of the time), and Arizona is correctly predicted at a slightly higher rate (5/17 or 29%). Let's take a look at some of the cases where Washington is correctly predicted.

# build data frame containing test questions, answers, and predicted answers.
df_states_6_test = pd.DataFrame(
    data = {"Question":X_test,"Answer_true": y_test,"Answer_predicted": list(y_test_predicted)})
# correctly predicting Washington
df_states_6_test[(df_states_6_test['Answer_true']=='Washington') & 
                        (df_states_6_test['Answer_predicted']=='Washington') ]
Answer_predicted Answer_true Question
58 Washington Washington Joseph J. Ellis' biography of this 18th century man is called "His Excellency"
84 Washington Washington Gary Locke of this "Evergreen State" is the first Chinese-American governor in U.S. history
104 Washington Washington Title destination of Mr. Smith, Billy Jack & The Happy Hooker
124 Washington Washington His widow Martha once gave the O.K. for him to be re-entombed in the U.S. Capitol; a nephew later nixed the idea
135 Washington Washington Edmund Randolph helped draft & ratify the Constitution before becoming this man's Attorney General
140 Washington Washington SAW NOTHING

About half of the questions where Washington is the answer have to do with the president and not the state (#58, #124, #135). It looks like the algorithm is able to pull out information about Washington, whether or not the question is asking about the state or the president.

Conclusion

I hope you enjoyed this post and you are inspired to learn more about natural language processing and classification problems. This post can also be found on github as a jupyter notebook. Some of the inspiration for this post came from this notebook by Emmanuel Ameisen. Also you can learn more about pipelines from Python Data Science Handbook by Jake VanderPlas.