A Note from Our Advisor

It has been almost four years since I joined ProCogia as an advisor. Today, the values of ProCogia’s founder Mehar Singh still inspire me. At ProCogia, we strive for open communication, teamwork, continuous learning, and career development for all our employees. Our data scientists embody these values as they exercise their deep analytic and mathematical skills, working together to bring new insights to our clients. That culture of open, respectful dialogue mixed with challenging new learning enriches my own experience with the team.

I regularly meet with all team members individually and discuss both their growth in the company and their growth as professionals. Our conversations include everything from specific technical skills to relevant business topics in Seattle-area industries. As we talk, we’re shaping our understanding of our customers’ business drivers as well as pinpointing technologies that are important for us to learn. I find these interchanges help bring us together and make us a stronger team. Not only that, I’m learning a lot from the team about the direction of data science as they practice it.

My work with Mehar is exciting as we discuss the maturation of data science and how to best apply it for our customers in the Seattle area and beyond. It’s rewarding to spend time sharing what we have learned from each team member and investigate new ways to help them grow professionally. I am thankful for having joined this team and look forward to our next set of opportunities.

Guessing the State of Jeopardy!

I am a big Jeopardy! fan and when I found out about the Jeopardy! data set, I had to play around with it. After taking a look, I decided to see if I could, in a small way, replicate the famous quiz-playing machine Watson developed by IBM.

For this blog post, I'll investigate how well I can build a simple classifier to answer Jeopardy! questions about states in the U.S. I'll use previous Jeopardy! answers and questions to train my model.

Let's get started!

Step #1: Setup and Import Packages

# import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
# set up display properties
%matplotlib inline
# use this command to set the display wider.
pd.set_option('max_colwidth', 300)  
# packages used in modeling
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report
from sklearn.metrics import confusion_matrix

Step #2: Read in data and explore a little

# read in data set
df = pd.read_csv('JEOPARDY_CSV.csv')
# how big is the data set
df.shape

(216930, 7)

# take a look at the first few rows of the dataframe
df.head()

Show_Number Air_Date Round Category Value Question Answer
0 4680 12/31/04 Jeopardy! HISTORY $200 For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory Copernicus
1 4680 12/31/04 Jeopardy! ESPN's TOP 10 ALL-TIME ATHLETES $200 No. 2: 1912 Olympian; football star at Carlisle Indian School; 6 MLB seasons with the Reds, Giants & Braves Jim Thorpe
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
3 4680 12/31/04 Jeopardy! THE COMPANY LINE $200 In 1963, live on "The Art Linkletter Show", this company served its billionth burger McDonald's
4 4680 12/31/04 Jeopardy! EPITAPHS & TRIBUTES $200 Signer of the Dec. of Indep., framer of the Constitution of Mass., second President of the United States John Adams

Side note:

So, Jeopardy! is a bit unusual in its naming of answers and questions, e.g., I just said "answers and questions" instead of "questions and answers." Jeopardy! requires contestants to state their responses in the form of a question. For example, for row 0 shown above, Alex Trebek (the game show host) would say, "Answer: For the last 8 years of his life, Galileo was under house arrest for espousing this man's theory". Then a contestant would have to answer with "Who is Copernicus?" in order to be awarded $200. If they instead said "Copernicus," then they the buzzer would sound and they would lose $200. For sanity-sake, I'll reverse the standard Jeopardy! naming and use the column names shown in the dataframe above.

# what are the most common Jeopardy! answers?
df['Answer'].value_counts().head()

China        216
Australia    215
Japan        196
Chicago      194
France       193
Name: Answer, dtype: int64

# what about for the Jeopardy! vs Double Jeopardy! rounds?
df[df['Round']=='Jeopardy!']['Answer'].value_counts().head()

China         117
California    115
Chicago       114
Australia     109
Japan         106
Name: Answer, dtype: int64

# what about for the Jeopardy! vs Double Jeopardy! rounds?
df[df['Round']=='Double Jeopardy!']['Answer'].value_counts().head()

Australia    102
China         95
India         93
Paris         89
Mexico        89
Name: Answer, dtype: int64

Step #3: Build dataframe of only U.S. states

Geography is big in Jeopardy! in both rounds. Let's select only answers that are U.S. states.

 Map of the United States of America (CC BY-SA 3.0,  https://commons.wikimedia.org/w/index.php?curid=8884010 )

Map of the United States of America (CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=8884010)

# build list of U.S. states
state_list = ['Alabama', 'Alaska', 'Arizona', 'Arkansas', 'California', 'Colorado', 'Connecticut', 'Delaware', 'Florida', 'Georgia', 'Hawaii', 'Idaho', 'Illinois', 'Indiana', 'Iowa', 'Kansas', 'Kentucky', 'Louisiana', 'Maine', 'Maryland', 'Massachusetts', 'Michigan', 'Minnesota', 'Mississippi', 'Missouri', 'Montana', 'Nebraska', 'Nevada', 'New Hampshire', 'New Jersey', 'New Mexico', 'New York', 'North Carolina',  'North Dakota', 'Ohio', 'Oklahoma', 'Oregon', 'Pennsylvania',  'Rhode Island', 'South Carolina', 'South Dakota', 'Tennessee', 'Texas', 'Utah', 'Vermont', 'Virginia', 'Washington',  'West Virginia', 'Wisconsin', 'Wyoming']
# verify the length is 50
len(state_list)

50

# create dataset that contains only answers from state_list.
df_states = df[df['Answer'].isin(state_list)].copy(deep= True)
# check the size
df_states.shape

(3893, 7)

Our original data set had about 200,000 rows and selecting only rows where the answer is a U.S. state drops it down to almost 4,000. Let's explore this data set a bit.

# what does the data look like?
df_states.head()
Show_Number Air_Date Round Category Value Question Answer
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
95 5957 7/6/10 Double Jeopardy! SEE & SAY $800 Say <a href="http://www.j-archive.com/media/2010-07-06_DJ_27.jpg" target="_blank">this</a> state that was admitted to the Union in 1859 Oregon
108 5957 7/6/10 Double Jeopardy! NEWS TO ME $1,600 In a surprise, Ted Kennedy's old Senate seat in this state went to a Republican in a January 2010 election Massachusetts
299 5690 5/8/09 Jeopardy! A STATE OF COLLEGE-NESS $200 Baylor, Stephen F. Austin, Rice Texas
# what are the most popular answers about U.S. states?
df_states['Answer'].value_counts().head()

California    180
Alaska        161
Hawaii        157
Texas         153
Florida       140
Name: Answer, dtype: int64

# what percentage of questions are about the top states?
df_states['Answer'].value_counts(normalize = True).head() *100

California    4.623684
Alaska        4.135628
Hawaii        4.032880
Texas         3.930131
Florida       3.596198
Name: Answer, dtype: float64

The top three answers appear 4% of the time. If the states were uniformly sampled, then they should all appear about 2% of the time (1/50 X 100%). So California, Alaska, and Hawaii are over-sampled in the Jeopardy! clues.

The goal of my exploration is to build a model that can correctly predict the correct answer to a Jeopardy! question. I'm going to simplify this a bit more because there are 50 states and that's a lot to work with at once. So, I'll focus on only the top 3 states, California, Alaska, and Hawaii, and then 3 of the states where I've lived, Arizona, Washington, and Texas. Let's see where Washington and Arizona rank in popularity.

df_states['Answer'].value_counts(normalize = True)['Arizona'] *100

2.003596198304649

df_states['Answer'].value_counts(normalize = True)['Washington']*100

2.440277421012073

# build a dataframe with only 6 states
states_6 = ['California', 'Alaska', 'Hawaii', 'Arizona', 'Washington', 'Texas']
df_states_6 = df_states[df_states['Answer'].isin(states_6)].copy(deep= True)
# check the size
df_states_6.shape

(824, 7)

# take a look 
df_states_6.head()
Show_Number Air_Date Round Category Value Question Answer
2 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $200 The city of Yuma in this state has a record average of 4,055 hours of sunshine each year Arizona
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
299 5690 5/8/09 Jeopardy! A STATE OF COLLEGE-NESS $200 Baylor, Stephen F. Austin, Rice Texas
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
477 5243 5/30/07 Jeopardy! STATE SUPERLATIVES $200 A valley at 282 feet below sea level in this state is the lowest point in the Western Hemisphere California
# 10 questions about Washington
df_states_6[df_states_6['Answer']=='Washington'].head(10)
Show_Number Air_Date Round Category Value Question Answer
8 4680 12/31/04 Jeopardy! EVERYBODY TALKS ABOUT IT... $400 In the winter of 1971-72, a record 1,122 inches of snow fell at Rainier Paradise Ranger Station in this state Washington
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
501 5243 5/30/07 Jeopardy! STATE SUPERLATIVES $1,000 Its Boeing manufacturing plant in Everett is the world's largest building by volume Washington
3088 4487 2/24/04 Double Jeopardy! THE PRODUCERS $1,600 It leads the states in apple production Washington
3250 5084 10/19/06 Jeopardy! AMERICAN COUNTIES $1,000 While many states have counties named Lincoln, this is the only state that has one named Snohomish Washington
5099 4842 10/4/05 Jeopardy! YOUNG ABE LINCOLN $400 Among the books read by Lincoln as a youngster were "Robinson Crusoe", "Aesop's Fables", & Mason Weems' "Life of" this man Washington
6787 4306 4/28/03 Double Jeopardy! STATE: THE OBVIOUS $800 Florida's in the southeast corner of the 48 contiguous states; this state is in the northwest corner Washington
9033 3409 6/3/99 Jeopardy! BILL GATES' 50 BILLION $200 In this state where he lives, Bill could pay the governor's salary for 413,000 years Washington
11369 5279 7/19/07 Jeopardy! WASHINGTON, LINCOLN OR GEORGE W. BUSH $200 Was born a British subject Washington
12308 6230 10/21/11 Jeopardy! STATE OF THE NOVEL $200 "Twilight" Washington

Washington: State or President?

Taking a look at questions where "Washington" is the answer, shows that the question could be referring to Washington, the state, or Washington, the president. This is something that might play a role in how accurate our model can be with questions about Washington versus questions about Arizona.

Step 4: Feature Engineering

Now that we've decided that we want to predict the correct state name given a question, we have to figure out what sort of features a question has that can inform us about how to pick the correct answer. What patterns emerge for each state? Looking at the previous output showing the clues about Washington, a common word that appears is "Snohomish." Does this word appear in questions about any of the other states?

df_states_6[df_states_6['Question'].str.contains("Snohomish")].head(10)
Show_Number Air_Date Round Category Value Question Answer
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
3250 5084 10/19/06 Jeopardy! AMERICAN COUNTIES $1,000 While many states have counties named Lincoln, this is the only state that has one named Snohomish Washington

Nope, the only two questions containing the word "Snohomish" have "Washington" as their answer. What about "Evergreen"? After all, Washington is the "Evergreen State."

df_states_6[df_states_6['Question'].str.contains("Evergreen")].head(10)
Show_Number Air_Date Round Category Value Question Answer
368 2825 12/6/96 Jeopardy! ANNUAL EVENTS $200 Monroe, near Snohomish in this state, is the site of the annual Evergreen State Fair Washington
106424 3345 3/5/99 Double Jeopardy! STATE LICENSE PLATES $800 301-JRJ Evergreen State Washington
108152 3631 5/22/00 Jeopardy! ASIAN-AMERICAN ACHIEVERS $100 Gary Locke of this "Evergreen State" is the first Chinese-American governor in U.S. history Washington
164029 5122 12/12/06 Jeopardy! STATE THE STATE $600 Shots like the one seen <a href="http://www.j-archive.com/media/2006-12-12_J_13.jpg" target="_blank">here</a> show you why it's the Evergreen State Washington

Yes! Evergreen is a great indicator that a question is about the state of Washington.

Using this technique of finding words that are popular in questions about specific states, we will build a model that counts words contained in the questions for each state and use that to predict the answer when we present the model with a never-before-seen question. The test of the model is to see how well the accuracy is on questions where we know the answer, but the model doesn't. We will train the model on 80% of the data and then test it by asking it to predict the answer on the remaining 20%.

# transform dataframe columns containing question and answers to a list
list_docs = df_states_6['Question'].tolist()
list_labels = df_states_6['Answer'].tolist()
# split the data into a training and test set. 
X_train, X_test, y_train, y_test = train_test_split(list_docs, list_labels, test_size = 0.2, random_state = 42)

Step #5: Building the model

Now, I haven't yet counted all the words in my data set; I'm going to do this at the same time as I build my model. Python's sk-learn package has a really great tool called Pipeline. It lets you pass your data through a pipeline of feature engineering and modeling all in one-step. Here, I'll build a pipeline that uses the CountVectorizer package and then passes the data to a logistic regression model.

# build the pipeline
pipeline_LR = Pipeline([('vect', CountVectorizer(decode_error='ignore')), ('clf', LogisticRegression())])
# pass the data to the pipeline and build a model using the training data set
pipeline_LR = pipeline_LR.fit(X_train, y_train)
# pass the test data to the pipeline and get the predicted values
y_test_predicted = pipeline_LR.predict(X_test)
# calculate the accuracy of the model using the test data and the predicted value of the test data
accuracy = accuracy_score(y_test, y_test_predicted)
accuracy

0.5818181818181818

Hmmm... the accuracy of the model is 58%. Is that good or is that bad? Well, I'm trying to predict the answer to a question where there are 6 possible answers (California, Alaska, Hawaii, Arizona, Washington, or Texas). If I were to guess randomly, I would be correct 1/6th of the time or ~17% of the time. So, an accuracy of 58% is actually predict good. It's 3 times better than randomly guessing.

Now there's a lot more that I could do with improving the model. I can try a few other classification models, like decision trees or support-vector classifier. I could also adjust the default parameters of the model or clean up my data a bit more by removing stop words and punctuation. For now, I'm pretty happy with an accuracy that is better than random.

Step #6: Investigating the model

Where did the model do well and where did it perform poorly? A confusion matrix is a great visual representation of the answer to this question. A good-looking confusion matrix will have a strongly defined diagonal line. When the true label is California, then most of the predicted labels should fall under California as well.

CM = confusion_matrix(y_test, list(y_test_predicted),labels = states_6)
ax = sns.heatmap(CM, linewidths=.5,  xticklabels = states_6, yticklabels = states_6, cmap="YlGnBu", annot = True, fmt = "d")
plt.ylabel('True label')
plt.xlabel('Predicted label');
 Confusion Matrix for Six States

Confusion Matrix for Six States

I had a theory that Washington might be tricky to get right because of possible confusion with the president. It looks like I might be right.

Both the states of Arizona and Washington were present in the data set by about the same amount. However, in the test data set, Washington is correctly predicted 6 out of 23 times (i. e., 26% of the time), and Arizona is correctly predicted at a slightly higher rate (5/17 or 29%). Let's take a look at some of the cases where Washington is correctly predicted.

# build data frame containing test questions, answers, and predicted answers.
df_states_6_test = pd.DataFrame(
    data = {"Question":X_test,"Answer_true": y_test,"Answer_predicted": list(y_test_predicted)})
# correctly predicting Washington
df_states_6_test[(df_states_6_test['Answer_true']=='Washington') & 
                        (df_states_6_test['Answer_predicted']=='Washington') ]
Answer_predicted Answer_true Question
58 Washington Washington Joseph J. Ellis' biography of this 18th century man is called "His Excellency"
84 Washington Washington Gary Locke of this "Evergreen State" is the first Chinese-American governor in U.S. history
104 Washington Washington Title destination of Mr. Smith, Billy Jack & The Happy Hooker
124 Washington Washington His widow Martha once gave the O.K. for him to be re-entombed in the U.S. Capitol; a nephew later nixed the idea
135 Washington Washington Edmund Randolph helped draft & ratify the Constitution before becoming this man's Attorney General
140 Washington Washington SAW NOTHING

About half of the questions where Washington is the answer have to do with the president and not the state (#58, #124, #135). It looks like the algorithm is able to pull out information about Washington, whether or not the question is asking about the state or the president.

Conclusion

I hope you enjoyed this post and you are inspired to learn more about natural language processing and classification problems. This post can also be found on github as a jupyter notebook. Some of the inspiration for this post came from this notebook by Emmanuel Ameisen. Also you can learn more about pipelines from Python Data Science Handbook by Jake VanderPlas.

Yellowstone: A Case Study of Data Science

Recently, I visited Yellowstone National Park, which is well known for its geyser basins, lakes, scenic mountains, and variety of wildlife. As a data scientist, I was amazed by the park officials’ ability to predict the time and duration of geyser eruptions. My extreme curiosity drove me to explore how they did it. After reviewing a few articles, I learned about their prediction process and the data science behind it.

I specifically investigated the prediction of the most famous geyser, Old Faithful. While playing with the data, I found—to my surprise—that R Studio has a built-in data set for Old Faithful eruptions. Below is a sample of the data set.

link.png

Old Faithful is currently bimodal. It has two eruption durations, one that lasts more than four minutes and a second that occurs more rarely and lasts about two-and-a-half minutes. The objective is to figure out if it is a long or short eruption—a clear classification problem.

graph.png

After plotting the data, I saw a linear relationship between the waiting time and duration, which implies that as the waiting time increases, the duration of the eruption increases. For this reason, I chose to employ a simple machine learning algorithm, linear regression. Although it was a classification problem, we can apply regression and obtain more accurate results. To convert them into two classes, we enforce a condition on the regression results. The data set already exists in R, so I performed the predictions using R programming.

Visualization provides a unique perspective and better understanding of the data set. It is clear from the box plot below that the range is around 40-100. Interpreting and understanding the quartiles, median and the range is important here.

fkje.png

Once the model is finalized, we would split the data into training and testing sets. The lm function in R applies the linear model to the data and gives us the estimates (i.e., the value of the intercept/constant).

  y = mx + c    ExpectedDuration = m (waiting time) + c

y = mx + c

ExpectedDuration = m (waiting time) + c

Now, let’s assume the waiting time between eruptions is 75 minutes. We would use the above equation and have: 

ExpectedDuration = 0.073901 (75) -1.792739 = 3.749836

Similarly, for any new waiting time, the duration would be predicted in the way shown above. Based on the results, we can conclude that as the waiting time increases, the geyser eruption will have a longer duration. Therefore, it would fall into the long duration class. On the other hand, if the waiting time decreases, the geyser eruption would fall into the short duration class.

In summary, we don't necessarily need to apply a complex machine learning or deep learning algorithm to solve a data science problem. My interest to know how the geyser eruptions were predicted led me to play around with the data. Deriving these conclusions was organic and came out of sheer curiosity. By performing a simple linear regression model, we have achieved great results. So, applying a complicated machine learning algorithm does not always equate to great results. Sometimes, a machine learning algorithm is not necessary, and a simple algorithm, such as a linear regression, is actually more effective.

The Elements of Great Data Visualizations

Among the effects of the rapid evolution of information technology is the exponential growth in the volume of data related to our daily lives. This presents companies with unprecedented opportunities to more deeply understand their customers, anticipate their needs, and make informed decisions about product innovations and competitive strategy.

Many of the world’s leading brands are making significant investments in data analytics. To reap the maximum return, it’s essential to understand the role of data visualization in helping to identify and articulate data in ways that inspire game-changing insights and better decisions.  

As a business intelligence analyst, I am frequently asked “What does a good data visualization look like?” There is not one simple, correct answer, but here are several fundamental features that characterize visualizations that are appealing, accessible and effective.

 

1.       A good visualization reveals problems and inspires solutions.

For exploratory data visualizations, the critical factor is not what is on the diagram, but rather how people respond to it. Here is a classic example.

This map visualization was created by Dr. John Snow in the mid-19th century London during a devastating cholera outbreak. He mapped the 13 public wells and all known cholera deaths around the city and noted a spatial clustering of cases around the same public drinking well.

This pattern provided evidence that the contaminated water was the cause of cholera. With this insight, the London government built a sewage system to control and stop the cholera spread. In an age of scant business-intelligence tools, Dr. Snow’s visualization effectively revealed a root causal problem and inspired the right solution.

2.       Explanatory visualizations should be straightforward and easy to understand.

Here are two creative examples that I admire:

This is a snapshot of an interactive dashboard created by Siroros Roongdonsai. Clicking on the fish type gives you the sustainable shopping guide on the right side. This particular visualization provides a simple and efficient way to integrate a vast amount of information in a single-page dashboard. The creator ensured that the intended audience would be able to easily understand the visualization.

This visualization describes the daily routines of some famous celebrities by dividing their routines into six categories, marked by six different colors. It’s very clear for readers to find the pattern of when the celebrities eat, sleep, or work. This visualization method can also be very helpful for conducting customer-level analysis within a defined period.

Today’s business intelligence software provides analysts with powerful options for creating robust visualizations. But the most important component of an effective visualization is the analyst’s desire to facilitate understanding and insight by conveying the data with clarity and relevance.

ProCogia: Who are we?

We’re excited to be kicking off our new blog series, ProBlogia, that will feature contributions by ProCogians on a variety of topics across technology, business, society, and culture.

But let’s start with education, ­specifically with two of our most Frequently Asked Questions:

How do you pronounce ProCogia?

Phonetically, it’s prəˈkäɡ’ēə or “Pro-cawg-ee-ah.”

What does ProCogia mean?

ProCogia…is that a coastal town in Italy? A French dessert?  A shampoo that increases your IQ?

The closest word in popular culture to ProCogia is “Precog,” short for the precognitives who could see events in the future in “Minority Report,” the 2002 sci-fi hit directed by Steven Spielberg.

“Cog” obviously refers to cognition—conscious mental activities, such as thinking, understanding, learning, and remembering. Our spin on the word replaces “pre” with “pro,” which has several meaningful connotations for a data analytics company, including: affirmation, advance, positive, prior to, favoring, supporting, championing, in favor of, and in front of.

While we don’t claim to be able to divine the future, our mission is to help our clients chart the journey ahead by starting with the clearest possible assessment of the present. And so, the most succinct definition of ProCogia is also our promise:

Higher intelligence. Deeper insights. Smarter Decisions.