Diagnosing Diabetes using Machine Learning

Diabetes is a chronic disease that occurs when a patient’s blood glucose is higher than the normal level. Elevated glucose could be caused by defective insulin secretion, its impaired biological effects, or both. Projections suggest that in 2040 over 600 million patients will be diagnosed with diabetes, implying that one in ten adults  will experience the ill effects of diabetes. There is no doubt that this alarming figure needs great attention. With the rapid development of machine learning, we will see how we can apply modern techniques to diagnose diabetes.

I will use a real world dataset based on Pima Indians provided by Kaggle. Let’s see how we can apply a classification algorithm and identify if a person is diabetic or not.

The dataset consists of 8 features measured for 332 patients . Below is the description of each feature:

Diabetes Table 1.PNG

I start by splitting the data into two distinct subsets - one I will use to train my models and the other for testing the accuracy of the models.  I choose a  training to testing  ratio of 80:20,  a common best practice for training machine learning models. Here is some R code for reading within my data and creating the training and testing sets:

Diabetes Code 1.PNG

After the data is split, let’s  train a logistic regression classifier  using the training data.

With the model trained, let’s also look at the summary  using the following code:

Diabetes Code 2.PNG

This summary shows a lot of information which is helpful to understand the internal workings of a logistic regression model.

Let’s break it down into the details for a better understanding:

The coefficients in a nutshell show all the features and signify how important they are. The stars(*) on the extreme right of each feature show the statistical significance of that feature. For example, the p-value for feature glu is very close to 0. Naively, we can say that glu, ped and bmi are the three most important features in order.

Let’s check the accuracy of this model by constructing a confusion matrix from the testing data. We can see that 51 cases were correctly classified as non-diabetic, while 15 cases where correctly classified as diabetic. This gives an accuracy of 79% which isn’t too bad for a first attempt. It is concerning, however, that nearly as many non-diabetic patients would be diagnosed with diabetes under this model, a false positive rate that would be unacceptable in practice.

Diabetes Code 3.PNG

Model Optimization:

Model optimization is an important aspect of choosing the best parameters for machine learning  models. At ProCogia, we commonly spend a significant amount of time and effort to ensure the  model parameters are appropriately tuned. Since every case is different,  each has a unique set of challenges while performing optimization.

Before we begin, I should define the concepts of Null deviance, Residual deviance and AIC:

Null Deviance: Null deviance shows how well the response variable is predicted by a model that includes only the intercept

Residual Deviance: Residual deviance shows how well the response variable is predicted with the inclusion of independent variables. The lower the deviation, the better the accuracy. We need to make sure that the residual deviance does not increase while performing model optimization.

AIC: The Akaike information criterion (AIC) is an estimator of the relative quality of statistical models for a given set of data. This should decrease while performing optimization.

From the summary of the model, we have concluded that there are three important features, glu, bmi, and ped. It is possible that one or more of the other features is causing our model to suffer. Having said that, we cannot directly remove the unimportant features. We will have to check if the residual deviance and AIC is impacted by removing a feature. If the values of the residual deviance and AIC  increase, we  should not remove the feature from the model.

Also, for reference, let’s keep a note of our current models.

Diabetes Code 4.PNG

Let’s start by removing the unimportant features and making sure that the Residual Deviance and AIC do not increase.

Removing Age:

Diabetes Code 5.PNG

Performing the same process for the other three features, we find that the only other feature that could be removed was skin. Therefore, a new model was trained without the age and skin feature. Let’s check if the accuracy has increased from 78%:

Diabetes Code 6.PNG

It has, and we were able to correctly classify three patients who were previously showing up as false positives. In this way, we can similarly test and tune any other parameter of our model.

In conclusion, machine learning methods are widely used in predicting diabetes. Early detection of diabetes is crucial. Model optimization is an essential part of accurately predicting diabetes. We optimized the accuracy from 78% to 82%. As data scientists, we must constantly be looking for ways to redefine our models to obtain the most accurate outcomes.

Who is Going to Leave Your Team?

Employee attrition means a loss of valuable staff and talent. A high turnover rate is costly. An organization invested its time and money in recruiting and training employees, and the employees take all skills and qualifications they developed for their positions when they leave. The organization needs to spend a cost on the termination, hiring process, and the rest of employees need to cover the workload for the unoccupied position. There are various reasons for employees to leave an organization. We can analyze employee HR data and estimate their reason to leave as well as help organizations retain valuable employees.

IBM provides SAMPLE DATA: HR Employee Attrition and performance data on their community webpage. This is fictional sample data, but it can give us good ideas on how to analyze HR data and build classification models to predict attrition. This post will show you the step-by-step analytics process in R.

Exploratory Data Analysis

Let’s start with importing the dataset into R dataframe and exploring the data shape and the attrition distribution.

   df <- read.csv("../data/WA_Fn-UseC_-HR-Employee-Attrition.csv")




HR Attrition 1.png
HR Attrition 2.png

Data have 34 HR features and attrition information. About a fifth of employees left the company.


It is important to have a better understanding on dataset. Building a machine learning model in R is not difficult. It can be only a couple of lines of code. However, tuning the model to achieve goals requires preprocessing on the dataset. A good understanding on the data can help us choose a proper preprocessing method. Missing values need to be taken care of and biased data sometimes need to be balanced, and highly correlated features need to be removed.

The code below organizes information of the data in a place such as the presence of missing values in each column, mean, standard deviation, percentiles, and distributions. This allows us to minimize how much time we spend on exploring missing values and plotting histograms.

df %>% skim() %>% kable()


The code below will show the correlation plot for numerical features. Viewing the correlations  helps us understand multicollinearity. The values range from -1 to 1 with positive values indicating a direct relationship and negative values indicating an inverse relationship. The closer to 1 or -1, the stronger the correlation between features. The correlation plotallows  us to find correlated features quickly. For instance, we see that ‘Job level’ is highly correlated to ‘Monthly income’ and ‘Total working years’ and ‘Percent salary hike’ is correlated to ‘Performance rating’. You can go through the plot and decide what features to include and what features to minimize or eliminate.

   num_fea <- which(sapply(df,is.numeric))

   corrplot(cor(df[num_fea]),type = "upper")

HR Attrition 4.png

Machine Learning Modeling

Now we are ready to move on to model building. Randomly select 70% of the data to train the model and the rest will be for validation of the model. A random forest classification model will predict the attrition. Accuracy, confusion matrix, and AUC (area under the curve) scores are chosen as metrics to evaluate the model.

   idx <- sample(nrow(df), nrow(df) * .70)

   train <- df[idx,]

   test <- df[-idx,]


   rf <- randomForest(Attrition~., train, importance=TRUE, ntree=1000)

   prd <- predict(rf, newdata = test)

   confusionMatrix(test$Attrition, prd)

   rocgbm <- roc(as.numeric(test$Attrition), as.numeric(prd))


HR Attrition 5.png

The accuracy of the model is 0.83 which is good, however, the confusion matrix indicates that there are many false positives which is reflected on low specificity compared to high sensitivity. The data is biased as we found that less than 20% of people left company. Even though the accuracy of the model is high, AUC score is as low as 0.52. We can balance the data to avoid bias. There are multiple ways to balance data. We will use SMOTE for majority down-sampling and minority over-sampling and then retrain the random forest model.

   df_balanced = SMOTE(Attrition~., df, perc.over = 200, k = 5, perc.under = 100)

HR Attrition 6.png

   idx <- sample(nrow(df_balanced), nrow(df_balanced) * .70)

   train <- df_balanced[idx,]

   test <- df_balanced[-idx,]


   rf <- randomForest(Attrition~., train, importance=TRUE,ntree=1000)

   prd <- predict(rf, newdata = test)

   confusionMatrix(test$Attrition, prd)

   rocgbm <- roc(as.numeric(test$Attrition), as.numeric(prd))


HR Attrition 7.png

Post Analysis

We can see that balancing the data has led to an improvement in our model. The confusion matrix shows that false negatives increased slightly, but false positives were significantly decreased. The accuracy of our model increased by 7%, and AUC score increased from 0.5288 to 0.9002. The model sensitivity is similar but specificity increased from 0.625 to 0.9417. Now that we are satisfied with our  model  we can investigate what features are important to predict attrition. Our Random forest model provides variable importance through the command below. Features toward thetop  are more important  and a rough measure of each features relative importance is provided.. The most important feature is “stock option level” and we can now analyze the distributions for remaining employees versus employees that left to understand how to award stock options to employees and retain them.


HR Attrition 8.png


Individual feature analysis related to attrition can be done without building a machine learning model. You can plot each feature against attrition and find their relationship. However, if there are 1,000 features, it is difficult to go through the whole feature set to analyze each one. A machine learning model can give us enhanced understanding of data. At ProCogia our data scientists help our clients enhance their data understanding by deploying state of the art machine learning models.

Cascadia R Conference 2019

The third annual Cascadia R came up to Seattle for the first time this year. It was a wonderful opportunity for those that love to use R to meet others like them and listen to thought-provoking presentations from R users across all industries. Hosted in Bellevue, the space was intimate and provided a welcoming environment to R professionals who came for the conference that Saturday morning.  

Our team received the privilege of being part of the organizing committee and ProCogia was the leading sponsor for the conference! As a team, we have always been focused on our local community and what we can do to be engaged around our tech hub – Seattle. This was another incredible opportunity for us to be involved in and we loved every minute of it! ProCogia hosted two trainings the day before the conference - one in Basic R and the second in R Shiny. 

Raphael Gottardo - the head of the Translation Data Science Integrated Research Center at Fred Hutchinson Cancer Research Center - kicked off the conference sharing how Fred Hutch is changing the world of cancer and diseases using data science. David Smith, Microsoft’s Cloud Developer Advocate, came up after Raphael and shared a personal note about how his partner will live after being diagnosed with cancer because of the research that Fred Hutch has done. There was not a soul in the room that was not moved by the personal impact that data science had on someone standing directly in front of them.  

The following session split into two with several speakers in the Community and the Application of R tracks. There were excellent speakers in both sessions that were well-versed in the topics. I sat captivated as each speaker came forward to present during the session. The lunch was intentionally longer than most people were used to in order to provide plenty of time for people to meet and get to know each other – one of the main focuses of the R community. As a part of both the committee and the leading sponsor, I hope that the people that came to the conference created lasting bonds that will support both their personal lives and professional careers. The day continued with two more track options: Machine Learning and R in Production – it was extremely difficult to choose which one because the lineup was just too good to be true for both sessions. Thankfully, we did not have to choose during the lightning talks. Everyone came back together to listen to quick 5-minute presentations (which everyone managed to stay within somehow –WOW – I know I would struggle with that) from another group of excellent presenters that had the most interesting information.  If you would like to see the speakers, their topics, and what you missed out on, you can follow this link.  

The conference closed with the one and only Gabriela de Quieroz speaking on how R Ladies began just a few short years ago and how the close-knit groups have grown exponentially across the globe with chapters opening in cities everywhere – surpassing the wildest expectations. She inspired the dreams that seem out there in each of us. As a woman in the data science field, I often find myself looking for other females that I can connect with. To hear Gabriela discuss her personal struggle as a college student to meet other women in data science, I was encouraged that there are others out there no matter how it may look on a day-to-day basis.  

David Smith wrapped up the conference with, “I came for the language and stayed for the community,” and  I could not have agreed more. The day was filled with people from all walks of life getting together. The R community is truly one of the most welcoming communities I have known and from everything I have heard, this has been an intentional effort by leaders of the R community. 

As I wrap up this blog post, I want to say THANK YOU. First of all to the incredible R community and each member that comprises the community. Secondly, to the leaders of the R community that have developed a safe and welcoming place for people to be who they are and excel in a pretty great programming language! The conference could not have come together without the committee members putting in time outside of their full-time jobs to put this on. Another big THANK YOU to the sponsors who made this conference possible and each person that showed up to attend this conference. And last but most certainly not least, the speakers that spent time putting together and presenting their information. It takes a lot of work to put together great, relevant content and present it to a group of strangers.  

The third annual Cascadia R Conference was oh so memorable. Thank you again to all the great people that made this conference possible. We are looking forward to the conference next year in Eugene, Oregon! You can see more details from the conference and follow the upcoming conference at www.cascadiarconf.com ! 

If you are looking for ways to be involved in the local community, feel free to reach out to us. ProCogia offers data science consulting services and on-site training for your team. Send us an email outreach@procogia.com for more information on these services.  

Agile Methodologies in Data Science

What problem were we trying to solve?

At ProCogia, our goal is to provide the highest level of data science support we possibly can. With our larger clients, we operate as long-term support for several data science initiatives. This type of engagement requires a greater level of project management than single project or short-term engagements. On one of our current client sites, we maintain a staff of five data scientists who support many of their engineers. Before long we identified several inefficiencies that hindered our ability to develop the high-quality data science solutions that we pride ourselves on producing. The two main pain points we identified were a lack of clarity in the project requirements, and a propensity for projects to extend far beyond their initial expected duration. To address these problems, organize our efforts, and maximize our output, we adopted an agile project management process that we customized to meet the needs of our data science team.


What is Agile:

The original Manifesto for Agile Software Development was an effort by 17 veteran software developers who identified a few key features to successful development teams. Their primary values are:


Individuals and Interactions over processes and tools

Working Software over comprehensive documentation

Customer Collaboration over contract negotiation

Responding to Change over following a plan


Additionally, 12 principles are identified to further define best practices for software development. As data scientists, we recognized that not all these principles would apply to our work, but we realized that many of them do, and we set out to organize our teams and projects in an agile way.


Our Experience:

As consultants we often find ourselves taking on projects from our business clients that are not fully defined. After all, that’s part of the fun of data science, the science of it. Form a hypothesis, design and carry out an experiment to test your hypothesis, evaluate the results, and iterate from there. We use a scrum process to manage our team and ensure that maximum value is delivered to our clients. Our process follows the standard scrum framework shown below.

Johns Q1 2019 Blog.png

Within this framework, we develop our projects in two-week cycles called sprints.

By holding ourselves to two-weeks and communicating our work to our clients during this time, we provide a much-needed lens into our development lifecycle. It is all too easy for clients on the business end to neglect to account for the difficulties we may face in gathering, cleaning, and pre-processing their data. Instead of our team working in a silo for two months before delivering what we think the client wants, we deliver smaller pieces more often to allow them time to digest our work and provide feedback along the way. The greatest benefit we have noticed after implementing this framework has been our ability to more quickly iterate and reach a solution that meets our client’s needs.


We began our agile implementation four months ago and although the improvements have been evident, some aspects of the scrum process have been difficult for our team to enact. We are currently focusing on the sprint retrospective phase by gathering input from our team, as well as from our clients, to tweak our process. One important change we have made is in the project estimation that we do during our sprint planning. We set out to engage in a team wide level of effort estimation at the beginning of each sprint cycle. What we learned after a few tries was that our breadth of active projects meant that usually only one team member was fully aware of the overall complexities of a project to accurately estimate the level of effort. This meant that the rest of the team was providing little input to the estimation and our time spent estimating was not useful. Now, we allow individual contributors to estimate their own level of effort and reserve the time for team estimation for new projects coming into our backlog.


Overall, our implementation of agile management processes has made positive change in our team’s level of communication and productivity. This adoption has streamlined our project intake process and eliminated much of the uncertainty we used to deal with when taking on complex data science projects for our clients. Reducing the unknowns helps us reach a solution faster and improving communication helps us adapt to changing requirements. The most important lesson we have learned though is that there is always room to improve. As ProCogia grows and takes on new and varied clients, the only way to keep up with the complexities is to be agile.

Domopalooza 2019

Domo has found a way to connect all the data that you could possibly want access to and made it into a user-friendly platform that translates the data into visualizations that anyone in your company can understand. Domo has made their platform incredibly user-friendly. If you are technologically challenged, you will still be able to create incredible visualizations. They also allow the platform to be customized based on client needs where a partner like ProCogia can come in and build intricate connectors for Domo clients. The capabilities of this platform are astounding.

Domopalooza was an action-packed week with all the knowledge your head could hold shared at general sessions, break-out sessions and at their genius lab, an abundance of delicious food, and activities such as concerts in the evenings and skiing at world-renowned Snowbird on the last day.

The Power of the Platform was the theme of the conference. Domo made some big announcements at the conference like new data science, AI and machine learning tiles that will expand Domo to become a more capable all-in-one platform. They also made an announcement of their Domo for Good program that will support non-profits to use the platform to make decisions that will directly affect communities and those in need.

Derek White explained how Domo has changed the game for BBVA by allowing them to see data in real-time which directly affects their business. One after another, you could hear stories from speakers explaining how Domo has affected their work and their company. The recurring theme was that Domo made it simple for companies to understand the information that their data was telling them. This led to collaboration between various departments based on real-time data. They were able to make informed decisions based on the data that they already had available.

Domo wrote a quick recap post that you can access here for more information on Domopalooza.

We came to Domopalooza as a sponsor and spent most of our time near the ProCogia booth talking with anyone that came by who wanted to know more about what we do. We are one of the few partners that are capable of implementing the data science capabilities within Domo. We were pleasantly surprised to see how many people from across the globe made it a priority to be at Domopalooza this year. Domopalooza was the perfect meeting place for everyone to come together, find support in their implementation, and learn about the new capabilities that Domo is releasing!

ProCogia recently partnered with Domo because we have seen what they were capable of and we want to be a part of it. ProCogia is also certified in Domo and we can hit the ground running to implement any part of Domo for companies.

Please reach out to us at outreach@procogia.com or give us a call at 425-624-7532 if you want information on how we can help your company succeed.

Which Cell Phone Carrier Should I Choose?

ProCogia at RStudio::conf 2019

As a designated enterprise partner and cherished sponsor of RStudio’s annual conference, ProCogia brought a strong contingent of Data Scientists and Sales executives to this year’s conference in Austin. It was an honor to witness the tremendous growth and popularity of this conference, which saw over 1700 participants this year. The 2-day main conference was preceded by 2 additional days of workshops and certifications. Some of our own ProCogia data scientists got certified in implementing RStudio products at an enterprise level.

It would be a mountainous job to recap the entirety of the conference, which had multiple tracks and themes, but here is a humble recap of what became the highlights of this year:

R is production ready

This was the dominant theme of this year’s edition and we could not agree more. RStudio’s CTO and inventor of Shiny, Joe Cheng kicked off the conference with his Shiny in production Keynote. He led the audience through a journey of how shiny started as data scientist’s visualization tool and gradually became a production ready product that can be used at an organizational level. There were some obvious technical and cultural challenges: software engineers’ reluctance to use Shiny apps as they were built using a data scientist’s perspective, and broader push back from organizational IT regarding scaling of shiny apps. He provided plenty of examples to emphasize how Shiny has evolved to become more robust and ready for widespread use. The conference also provided introductions to RStudio’s emerging products RStudio Package Manager and RStudio Connect: Past, present, and future which can be used to create an enterprise level data science environment (something ProCogia can help with).

Sharing is caring

David Robinson from DataCamp presented his keynote around The Unreasonable Effectiveness of Public Work, emphasizing how important it is for a data scientist to evangelize their work within their organization and the broader community. Data science doesn’t operate in a silo and neither should a data scientist. He suggested using social media, authoring blogs and articles, even writing a book (with practice and experience) to engage with the community. We, at ProCogia, also believe in giving back to the local R programming community and do so through organizing Bellevue’s useR group meetup every other month in our office location.

Learning, Learning, Learning!

Keeping an open mind around learning new concepts and techniques is the key for professional growth. Noted education researcher Felienne‘s keynote was around understanding how code can thought in a very non-trivial manner, using her research with teaching programming to school kids. It provided a very interesting way to look at programming in general and how you can master this concept using a practice-based approach.

Other notable talks that caught our attention were:

Of course, this blog post can’t cover all the talks and workshops from the conference, but hopefully it provides enough summary for you to get excited about R and Rstudio! Please feel free to reach out to us at outreach@procogia.com if you want to engage RStudio products at your enterprise, give a talk at our Seattle meetup, or are just inquisitive about R in general.

Stay tuned for our coverage of the next conference from San Francisco!

Image Processing in Python

Deep learning is a widely used technique that is renowned for its high accuracy. It can be used in various fields such as regression/classification, image processing, and natural language processing. The downside of deep learning is that it requires a large amount of data and high computational power to tune the parameters. Thus, it may not be an effective method to solve simple problems or build models on a small data set. One way to take advantage of the power of deep learning on a small data set is using pre-trained models built by companies or research groups. VGG16, which is used in this post, was developed and trained by Oxford’s Visual Geometry Group (VGG) to classify images into 1000 categories. Besides being used for classification, VGG16 can also be used for different applications for image processing by changing the last layer of the model.

I have a 2-month-old daughter, so my wife and I had to prepare some clothes, accessories, and furniture before her arrival. My wife asked me to build a model to find onesies similar to the ones she wanted to buy. I can utilize the pre-trained deep learning model to help my wife find similar onesies.

To explain how a pre-trained deep learning model can be used for this situation, I collected total of 20 images from the internet; 10 short sleeve baby onesies, eight other baby clothes, one adult t-shirt, and one pair of baby shoes. I included other baby clothes to validate if this model can distinguish them, and the images of a t-shirt and shoes are included as outliers. This is a great example to explain how pre-trained models can be utilized for the small data set.


Keras is a neural network library written in Python that runs Tensorflow at backend. Codes below will guide you through the details on how to utilize VGG16 using Keras in Python.

from keras.applications.vgg16 import VGG16, preprocess_input
from keras.preprocessing import image
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from IPython.display import Image, display     

VGG16 requires the input image size to be 224 by 224. A function below pre-processes images and converts them into arrays.

def preprocess(file):
         img = image.load_img(file, target_size=(224, 224))
         feature = image.img_to_array(img)
         feature = np.expand_dims(feature, axis=0)
         feature = preprocess_input(feature)
         print('Error:', file)
     return feature[0]

imgs = [preprocess('C:/Users/pc-procogia/Desktop/ProBlogia/'+str(i)+'.jpg') for i in range(1,21)]
X_pics = np.array(imgs)     

The last layer of neural networks determines the output of the model. VGG16 is primarily built for classification, thus the last layer needs to be modified to extract the raw image features from data set. In this post, no parameters will be trained, but only pre-trained layers will be used.

def feature_extraction(images):
     base_model = VGG16(weights='imagenet', include_top=True, input_shape = (224,224,3))

     for layer in base_model.layers:
         layer.trainable = False

     base_model.layers.pop()          # this removes the last layer
     base_model.outputs = [base_model.layers[-1].output]
     base_model.summary( )

     pic_features = base_model.predict(images)
     return pic_features

pic_features = feature_extraction(X_pics)     


As you can see from the results above, fc1 in the red box indicates that total 4096 features were extracted from each image, and the number of trainable parameters is zero since none of them were trained. The number of non-trainable parameters is more than 100 million which means it will take a long time to tune the parameters with your local machine. Because not everyone has access to a high computing system, the pre-trained model can come in handy here.

Now all features are extracted from each baby clothing images. Cosine similarity is a good metric to find similar clothes because all images are in the form of vectors. The code below will calculate the cosine similarities and show the most similar clothes.

dists = cosine_similarity(pic_features)
dists = pd.DataFrame(dists)

def get_similar(dists):
     for item in range(len(dists)):
         L = [i[0]+1 for i in sorted(enumerate(dists[item]), key=lambda x:x[1], reverse=True)]
         print('=== Your Favorite ===', item)
         display(Image(filename=str(item+1)+'.jpg',width=200, height=200))
         print('--- Similar Clothes---')
         for i in L[1:6]:
             display(Image(filename=str(i+1)+'.jpg',width=200, height=200))
     return L_list     

If you used the image of a onesie as the input, the following five images are similar clothes that the deep learning model chose. The model successfully picked the short sleeve onesies for us.


In this case below, the search results includes the onesies but also an adult t-shirt. It seems like the color was one of the features to show the adult t-shirt in the result.


The least similar image to the short sleeve onesies were the baby shoes. The shape of the shoes are obviously different from onesies. The simple feature extraction from images was able to distinguish it even though the model was not specifically tuned for this case.


To increase model accuracy or to focus on certain features of clothes such as color, shape, or pattern, you can include those features in the model and train them on top of the pre-trained model. Then, you do not have to train 100 million features, but you can include some features that your local machine can handle to achieve ideal results.

Learning how to add additional features to the pre-trained model will be an interesting topic for the next post. I hope you enjoyed this post on deep learning and feel motivated to start your own projects using pre-trained models.