Below is a list of recent projects I've completed. These tend to be data science projects, but may also include mapping software and consultation. My skills list is as follows and these projects tend to draw on these skill sets:
Python, Pandas, Scikit-learn, Machine Learning, Statistical Modeling, Scala, Spark, TSQL, HTML, CSS, Quick Base, MS Excel (VBA), Project Management, Research Design, Google Adwords, Google Analytics, Zapier, GoFormz, Workato, MS Suite
Python, Pandas, Scikit-learn, Machine Learning, Statistical Modeling, Scala, Spark, TSQL, HTML, CSS, Quick Base, MS Excel (VBA), Project Management, Research Design, Google Adwords, Google Analytics, Zapier, GoFormz, Workato, MS Suite
FEATURED |
UN General Assembly: A Cluster Analysis
Despite all the recent bellicose criticism of the United Nations (some of it deserved) I happen to think the United Nations is one of the world's most important institutions. In this project I wanted to gain some insight into voting patterns each year from the General Assembly's inception in 1946 to 2017.
I use a Hierarchical Database Scan (HDBSCAN) model to identify nation-state clusters for each year. For the same time period I use Principal Component Analysis (PCA) to visualize the clusters in two dimensions. In this case PCA component 1 contains over 100 times the weight of component 2, allowing me to identify which resolutions are responsible for the greatest variance. The end result is a slide widget that allows each year to be selected, outputting a cluster visualization, silhouette , a list of each country in its respective cluster and descriptions and classifications of the resolutions with the greatest impact on clustering.
Right now its just the presentation, but stay tuned for the following additions:
I use a Hierarchical Database Scan (HDBSCAN) model to identify nation-state clusters for each year. For the same time period I use Principal Component Analysis (PCA) to visualize the clusters in two dimensions. In this case PCA component 1 contains over 100 times the weight of component 2, allowing me to identify which resolutions are responsible for the greatest variance. The end result is a slide widget that allows each year to be selected, outputting a cluster visualization, silhouette , a list of each country in its respective cluster and descriptions and classifications of the resolutions with the greatest impact on clustering.
Right now its just the presentation, but stay tuned for the following additions:
- Interactive notebook online (so you can play with the slider)
- Geographic visualization on world map
- Natural Language processing to identify categories of resolutions
- More and more analysis of other data sets connected to yearly clusters
Will my Reddit post go viral?
I'm guessing interesting content is the main predictor. That aside, it looks like we can predict which posts will be popular to a surprisingly high degree. In this project I look at how we can use the "subreddit" and the words in a posts' title in order to predict "hotness" (that's the scientific term).
I look at whether the words "Dog" and "Cat" affect the hotness. Then check the importance of all other words contained in the title. We'll look at how useful this is in a K-nearest neighbors model and then compare its performance to a random forest model.
I look at whether the words "Dog" and "Cat" affect the hotness. Then check the importance of all other words contained in the title. We'll look at how useful this is in a K-nearest neighbors model and then compare its performance to a random forest model.