Hi, I’m Nick!
Data Scientist

I am a data scientist who studies applied machine learning and artificial intelligence. I am in the process of finishing my doctoral dissertation at the University of Pennsylvania. In my dissertation, I develop and test automated annotation technology using generative AI. Specifically, I investigate the potential of generative large language models (LLMs) to automate manual annotation procedures often used in natural language processing (NLP). To conduct my experiments, I developed an easy-to-use Python package to implement automated annotation.

I have presented this research at numerous computational social science conferences and on expert panels on AI. My research has also received press coverage. I am coordinating, along with Princeton’s Arthur Spirling, an inaugural Princeton-Penn artificial intelligence conference in early September 2024. The conference is titled, "Language Models in Social Science” (LMiSS).

In addition to my doctoral studies, I received a Master's in Data Science and Statistics from The Wharton School and have applied industry experience developing predictive algorithms and integrating cutting-edge AI technology into products. I’ve previously worked at Theta Equity Partners and DataCamp.

Contact Download My Resume

Data Science Portfolio

“Automated Annotation with Generative AI Requires Validation” (with Sam Wolken and Neil Fasching)

Generative large language models (LLMs) can be powerful tools for augmenting text annotation procedures. Using GPT-4, we replicated 27 annotation tasks from articles in high-impact journals and show that LLM performance is promising but contingent on the task and dataset. As a result, we argue that any automated annotation process using an LLM must validate the LLM’s performance against labels generated by humans. To ensure effective use of LLMs for annotation, we’re releasing easy-to-use Python code designed to streamline LLM deployment and validation procedures.

Link to Full Project and Open-Source Python Package

“Automated Annotation with Generative AI Requires Validation” (with Sam Wolken and Neil Fasching)

“Classifying Local News Television Transcripts Using RoBERTa” (with Sam Wolken and Chloe Ahn)

We scraped a novel dataset of approximately 18,000 closed captioning transcripts from local television news programs across three cities from 2014 to 2018 and created a series of RoBERTa classifiers to identify news topics in these transcripts over time. Across the seven selected topics, the RoBERTa models achieved an average precision score of 0.85, an average recall score of 0.876, and an average F1 score of 0.859.

View findings through interactive dashboard

“Classifying Local News Television Transcripts Using RoBERTa” (with Sam Wolken and Chloe Ahn)

“Rallying for Change: The Effect of Protests on Political Fundraising in the U.S.” (with Daniel Gillion)

Protests in the United States have become a common method for citizens to express their concerns about various social and political issues. This study examines the causal impact of protests on individuals' willingness to donate money to American political campaigns. We find a substantial causal relationship between protests and political donations using a staggered difference-in-differences (DiD) design with county and temporal fixed effects. The results indicate that a one-percent increase in protest activities within a county leads to a 0.76% increase in the number of donations and a 1.13% increase in the total donation amount in the immediate days following the protest.

Link to Full Project

“Fleeing For Their Lives: Reconsidering How Americans View Immigrants’ Reasons for Migrating”

Using two nationally representative survey experiments, I test the hypothesis that—when the threats and risks migrants face in their home country are equalized—natives will not penalize people who immigrate due to economic threat relative to people immigrating due to violence threat. After accounting for threat in the migrant’s home country, I find that natives’ special penalty associated with economic types of migration is erased. My results indicate that policymakers should reconsider existing immigration and refugee policies related to economic migration and poverty.

Link to Full Project

“Fleeing For Their Lives: Reconsidering How Americans View Immigrants’ Reasons for Migrating”

“Who Donates? Using Machine Learning to Predict Federal Donation Behavior”

Using governmental administrative data and socio-demographic data, I show that LASSO logistic regression and random forest are effective at predicting individual-level donation behavior. LASSO logistic regression correctly classifies 82.7% of test cases (61.9% of positive cases) and random forest correctly classifies 92.8% of test cases (99.9% of positive classes). Although both of these accuracy scores are notably higher than the 74.1 percent no-information rate, random forest proves to be the suprior model by far.

Link to Full Project

September 2020 - May 2024

Quantitative Researcher

University of Pennsylvania

I have presented research at five peer-reviewed conferences on topics including artificial intelligence, causal inference, experimentation, and time-series analysis. I have taught 'Introduction to Data Science,' participated in various research labs, and served as a paid research assistant. I am also coordinating the first annual Princeton-Penn Artificial Intelligence in the Social Sciences Conference with Arthur Spirling.

June 2023 - Sep 2023

PhD Data Science Intern

Theta Equity Partners, Inc

Developed internal R-package that combined Theta’s proprietary probability models with various cutting-edge machine learning methodologies to improve customer lifetime value (CLV) prediction. Tested the ensemble methodology on three real-world customer transaction datasets and demonstrated notable improvements in model accuracy.

June 2022 - Sep 2022

Machine Learning Intern

DataCamp

Using several large language models (LLMs), I developed an internal linking tool that created over 80,000 internal connections between DataCamp web pages based on semantic similarity, which saved at least 1,300 hours of labor if the task had been done manually. This tool increased web traffic and, most importantly, enhanced the DataCamp user experience.

Experience

3+ years of experience as a quantitative researcher and data scientist.

Linkedin Download My Resume

Hi, I’m Nick! Data Scientist

Data Science Portfolio

“Automated Annotation with Generative AI Requires Validation” (with Sam Wolken and Neil Fasching)

“Classifying Local News Television Transcripts Using RoBERTa” (with Sam Wolken and Chloe Ahn)

“Rallying for Change: The Effect of Protests on Political Fundraising in the U.S.” (with Daniel Gillion)

“Fleeing For Their Lives: Reconsidering How Americans View Immigrants’ Reasons for Migrating”

“Who Donates? Using Machine Learning to Predict Federal Donation Behavior”

Experience

Education

Hi, I’m Nick!
Data Scientist