I am a data scientist who studies applied machine learning and artificial intelligence. I am in the process of finishing my doctoral dissertation at the University of Pennsylvania. In my dissertation, I develop and test automated annotation technology using generative AI. Specifically, I investigate the potential of generative large language models (LLMs) to automate manual annotation procedures often used in natural language processing (NLP). To conduct my experiments, I developed an easy-to-use Python package to implement automated annotation.
I have presented this research at numerous computational social science conferences and on expert panels on AI. My research has also received press coverage. I am coordinating, along with Princeton’s Arthur Spirling, an inaugural Princeton-Penn artificial intelligence conference in early September 2024. The conference is titled, "Language Models in Social Science” (LMiSS).
In addition to my doctoral studies, I received a Master's in Data Science and Statistics from The Wharton School and have applied industry experience developing predictive algorithms and integrating cutting-edge AI technology into products. I’ve previously worked at Theta Equity Partners and DataCamp.
Contact Download My ResumeGenerative large language models (LLMs) can be powerful tools for augmenting text annotation procedures. Using GPT-4, we replicated 27 annotation tasks from articles in high-impact journals and show that LLM performance is promising but contingent on the task and dataset. As a result, we argue that any automated annotation process using an LLM must validate the LLM’s performance against labels generated by humans. To ensure effective use of LLMs for annotation, we’re releasing easy-to-use Python code designed to streamline LLM deployment and validation procedures.
Link to Full Project and Open-Source Python PackageWe scraped a novel dataset of approximately 18,000 closed captioning transcripts from local television news programs across three cities from 2014 to 2018 and created a series of RoBERTa classifiers to identify news topics in these transcripts over time. Across the seven selected topics, the RoBERTa models achieved an average precision score of 0.85, an average recall score of 0.876, and an average F1 score of 0.859.
View findings through interactive dashboardProtests in the United States have become a common method for citizens to express their concerns about various social and political issues. This study examines the causal impact of protests on individuals' willingness to donate money to American political campaigns. We find a substantial causal relationship between protests and political donations using a staggered difference-in-differences (DiD) design with county and temporal fixed effects. The results indicate that a one-percent increase in protest activities within a county leads to a 0.76% increase in the number of donations and a 1.13% increase in the total donation amount in the immediate days following the protest.
Link to Full ProjectUsing two nationally representative survey experiments, I test the hypothesis that—when the threats and risks migrants face in their home country are equalized—natives will not penalize people who immigrate due to economic threat relative to people immigrating due to violence threat. After accounting for threat in the migrant’s home country, I find that natives’ special penalty associated with economic types of migration is erased. My results indicate that policymakers should reconsider existing immigration and refugee policies related to economic migration and poverty.
Link to Full ProjectUsing governmental administrative data and socio-demographic data, I show that LASSO logistic regression and random forest are effective at predicting individual-level donation behavior. LASSO logistic regression correctly classifies 82.7% of test cases (61.9% of positive cases) and random forest correctly classifies 92.8% of test cases (99.9% of positive classes). Although both of these accuracy scores are notably higher than the 74.1 percent no-information rate, random forest proves to be the suprior model by far.
Link to Full Project3+ years of experience as a quantitative researcher and data scientist.
Linkedin Download My Resume