Close

Karan Vombatkere

PhD Candidate in Computer Science | Boston University

Email: username @ domain
where username = kvombat , domain = bu.edu


I am a PhD student in the Massive Data & Algorithms (MiDAS) research group at BU, advised by Dr. Evimaria Terzi. My research interests are primarily focused on algorithmic data mining and computational social science problems. My current research is on designing efficient approximation algorithms for submodular optimization problems, specifically the Team Formation problem.

Before starting my PhD I was a Data Scientist at IBM in Cambridge, MA, where I developed machine learning solutions from the ground-up on exciting projects at the intersection of business and tech. Prior to IBM I studied Physics and Electrical & Computer Engineering at the University of Rochester, NY, and explored a variety of topics such as computer audition (with Dr. Zhiyao Duan), computational physics and game theory. I then received a M.S. in Data Science and worked on data mining problems with Dr. Jiebo Luo.

I play tennis and soccer, and enjoy outdoor activities such as skiing, hiking and biking. I also like to challenge myself with chess, poker and other strategy games. I am an eager home cook and I enjoy exploring the Boston/Cambridge area in search of good food and drink!

Research

Investigating Exploration and Exploitation on Social Media Feeds

Recommendation algorithms for social media feeds often function as black boxes from the perspective of users. We aim to detect whether social media feed recommendations are personalized to users, and we also aim to characterize the extent and factors contributing to personalization in these feeds. We introduce a general framework to examine a set of social media feed recommendations for a user as a temporal graph. We label vertices in the graph as the result of exploration vs. exploitation of the user’s interests on the part of the recommendation algorithm and introduce a set of metrics to capture the extent of personalization across user timelines. We apply our framework to a real TikTok dataset and validate our results using a baseline dataset generated from automated TikTok bots, as well as a randomized baseline. We also investigate the extent to which factors such as video viewing duration, liking, and following drive the personalization of content on TikTok. Our results demonstrate that our framework produces intuitive and explainable results, and can be used to audit and understand personalization in social media feeds.

[WebConf 2024]

Balancing Task Coverage and Expert Workload in Team Formation

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.

[SDM 2023]

Detecting trends in streaming financial data using Apache Flink

Modern financial analytics rely on high-volume streams of event notifications that report live market fluctuations based on supply and demand. Accurately identifying trends or breakout patterns based on the Exponential Moving Average (EMA) in the development of an instrument’s price early on is an important challenge, so as to buy while the price is low and sell before a downtrend begins. This paper aims to solve the above challenge with a distributed, event-streaming solution built using Apache Flink. We present and implement a solution that leverages customized window operators to calculate the EMA and find breakout patterns, using event generation parallelism to facilitate the rapid processing of the input stream uses sinks to collect and output results, and scales easily on a distributed Flink cluster. We empirically test our design on metrics specified by the benchmarking platform for the DEBS 2022 Grand Challenge and observe a throughput of 45 batches per second and an average latency of 120 ms.


[DEBS 2022]

Mining travel and weather patterns to understand the spread of COVID-19

We investigate the difference in the spread of COVID-19 between the states won by Donald Trump (Red) and the states won by Hillary Clinton (Blue) in the 2016 presidential election, by mining transportation patterns of US residents from March 2020 to July 2020. To ensure a fair comparison, we first use a K-means clustering method to group the 50 states into five clusters according to their population, area and population density. We then characterize daily transportation patterns of the residents of different states using the mean percentage of residents traveling and the number of trips per person. We also use temperature data to attempt to explain the difference in the way residents travel and practice safety measures.


[SBP - BriMS 2021]

Displaying automatic lyrics for streaming audio

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab at the University of Rochester during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time. I implemented a multi-threaded audio framework with a shared synchronous buffer to handle simultaneous recording and processing. I extracted chroma feature vectors from the waveform audio data and then used online dynamic time warping to align the live audio stream with a pre-annotated version of the audio based on their harmonic similarity, and display lyrics in real-time.