Karan Vombatkere

A QUBO Framework for Team Formation

The team formation problem assumes a set of experts and a task, where each expert has a set of skills and the task requires some skills. The objective is to find a set of experts that maximizes coverage of the required skills while simultaneously minimizing the costs associated with the experts. Different definitions of cost have traditionally led to distinct problem formulations and algorithmic solutions. We introduce the unified TeamFormation formulation that captures all cost definitions for team formation problems that balance task coverage and expert cost. Specifically, we formulate three TeamFormation variants with different cost functions using quadratic unconstrained binary optimization (QUBO), and we evaluate two distinct general-purpose solution methods. We show that solutions based on the QUBO formulations of TeamFormation problems are at least as good as those produced by established baselines. Furthermore, we show that QUBO-based solutions leveraging graph neural networks can effectively learn representations of experts and skills to enable transfer learning, allowing node embeddings from one problem to be applied to another.

[Preprint]

Forming Coordinated Teams that Balance Task Coverage and Expert Workload

We deviate from the classic team formation problem setting where one is asking to cover all skills of each given task; instead we aim to cover as many skills as possible while also trying to minimize the maximum workload among the experts. We also consider a variant of this problem, where the experts are organized into a graph, which encodes how well they work together. Utilizing such a coordination graph, we aim to find teams to assign to tasks such that each team's radius does not exceed a given threshold. We develop a generic template algorithm for approximating both problems in polynomial time, and we show that our template algorithm has provable guarantees. We describe a set of computational speedups that we can apply to our algorithms and make them scale for reasonably large datasets. From the practical point of view, we demonstrate how to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real-world datasets demonstrate the utility of our problem formulation as well as the efficiency of our algorithms in practice.

[DMKD'25]

Investigating Exploration and Exploitation on Social Media Feeds

Recommendation algorithms for social media feeds often function as black boxes from the perspective of users. We aim to detect whether social media feed recommendations are personalized to users, and we also aim to characterize the extent and factors contributing to personalization in these feeds. We introduce a general framework to examine a set of social media feed recommendations for a user as a temporal graph. We label vertices in the graph as the result of exploration vs. exploitation of the user’s interests on the part of the recommendation algorithm and introduce a set of metrics to capture the extent of personalization across user timelines. We apply our framework to a real TikTok dataset and validate our results using a baseline dataset generated from automated TikTok bots, as well as a randomized baseline. We also investigate the extent to which factors such as video viewing duration, liking, and following drive the personalization of content on TikTok. Our results demonstrate that our framework produces intuitive and explainable results, and can be used to audit and understand personalization in social media feeds.

[WebConf'24]

Balancing Task Coverage and Expert Workload in Team Formation

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.

[SDM'23]

Detecting trends in streaming financial data using Apache Flink

Modern financial analytics rely on high-volume streams of event notifications that report live market fluctuations based on supply and demand. Accurately identifying trends or breakout patterns based on the Exponential Moving Average (EMA) in the development of an instrument’s price early on is an important challenge, so as to buy while the price is low and sell before a downtrend begins. This paper aims to solve the above challenge with a distributed, event-streaming solution built using Apache Flink. We present and implement a solution that leverages customized window operators to calculate the EMA and find breakout patterns, using event generation parallelism to facilitate the rapid processing of the input stream uses sinks to collect and output results, and scales easily on a distributed Flink cluster. We empirically test our design on metrics specified by the benchmarking platform for the DEBS 2022 Grand Challenge and observe a throughput of 45 batches per second and an average latency of 120 ms.

[DEBS'22]

Mining travel and weather patterns to understand the spread of COVID-19

We investigate the difference in the spread of COVID-19 between the states won by Donald Trump (Red) and the states won by Hillary Clinton (Blue) in the 2016 presidential election, by mining transportation patterns of US residents from March 2020 to July 2020. To ensure a fair comparison, we first use a K-means clustering method to group the 50 states into five clusters according to their population, area and population density. We then characterize daily transportation patterns of the residents of different states using the mean percentage of residents traveling and the number of trips per person. We also use temperature data to attempt to explain the difference in the way residents travel and practice safety measures.

Displaying automatic lyrics for streaming audio

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab at the University of Rochester during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time. I implemented a multi-threaded audio framework with a shared synchronous buffer to handle simultaneous recording and processing. I extracted chroma feature vectors from the waveform audio data and then used online dynamic time warping to align the live audio stream with a pre-annotated version of the audio based on their harmonic similarity, and display lyrics in real-time.