Karan Vombatkere

PhD Candidate in Computer Science | Boston University

Email: username @ domain
where username = kvombat , domain =


I am a PhD student in the Massive Data & Algorithms (MiDAS) research group at BU, advised by Dr. Evimaria Terzi. My research interests are primarily focused on algorithmic data mining and computational social science problems. My current research is on designing efficient approximation algorithms for submodular optimization problems, specifically the Team Formation problem.

Before starting my PhD I worked for ~3 years as a Data Scientist at IBM in Cambridge, MA, where I developed machine learning solutions from the ground-up on exciting projects at the intersection of business and tech. Prior to IBM I studied Physics and Electrical & Computer Engineering at the University of Rochester, NY, and explored a variety of topics such as computer audition (with Dr. Zhiyao Duan), computational physics and game theory. I then received a M.S. in Data Science and worked on data mining problems with Dr. Jiebo Luo.

I play tennis and soccer, and enjoy outdoor activities such as skiing, hiking and biking. I also like to challenge myself with chess, poker and other strategy games. I am an eager home cook and I enjoy exploring the Boston/Cambridge area in search of good food and drink!


Balancing Task Coverage and Expert Workload in Team Formation

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.

SDM 2023

Detecting trends in streaming financial data using Apache Flink

Modern financial analytics rely on high-volume streams of event notifications that report live market fluctuations based on supply and demand. Accurately identifying trends or breakout patterns based on the Exponential Moving Average (EMA) in the development of an instrument’s price early on is an important challenge, so as to buy while the price is low and sell before a downtrend begins. This paper aims to solve the above challenge with a distributed, event-streaming solution built using Apache Flink. We present and implement a solution that leverages customized window operators to calculate the EMA and find breakout patterns, using event generation parallelism to facilitate the rapid processing of the input stream uses sinks to collect and output results, and scales easily on a distributed Flink cluster. We empirically test our design on metrics specified by the benchmarking platform for the DEBS 2022 Grand Challenge and observe a throughput of 45 batches per second and an average latency of 120 ms.

DEBS 2022

Mining travel and weather patterns to understand the spread of COVID-19

We investigate the difference in the spread of COVID-19 between the states won by Donald Trump (Red) and the states won by Hillary Clinton (Blue) in the 2016 presidential election, by mining transportation patterns of US residents from March 2020 to July 2020. To ensure a fair comparison, we first use a K-means clustering method to group the 50 states into five clusters according to their population, area and population density. We then characterize daily transportation patterns of the residents of different states using the mean percentage of residents traveling and the number of trips per person. We also use temperature data to attempt to explain the difference in the way residents travel and practice safety measures.

SBP-BRiMS 2021

Displaying automatic lyrics for streaming audio

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab at the University of Rochester during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time. I implemented a multi-threaded audio framework with a shared synchronous buffer to handle simultaneous recording and processing. I extracted chroma feature vectors from the waveform audio data and then used online dynamic time warping to align the live audio stream with a pre-annotated version of the audio based on their harmonic similarity, and display lyrics in real-time.

IEEE Signal Processing Magazine, Jan 2017