Karan Vombatkere

PhD Candidate in Computer Science | Boston University

MCS 105A, Department of Computer Science
111 Cummington Mall, Boston, MA 02215

Email: username @ domain
where username = kvombat , domain =



I am a Computer Science PhD student at Boston University, advised by Dr. Evimaria Terzi. I am a part of the Massive Data & Algorithms (MiDAS) research group at BU, and my research is primarily focused on data mining algorithms. My current research is on designing efficient approximation algorithms for submodular optimization problems, such as the Team Formation/Max-k Coverage problem.

My broader research interests are shaped by the diversity of my past experiences: I am excited about all things statistical, algorithmic and game theoretic. I am interested in studying about optimality and efficient strategies in environments of incomplete and imperfect information. I am curious about using machine learning and data-driven methods to formulate a better understanding of social systems.

Before starting my PhD I worked for 3 years as a Data Scientist at IBM, where I developed machine learning solutions from the ground-up on exciting projects at the intersection of business and tech. I developed a bid-price optimization engine in Python, where I used hierarchical clustering and logistic model trees to build a microservice to handle real-time pricing requests. I also helped automate pre-surgical authorization for patients using natural language processing on clinical data. Prior to IBM I was at the University of Rochester, where I studied Physics and Electrical & Computer Engineering as an undergraduate, and explored a variety of topics such as computer audition (with Prof. Zhiyao Duan), computational physics and game theory. I then pursued a masters degree in Data Science with a focus in computational and statistical methods, and worked with Prof. Jiebo Luo.

I play tennis and soccer, and enjoy outdoor activities such as skiing, hiking and biking. I also like to challenge myself with chess, poker and other strategy games. I am an eager home cook and I enjoy exploring the Boston/Cambridge area in search of good food and drink!


Team Formation: Balancing Task Coverage and Expert Workload

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. Here we deviate from this setting and tackle a variant of the problem where all the skills of every task need not be covered. We aim to maximize total task coverage while also simultaneously minimizing the maximum workload among all experts. We combine these two objectives into one objective function and call the corresponding assigninment problem the Balanced-Coverage problem. We first show that this problem is NP-hard, and then design a polynomial-time algorithm with provable approximation guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale well. Finally, we experiment with a variety of real datasets to demonstrate the practical utility of our problem formulation as well as the efficiency and efficacy of our algorithm in practice.

CIKM '22 Submission.

Detecting trends in streaming financial data using Apache Flink

Modern financial analytics rely on high-volume streams of event notifications that report live market fluctuations based on supply and demand. Accurately identifying trends or breakout patterns based on the Exponential Moving Average (EMA) in the development of an instrument’s price early on is an important challenge, so as to buy while the price is low and sell before a downtrend begins. This paper aims to solve the above challenge with a distributed, event-streaming solution built using Apache Flink. We present and implement a solution that leverages customized window operators to calculate the EMA and find breakout patterns, using event generation parallelism to facilitate the rapid processing of the input stream uses sinks to collect and output results, and scales easily on a distributed Flink cluster. We empirically test our design on metrics specified by the benchmarking platform for the DEBS 2022 Grand Challenge and observe a throughput of 45 batches per second and an average latency of 120 ms.

DEBS 2022 Grand Challenge paper.

Mining travel and weather patterns to understand the spread of COVID-19

We investigate the difference in the spread of COVID-19 between the states won by Donald Trump (Red) and the states won by Hillary Clinton (Blue) in the 2016 presidential election, by mining transportation patterns of US residents from March 2020 to July 2020. To ensure a fair comparison, we first use a K-means clustering method to group the 50 states into five clusters according to their population, area and population density. We then characterize daily transportation patterns of the residents of different states using the mean percentage of residents traveling and the number of trips per person. We also use temperature data to attempt to explain the difference in the way residents travel and practice safety measures.

SBP-BRiMS 2021.

Displaying automatic lyrics for streaming audio

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab at the University of Rochester during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time. I implemented a multi-threaded audio framework with a shared synchronous buffer to handle simultaneous recording and processing. I extracted chroma feature vectors from the waveform audio data and then used online dynamic time warping to align the live audio stream with a pre-annotated version of the audio based on their harmonic similarity, and display lyrics in real-time.

IEEE Signal Processing Magazine, Jan 2017.


Boston University | Boston, MA

Doctoral Student

Computer Science PhD candidate advised by Dr. Evimaria Terzi.

IBM | Cambridge, MA

Data Scientist

At IBM, I designed real-world solutions and developed ML models from the ground up on exciting projects at the intersection of business and tech.
Under the guidance of Dr. Mark Freeman, I developed a novel algorithmic method in Python for business-to-business bid price optimization, and successfully deployed it as a microservice for a large telecommunications company.
I also worked with Watson Health to develop an NLP solution to automate pre-surgical authorization. My contributions included building a rules engine in Python and implementing algorithms to extract natural language features from patient clinical data.

University of Rochester | Rochester, NY

Teaching Assistant

I was awarded the Citation for Achievement in College Leadership in recognition of outstanding undergraduate teaching and research commitment. I led workshops, recitations and labs for groups of 15-20 students for the following courses across the Physics, Electrical & Computer Engineering and Math departments:

  • ECE 231: Applied Electromagnetics
  • ECE 111: Circuit Analysis; ECE 112: Digital Logic
  • MTH 161: Calculus I; MTH 162: Calculus II
  • PHY 113: Mechanics; PHY 122: Electromagnetism
  • AST 105 & AST 106: Introductory Astronomy

Research Fellow

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time.


Coreset Algorithms for Clustering and Streaming

Python implementation of Coreset algorithms for clustering and streaming. The code in this repository can be used to generate Coresets for use-cases such as median estimation, minumum-enclosing ball (MEB), k-center clustering and Gaussian mixture models (GMMs).

Settlers of Catan AI Framework

Settlers of Catan boardgame built in Python. The goal of this project is to implement full multiplayer game functionality and use machine learning to build an AI player that can effectively explore-exploit heuristic strategies.

Xenou, Konstantia, Georgios Chalkiadakis, and Stergos Afantenos. "Deep Reinforcement Learning in Strategic Board Game Environments." European Conference on Multi-Agent Systems. Springer, Cham, 2018.
Gendre, Quentin, and Tomoyuki Kaneko. "Playing Catan with Cross-Dimensional Neural Network." International Conference on Neural Information Processing. Springer, Cham, 2020.

Tennis Player Performance Prediction

Under the guidance of Prof. Jiebo Luo, I used efficient feature extraction on historical tennis match data, combined with a machine learning implementation in Python to predict the likelihood of professional tennis player success with 80% accuracy. I implemented code to create player-specific statistical feature sets aggregated from individual match data, and used neural network and logistic regression classification models to categorize player success.

Adversarial Search: Ultimate Tic Tac Toe

An Ultimate Tic Tac Toe framework in Java, with an implementation of adversarial search using MiniMax with Alpha-Beta pruning. Ultimate Tic Tac Toe comprises nine 3x3 Tic Tac Toe boards, and the goal is to win 3 boards. I also developed a heuristic AI player, which was tested to beat a control player in 99 out of 100 games.

German WWII Enigma Machine

A complete implementation of the Enigma Machine in Python. I use an object-oriented framework to design the plugboard, reflector and rotor set, allowing for full encryption and decryption functionality. I also implement code to crack the Enigma cipher, using a known-plaintext attack methodology.

Non-Linear Dynamics of the Damped and Driven Pendulum

I investigated the non-linear dynamics of the damped and driven pendulum and developed a theoretical framework for the system. I then computationally solved the classical mechanics problem using Mathematica to discover regions of chaotic and non-chaotic motion.

Augmented Audio Reality: Binaural Headphones

Designed, built and tested binaural headphones with real-time recording and filtering capability, with a 12ms latency. Modeled the head-related tranfer function (HRTF) using a Neumann head microphone, and implemented the FFT algorithm in C++ to enable real-time audio filtering.

Brownian Motion Stock Price Evolution

A statistical framework in Python to predict stock price evolution using geometric Brownian motion. The model was tested to have under 5% error using Monte Carlo simulations on 2 years of historical Nike stock prices.

Reference: Dmouj, Abdelmoula. "Stock price modelling: Theory and Practice." Masters Degree Thesis, Vrije Universiteit (2006).


Boston University

Aug 2021 - Present

Doctor of Philosophy in Computer Science

Focus in Machine Learning and Data Mining

University of Rochester

Aug 2017 - May 2018

Master of Science in Data Science

Focus in Computational & Statistical Methods
Relevant coursework: Artificial Intelligence, Statistical Learning, Data Mining, Database Systems, Cryptography

University of Rochester

Aug 2013 - May 2017

Bachelor of Science in Electrical & Computer Engineering
Bachelor of Arts in Physics

Highest Distinction, Magna Cum Laude
Relevant coursework: Computational Physics, Data Structures & Algorithms, Robot Control, Game Theory, Classical Mechanics, Linear Algebra, Signal Processing