Research

Pareto frontier tradeoffs

Computing Approximate Pareto Frontiers for Submodular Utility and Cost Tradeoffs

arXiv preprint · PDF ·

We study submodular optimization problems where the goal is to maximize utility while minimizing cost, and move beyond traditional approaches that return only a single solution with a fixed tradeoff. We develop efficient algorithms to compute approximate Pareto frontiers, providing representative solutions that expose the full spectrum of achievable utility–cost tradeoffs in applications such as recommender systems, influence maximization, and team formation.

In many data-mining applications, including recommender systems, influence maximization, and team formation, the goal is to pick a subset of elements (e.g., items, nodes in a network, experts to perform a task) to maximize a monotone submodular utility function while simultaneously minimizing a cost function. Classical formulations model this tradeoff via cardinality or knapsack constraints, or by combining utility and cost into a single weighted objective. However, such approaches require committing to a specific tradeoff in advance and return only a single solution, offering limited insight into the space of viable utility-cost tradeoffs. In this paper, we depart from the single-solution paradigm and examine the problem of computing representative sets of high-quality solutions that expose different tradeoffs between submodular utility and cost. For this, we introduce $(\alpha_1,\alpha_2)$-approximate Pareto frontiers that provably approximate the achievable tradeoffs between submodular utility and cost. Specifically, we formalize the Pareto-$\langle f,c \rangle$ problem and develop efficient algorithms for multiple instantiations arising from different combinations of submodular utility $f$ and cost functions $c$. Our results offer a principled and practical framework for understanding and exploiting utility-cost tradeoffs in submodular optimization. Experiments on datasets from diverse application domains demonstrate that our algorithms efficiently compute approximate Pareto frontiers in practice.
QUBO framework diagram

A QUBO Framework for Team Formation

ECML PKDD 2025 · PDF ·

We introduce a unified QUBO-based formulation for team formation that jointly optimizes skill coverage and expert cost across multiple cost definitions. Our approach achieves performance comparable to established methods and enables transfer learning via graph neural networks that learn reusable expert and skill representations

The team formation problem assumes a set of experts and a task, where each expert has a set of skills and the task requires some skills. The objective is to find a set of experts that maximizes coverage of the required skills while simultaneously minimizing the costs associated with the experts. Different definitions of cost have traditionally led to distinct problem formulations and algorithmic solutions. We introduce the unified TeamFormation formulation that captures all cost definitions for team formation problems that balance task coverage and expert cost. Specifically, we formulate three TeamFormation variants with different cost functions using quadratic unconstrained binary optimization (QUBO), and we evaluate two distinct general-purpose solution methods. We show that solutions based on the QUBO formulations of TeamFormation problems are at least as good as those produced by established baselines. Furthermore, we show that QUBO-based solutions leveraging graph neural networks can effectively learn representations of experts and skills to enable transfer learning, allowing node embeddings from one problem to be applied to another.
Team formation visualization

Forming Coordinated Teams that Balance Task Coverage and Expert Workload

DMKD 2025 · PDF ·

We study team formation problems that balance maximizing skill coverage with minimizing expert workload, and extend the setting to coordination graphs that capture collaboration constraints between experts. We develop scalable approximation algorithms with provable guarantees that efficiently optimize coverage, workload balance, and coordination on real-world datasets.

We deviate from the classic team formation problem setting where one is asking to cover all skills of each given task; instead we aim to cover as many skills as possible while also trying to minimize the maximum workload among the experts. We also consider a variant of this problem, where the experts are organized into a graph, which encodes how well they work together. Utilizing such a coordination graph, we aim to find teams to assign to tasks such that each team's radius does not exceed a given threshold. We develop a generic template algorithm for approximating both problems in polynomial time, and we show that our template algorithm has provable guarantees. We describe a set of computational speedups that we can apply to our algorithms and make them scale for reasonably large datasets. From the practical point of view, we demonstrate how to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real-world datasets demonstrate the utility of our problem formulation as well as the efficiency of our algorithms in practice.
Social media feed analysis

Investigating Exploration and Exploitation on Social Media Feeds

The Web Conference 2024 · PDF ·

We develop a temporal graph framework to detect and quantify personalization in social media feeds by distinguishing between exploration and exploitation in recommendation timelines. Using real and baseline TikTok datasets, we show how user interactions such as viewing, liking, and following shape personalization, enabling transparent and explainable auditing of recommendation systems.

Recommendation algorithms for social media feeds often function as black boxes from the perspective of users. We aim to detect whether social media feed recommendations are personalized to users, and we also aim to characterize the extent and factors contributing to personalization in these feeds. We introduce a general framework to examine a set of social media feed recommendations for a user as a temporal graph. We label vertices in the graph as the result of exploration vs. exploitation of the user's interests on the part of the recommendation algorithm and introduce a set of metrics to capture the extent of personalization across user timelines. We apply our framework to a real TikTok dataset and validate our results using a baseline dataset generated from automated TikTok bots, as well as a randomized baseline. We also investigate the extent to which factors such as video viewing duration, liking, and following drive the personalization of content on TikTok. Our results demonstrate that our framework produces intuitive and explainable results, and can be used to audit and understand personalization in social media feeds.
Balanced coverage diagram

Balancing Task Coverage and Expert Workload in Team Formation

SDM 2023 · PDF ·

We introduce the Balanced Coverage problem, which assigns experts to tasks to maximize skill coverage while simultaneously minimizing imbalance in workload across experts. Despite the problem’s NP-hardness, we develop a scalable approximation algorithm with provable guarantees that efficiently balances coverage and workload on real-world datasets.

In the classical team-formation problem the goal is to identify a team of experts such that the skills of these experts cover all the skills required by a given task. We deviate from this setting and propose a variant of the classical problem in which we aim to cover the skills of every task as well as possible, while also trying to minimize the maximum workload among the experts. Instead of setting the coverage constraint and minimizing the maximum load, we combine these two objectives into one. We call the corresponding assignment problem the Balanced Coverage problem, and show that it is NP-hard. We adopt a weaker notion of approximation and we show that under this notion we can design a polynomial-time approximation algorithm with provable guarantees. We also describe a set of computational speedups that we can apply to the algorithm and make it scale for reasonably large datasets. From the practical point of view, we demonstrate how the nature of the objective function allows us to efficiently tune the two parts of the objective and tailor their importance to a particular application. Our experiments with a variety of real datasets demonstrate the utility of our problem formulation as well as the efficacy and efficiency of our algorithm in practice.
Financial streaming data

Detecting trends in streaming financial data using Apache Flink

DEBS 2022 Grand Challenge · PDF · ·

Distributed Apache Flink pipeline for streaming EMA-based trend detection and breakout identification. Uses custom window operators and parallel event generation for scalable throughput. Benchmarked on DEBS 2022 Grand Challenge metrics with low latency and high batch throughput.

Modern financial analytics rely on high-volume streams of event notifications that report live market fluctuations based on supply and demand. Accurately identifying trends or breakout patterns based on the Exponential Moving Average (EMA) in the development of an instrument's price early on is an important challenge, so as to buy while the price is low and sell before a downtrend begins. This paper aims to solve the above challenge with a distributed, event-streaming solution built using Apache Flink. We present and implement a solution that leverages customized window operators to calculate the EMA and find breakout patterns, using event generation parallelism to facilitate the rapid processing of the input stream uses sinks to collect and output results, and scales easily on a distributed Flink cluster. We empirically test our design on metrics specified by the benchmarking platform for the DEBS 2022 Grand Challenge and observe a throughput of 45 batches per second and an average latency of 120 ms.
COVID-19 mobility analysis

Mining travel and weather patterns to understand the spread of COVID-19

arXiv 2020 · PDF · ·

Comparative analysis of COVID-19 spread and mobility across US states, controlled by clustering on population and density. Characterizes travel behavior using daily trip metrics and explores temperature effects on safety practices. Highlights political and behavioral contrasts in early-pandemic mobility patterns.

We investigate the difference in the spread of COVID-19 between the states won by Donald Trump (Red) and the states won by Hillary Clinton (Blue) in the 2016 presidential election, by mining transportation patterns of US residents from March 2020 to July 2020. To ensure a fair comparison, we first use a K-means clustering method to group the 50 states into five clusters according to their population, area and population density. We then characterize daily transportation patterns of the residents of different states using the mean percentage of residents traveling and the number of trips per person. We also use temperature data to attempt to explain the difference in the way residents travel and practice safety measures.
Audio alignment visualization

Displaying automatic lyrics for streaming audio

IEEE Signal Processing Magazine 2016 · PDF · ·

Real-time lyric alignment system for live music, supported by a Xerox research fellowship with the Audio Information Research Lab. Built a multi-threaded audio pipeline and aligned live streams using chroma features and online dynamic time warping. Demonstrates low-latency synchronization between live audio and pre-encoded lyrics.

I was awarded a research fellowship by Xerox to conduct research under the guidance of Prof. Zhiyao Duan in the Audio Information Research Lab at the University of Rochester during the summer of 2016. The primary purpose of my research was to design a computational system that could follow live musical performances and display pre-encoded lyrics in real-time. I implemented a multi-threaded audio framework with a shared synchronous buffer to handle simultaneous recording and processing. I extracted chroma feature vectors from the waveform audio data and then used online dynamic time warping to align the live audio stream with a pre-annotated version of the audio based on their harmonic similarity, and display lyrics in real-time.

Teaching

Boston University

BU Summer Challenge Course Instructor · 2023, 2024

Teaching Fellow · 2022 - 2026

  • CS 460: Database Systems
  • CS 565: Algorithmic Data Mining
  • CS 131: Combinatoric Structures
  • CS 132: Geometric Algorithms
  • CS 391: Algorithms to Live By

University of Rochester

Teaching Assistant · 2014 - 2018

  • ECE 231: Applied Electromagnetism
  • ECE 111, ECE 112: Analysis of Electrical Circuits and Logic Circuit Design
  • MTH 161, MTH 162: Differential and Integral Calculus
  • PHY 113, PHY 122: Mechanics and Electricity & Magnetism
  • AST 105, AST 106: Introductory Astronomy