Optimization and Machine Learning Seminar
Spring 2022
Upcoming talk: April 21st
Speaker: David Shirokoff, Department of Mathematical Sciences, NJIT
Topic: Stochastic Gradient Descent as a Markov Chain
Stochastic gradient descent (SGD) is a widely used method for optimizing objective functions written as the sum of many terms--such as those arising in machine learning applications. We will first review how SGD can be cast as a Markov chain with transition probabilities, as well as the asymptotics form of the diffusion approximation (Fokker-Planck PDE). In the situation where the underlying object functions are convex and the step size (learning rate) are small, SGD is well-known to minimize the objective function with the diffusion approximation providing a good characterization. Unfortunately, such underlying assumptions are lacking in practical situations. We will examine several situations where the diffusion approximation fails to capture the correct long time behavior of SGD. We will also discuss ongoing work as to what extent SGD has a Lyapunov functional and satisfies a notion of detailed balance (the diffusion approximation satisfies both). Identifying whether SGD has a notion of a Lyapunov function would provide insight into how SGD samples the minima for a given objective function.
April 14th
Speaker: Soheil Saghafi, Department of Mathematical Sciences, NJIT
Topic: A Markov Chain Monte Carlo version of the genetic algorithm Differential Evolution: easy Bayesian computing for real parameter spaces
Differential Evolution (DE) is a stochastic global optimization technique that uses a simple genetic algorithm for numerical optimization in real parameter spaces. In the statistical context, the uncertainty distribution of the optimal value can be obtained using Markov Chain Monte Carlo (MCMC) simulation, a method from Bayesian. In this talk, I will mainly focus on an MCMC method known as the Metropolis-Hastings algorithm and go through some of the specifics of its implementation and calibration using examples animated in MATLAB. By introducing the Differential Evolution-Markov Chain (DE-MC) method, which is the integration of the essential ideas of DE and MCMC, we can see how this method is superior to the other MCMC approaches in some cases involving nonlinearity, high-dimensionality, and multimodality. In the end, I will discuss applying this method for doing parameter estimation using the Rosenbrock function as a test case.
April 7th
Speaker: Sepideh Nikookar, Department of Compute Science, NJIT
Topic: Human-AI Complex Task Planning
The process of complex task planning is ubiquitous and arises in a variety of compelling applications. Few leading examples include, designing a personalized course plan or trip plan, or even planning routes of naval assets to collaboratively discover an unknown destination. For all these aforementioned applications, creating a plan requires satisfying a basic construct, i.e., composing a sequence of sub-tasks that optimizes several criteria and satisfies constraints. Task plans need to be individualized and designed considering uncertainty. When done manually, the process is human-intensive and tedious, and unlikely to scale. This talk will discuss a set of computational frameworks that synthesize the capabilities of human and AI algorithms to enable task planning at scale, while satisfying multiple objectives and complex constraints by adapting reinforcement learning. Following that, the talk will discuss the data engineering and data management opportunities that have been explored to design scalable algorithms to address deployment challenges in the real-world setting.
March 24th
Topic: Applications and Algorithms for kNN (k-Nearest Neighbors) in the Wasserstein Metric
The k-nearest neighbors method aims to classify each point in a dataset by finding the k nearest points. When applied to structured datasets, each point is often viewed as an element of a high-dimensional Euclidean space. However, the euclidean distance can be poorly suited for measuring similarity amongst certain structured datasets. In this talk, I will introduce the notion of kNN in the Wasserstein metric as an alternative for data points with underlying geometric structures. The first half will focus on W-kNN in natural language processing, where the Wasserstein distance has been used to measure the semantic similarity between text documents. In the second half, I will discuss scalable algorithms and their theoretical guarantees for approximating W-kNN for large datasets.
March 10th
Topic: Data Assimilation for Conductance-Based Modeling of Circadian Pacemaker Neurons
Data assimilation aims to optimally combine noisy measurements and imperfect models to yield accurate estimates of the underlying system's dynamics. In practice, there are two flavors of data assimilation strategies. Sequential methods, like the Kalman filter, can incrementally incorporate new measurements upon arrival to update system estimates in time using only the most recent estimate mean and uncertainty. Variational methods approximate the posterior probability distribution of the states of the system conditioned on the observations, often resulting in a nonlinearly constrained cost function over a specified observation window. Versions of these techniques will be introduced in tandem using certain assumptions such as Gaussian error and applied to a twin experiment with a minimalistic conductance-based model, i.e. the Morris-Lecar model. An additional formulation of the variational method using a "nudging" term, which stabilizes the synchronization manifold of the model to the data, is then applied to construct models for circadian pacemaker neurons of the Rhabdomys pumilio, a diurnal rodent, and Drosophila. Our models were used to understand what governs the distinct electrical variation that occurs in these cells over the course of the day/night cycle.
March 3rd
Hosted by: Binan Gu, NJIT
Topic: A PDE-Based Analysis of the Symmetric Two-Armed Bernoulli Bandit
The multi-armed bandit is a classic sequential prediction problem. At
each round, the predictor (player) selects a probability distribution
from a finite collection of distributions (arms) with the goal of
minimizing the difference (regret) between the player's rewards
sampled from the selected arms and the rewards of the best performing
arm at the final round. The player's choice of the arm and the reward
sampled from that arm are revealed to the player, and this prediction
process is repeated until the final round. Our work addresses a
version of the two-armed bandit problem where the arms are distributed
independently according to Bernoulli distributions and the sum of the
means of the arms is one (the symmetric two-armed Bernoulli bandit).
In a regime where the gap between these means goes to zero and the
number of prediction periods approaches infinity, we obtain the
leading order terms of the expected regret and pseudoregret for this
problem by associating each of them with a solution of a linear
parabolic partial differential equation. Our results improve upon the
previously known results; specifically we explicitly compute the
leading order term of the optimal regret and pseudoregret in three
different scaling regimes for the gap. Additionally, we obtain new
non-asymptotic bounds for any given time horizon. This is joint work
with Robert Kohn available at https://arxiv.org/abs/2202.05767.
February 24th
Hosted by: Binan Gu, NJIT
Topic: On the Change of Variables Formula and the Problem of Diffeomorphic Density Matching
Suppose you would like to redistribute a probability measure in a prescribed way. Under very general conditions this can be done, but not necessarily uniquely. A well-known method for doing this efficiently is known as Optimal Transport. However, some applications such as those arising in computer graphics and computational geometry require that the redistribution of mass density be done in a smooth way. Here we introduce a way of doing this by starting from the space of diffeomorphisms on a compact manifold and exploring the geometric connections with the space of smooth densities. This will allow us to derive a simple PDE to solve with wide applicability over many manifolds.
February 17th
Speaker: Aaron Dharna, Department of Computer Science, New Jersey Institute of Technology
Topic: Open-Ended Learning Leads to Generally Capable Agents
In this work we create agents that can perform well beyond a single, individual task, that exhibit much wider generalisation of behaviour to a massive, rich space of challenges. We define a universe of tasks within an environment domain and demonstrate the ability to train agents that are generally capable across this vast space and beyond. The environment is natively multi-agent, spanning the continuum of competitive, cooperative, and independent games, which are situated within procedurally generated physical 3D worlds. The resulting space is exceptionally diverse in terms of the challenges posed to agents, and as such, even measuring the learning progress of an agent is an open research problem. We propose an iterative notion of improvement between successive generations of agents, rather than seeking to maximise a singular objective, allowing us to quantify progress despite tasks being incomparable in terms of achievable rewards. We show that through constructing an open-ended learning process, which dynamically changes the training task distributions and training objectives such that the agent never stops learning, we achieve consistent learning of new behaviours. The resulting agent is able to score reward in every one of our humanly solvable evaluation levels, with behaviour generalising to many held-out points in the universe of tasks. Examples of this zero-shot generalisation include good performance on Hide and Seek, Capture the Flag, and Tag. Through analysis and hand-authored probe tasks we characterise the behaviour of our agent, and find interesting emergent heuristic behaviours such as trial-and-error experimentation, simple tool use, option switching, and cooperation. Finally, we demonstrate that the general capabilities of this agent could unlock larger scale transfer of behaviour through cheap finetuning.
February 10th
Speaker: Yunan Yang, Advanced Fellow at the Institute for Theoretical Studies, ETH Zürich
Topic: A Generalized Weighted Optimization Method for Computational Learning and Inversion
Abstract: The generalization capacity of various machine learning models exhibits different phenomena in the under- and over-parameterized regimes. In this paper, we focus on regression models such as feature regression and kernel regression and analyze a generalized weighted least-squares optimization method for computational learning and inversion with noisy data. The highlight of the proposed framework is that we allow weighting in both the parameter space and the data space. The weighting scheme encodes both a priori knowledge on the object to be learned and a strategy to weight the contribution of different data points in the loss function. Here, we characterize the impact of the weighting scheme on the generalization error of the learning method, where we derive explicit generalization errors for the random Fourier feature model in both the under- and over-parameterized regimes. For more general feature maps, error bounds are provided based on the singular values of the feature matrix. We demonstrate that appropriate weighting from prior knowledge can improve the generalization capability of the learned model.
January 27th
Topic: The Mean Drift: Tailoring the Mean Field Theory for Markov Processes for Real-World Applications
Abstract: In the usual sense, "mean-field" targets a system with population size increasing to infinity. But for most applications in engineering, one faces the problem of analyzing middle-sized systems in which the number of agents is bounded. In this talk, I will introduce the tailoring of the classic mean-field theory for the analysis of population processes with arbitrary size via the mean drift and the associated mathematical constructs.
January 20th
Hosted by: Binan Gu, NJIT
Topic: Wasserstein GANs work because they fail to approximate the Wasserstein distance
Abstract: Generative Adversarial Networks (GANs) were introduced by Goodfellow, et al in 2014. Subsequently, the Wasserstein GAN (WGAN), introduced in 2017, aimed to fix some known issues with training GANs by aiming to minimize the Wasserstein-1 distance as a loss function between a real observed distribution and a generated distribution. In practice, many have found that WGANs work quite well and, like vanilla GANs, produce life-like images that can fool discriminators. However, there is a large gap between practice and theory. A very recent alarmist paper (October 2021) critically examines the fact that in practice the minimum of the Wasserstein-1 distance is not achieved! The algorithm works mostly because of other aspects of its design, not the central choice of the Wasserstein distance. When the Wasserstein-1 distance is actually minimized, the images look poor. So why do WGANs work anyway? How important is the choice of loss function for this problem?