-
Permanent and Transient Representations for Continual Reinforcement Learning. Nishanth Anand, Doina Precup. Preprint 2026.
[PDF]
Continual Reinforcement Learning agents struggle to adapt to new situations while retaining past knowledge, resulting in a stability-plasticity trade-off. An appealing solution is to decompose the agent's predictions into permanent and transient components---one for long-term retention and the other for rapid adaptation---thereby achieving a better balance (Anand & Precup, 2023). Building on this idea, we propose using different sets of feature representations to estimate permanent and transient value functions, enabling even faster adaptation. We demonstrate the effectiveness of our approach on small-scale examples for both prediction and control tasks, analyze its theoretical properties, and show its benefits on the Craftax-Classic benchmark using a novel non-parametric approximator for transient value function estimation. Our method facilitates online learning and outperforms the PQN baseline.
-
AIF-GEN: Open-Source Platform and Synthetic Dataset Suite for Reinforcement Learning on Large Language Models. Jacob Chmura,Shahrad Mohammadzadeh, Ivan Anokhin, Jacob-Junqi Tian, Mandana Samiei, Taz Scott-Talib, Irina Rish, Doina Precup, Reihaneh Rabbany, Nishanth Anand. CodeML Workshop, ICML 2025.
[PDF]
Reinforcement learning has proven effective for fine-tuning large language models (LLMs) using reward models trained on human preference data. However, collecting such feedback remains expensive, especially in dynamic settings like personalized tutoring, where users' preferences shift over time and through past interactions. To address this, we present AIF-GEN, the first synthetic preference data generation platform designed for traditional and lifelong RLHF. We use AIF-GEN to instantiate 18 synthetic datasets and evaluate its quality using an LLM. We also perform human evaluation on a subset of the generated datasets to further confirm its quality. Our results show AIF-GEN's potential to support the development of traditional and lifelong RLHF algorithms that align LLMs.
-
Prediction and Control in Continual Reinforcement Learning. Nishanth Anand, Doina Precup. NeurIPS 2023.
[PDF]
[YouTube]
Temporal difference (TD) learning is often used to update the estimate of the value function which is used by RL agents to extract useful policies. In this paper, we focus on value function estimation in continual reinforcement learning. We propose to decompose the value function into two components which update at different timescales: a permanent value function, which holds general knowledge that persists over time, and a transient value function, which allows quick adaptation to new situations. We establish theoretical results showing that our approach is well suited for continual learning and draw connections to the complementary learning systems (CLS) theory from neuroscience. Empirically, this approach improves performance significantly on both prediction and control problems.
-
Preferential Temporal Difference Learning. Nishanth Anand, Doina Precup. ICML 2021.
[PDF]
[YouTube]
Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
-
Recurrent Learning in Reinforcement Learning. Pierre Thodoroff*, Nishanth Anand*, Lucas Caccia, Doina Precup, Joelle Pineau. SPiRL workshop, ICLR 2019.
[PDF]
In sequential modelling, exponential smoothing is one of the most widely used techniques to maintain temporal consistency in estimates. In this work, we propose Recurrent Learning, a method that estimates the value function in reinforcement learning using exponential smoothing mph{along the trajectory}. We establish its asymptotic convergence properties under some smoothness assumption on the reward. The proposed algorithm yields a natural way to learn a state dependent emphasis function that selectively learns to emphasize or ignore states based on trajectory information. We demonstrate the potential for this selective updating on a partially observable domain and several continuous control tasks.
-
Recurrent Value Function. Pierre Thodoroff*, Nishanth Anand*, Lucas Caccia, Doina Precup, Joelle Pineau. RLDM 2019.
[PDF]
Despite recent successes in Reinforcement Learning, value-based methods often suffer from high variance hindering performance. In this paper, we illustrate this in a continuous control setting where state of the art methods perform poorly whenever sensor noise is introduced. To overcome this issue, we introduce Recurrent Value Functions (RVFs) as an alternative to estimate the value function of a state. We propose to estimate the value function of the current state using the value function of past states visited along the trajectory. Due to the nature of their formulation, RVFs have a natural way of learning an emphasis function that selectively emphasizes important states. First, we establish RVF's asymptotic convergence properties in tabular settings. We then demonstrate their robustness on a partially observable domain and continuous control tasks. Finally, we provide a qualitative interpretation of the learned emphasis function.
-
Temporal Credit Assignment via Traces in Reinforcement Learning. Nishanth Anand. MSc Thesis.
[PDF]
Reinforcement Learning is a framework for sequential decision making which is widely used in many domains such as robotics, autonomous driving, etc. Due to the sequential nature there exists the problem of assigning the credit to the actions taken in the past. This problem in reinforcement learning is known as temporal credit assignment. The problem of temporal credit assignment lies in the core of many methods such as options, online learning, off-policy learning, etc. within reinforcement learning framework. Several problems such as high variance in the value function estimates, sub-optimal policy, high sample complexity are a consequence of improper temporal credit assignment in reinforcement learning. In this thesis, we introduce and examine a couple of temporal credit assignment techniques. Specifically, we mitigate the problem of variance in value function by effectively assigning credit. First, we discuss the fundamental concepts of signals and reinforcement learning. Then, we introduce Recurrent Learning which smooths the value function along the trajectory. We then analyze the strengths of Recurrent Learning experimentally. Finally, we introduce filters from signal processing as a general framework for various traces in reinforcement learning. We show the effectiveness of filters with a couple of toy examples.
-
Stock Market Prediction Using Optimum Threshold Based Relevance Vector Machines. HS Karthik, Nishanth Anand, J Manikandan. ADCOM 2016.
[PDF]
Machine learning is employed for myriad of applications ranging from engineering to non-engineering, medical to finance, sports to studies and many more. The huge demand for machine learning has spearheaded various techniques such as Hidden Markov Models (HMM), Artificial Neural Networks(ANN), Support Vector Machines (SVM), Relevance Vector Machines (RVM) and many more. It is well reported in literature that RVM outperforms SVM interms of sparseness as well as accuracy and hence the same is employed for the proposed work. In this paper, stock market prediction using optimum threshold based RVM is reported and its performance is evaluated using given input parameters for share market. In order to assess the performance of the proposed system, datasets from the following four stock exchanges are considered for evaluation, which includes NASDAQ, National Security Exchange (NSE), New York Stock Exchange (NYSE) and London Stock Exchange (LSE). It is observed that 19.17 - 83.33% of relevance vectors are pruned on using the proposed optimum threshold based RVM technique. Also a user friendly Graphical user interface is developed for the proposed work, which can be easily extended for various other machine learning applications too.
-
SAR image compression using Relevance Vector Machines. Nishanth Anand, J Manikandan. INDICON 2015.
[PDF]
Synthetic Aperture Radar (SAR) images are built on-board an aircraft or spacecraft with the help of backscatters and these images are displayed on the cockpit, transmitted to ground station or stored in on-board storage disks based on the system where it is employed. SAR images represent vital information for a large variety of applications which includes automatic target recognition. Hence there is an urge to compress these images with negligible degradation in image quality. In this paper, a novel attempt is made to compress SAR images using RVM for aerospace and satellite applications. Also an optimum threshold based RVM image compressor is proposed and its performance is evaluated. In order to assess the effectiveness of the proposed system, datasets from USC-SIPI image databases are used. It is observed that the images are compressed by 40.36% to 88.53% with a PSNR ranging from 24.34 dB to 33.81 dB on using the proposed optimum threshold technique based RVM model.
-
Sparse representation using optimum threshold based relevance vector machine. Nishanth Anand, J Manikandan. INDICON 2015.
[PDF]
Sparse representation is a signal processing technique that is capable of determining the entire signal from relatively fewer samples. Support vector machines (SVM) and relevance vector machines (RVM) are the most commonly used sparse representation techniques, where the ability of the model to estimate the output is directly related to the sparsity. It is also reported in literature that the performance of RVM is superior over SVM in terms of accuracy and sparseness. In this paper, an optimum threshold based relevance vector machine is proposed for sparse representation. In order to assess the sparseness of proposed approach, three signals and datasets from UCI databases are used for sparse approximation using proposed RVM model and the results are reported. The performance of proposed system is assessed using two parameters, Relative error and Mean square error. It is observed that the number of relevance vectors is pruned by 7.18 - 69.46% on using the proposed optimum threshold technique based RVM model for sparse approximation.