Temporal Difference Learning in Dialogue Systems

Reinforcement Learning (RL) has become increasingly important in training conversational AI systems to better align with human preferences. In this blog post, we'll explore how Temporal Difference (TD) learning, a fundamental RL technique, can be applied to improve conversational agents. We'll start with the general case and then examine simplified scenarios with only terminal rewards before extending the framework to Generalized Advantage Estimation (GAE).

Modeling Conversations as RL Problems

In a conversational setting, we can model the interaction as a sequence of states and actions:

\tau = (s_0, a_0, s_1, a_1, \ldots)

where states $s_i$ represent the conversation history up to the end of each user message, and actions $a_i$ represent the assistant's responses. This sequential nature makes RL a natural framework for optimization.

Value Functions and Returns

In RL, we aim to maximize expected return, which is the sum of rewards received over time. Two common formulations are:

Finite-horizon undiscounted return: Sum of rewards within a fixed window (e.g., a single conversation session):
$R(\tau)=\sum^{T}_{t=0} r_t$
Infinite-horizon discounted return: Sum of all rewards, with future rewards discounted:
$R(\tau)=\sum^{\infty}_{t=0} \gamma^t r_t$
where $\gamma \in (0,1)$ is the discount factor.

For each state, we define the state value function as the expected return starting from state $s$ and following policy $\pi$ :

V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau)\vert s_0=s\right]

Similarly, the action-value function gives the expected return when starting in state $s$ , taking action $a$ , and then following policy $\pi$ :

Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau)\vert s_0=s, a_0=a \right]

Bellman Equations

The value functions satisfy recursive relationships known as Bellman equations:

V^\pi (s)=\mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a) +\gamma V^\pi(s') \right]

Q^\pi (s, a)=\mathbb{E}_{s'\sim P}\left[ r(s,a) + \gamma \mathbb{E}_{a'\sim \pi} \left[Q^\pi (s', a')\right]\right]

where $s' \sim P$ indicates that the next state is sampled according to the environment dynamics $P(s'|s,a)$ .

Temporal Difference Learning

TD learning is a method for learning value functions that combines elements of Monte Carlo and dynamic programming approaches. The core idea is to update estimates based on the difference between successive predictions.

TD(0): One-Step Updates

The simplest form of TD learning is TD(0), which updates value estimates based on the immediate next state:

V(s_t) \leftarrow V(s_t) + \alpha [r_t + \gamma V(s_{t+1}) - V(s_t)]

where $\alpha$ is the learning rate and $[r_t + \gamma V(s_{t+1}) - V(s_t)]$ is called the TD error. This approach bootstraps from the next state's value estimate rather than waiting for the actual return, allowing for online learning.

TD( $\lambda$ ): Multi-Step Updates

TD( $\lambda$ ) extends this by blending multiple n-step returns, creating a more flexible update mechanism. First, let's define the n-step return:

G_t^{(n)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})

The $\lambda$ -return is a weighted average of these n-step returns:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

Here, $\lambda$ determines how much weight is given to longer-term returns versus immediate ones. Higher $\lambda$ values emphasize longer-term outcomes, while lower values prioritize immediate estimates.

Derivation of the Finite and Infinite Horizon Components

Let's derive the decomposition of the $\lambda$ -return into finite-horizon and infinite-horizon components. Starting with the definition:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

For a conversation of length $T$ (ending at time $T$ ), we can split this sum into two parts:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) \sum_{n=T-t}^{\infty} \lambda^{n-1} G_t^{(n)}

The first sum involves all n-step returns that terminate before the end of the conversation, while the second sum involves returns that extend to or beyond the terminal state.

For all $n \geq T-t$ , the n-step return $G_t^{(n)}$ involves the actual terminal reward $r_T$ , so they all equal to $G_t^{(\infty)}$ (the full return from $t$ to $T$ ).

Therefore:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) G_t^{(\infty)} \sum_{n=T-t}^{\infty} \lambda^{n-1}

The second sum is a geometric series starting at $\lambda^{T-t-1}$ :

\sum_{n=T-t}^{\infty} \lambda^{n-1} = \lambda^{T-t-1} + \lambda^{T-t} + \lambda^{T-t+1} + ... = \frac{\lambda^{T-t-1}}{1-\lambda}

Substituting this back:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) G_t^{(\infty)} \frac{\lambda^{T-t-1}}{1-\lambda}

Simplifying:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + \lambda^{T-t-1} G_t^{(\infty)}

This elegant decomposition shows how TD( $\lambda$ ) blends between immediate value estimates and the actual return. When $\lambda = 0$ , we rely only on the one-step estimate; when $\lambda = 1$ , we use the full return from the episode.

The value function update is then:

V(s_t) \leftarrow V(s_t) + \alpha [G_t^{\lambda} - V(s_t)]

The Case of Terminal-Only Rewards

In many conversational AI applications, rewards are only provided at the end of a conversation (e.g., user satisfaction ratings, task completion success, or feedback scores). This simplifies our equations considerably and makes TD learning particularly relevant.

When $r_t = 0$ for all $t < T$ and $r_T$ is the terminal reward:

The finite-horizon return becomes simply $R(\tau) = r_T$
The value function at any state should equal the expected terminal reward: $V^\pi(s) = \mathbb{E}[r_T|s]$

This is a common scenario in conversational AI, where we often lack immediate feedback after each turn but receive user ratings or other metrics at the conversation's end.

In this case, the n-step returns simplify to:

G_t^{(n)} = \begin{cases} \gamma^n V(s_{t+n}) & \text{if } n < T-t \\ \gamma^{T-t} r_T & \text{otherwise} \end{cases}

And the $\lambda$ -return becomes:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} \gamma^n V(s_{t+n}) + \lambda^{T-t-1} \gamma^{T-t} r_T

For the common case where $\gamma = 1$ (no discounting within a session), this further simplifies to:

G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} V(s_{t+n}) + \lambda^{T-t-1} r_T

At the extremes:

When $\lambda = 1$ , the return is simply the terminal reward: $G_t^{1} = r_T$
When $\lambda = 0$ , the return is just the next state's value: $G_t^{0} = V(s_{t+1})$

This formulation allows us to propagate the terminal reward signal backward through the conversation, with the $\lambda$ parameter controlling how much we trust our value estimates versus waiting for the actual outcome.

Advantage Functions

While value functions tell us how good a state is, advantage functions tell us how much better a specific action is compared to the average action in that state:

A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

Given the relationship between value and action-value functions:

V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]

We can rewrite the advantage as:

A^\pi(s, a) = \mathbb{E}_{s' \sim P}[r(s,a) + \gamma V^\pi(s')] - V^\pi(s)

In conversational AI, advantage functions help us understand which responses are better than the average response given the current conversation state.

Generalized Advantage Estimation (GAE)

GAE, introduced by Schulman et al., extends TD( $\lambda$ ) to advantage functions, providing a more stable estimate for policy optimization. The GAE parameter $\lambda \in [0,1]$ controls the trade-off between bias and variance.

The k-step advantage estimator is defined as:

\hat{A}_t^{(k)} = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) - V(s_t)

GAE is then a weighted sum of these k-step estimators:

\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{k=1}^{\infty} (\gamma \lambda)^{k-1} \delta_{t+k-1}^V

where $\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t)$ is the TD error.

This can also be written recursively:

\hat{A}_t^{GAE(\gamma, \lambda)} = \delta_t^V + \gamma \lambda \hat{A}_{t+1}^{GAE(\gamma, \lambda)}

For the special case of terminal-only rewards, GAE simplifies to:

\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{k=1}^{T-t-1} (\gamma \lambda)^{k-1} \gamma [V(s_{t+k}) - V(s_t)] + (\gamma \lambda)^{T-t-1} [r_T - V(s_t)]

With $\gamma = 1$ , this becomes:

\hat{A}_t^{GAE(1, \lambda)} = \sum_{k=1}^{T-t-1} \lambda^{k-1} [V(s_{t+k}) - V(s_t)] + \lambda^{T-t-1} [r_T - V(s_t)]

GAE is particularly useful in conversational AI because it provides a balanced estimate of how good an assistant's response was, accounting for both immediate user reactions and long-term conversation outcomes.

Practical Implementation for Conversational AI

In practical implementations, we often use neural networks to parameterize the value function $V_\theta(s)$ and policy $\pi_\phi(a|s)$ . The training objective for the value function using TD( $\lambda$ ) can be formulated as:

L_V(\theta) = \frac{1}{2} \mathbb{E}_t [(G_t^{\lambda} - V_\theta(s_t))^2]

For binary outcomes (like user satisfaction or task completion), we can use a sigmoid-transformed value prediction:

V_\theta(s_t) = \sigma(v_t)

where $v_t$ is the raw output of the value network and $\sigma$ is the sigmoid function.

The corresponding loss function then becomes:

J_t^{\lambda} = G_t^{\lambda} \cdot \log \sigma(v_t) + (1-G_t^{\lambda}) \cdot \log (1-\sigma(v_t))

This binary cross-entropy loss is particularly appropriate for conversational AI applications where outcomes are often binary (success/failure) or can be binarized (satisfaction above/below threshold).

Example: Applying TD Learning to Conversation Quality

Consider a conversational AI trained to help users with customer service tasks. At the end of each conversation, users provide a satisfaction rating (1-5 stars). We can:

Transform this into a binary outcome (4-5 stars = success, 1-3 stars = failure)
Train a value function to predict this outcome at each turn
Use TD( $\lambda$ ) to propagate the final reward through the conversation
Train our policy to maximize predicted advantage

This approach helps the assistant learn which conversation patterns lead to positive outcomes, even when feedback is delayed until the conversation's end.