Temporal Difference Learning in Dialogue Systems

Ziyi Zhu

Ziyi Zhu / March 08, 2025

8 min read––– views

Reinforcement Learning (RL) has become increasingly important in training conversational AI systems to better align with human preferences. In this blog post, we'll explore how Temporal Difference (TD) learning, a fundamental RL technique, can be applied to improve conversational agents. We'll start with the general case and then examine simplified scenarios with only terminal rewards before extending the framework to Generalized Advantage Estimation (GAE).

Modeling Conversations as RL Problems

In a conversational setting, we can model the interaction as a sequence of states and actions:

τ=(s0,a0,s1,a1,)\tau = (s_0, a_0, s_1, a_1, \ldots)

where states sis_i represent the conversation history up to the end of each user message, and actions aia_i represent the assistant's responses. This sequential nature makes RL a natural framework for optimization.

Value Functions and Returns

In RL, we aim to maximize expected return, which is the sum of rewards received over time. Two common formulations are:

  1. Finite-horizon undiscounted return: Sum of rewards within a fixed window (e.g., a single conversation session):

    R(τ)=t=0TrtR(\tau)=\sum^{T}_{t=0} r_t
  2. Infinite-horizon discounted return: Sum of all rewards, with future rewards discounted:

    R(τ)=t=0γtrtR(\tau)=\sum^{\infty}_{t=0} \gamma^t r_t

    where γ(0,1)\gamma \in (0,1) is the discount factor.

For each state, we define the state value function as the expected return starting from state ss and following policy π\pi:

Vπ(s)=Eτπ[R(τ)s0=s]V^\pi(s) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau)\vert s_0=s\right]

Similarly, the action-value function gives the expected return when starting in state ss, taking action aa, and then following policy π\pi:

Qπ(s,a)=Eτπ[R(τ)s0=s,a0=a]Q^\pi(s, a) = \mathbb{E}_{\tau \sim \pi}\left[R(\tau)\vert s_0=s, a_0=a \right]

Bellman Equations

The value functions satisfy recursive relationships known as Bellman equations:

Vπ(s)=Eaπ,sP[r(s,a)+γVπ(s)]V^\pi (s)=\mathbb{E}_{a \sim \pi, s' \sim P}\left[r(s,a) +\gamma V^\pi(s') \right]
Qπ(s,a)=EsP[r(s,a)+γEaπ[Qπ(s,a)]]Q^\pi (s, a)=\mathbb{E}_{s'\sim P}\left[ r(s,a) + \gamma \mathbb{E}_{a'\sim \pi} \left[Q^\pi (s', a')\right]\right]

where sPs' \sim P indicates that the next state is sampled according to the environment dynamics P(ss,a)P(s'|s,a).

Temporal Difference Learning

TD learning is a method for learning value functions that combines elements of Monte Carlo and dynamic programming approaches. The core idea is to update estimates based on the difference between successive predictions.

TD(0): One-Step Updates

The simplest form of TD learning is TD(0), which updates value estimates based on the immediate next state:

V(st)V(st)+α[rt+γV(st+1)V(st)]V(s_t) \leftarrow V(s_t) + \alpha [r_t + \gamma V(s_{t+1}) - V(s_t)]

where α\alpha is the learning rate and [rt+γV(st+1)V(st)][r_t + \gamma V(s_{t+1}) - V(s_t)] is called the TD error. This approach bootstraps from the next state's value estimate rather than waiting for the actual return, allowing for online learning.

TD(λ\lambda): Multi-Step Updates

TD(λ\lambda) extends this by blending multiple n-step returns, creating a more flexible update mechanism. First, let's define the n-step return:

Gt(n)=rt+γrt+1+γ2rt+2+...+γn1rt+n1+γnV(st+n)G_t^{(n)} = r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + ... + \gamma^{n-1} r_{t+n-1} + \gamma^n V(s_{t+n})

The λ\lambda-return is a weighted average of these n-step returns:

Gtλ=(1λ)n=1λn1Gt(n)G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

Here, λ\lambda determines how much weight is given to longer-term returns versus immediate ones. Higher λ\lambda values emphasize longer-term outcomes, while lower values prioritize immediate estimates.

Derivation of the Finite and Infinite Horizon Components

Let's derive the decomposition of the λ\lambda-return into finite-horizon and infinite-horizon components. Starting with the definition:

Gtλ=(1λ)n=1λn1Gt(n)G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{\infty} \lambda^{n-1} G_t^{(n)}

For a conversation of length TT (ending at time TT), we can split this sum into two parts:

Gtλ=(1λ)n=1Tt1λn1Gt(n)+(1λ)n=Ttλn1Gt(n)G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) \sum_{n=T-t}^{\infty} \lambda^{n-1} G_t^{(n)}

The first sum involves all n-step returns that terminate before the end of the conversation, while the second sum involves returns that extend to or beyond the terminal state.

For all nTtn \geq T-t, the n-step return Gt(n)G_t^{(n)} involves the actual terminal reward rTr_T, so they all equal to Gt()G_t^{(\infty)} (the full return from tt to TT).

Therefore:

Gtλ=(1λ)n=1Tt1λn1Gt(n)+(1λ)Gt()n=Ttλn1G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) G_t^{(\infty)} \sum_{n=T-t}^{\infty} \lambda^{n-1}

The second sum is a geometric series starting at λTt1\lambda^{T-t-1}:

n=Ttλn1=λTt1+λTt+λTt+1+...=λTt11λ\sum_{n=T-t}^{\infty} \lambda^{n-1} = \lambda^{T-t-1} + \lambda^{T-t} + \lambda^{T-t+1} + ... = \frac{\lambda^{T-t-1}}{1-\lambda}

Substituting this back:

Gtλ=(1λ)n=1Tt1λn1Gt(n)+(1λ)Gt()λTt11λG_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + (1-\lambda) G_t^{(\infty)} \frac{\lambda^{T-t-1}}{1-\lambda}

Simplifying:

Gtλ=(1λ)n=1Tt1λn1Gt(n)+λTt1Gt()G_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} G_t^{(n)} + \lambda^{T-t-1} G_t^{(\infty)}

This elegant decomposition shows how TD(λ\lambda) blends between immediate value estimates and the actual return. When λ=0\lambda = 0, we rely only on the one-step estimate; when λ=1\lambda = 1, we use the full return from the episode.

The value function update is then:

V(st)V(st)+α[GtλV(st)]V(s_t) \leftarrow V(s_t) + \alpha [G_t^{\lambda} - V(s_t)]

The Case of Terminal-Only Rewards

In many conversational AI applications, rewards are only provided at the end of a conversation (e.g., user satisfaction ratings, task completion success, or feedback scores). This simplifies our equations considerably and makes TD learning particularly relevant.

When rt=0r_t = 0 for all t<Tt < T and rTr_T is the terminal reward:

  1. The finite-horizon return becomes simply R(τ)=rTR(\tau) = r_T
  2. The value function at any state should equal the expected terminal reward: Vπ(s)=E[rTs]V^\pi(s) = \mathbb{E}[r_T|s]

This is a common scenario in conversational AI, where we often lack immediate feedback after each turn but receive user ratings or other metrics at the conversation's end.

In this case, the n-step returns simplify to:

Gt(n)={γnV(st+n)if n<TtγTtrTotherwiseG_t^{(n)} = \begin{cases} \gamma^n V(s_{t+n}) & \text{if } n < T-t \\ \gamma^{T-t} r_T & \text{otherwise} \end{cases}

And the λ\lambda-return becomes:

Gtλ=(1λ)n=1Tt1λn1γnV(st+n)+λTt1γTtrTG_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} \gamma^n V(s_{t+n}) + \lambda^{T-t-1} \gamma^{T-t} r_T

For the common case where γ=1\gamma = 1 (no discounting within a session), this further simplifies to:

Gtλ=(1λ)n=1Tt1λn1V(st+n)+λTt1rTG_t^{\lambda} = (1-\lambda) \sum_{n=1}^{T-t-1} \lambda^{n-1} V(s_{t+n}) + \lambda^{T-t-1} r_T

At the extremes:

  • When λ=1\lambda = 1, the return is simply the terminal reward: Gt1=rTG_t^{1} = r_T
  • When λ=0\lambda = 0, the return is just the next state's value: Gt0=V(st+1)G_t^{0} = V(s_{t+1})

This formulation allows us to propagate the terminal reward signal backward through the conversation, with the λ\lambda parameter controlling how much we trust our value estimates versus waiting for the actual outcome.

Advantage Functions

While value functions tell us how good a state is, advantage functions tell us how much better a specific action is compared to the average action in that state:

Aπ(s,a)=Qπ(s,a)Vπ(s)A^\pi(s, a) = Q^\pi(s, a) - V^\pi(s)

Given the relationship between value and action-value functions:

Vπ(s)=Eaπ[Qπ(s,a)]V^\pi(s) = \mathbb{E}_{a \sim \pi}[Q^\pi(s, a)]

We can rewrite the advantage as:

Aπ(s,a)=EsP[r(s,a)+γVπ(s)]Vπ(s)A^\pi(s, a) = \mathbb{E}_{s' \sim P}[r(s,a) + \gamma V^\pi(s')] - V^\pi(s)

In conversational AI, advantage functions help us understand which responses are better than the average response given the current conversation state.

Generalized Advantage Estimation (GAE)

GAE, introduced by Schulman et al., extends TD(λ\lambda) to advantage functions, providing a more stable estimate for policy optimization. The GAE parameter λ[0,1]\lambda \in [0,1] controls the trade-off between bias and variance.

The k-step advantage estimator is defined as:

A^t(k)=i=0k1γirt+i+γkV(st+k)V(st)\hat{A}_t^{(k)} = \sum_{i=0}^{k-1} \gamma^i r_{t+i} + \gamma^k V(s_{t+k}) - V(s_t)

GAE is then a weighted sum of these k-step estimators:

A^tGAE(γ,λ)=k=1(γλ)k1δt+k1V\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{k=1}^{\infty} (\gamma \lambda)^{k-1} \delta_{t+k-1}^V

where δtV=rt+γV(st+1)V(st)\delta_t^V = r_t + \gamma V(s_{t+1}) - V(s_t) is the TD error.

This can also be written recursively:

A^tGAE(γ,λ)=δtV+γλA^t+1GAE(γ,λ)\hat{A}_t^{GAE(\gamma, \lambda)} = \delta_t^V + \gamma \lambda \hat{A}_{t+1}^{GAE(\gamma, \lambda)}

For the special case of terminal-only rewards, GAE simplifies to:

A^tGAE(γ,λ)=k=1Tt1(γλ)k1γ[V(st+k)V(st)]+(γλ)Tt1[rTV(st)]\hat{A}_t^{GAE(\gamma, \lambda)} = \sum_{k=1}^{T-t-1} (\gamma \lambda)^{k-1} \gamma [V(s_{t+k}) - V(s_t)] + (\gamma \lambda)^{T-t-1} [r_T - V(s_t)]

With γ=1\gamma = 1, this becomes:

A^tGAE(1,λ)=k=1Tt1λk1[V(st+k)V(st)]+λTt1[rTV(st)]\hat{A}_t^{GAE(1, \lambda)} = \sum_{k=1}^{T-t-1} \lambda^{k-1} [V(s_{t+k}) - V(s_t)] + \lambda^{T-t-1} [r_T - V(s_t)]

GAE is particularly useful in conversational AI because it provides a balanced estimate of how good an assistant's response was, accounting for both immediate user reactions and long-term conversation outcomes.

Practical Implementation for Conversational AI

In practical implementations, we often use neural networks to parameterize the value function Vθ(s)V_\theta(s) and policy πϕ(as)\pi_\phi(a|s). The training objective for the value function using TD(λ\lambda) can be formulated as:

LV(θ)=12Et[(GtλVθ(st))2]L_V(\theta) = \frac{1}{2} \mathbb{E}_t [(G_t^{\lambda} - V_\theta(s_t))^2]

For binary outcomes (like user satisfaction or task completion), we can use a sigmoid-transformed value prediction:

Vθ(st)=σ(vt)V_\theta(s_t) = \sigma(v_t)

where vtv_t is the raw output of the value network and σ\sigma is the sigmoid function.

The corresponding loss function then becomes:

Jtλ=Gtλlogσ(vt)+(1Gtλ)log(1σ(vt))J_t^{\lambda} = G_t^{\lambda} \cdot \log \sigma(v_t) + (1-G_t^{\lambda}) \cdot \log (1-\sigma(v_t))

This binary cross-entropy loss is particularly appropriate for conversational AI applications where outcomes are often binary (success/failure) or can be binarized (satisfaction above/below threshold).

Example: Applying TD Learning to Conversation Quality

Consider a conversational AI trained to help users with customer service tasks. At the end of each conversation, users provide a satisfaction rating (1-5 stars). We can:

  1. Transform this into a binary outcome (4-5 stars = success, 1-3 stars = failure)
  2. Train a value function to predict this outcome at each turn
  3. Use TD(λ\lambda) to propagate the final reward through the conversation
  4. Train our policy to maximize predicted advantage

This approach helps the assistant learn which conversation patterns lead to positive outcomes, even when feedback is delayed until the conversation's end.