Dynamic Programming for Deterministic Discrete-Time Systems with Uncertain Gain

We generalise the optimisation technique of dynamic programming for discrete-time systems with an uncertain gain function. We assume that uncertainty about the gain function is described by an imprecise probability model, which generalises the well-known Bayesian, or precise, models. We compare various optimality criteria that can be associated with such a model, and which coincide in the precise case: maximality, robust optimality and maximinity. We show that (only) for the ﬁrst two an optimal feedback can be constructed by solving a Bellman-like equation.


Introduction to the Problem
The main objective in optimal control is to find out how a system can be influenced, or controlled, in such a way that its behaviour satisfies certain requirements, while at the same time maximising a given gain function.A very efficient method for solving optimal control problems for discrete-time systems is the recursive dynamic programming technique, introduced by Richard Bellman [1].
To explain the ideas behind it, we refer to Figures 1 and 2. In Figure 1 we depict a situation where a system can go from state a to state c through state b in three ways: following the paths αβ, αγ and αδ.We denote the gains associated with these paths by J αβ , J αγ and J αδ respectively.Assume that path αγ is optimal, meaning that J αγ > J αβ and J αγ > J αδ .Then it follows that path γ is the optimal way to go To see this, observe that J αν = J α +J ν for ν ∈ {β, γ, δ} (we shall assume throughout that gains are additive along paths) and derive from the inequalities above that J γ > J β and J γ > J δ .This simple observation, which Bellman called the principle of optimality, forms the basis for the recursive technique of dynamic programming for solving an optimal control problem.To see how this is done in principle, consider the situation depicted in Figure 2. Suppose we want to find the optimal way to go from state a to state e.After one time step, we can reach the states b, c and d from state a, and the optimal paths from these states to the final state e are known to be α, γ and η, respectively.To find the optimal path from a to e, we only need to compare the costs J λ + J α , J µ + J γ and J ν + J η of the respective candidate optimal paths λα, µγ and νη, since the principle of optimality tells us that the paths λβ, νδ and ν cannot be optimal: if they were, then so would be the paths β, δ and .This, written down in a more formal language, is what is essentially known as Bellman's equation.It allows us to solve an optimal control problem fairly efficiently through a recursive procedure, by calculating optimal paths backwards from the final state.
In applications, it may happen that the gain function, which associates a gain with every possible control action and the resulting behaviour of the system, is not well known.This problem is most often treated by modelling the uncertainty about the gain by means of a probability measure, and by maximising the expected gain under this probability measure.Due to the linearity of the expectation operator, this approach does not change the nature of the optimisation problem in any essential way, and the usual dynamic programming method can therefore still be applied.
As an example, consider the simple linear system described by where x k ∈ R denote the system state and u k ∈ R the control at time k, and where a and b are non-zero real numbers.Given an initial state x 0 and a sequence u • of successive controls u 0 , u 1 , . . ., u N −1 , the systems goes through the successive states x 1 , x 2 , . . ., x N determined by Eq. ( 1), and we assume that with this control there is associated a gain where ω is some positive real constant.Solving the present optimal control problem consists in finding a control u • that brings the system at time N in a given final state x f , while at the same time maximising the gain J(x 0 , u • , ω).The dynamic programming approach achieves this by reasoning backwards in time.First, the control u N −1 is determined that maximises the gain This control also determines a unique x N −1 , and the procedure is then repeated by finding a control u N −2 that maximises the gain x2 N −2 + ωu 2 N −2 , and so on . . .The principle of optimality then ensures that the control u • found in this recursive manner indeed solves the optimal control problem.When ω is not well known, and only its probability distribution is given, the optimal control problem is solved by maximising the expected value of the gain, which can in this special example be done by replacing ω with its expectation.
It has however been argued by various scholars (see [2,Chapter 5] for a detailed discussion with many references) that uncertainty cannot always be modelled adequately by (precise) probability measures, because, roughly speaking, there may not be enough information available to identify a single probability measure.In those cases, it is more appropriate to represent the available knowledge through a so-called imprecise probability model, e.g., by a coherent lower prevision, or what is mathematically equivalent, by a set of probability measures.For applications of this approach, see for instance [3,4].In the example above, it may for instance happen that the probability distribution for ω is only known to belong to a given set: e.g., ω is normally distributed with mean zero, but the variance is only known to belong to an interval [σ 2 , σ 2 ]; or ω itself is only known to belong to an interval [ω, ω]. 2   Two questions now arise naturally.First of all, how should we formulate the optimal control problem: what does it mean for a control to be optimal with respect to an uncertain gain function, where the uncertainty is represented through an imprecise probability model?In Section 2 we identify three different optimality criteria, each with a different interpretation (although they coincide for precise probability models), and we study the relations between them.Secondly, is it still possible to solve the corresponding optimal control problems using the ideas underlying Bellman's dynamic programming method?We show in Section 3 that this is the case for only two of the three optimality criteria we study: only for these a generalised principle of optimality holds, and the optimal controls are solutions of suitably generalised Bellman-like equations.In order to arrive at this conclusion, we study the properties that an abstract notion of optimality should satisfy for the Bellman approach to work.To illustrate how our ideas can be implemented, we present a numerical example in Section 4.
We recognise that other authors (see for instance [5,6,7,8,9]) have extended the dynamic programming algorithm to systems with uncertain gain and/or uncertain dynamics, where the uncertainty is modelled by an imprecise probability model.But none of them seem to have questioned under what assumptions their generalised dynamic programming method leads to optimal paths.Here we approach the problem from the opposite, and in our opinion, more logical side: one should first define a notion of optimality and investigate whether the dynamic programming argument holds for it, rather than blindly "generalise" Bellman's algorithm without showing that it actually yields optimal controls.
In the remainder of this section, we introduce the basic systems-theoretic concepts and notation used in the rest of the paper.

The System
For a and b in N, the set of natural numbers c that satisfy describe a discrete-time dynamical system with k ∈ N, x k ∈ X and u k ∈ U.The set X is the state space (e.g., R n , n ∈ N \ {0}), and the set U is the control space (e.g., R m , m ∈ N \ {0}).The map f : X × U × N → X describes the evolution of the state in time: given the state x k ∈ X and the control u k ∈ U at time k ∈ N, it returns the next state x k+1 of the system.For practical reasons, we impose a final time N beyond which we are not interested in the dynamics of the system.Moreover, it may happen that not all states and controls are allowed at all times: we demand that x k should belong to a set of admissible states X k at every instant k ∈ [0, N ], and that u k should belong to a set of admissible controls U k at every instant k ∈ [0, N − 1], where X k ⊆ X and U k ⊆ U are given.The set X N may be thought of as the set we want the state to end up in at time N .

Paths
A path is a triple (x, k, u • ), where x ∈ X is a state, k ∈ [0, N ] a time instant, and The set of admissible paths starting in state x ∈ X k at time k ∈ [0, N ] and ending at time ∈ [k, N ] is denoted by U(x, k) .In particular we have that Moreover, for any (x, k, u • ) ∈ U(x, k) and any V ⊆ U(x , ), we use the notation

The Gain Function
We assume that applying the control action u ∈ U to the system in state x ∈ X at time k ∈ [0, N − 1] yields a real-valued gain g(x, u, k, ω).Moreover, reaching the final state x ∈ X at time N also yields a gain h(x, ω).The parameter ω ∈ Ω represents the (unknown) state of the world, and it is a device used to model that the gains are not well known.If we knew that the real state of the world was ω o , we would know the gains to be g(x, u, k, ω o ) and h(x, ω o ).As it is, the real state of the world is uncertain, and so are the gains, which could be considered as random variables.It is important to note that the parameter ω only influences the gains; it has no effect on the system dynamics, which are assumed to be known perfectly well.
We shall only consider the important case where the gains are additive along paths, i.e., with a path (x, k, u • ) we associate a gain J(x, k, u • , ω) given by: for any ω ∈ Ω (gain additivity).If M < N , we also use the notation It will be convenient to associate a zero gain with an empty control action: for The main objective of optimal control can now be formulated as follows: given that the system is in the initial state x ∈ X at time k ∈ [0, N ], find a control sequence u • : [k, N − 1] → U resulting in an admissible path (x, k, u • ) such that the corresponding gain J(x, k, u • , ω) is maximal.Moreover, we would like this control sequence u • to be such that its value u k at time k is a function of x and k only, since in that case the control can be realised through state feedback.
If ω is known, then the problem reduces to the classical problem of dynamic programming, first studied and solved by Bellman [1].We shall assume here that the available information about the true state of the world is modelled through a coherent lower prevision P defined on the set L(Ω) of gambles, or bounded realvalued maps, on Ω.A special case of this obtains when P is a linear prevision P .Linear previsions are the precise probability models; they can be interpreted as expectation operators associated with (finitely additive) probability measures, and they are previsions or fair prices in the sense of de Finetti [10].We assume that the reader is familiar with the basic ideas behind the theory of coherent lower (and linear) previsions (see [2] for more details).
For a given path (x, k, u • ), the corresponding gain J(x, k, u • , ω) can be seen as a real-valued map on Ω, which is denoted by J(x, k, u • ) and is called the gain gamble associated with (x, k, u • ). 3 In the same way we define the gain gambles There is gain additivity: We denote by J (x, k) the set of gain gambles for admissible paths from initial state x ∈ X k at time k ∈ [0, N ]: 2 Optimality Criteria

P -Maximality
The lower prevision P (X) of a gamble X has a behavioural interpretation as a subject's supremum acceptable price for buying the gamble X: it is the highest value of µ such that the subject accepts the gamble X − x (i.e., accepts to buy X for a price x) for all x < µ.The conjugate upper prevision P (X) = −P (−X) of X is then the subject's infimum acceptable price for selling X.This way of looking at a coherent lower prevision P defined on the set L(Ω) of all gambles allows us to define a strict partial order > P on L(Ω) whose interpretation is that of strict preference.
Definition 1 For any gambles X and Y in L(Ω) we say that X strictly dominates Y , or that X is strictly preferred to Y (with respect to P ), and we write Indeed, if X ≥ Y and X = Y , then the subject should be willing to exchange Y for X, since this can only improve his gain.On the other hand, P (X − Y ) > 0 expresses that the subject is willing to pay some strictly positive price to exchange Y for X, which again means that he strictly prefers X to Y .
It is clear that we can also use the coherent lower prevision P to express a strict preference between any two paths (x, k, u • ) and (x, k, v • ), based on their gains: if We then say that the path (x, k, u • ) is strictly preferred to (x, k, v • ), and we use the notation (x, k, u The relation > P is anti-reflexive and transitive. 5It is therefore indeed a strict partial order on L(Ω), and in particular also on J (x, k) and on U(x, k).But it is generally not linear: unless P is a linear prevision, there will typically be gambles X and Y such that P (X − Y ) ≤ 0 ≤ P (X − Y ), and therefore X > P Y and Y > P X.Two paths need not be comparable with respect to this order, and it does not always make sense to look for greatest elements, i.e., for paths that strictly dominate all the others.Rather, we should look for maximal, or undominated, elements: paths (x, k, u • ) that are not dominated by any other path, meaning that (x, k, v Observe that a maximal gamble X in a set K with respect to > P can be characterised as a maximal element of K with respect to ≥ (i.e., it is point-wise undominated) such that P (X − Y ) ≥ 0 for all Y ∈ K.In case P is a linear prevision P , maximal gambles with respect to > P are precisely the point-wise undominated gambles whose prevision is maximal; they maximise the expected gain.This motivates the following optimality definition.
We denote the set of the P -maximal paths in V by opt > P (V).The operator opt > P is called the optimality operator induced by > P , associated with U(x, k).
The P -maximal paths in U(x, k) are precisely those admissible paths starting at time k in state x for which the associated gain gamble is a maximal element of J (x, k) with respect to the strict partial order > P .If we denote the set of these > P -maximal gain gambles in J (x, k) by opt > P (J (x, k)), then for all (x, k, u • ) ∈ U(x, k): P -maximal paths do not always exist: not every partially ordered set has maximal elements.A fairly general sufficient condition for the existence of P -maximal ele-ments in J (x, k) (and hence in U(x, k)) is that J (x, k) should be compact 6 (and of course non-empty).This follows from a general result mentioned in [2, Section 3.9.2],which is also proven in Lemma 3 below.In fact, we use this lemma to prove a stronger result in Theorem 4, whose Corollary 5 turns out to be very important in showing that the dynamic programming approach works for P -maximality (see Section 3.2).
In order to prove Lemma 3 and Theorem 4, it is convenient to introduce the partial preorder (a reflexive and transitive relation) P on L(Ω) defined by: Recall that an element X of a subset K of L(Ω) is a maximal element of K with respect to P if it is undominated, i.e., if and only if ( Lemma 3 For any non-empty compact subset K of L(Ω) the following statements hold.
of K is non-empty and compact.(iii) There is a maximal element of K with respect to P .(iv) For every X in K there is a maximal element Y of K with respect to P such that Y P X. (v) For every X in K there is a maximal element Y of K with respect to the pointwise order ≥ such that Y ≥ X. (vi) There is maximal element of K with respect to > P .
PROOF.Assume that X is a maximal element of K with respect to P .Consider Y in K, then it follows from Condition (2) that P (Y − X) < 0 or P (X − Y ) ≥ 0. In both cases it follows that P (X − Y ) ≥ 0. This proves (i).
It is obvious that X P X, so ↑ P X is non-empty.Consider a sequence (X n ) in ↑ P X that converges to some gamble X ∞ : sup ω∈Ω |X ∞ (ω) − X n (ω)| → 0. Since K is compact and therefore closed, we know that X ∞ ∈ K.It now follows from the coherence of P (see [2, Theorem 2.6.1]) and P (X n − X) ≥ 0 that and that 0 and therefore P (X ∞ − X) ≥ 0, whence X ∞ ∈ ↑ P X.This tells us that ↑ P X is a closed subset of the compact K and therefore also compact, proving (ii).
To prove (iii), let K be any subset of the non-empty compact set K that is linearly ordered with respect to P .If we can show that K has an upper bound in K with respect to P , then we can infer from Zorn's lemma that K has a P -maximal element.Let then {X 1 , X 2 , . . ., X n } be an arbitrary finite subset of K .We can assume without loss of generality that X 1 P X 2 P . . .P X n , and consequently This implies that the intersection n k=1 ↑ P X k = ↑ P X 1 of these up-sets is non-empty.We see that the collection ↑ P X : X ∈ K of compact and therefore closed subsets of the compact set K has the finite intersection property.Consequently, the intersection X∈K ↑ P X is non-empty as well, and this is the set of upper bounds of K in K with respect to P .
To prove (iv), combine (ii) and (iii) to show that the non-empty compact set ↑ P X has a maximal element Y with respect to P .It is then a trivial step to prove that Y is also P -maximal in K.
The fifth statement follows from the fourth: let P be the (coherent) so-called vacuous lower prevision, defined by P (X) = inf {X(ω) : ω ∈ Ω}.Then the order P is nothing but the pointwise order ≥.
We now come to the last statement.By combining (i) and (iii), we know that there is some Y o in K such that P (Y o − X) ≥ 0 for all X ∈ K. From (v) we infer that there is some ≥-maximal Y in K such that Y ≥ Y o , and therefore (by coherence) P (Y − X) ≥ P (Y o − X) ≥ 0 for all X ∈ K.This means that Y is a maximal element of K with respect to > P .
Theorem 4 For every element X of a compact subset K of L(Ω) that is not a maximal element of K with respect to > P , there is some maximal element Y of K with respect to > P such that Y > P X.
PROOF.Consider an element X of K that is not > P -maximal in K.We may assume that X is ≥-maximal in K. Indeed, if X is not ≥-maximal then by Lemma 3(v) there is some ≥-maximal Z in K such that Z ≥ X and Z = X, whence Z > P X.If Z is > P -maximal in K then there is nothing left to prove.So we are left with the case that Z is not > P -maximal.If we can prove for this ≥-maximal Z that there is some > P -maximal Y in K such that Y > P Z then also Y > P X and the proof is complete.
Since X is ≥-maximal in K, there is some U in K such that P (U − X) > 0. By Lemma 3(ii) and (vi) there is a > P -maximal element Y in ↑ P U .Since P (Y − U ) ≥ 0 we infer from the coherence of P (see [2, Theorem 2.6.1(e)]) that P (Y − X) = P (Y − U + U − X) ≥ P (Y − U ) + P (U − X) > 0, whence Y > P X.It remains to prove that Y is also > P -maximal in K. Assume ex absurdo that there is some V in K such that V > P Y .Then there are two possibilities.If V ≥ Y and V = Y then it follows from the coherence of P (see [2, Theorem 2.6.1(d)]) that P (V −U ) ≥ P (Y −U ) ≥ 0, whence V ∈ ↑ P U , a contradiction.If P (V −Y ) > 0 then it follows from the coherence of P (see [2, Theorem 2.6.1(e)]) that ) that is strictly preferred to it.

P -Maximinity
We now turn to a different optimality criterion that can be associated with a lower prevision P .We use P to define another strict order on L(Ω): Definition 6 For any gambles X and Y in L(Ω) we write X P Y if P (X) > P (Y ) or (X ≥ Y and X = Y ).
P induces a strict partial order on U(x, k), since it is anti-reflexive and transitive on L(Ω).A maximal element X of a subset K of L(Ω) with respect to P is easily seen to be a point-wise undominated element of K that maximises the lower prevision: P (X) ≥ P (Y ) for all Y ∈ K.
We can consider as optimal in U(x, k) those admissible paths (x, k, u • ) for which the associated gain gamble J(x, k, u • ) is a maximal element of J (x, k) with respect to P ; they are the paths (x, k, u • ) that maximise the 'lower expected gain' P (J(x, k, u • )) and whose gain gambles J(x, k, u • ) are point-wise undominated.
We denote the set of the P -maximin paths in V by opt P (V).The operator opt P is called the optimality operator induced by P , associated with U(x, k).
Proposition 8 P -maximinity implies P -maximality.For a linear prevision P , P -maximinity is equivalent to P -maximality.PROOF.Consider a set of gambles K and assume that X is a maximal element of K with respect to P .In order to prove that X is also a maximal element of K with respect to > P , it obviously suffices to show that P (X − Y ) ≥ 0 for all Y ∈ K.We know that P (X) ≥ P (Y ) for any Y ∈ K, and consequently, taking into account coherence (see [2, Section 2.6.1(e)]): If P is a linear prevision P , assume that X is a maximal element of K with respect to > P .In order to prove that X is also a maximal element of K with respect to P , it suffices to show that P (X) ≥ P (Y ) for all Y ∈ K. Since we know that for any Y ∈ K, P (X − Y ) ≥ 0, and that P (X − Y ) = P (X) − P (Y ), the desired result follows at once.The existence of maximal elements with respect to P in an arbitrary set of gambles K is obviously not guaranteed.But if K is compact, then we may easily infer from the continuity of any coherent lower prevision P , that the counterparts of Theorem 4 and Corollary 5 hold for P .

M-Maximality
There is a tendency, especially among robust Bayesians, to consider an imprecise probability model as a compact convex set of linear previsions M ⊆ P(Ω), where P(Ω) is the set of all linear previsions on L(Ω).M is assumed to contain the true, but unknown, linear prevision P T that models the available information [12,13].
A gamble X is then certain to be strictly preferred to a gamble Y under the true linear prevision P T if and only if it is strictly preferred under all candidate models P ∈ M.This observation leads to the definition of a 'robustified' strict partial order > M on L(Ω).
Since M is assumed to be compact and convex, it is not difficult to show that the strict partial orders > M and > P are one and the same, where the coherent lower prevision P is the so-called lower envelope of M, defined by P (X) = inf {P (X) : P ∈ M} for all X ∈ L(Ω). 7Conversely, given a coherent lower prevision P , the strict partial orders > M(P ) and > P are identical, where is the set of linear previsions that dominate P .These strict partial orders have the same maximal elements, and lead to the same notion of optimality.
But there is in the literature yet another notion of optimality that can be associated with a compact convex set of linear previsions M: a gamble X is considered optimal in a set of gambles K if it is a maximal element of K with respect to the strict partial order > P for some P ∈ M.This notion of optimality is called 'E-admissibility' by Levi [14,Section 4.8].It does not generally coincide with the ones associated with the strict partial orders > M and > P , unless the set K is convex [2, Section 3.9].We are therefore led to consider a third notion of optimality, associated with a lower prevision P , or a set of linear previsions M.
it is P -maximal in V for some P in M, i.e., if it is ≥-maximal in V and maximises P (J(x, k, u • )) over V for some P ∈ M. The set of all M-maximal elements of V is denoted by opt M (V).
Interestingly, for any set of paths V ⊆ U(x, k): 3 Dynamic Programming

A General Notion of Optimality
So far, we have discussed three different ways of associating optimal paths with a lower prevision P , all of which occur in the literature.We now propose to find out whether, for these different types of optimality, we can use the ideas behind the dynamic programming method to solve the corresponding optimal control problems.
To do this, we take a closer look at Bellman's analysis as described in Section 1, and we investigate which properties a generic notion of optimality must satisfy for his method to work.Let us therefore assume that there is some property, called * -optimality, which a path in a given set of paths P either has or does not have.If a path in P has this property, we say that it is * -optimal in P. We shall denote the set of the * -optimal elements of P by opt * (P).By definition, opt * (P) ⊆ P. Further on, we shall apply our findings to the various instances of * -optimality described above.
Consider Figure 3, where we want to find the * -optimal paths from state a to state e.Suppose that after one time step, we can reach the states b, c and d from state a.The * -optimal paths from these states to the final state e are known to be α, γ, and δ and η, respectively.For the dynamic programming approach to work, we need to be able to infer from this a generalised form of the Bellman equation, stating essentially that the * -optimal paths from a to e, a priori given by opt * ({λα, λβ, µγ, νδ, ν , νη}), are actually also given by opt * ({λα, µγ, νδ, νη}), i.e., the * -optimal paths in the set of concatenations of λ, µ and ν with the respective * -optimal paths α, γ, and δ and η.It is therefore necessary to exclude that the concatenations λβ and ν with the non- * -optimal paths β and can be * -optimal.This amounts to requiring that the operator opt * should satisfy some appropriate generalisation of Bellman's principle of optimality that will allow us to conclude that λβ and ν cannot be * -optimal because then β and would be * -optimal as well.Definition 13 below provides a precise general formulation.
But, perhaps surprisingly for someone familiar with the traditional form of dynamic programming, opt * should satisfy an additional property: the omission of the non- * -optimal paths λβ and ν from the set of candidate * -optimal paths should not have any effect on the actual * -optimal paths: we need that opt * ({λα, λβ, µγ, νδ, ν , νη}) = opt * ({λα, µγ, νδ, νη}) .
This is obviously true for the simple type of optimality that we have looked at in Section 1, but it need not be true for the more abstract types that we want to consider here.Equality will be guaranteed if opt * is insensitive to the omission of non- * -optimal elements from {λα, λβ, µγ, νδ, ν , νη}, in the following sense.
Definition 11 Consider a set S = ∅ and an optimality operator opt * defined on the set ℘(S) of subsets of S such that opt * (T ) ⊆ T for all T ⊆ S. Elements of opt * (T ) are called * -optimal in T .The optimality operator opt * is called insensitive to the omission of non- * -optimal elements from S if opt * (S) = opt * (T ) for all T such that opt * (S) ⊆ T ⊆ S.
The following proposition gives an interesting sufficient condition for this insensitivity in case optimality is associated with a (family of) strict partial order(s): it suffices that every non-optimal path is strictly dominated by an optimal one.
Proposition 12 Let S be a non-empty set provided with a family of strict partial orders > j , j ∈ J. Define for T ⊆ S, opt > j (T ) = {a ∈ T : (∀b ∈ T )(b > j a)} as the set of maximal elements of T with respect to > j , and let opt J (T ) = j∈J opt > j (T ).
Then opt > j , j ∈ J and opt J are optimality operators.If for some j ∈ J, then opt > j is insensitive to the omission of non-> j -optimal elements from S. If Condition (4) holds for all j ∈ J, then opt J is insensitive to the omission of non-J-optimal elements from S.
PROOF.Consider j in J, and assume that Condition (4) holds for this j.Let opt > j (S) ⊆ T ⊆ S, then we must prove that opt > j (S) = opt > j (T ).First of all, if a ∈ opt > j (S) then b > j a for all b in S, and a fortiori for all b in T , whence a ∈ opt > j (T ).Consequently, opt > j (S) ⊆ opt > j (T ).Conversely, let a ∈ opt > j (T ) and assume ex absurdo that a ∈ opt > j (S).It then follows from (4) that there is some c in opt > j (S) and therefore in T such that c > j a, which contradicts a ∈ opt > j (T ).
Next, assume that (4) holds for all j ∈ J. Let opt J (S) ⊆ T ⊆ S, then we must prove that opt J (S) = opt J (T ).Consider any j ∈ J, then opt > j (S) ⊆ opt J (S) ⊆ T ⊆ S, so we may infer from the first part of the proof that opt > j (S) = opt > j (T ).By taking the union over all j ∈ J, we find that indeed opt J (S) = opt J (T ).
We are now ready for a precise formulation of the dynamic programming approach for solving optimal control problems associated with general types of optimality.
We assume that we have some type of optimality, called * -optimality, that allows us to associate with the set of admissible paths U(x, k) starting at time k in initial state x, an optimality operator opt * defined on the set ℘(U(x, k)) of subsets of U(x, k).For each such subset V, opt * (V) is then the set of admissible paths that are * -optimal in V.The principle of optimality states that the optimality operators associated with the various U(x, k) should be related in a special way.
Definition 13 (Principle of Optimality) * -optimality satisfies the principle of optimality if it holds for all k This may also be expressed as: The Bellman equation now states that applying the optimality operator to the right hand side suffices to achieve equality.(Usually this is stated with = k + 1.) Theorem 14 (Bellman Equation) Let k ∈ [0, N ] and x ∈ X k .Assume that *optimality satisfies the principle of optimality, and that the optimality operator opt * for U(x, k) is insensitive to the omission of non- * -optimal elements from U(x, k).
Then for all ∈ [k, N ]: that is, a path is * -optimal if and only if it is a * -optimal concatenation of an admissible path (x, k, u • ) and a * -optimal path of U(x , ).
, and, . By the principle of optimality, no path in V 2 is * -optimal in U(x, k), so and since opt * is assumed to be insensitive to the omission of non- * -optimal elements from U(x, k), it follows that opt Let us now apply these general results to the specific types of optimality introduced in the previous section.For all three optimality operators opt > P , opt M and opt P , we shall check whether we can use a Bellman equation to solve the corresponding optimal control problem.

P -Maximality
We first consider the optimality operator opt > P that selects from a set of gambles (or paths) S those gambles (or paths) that are the maximal elements of S with respect to the strict partial order > P .The following lemma roughly states that the preference amongst paths with respect to > P is preserved under concatenation and truncation.It yields a sufficient condition for the principle of optimality with respect to P -maximality to hold.Moreover, the lemma, and the principle of optimality, do not necessarily hold for preference with respect to P -maximinity.
PROOF.Let X, Y and Z be gambles on Ω.The statement is proven if we can show that Y > P Z implies X+Y > P X+Z.Assume that Y > P Z.
As a direct consequence of Corollary 5 and Proposition 12, we see that if J (x, k) is compact, then the optimality operator opt > P associated with U(x, k) is insensitive to the omission of non-> P -optimal elements.Together with Proposition 16 and Theorem 14, this allows us to infer a Bellman equation for P -maximality.
that is, a path is P -maximal if and only if it is a P -maximal concatenation of an admissible path (x, k, u • ) and a P -maximal path of U(x , ).
Corollary 17 results in a procedure to calculate all P -maximal paths.Indeed, opt > P (U(x, N )) = {u ∅ } for every x ∈ X N , and opt > P (U(x, k)) can be calculated recursively through Eq. (5).It also provides a method for constructing a P -maximal feedback: for every x ∈ X k , choose any (x, k, u * • (x, k)) ∈ opt > P (U(x, k)).Then φ(x, k) = u * k (x, k) realises a P -maximal feedback.

M-Maximality
We now turn to the optimality operator opt M , defined through (3).If we recall Proposition 12, we see that opt M is insensitive to the omission of non-M-maximal elements of U(x, k) whenever J (x, k) is compact.By Proposition 16, opt M satisfies the principle of optimality (indeed, if a path is M-maximal, then it must be Pmaximal for some P ∈ M, and by the proposition any truncation of it is also P -maximal, hence also M-maximal).This means that the Bellman equation also holds for M-maximality under similar conditions as for P -maximality.As already mentioned in Section 2.3, both types of optimality coincide if J (x, k) is convex.

P -Maximinity
Finally, we come to the type of optimality associated with the strict partial order that if J (x, k) is compact, the optimality operator opt P for U(x, k) is insensitive to the omission of non-P -optimal paths from U(x, k).But, as the following counterexample shows, we cannot guarantee that the principle of optimality holds for P -optimality, and therefore the dynamic programming approach may not work here.Essentially, this is because the partial order P is not a vector ordering on L(Ω)-it is not compatible with gain additivity: contrary to expected gains, lower expected gains are not additive.
The following theorem gives a sufficient condition for P -maximality to satisfy the principle of optimality.It seems that this condition is implicitly assumed to hold in most of the literature studying maximin-strategies by dynamic programming.
The idea that underlies this theorem is simple: the principle of optimality will hold if there is additivity of lower expected gains.In order to formulate the theorem in a way that is sufficiently general, we need to introduce a new concept.Assume that the set Ω is a Cartesian product of non-empty sets Ω 0 , Ω 1 , . . ., Ω N .Let P be a lower prevision defined on L(Ω).Then we call P externally additive relative to Ω 0 , Ω 1 , . . ., Ω N if for all X k in L(Ω k ), where k = 0, . . ., N , it holds that where we have identified gambles on the Ω k with the corresponding gambles on Ω that only depend on the ω k (their so-called cylindrical extensions).
Theorem 19 Suppose that Ω = Ω 0 × Ω 1 × • • • × Ω N and assume that the gain gambles g(x k , u k , k) are a function of ω k only, and similarily, that h(x N ) is a function of ω N only.Let the coherent lower prevision P on L(Ω) be externally additive relative to the sets Ω 0 , Ω 1 , . . ., Ω N , Then the principle of optimality holds for P -maximinity.

A numerical example
Suppose we have a total amount of money x at our disposal, which we can invest into two companies, denoted by 0 and 1.We denote our investment in company 0 by u 0 , and in company 1 by u 1 .Observe that x, u 0 and u 1 are non-negative real numbers, and u 0 + u 1 ≤ x.The total gain is where ω 0 > 0, ω 1 > 0 are gain factors (for companies 0 and 1), and ω 2 > 0 is the devaluation factor (of the money we have not invested).We wish to maximise the gain, but, we are uncertain about ω 0 , ω 1 and ω 2 . 8We know that ω 0 = 1+g 0 + and ω 1 = 1 + g 1 + .g 0 and g 1 model the productivities of the companies, and models economical variations that affect each company in the same way, such as the global economical state.We do not make any assumption about the dependence between g 0 , g 1 , and ω 2 .We only know that g 0 ∈ [0.0, 0.3], g 1 ∈ [0.1, 0.2], ∈ [−0.1, 0.2] and ω 2 ∈ [0.85, 0.95].This leads to the following lower prevision on L(Ω 0 × Ω 1 × Ω 2 ): We refer to [18] for a detailed discussion about why this lower prevision really captures the available information.We now wish to find all u 0 and u 1 such that the gain J(x, u 0 , u 1 ) is P -maximal.Observe that this is a two-dimensional optimisation problem.
We formulate this problem in terms of a dynamical system.If we define x 0 = x and, recursively x k+1 = x k − u k , the total gain is precisely equal to J(x, u • , 0), with g(x k , u k , k, ω) = ω k u k and h(x 2 , ω) = ω 2 x 2 .Each state x k represents the money we can invest in companies ≥ k, and should therefore be non-negative.There is gain additivity, and the set of admissible gain gambles is compact.Corollary 17 applies: we can solve this problem using dynamic programming.
For k = 1, we find that the control u 1 = x 1 is optimal from state x 1 at time 1.Indeed, first observe that all controls are maximal with respect to the point-wise order.In that case, optimality of u 1 is equivalent to P (J(x 1 , u 1 , 1)−J(x 1 , v 1 , 1)) ≥ 0 for all v 1 .This holds iff sup (1 + g 1 + − ω 2 )(u 1 − v 1 ) : and thus, iff u 1 ≥ v 1 for all v 1 .Hence, optimal paths maximise u 1 .The highest u 1 we can choose such that x 2 is still non-negative is u 1 = x 1 .
In conclusion, the information implies that we should invest all money x, but we cannot infer how we should divide x over the two companies.
By our dynamic programming approach we have have managed to solve this twodimensional optimisation problem by reducing it to two one-dimensional ones, which are each very easy to solve.In the more general case of uncertain investment with n companies, we initially have a n-dimensional optimisation problem, and dynamic programming reduces this to n very simple one-dimensional optimisation problems.

Conclusion
The main conclusion of our work is that the method of dynamic programming can in principle be extended to deterministic systems with an uncertain gain, where the uncertainty about the gain is modelled by a coherent lower prevision, or by a set of linear previsions (probability measures).
But our general study of what conditions a generalised notion of optimality should satisfy for the Bellman approach to work is of some interest in itself too.In particular, besides an obvious extension of the well-known principle of optimality, another condition emerges that relates to the nature of the optimality operators per se: the optimality of a path should be invariant under the omission of non-optimal paths from the set of paths under consideration.If optimality is induced by a strict partial ordering of paths, then this second condition is satisfied whenever the existence of dominating optimal paths for non-optimal ones is guaranteed.
Another important observation is that, in contradistinction to P -maximality and M-maximality, the dynamic programming method cannot be used to solve optimisation problems corresponding to P -maximinity in general: for this notion the principle of optimality is not guaranteed to not hold, in particular when the external additivity property is not satisfied.
It is possible to refine our results by considering an additional equivalence relation on paths expressing some notion of indifference-relating, for instance, paths with the same expected gain.This allows us to partition a set opt * (V) of optimal elements into equivalence classes of mutually indifferent paths.Any two paths in opt * (V) that belong to different equivalence classes are necessarily incomparable: the available information, modelled through P , does not allow us to choose between these two paths.A discussion of such matters presents no great conceptual difficulties, but has been omitted from the present paper due to limitations of space.
Throughout the paper we have assumed the system dynamics to be deterministic, that is, independent of ω.This greatly simplifies the discussion, still encompasses a large number of interesting applications, and does not suffer from the computational problems often encountered when dealing with non-deterministic dynamical systems-simply because in general the number of possible (random) paths tends to grow exponentially with the size of the state space X .However, we should note that dropping this assumption still leads to a Bellman-type equation, connecting operators of optimality associated with random states x : Ω → X .We intend to present our results about and views on this issue elsewhere.

Fig. 3 .
Fig. 3.A More General Type of Dynamic Programming

P
Fig. 4. A Counterexample sequence of controls.Such a path fixes a unique state trajectoryx • : [k, N ] → X , which is defined recursively through x k = x and x +1 = f (x , u , ) for every ∈ [k, N − 1].It is said to be admissible if x ∈ X for every ∈ [k, N ] and u ∈ U for every ∈ [k, N − 1].We denote the unique map from the empty set ∅ to U by u ∅ .If k = N , the control u • does nothing: it is equal to u ∅ .The unique path starting and ending at time k = N in x ∈ X is denoted by (x, N, u ∅ ).For example, U(x, N ) = {(x, N, u ∅ )} whenever x ∈ X N and U(x, N ) = ∅ otherwise.If we consider a path with final time M different from N , then we write (x, k, u • ) M (assume k ≤ M ≤ N ).Observe that (x, k, u • ) k can be identified with (x, k, u ∅ )