Optimal Control of Probability on a Target Set for Continuous-Time Markov Chains

In this article, a stochastic optimal control problem is considered for a continuous-time Markov chain taking values in a denumerable state space over a fixed finite horizon. The optimality criterion is the probability that the process remains in a target set before and at a certain time. The optimal value is a superadditive capacity of target sets. Under some minor assumptions for the controlled Markov process, we establish the dynamic programming principle, based on which we prove that the value function is a classical solution of the Hamilton–Jacobi–Bellman (HJB) equation on a discrete lattice space. We then prove that there exists an optimal deterministic Markov control under the compactness assumption of control domain. We further prove that the value function is the unique solution of the HJB equation. We also consider the case starting from the outside of the target set and give the corresponding results. Finally, we apply our results to two examples.


Optimal Control of Probability on a Target Set for Continuous-Time Markov Chains Chenglin Ma and Huaizhong Zhao
Abstract-In this article, a stochastic optimal control problem is considered for a continuous-time Markov chain taking values in a denumerable state space over a fixed finite horizon.The optimality criterion is the probability that the process remains in a target set before and at a certain time.The optimal value is a superadditive capacity of target sets.Under some minor assumptions for the controlled Markov process, we establish the dynamic programming principle, based on which we prove that the value function is a classical solution of the Hamilton-Jacobi-Bellman (HJB) equation on a discrete lattice space.We then prove that there exists an optimal deterministic Markov control under the compactness assumption of control domain.We further prove that the value function is the unique solution of the HJB equation.We also consider the case starting from the outside of the target set and give the corresponding results.Finally, we apply our results to two examples.Index Terms-Controlled Markov chains, dynamic programming principle (DPP), Hamilton-Jacobi-Bellman (HJB) equation, optimal controls, risk probability criteria.

I. INTRODUCTION
Stochastic optimal control problems for Markov chains, also known as Markov decision processes (MDPs), have been widely studied due to their rich applications in real-world contexts, such as in communication engineering [1], finance [6], queuing systems [21], control of epidemics [24], and so on.Existing articles mainly focus on MDPs with expected/average reward criteria.See, for example, [7], [9], [12], [17], [22], [25], and [26].However, such a setup is not always suitable in some applications.For example, when we measure the market risk in the areas of finance and economics, it is reasonable to minimize the probability of loss exceeding a fixed value.Inspired by the considerations of real-world contexts, some authors started to study MDPs with risk probability criteria.
MDPs with risk probability criteria can be roughly divided into two kinds: the discrete-time case and the continuous-time case.For the discrete-time scenario, a general study can be found in [5] and [29].Recently, a discrete-time optimal dividend problem with risk probability criteria has been considered in [28], the aim of which was to minimize the risk probability of reaching a given dividend goal before the time of ruin and find the optimal dividend policy.In [13], a two-player nonzero-sum discrete-time stochastic games under the probability criterion was considered, and it was shown that the optimal value function for each player is the unique solution to the corresponding optimality equation, and the existence of Nash equilibria was established under mild conditions.
The continuous-time MDPs with risk probability criteria were considered for the first time in [16].Under some conditions, it was proved that the value function is a solution to the optimality equation.Following the publication of this work, there were some works on continuous-time MDPs with risk probability criteria, such as [4] and [15].Bhabak and Saha [4] studied a zero-sum stochastic game for continuous-time Markov chains.Under some assumptions, they showed the existence of value of the game and also characterized it as the unique solution of a pair of Shapley equations.Huo and Guo [15] dealt with finite-horizon continuous-time MDPs with unbounded transition rates and established the existence and uniqueness of a solution of the corresponding optimality equation.They also proved the existence of a risk probability optimal policy.
In this article, we would like to find the optimal control processes to maximize the probability that the controlled Markov process is always in a target set during the fixed finite horizon [0, T ].Such a risk probability setup can be regarded as the surviving probability on a safety set in many real-world contexts, such as the number of cancer cells in a patient in a certain safety range.In the context of this article, the admissible controls, we consider, are processes taking values in a compact control domain and being adapted to the natural filtration generated by the underlying Markov chain.We first give the dynamic programming principle (DPP) by considering a family of stochastic optimal control subproblems initiated by different times and states.We find that the global optimal control is also locally optimal over any second half-horizon [t, T ] in the sense of conditional expectation.We then establish the relationship among these subproblems by deriving the so-called Hamilton-Jacobi-Bellman (HJB) equation.This is a nonlinear first-order differential-difference equation.The value function is a classical solution of the HJB equation due to its right differentiability with respect to (w.r.t.) the time variable.By the compactness assumption of control domain, we give the existence theorem of optimal deterministic Markov controls for the dynamic programming (DP) problem by employing measurable selection theorem (see [2] and references therein).We further prove that the value function is the unique solution of the HJB equation.We then also consider the case starting from the outside of the target set to maximize the probability on the target set from any time t 0 ∈ (0, T ].

A. Problem Statement
Let (Ω, F, P ) be a complete probability space on which a continuous-time Markov chain {X t , 0 ≤ t ≤ T } is defined over finite horizon [0, T ] for a fixed T > 0. We denote by {F t , 0 ≤ t ≤ T } the natural filtration generated by X(•) and augmented by all P -null sets of F, that is, where N p is the set of all P -null set of F. The state space S of the process X t is a denumerable space endowed with a discrete topology.A finite-state subset B ⊂ S is the target set with B c := S \ B. The control domain U ⊂ R is a nonempty compact set equipped with Borel σ-algebra B(U ).Let U denote the admissible control set For any given u(•) ∈ U, the process X t,u(•) is assumed to satisfy the regularity condition: lim s↓t P {X t,x s,u(•) = x } = δ xx , where δ xx = 1 if x = x or 0 otherwise, which implies that the process X t,u(•) has only finitely many jumps with probability one over the finite horizon [0, T ].The superscript of X t,x s,u denotes the initial time and state, we consider, and X 0,x 0 s,u can be simplified as X x 0 s,u .For each u(•) ∈ U t (with u t = u ∈ U ), the infinitesimal transition probabilities of X t,u(•) are given by where λ ij (t, u) are transition rates of X t and supposed to satisfy the following assumptions.
2) The transition rates are conservative, that is, for any (t, u) ∈ [0, T ] × U and i ∈ S, we have j∈S λ ij (t, u) = 0.
3) The transition rates are stable, that is, for any 1) All the properties of the controlled Markov process are determined by the transition rates, so it is sufficient to make assumptions about the transition rates only.However, in practice, it is a difficult task to identify the transition rates; statistical analysis is useful in this context.This is not the aim of this article, so we will not expand this aspect here and leave it for a future project.2) For each fixed i ∈ S, the transition rate λ ij is bounded due to the continuity and stability.

Problem (S):
The optimal control problem we are interested in is to maximize the utility function given by the probability staying on the given target set B: for over u(•) ∈ U t .The value function associated with (2) is defined as Note the boundary condition Remark 2: 1) Let τ 1 := inf{s ≥ t, X s = x|X t = x} denote the first jump time of X s after t.By (1), for any control u(•) ∈ M t , we have which implies that V (t, x) > 0 always holds for each x ∈ B. 2) For a fixed x ∈ B, let τ (u(•)) := inf{s ≥ t, X t,x s,u(•) / ∈ B} denote the first exit time of X t,x s,u(•) from B; then, the utility function (2) can be rewritten as Remark 3: The optimality criterion in this article is similar to that in [14] on optimal risk probability for first passage models of semi-MDPs, since there is no reward/cost structure in these models.The admissible controls we consider here are stochastic processes adapted to the natural filtration generated by the underlying Markov chain, which are different from the policies studied in [14] as well as some continuous-time MDPs with risk probability criteria, such as [15] and [16].In these works, the policies are only taken at each jump point.

II. DYNAMIC PROGRAMMING PRINCIPLE
One of the most commonly used approaches to solve stochastic optimal control problems is to establish the DPP based on the pioneering work of Bellman [3].The basic idea of DPP is to consider a family of optimal control subproblems initiated at different times and states.Then, the next step is to establish the connections among these subproblems and solve all of them finally.However, for any s ∈ (t, T ], X t,x s is a random variable in (Ω, F, P ) rather than a deterministic state in S, but it can be regarded as almost surely deterministic under the conditional probability measure P {•|F t s }(ω) for each fixed ω ∈ Ω, in the sense that all the dynamics of X t,x • during the time period [t, s] are known under the filtration F t s , as explained in [31].Then, for any s ∈ (t, T ] and a given u(•) ∈ U t , we have Then, for each (t, x) ∈ [0, T ) × B and s ∈ (t, T ], by taking conditional expectation, we have where I F denotes the indicator function of set F ∈ F. Here, we used the flow relation Theorem 1: For any (t, x) ∈ [0, T ) × B and s ∈ (t, T ], the value function V (t, x) satisfies the DP equation . (8) In particular, for a sufficiently small δ > 0, we have Proof: First, for any ε > 0, it is easy to know that there exists û(•) ∈ U t such that for any t ≤ s ≤ T , Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply. Let which forms an admissible control in U t .Thus, by (7) and combining the 2εs in ( 10) and ( 11), we have Since ε > 0 is arbitrary, we have . (12) Conversely, for any ε > 0, there exists .
However, the transition probabilities (1) imply that the probability of two jumps occurring in a small interval of length δ is o(δ).In fact However, the event {X t,x t+δ,u(•) = x ∈ B} includes two distinct cases: Ω 1 :={X s always stays in B during the period [t, t + δ] and X t,x t+δ,u(•) = x } and Ω 2 :={X s jumps out of B and then returns back with X t,x t+δ,u(•) = x ∈ B}.From (1), we have That is Then, by the boundary condition (4) of V (t, x), we have This completes the proof.

Remark 4:
The DP equations we obtained in Theorem 1 are different from the optimality equations given in [15] and [16].The latter describes the relationship of value functions at successive jump points, but the DP equations here describe the relationship of value functions at any different time points.Furthermore, a nonlinear partial differential equation, i.e., the HJB equation, can be obtained in the next section from the DPP.This was not obtained from optimality equations in [15] and [16].Our results of the DP equation and the HJB equation for this kind of problem are new.
Remark 5: If T = ∞, we can also obtain a DP function, in which the freely chosen time s should be replaced by some stopping times.Some examples of DPP involving stopping times can be found in [18], [30], and references therein.
From Remark 2, P {X t,x r,u * (•) ∈ B, ∀r ∈ [t, s]} > 0, and by the definitions of utility function and value function, one has J(s, X t,x s,u Naturally, the DP equations ( 8) and ( 9) turn out to be ( 15) and ( 16), respectively.
If u * (•) ∈ U t is an optimal control of problem (S), Theorem 2 means that u * (•) restricted on [s, T ] is also optimal P-almost surely Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

III. HJB EQUATION AND EXISTENCE OF OPTIMAL CONTROL
From the DPP obtained in the last section, we will establish the relationships among these optimal control subproblems by deriving the so-called HJB equation in this section.In the classical DP problems, the state processes are usually characterized as stochastic differential equations driven by standard Brownian motions, which implies that the state processes are continuous and the HJB equations are normally backward partial differential equations with continuous terminal conditions; readers are referred to [12], [31], and references therein for more details.Different from the classical HJB equation, the HJB equation, derived below (20) in the context of this article, is a backward differential-difference equation with Dirichlet boundary value and terminal value given by an indicator function of the surviving set.We first discuss the infinitesimal generator of the controlled Markov jump process.
Define the operator L as the infinitesimal generator of X t .For any (t, x) ∈ [0, T ) × S, u(•) ∈ U t (with u t = u ∈ U ) and bounded function f, from (1), we have For each deterministic Markov control Then, M f s is {F t s }-adapted and integrable since the transition rates of the process X t are assumed to be stable.Moreover, for ŝ ∈ [t, s], from the definition of infinitesimal generator, it is easy to see It turns out that Therefore, for any t ≤ ŝ ≤ s ≤ T , we have That is, {M f s } t≤s≤T is an {F t s }-martingale with mean f (x).Thus, we obtain Dynkin's formula Similarly, we can prove that for any function f (t, x), which is bounded in x and right differentiable w.r.t.t, we have dr. ( Theorem 3: The value function V (t, x) is right differentiable w.r.t.t and is a classical solution of the following first-order nonlinear differential-difference equation: with boundary condition (4), where V + t (•, x) denotes the right derivative of V w.r.t.t.
Proof: Considering a sufficiently small δ > 0, by ( 9) and for any x ∈ B, we have For any u(•) ∈ U t (with u t = u ∈ U ), we have From the definition of generator ( 17) of X t and taking the supremum over u ∈ U , we have Conversely, there exists û(•) ∈ U t (with ût = û ∈ U ) such that In the lower limit of δ → 0, we have Since ε > 0 is arbitrary and taking the supremum over u ∈ U , we have Since the transition rates of X t is continuous and V ∈ [0, 1], then This, combining ( 21) and ( 22) leads to Then, the limit V + t (t, x) := lim δ→0 1 δ (V (t + δ, x) − V (t, x)) exists and is finite and unique because of the uniqueness of supremum.Thus, the HJB equation (20) holds and the value function V (t, x) is its classical solution.It is easy to see that the boundary conditions are also satisfied.
Remark 6: In classical DP problems, the differentiabilities of the value function V w.r.t. the time variable and state variable are normally Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
unattainable.One usually can only prove that the value function is a viscosity solution [8].The existence and uniqueness of viscosity solutions to the HJB equations can be found in [8] and some other literature, such as [18], [19], and [27].In this article, we proved that the value function is a classical solution of the HJB equation (20), and the uniqueness will be given in Theorem 5.
Since the transition rates of the process X t are assumed to be stable, then the function (u, t, x) → L u V (t, x) is continuous w.r.t.u in U and measurable w.r.t.(t, x) in [0, T ) × B, so by the compactness assumption of U and the measurable selection theorem [2], there exists a measurable control function ū(•, •) [0, T ) × B such that Next, we will prove that the deterministic Markov control u * (•) defined by ( 25) is an optimal control of Problem (S).This result provides the conjugacy of the optimal control in terms of the HJB equation and that of the utility surviving probability.
Theorem 4: The deterministic Markov control u * (•) ∈ M defined by ( 25) is the optimal control of Problem (S), i.e., such Proof: Let u * (•) be the deterministic Markov control defined by (25).Then, by Dynkin's formula (19), we have Note that this does not imply that To proceed, we consider, for a sufficiently small δ > 0 such that (T − t)/δ is an integer, the discretization with step size of length δ

Furthermore
x ∈B Since for each x ∈ B, V (T, x) = 1, by iteration, we have In the limit of δ → 0, we have 26) Thus, u * (•) is an optimal control of Problem (S).
Theorem 4 says that there exists an optimal deterministic Markov control of Problem (S); thus, the value function can be rewritten as Theorem 3 says that V is a classical solution of the HJB equation ( 20), and Theorem 4 gives the existence theorem of optimal control, based on which we have the following verification theorem.Theorem 5: The value function V is the unique solution of the HJB equation (20).
Proof: We need to prove that if there exists another bounded and measurable function ψ(t, x) solving (20), which is right differentiable w.r.t.t for each x ∈ B, then In fact, we only need to prove that (28) holds for each (t, x) ∈ [0, T ) × B. By Dynkin's formula (19) and for each u(•) ∈ M t , we have Considering the discretization of the process X t,x s,u(•) with step size of length δ such that (T − t)/δ is an integer, we have Same as in the proof of Theorem 4, we have for each u(•) Conversely, there exists a deterministic Markov control Using the discretization method used to derive (26), we have Combining ( 30) and ( 31), we have x).This completes the proof.
It is not difficult to find that u * given in Theorem 5 also is an optimal control of Problem (S).
Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

IV. THE CASE STARTING FROM OUTSIDE OF THE TARGET SET
In the previous sections, we have discussed the stochastic optimal control problem whose utility function is given by the probability on a given set B, if X 0 = x 0 ∈ B. However, x 0 does not necessarily belong to B. For the case that x 0 / ∈ B and any t 0 ∈ (0, T ], we can consider the optimal control problem to find an optimal control to maximize the probability that P {X In this section, we will find the optimal control within M. In fact Same as in the proof of Theorem 1, we have We have proved that there exists u * (•) Denote the above probabilities by a x ; then, (32) turns out to be x∈B It turns out that the optimal control problem that we consider in this section is to maximize the utility function: for (t, x) ∈ [0, t 0 ] × S, over u(•) ∈ M t and g is a bounded function.Then, the value function is With the similar proof to that of Theorems 1, 3, 4, and 5, we have the following theorem.Theorem 6: The value function V(t, x) satisfies the DP equation The value function V is the unique solution of the HJB equation There exists a deterministic Markov control u * (•) ∈ M t such that V(t, x) = J (t, x, u * (•)), (t, x) ∈ [0, t 0 ] × S.

V. EXAMPLES
Example 1: Let S = {1, 2, 3} be the state space of Markov chain X t , t ∈ [0, T ] and B = {2} as the target set.By the discussion given in the previous sections, the value function V (t, x) satisfies the HJB equation (20).By the boundary condition of V , we have That is By the existence theorem of optimal control, we have Example 2: Consider a controlled time-homogeneous birth-anddeath process, denoted by {X t } 0≤t≤T , as an example of general controlled Markov processes considered in the main result of this article.The state space of the process X t is S = {0, 1, . . ., K}, and the control domain is U = [a, b].The transition rates are given by Our optimal control problem is to maximize the utility function This example satisfies all the conditions set in this article, and the corresponding value function is Recall the DP equation ( 8) and HJB equation (20).By (24), the optimal control is the control maximizing , and if G x is negative, we need to find u to minimize (u − 2x K ) 2 .Because of the right continuity of X t , it is easy to see from (25) and Theorem 4 that the control process is right continuous.Therefore, the control process is measurable.Unfortunately, the partial derivative in the HJB equation is just a right derivative, and the HJB equation is a backward differential-difference equation, so it cannot be used directly for the simulation of this stochastic optimal control problem.We can investigate the properties of optimal control and simulate value function using the DP equation.
Step 3: Plot V (0, x).We calculate V (0, x) for U = {1, 1.01, 1.02, . . ., 2} and plot the graph in Fig. 1 as the curves in red dotted line.We also take u =1, 1.5, and 2 as three different fixed values and calculate V (0, x) according to the above algorithm without seeking optimal control u and plot the graph x → V (0, x) in Fig. 1 as the curves in orange, blue, and black dotted lines.
Remark 8: 1) The difference between the optimal surviving probabilities with a proper control (taking U = {1, 1.01, . . ., 2}) and the surviving probabilities with a fixed u is clearly shown especially for x being near the top end of the target set.It is easy to see that the red line is higher than or equal to other three lines for each x.
2) The optimal control process is recorded as a matrix, which can provide the optimal policy for each DP problem initiated by each state and time.
We also calculate the optimal control problems of surviving probability with U = {1, 1.01, . . ., 2}, B = {30, 31, . . ., 60} and different values of d (r is assumed to be the same as before): d = 0.6r, 0.8r, r, 1.2r, and1.4r.We plot them in Fig. 2.  It is noted that the probability V (0, x) is increasing with the decrease of d for each x ∈ B, which means decreasing the death rates is benefit to the controlled Markov chain staying in a safety range under the condition that the birth rate remains unchanged.We also simulate value functions according to the above algorithm (Steps 1-3) for B = {0, 1, . . ., 30} and calculate them with U = {1, 1.01, . . ., 2} and different values of d (r is assumed to be the same as before): d = 0.6r, 0.8r, r, 1.2r, and 1.4r.We plot them in Fig. 3.
Different from Fig. 2, the value function V (0, x) in Fig. 3 is increasing with the increase of d for each x ∈ B under the condition that r remains unchanged.An example of the model in Fig. 2 is the human population model staying in a set of a relatively moderate or large size.In this case, decreasing the death rate can make a substantial difference only when the population has a relatively moderate size.While the model in Fig. 3 shows, e.g., in the cancer cell population model, it is desirable to have less number or extinction of cancer cells.In the latter case, increasing death rate improves the probability of keeping the size of cancer cells to be in the safety target set.
We also consider the case starting from the outside of the target set B and use the value function V (0, x) with U = {1, 1.01, 1.02, . . ., 2} (see the red line in Fig. 1) as the terminal condition of the value function V(T, x).Using the algorithm as explained in Step 2 with U = {1, 1.01, . . ., 2}, we carry out our numerical computation and plot our result in Fig. 4. Note that, in this case, the algorithm focuses on the state space S, not just on the target set B.

VI. CONCLUSION
In this article, we consider continuous-time MDPs and derive the DPP and HJB equation for optimal surviving probability V (t, x) = sup u(•)∈U t P {X t,x s,u(•) ∈ B, ∀s ∈ [t, T ]} instead of using the optimality equations as in [15] and [16].The optimality criterion we consider Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
is the risk probability that the first exit time from a target set of the controlled Markov chains exceeds a fixed value.Such a setup is applicable in many real-world contexts.The main effort of this article is to establish the DPP, derive the HJB equation, prove the existence of optimal controls, and verify that the value function is the unique solution of the HJB equation.The problem is not covered by the traditional stochastic optimal problem and the associated DPP and HJB equation.It is also not covered by the risk probability considered by Huo et al. [15], [16].In fact, as V (t, x) depends on the surviving set B, we can define a set function as V t,x (B) = V (t, x).Then, it is easy to see that V t,x : V(S) → [0, 1] is a superadditive capacity.In this sense, we obtained the HJB equation for a superadditive capacity given by an optimal surviving probability.This is in contrast to the traditional stochastic optimal control problems, where the value function is a sublinear expectation operator [23].Given the recent progress of the ergodic theory of sublinear semigroup and capacity [10], [11], and that superlinear semigroup and the sublinear semigroup are conjugate to each other, it would be very interesting to ask whether or not there exists an invariant superlinear distribution μ such that V t,μ = μ.Here, V t,μ (B) = (μV t,• )(B).If so, is μ a continuous distribution and is it ergodic?The existence of such an invariant distribution is important in applications giving the equilibrium of a controlled superlinear process.We will publish this result in a further publication.The DPP and the HJB equation that we established in this article will be important tools for the analysis of invariant distribution and its ergodicity.
In this article, we mainly study the stochastic control problem of a risk probability criterion of a given controlled Markov process model.To link our results directly with applications, we should estimate the transition rates by studying some inverse problems and carrying out statistical analysis of datasets.This is clearly very important and worth pursuing in a future project.Controllability and observability for stochastic control systems are also interesting problems to investigate; we refer to [20] and [32] and will expand this aspect in the future.
An admissible control u(•) is called a deterministic Markov control, if the value of u(•) only depends on the current time and state.Denote by M the set of all deterministic Markov controls over [0, T ].Define F t s = σ{X t,x r , r ∈ [t, s]} ∨ N p ; the admissible control set U t consists of processes taking values in U and being adapted to {F t s }, and the admissible control set M t consists of deterministic Markov controls over [t, T ].Obviously, M ⊂ U and M t ⊂ U t , for all 0 < t ≤ T .
1558-2523 © 2023 IEEE.Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.