Complex Engineering Systems

Open Access Research Article

Prof. Qiugang Lu, Department of Chemical Engineering, Texas Tech University, P.O. Box 43121, Lubbock, TX 79409-3121, USA. E-mail: jay.lu@ttu.edu

Views:1419 | Downloads:203 | Cited:0 | Comments:0 | :3

© The Author(s) 2022. **Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Proportional-Integral-Derivative (PID) control has been the dominant control strategy in the process industry due to its simplicity in design and effectiveness in controlling a wide range of processes. However, most traditional PID tuning methods rely on trial and error for complex processes where insights about the system are limited and may not yield the optimal PID parameters. To address the issue, this work proposes an automatic PID tuning framework based on reinforcement learning (RL), particularly the deterministic policy gradient (DPG) method. Different from existing studies on using RL for PID tuning, in this work, we explicitly consider the closed-loop stability throughout the RL-based tuning process. In particular, we propose a novel episodic tuning framework that allows for an episodic closed-loop operation under selected PID parameters where the actor and critic networks are updated once at the end of each episode. To ensure the closed-loop stability during the tuning, we initialize the training with a conservative but stable baseline PID controller and the resultant reward is used as a benchmark score. A supervisor mechanism is used to monitor the running reward (e.g., tracking error) at each step in the episode. As soon as the running reward exceeds the benchmark score, the underlying controller is replaced by the baseline controller as an early correction to prevent instability. Moreover, we use layer normalization to standardize the input to each layer in actor and critic networks to overcome the issue of policy saturation at action bounds, to ensure the convergence to the optimum. The developed methods are validated through setpoint tracking experiments on a second-order plus dead-time system. Simulation results show that with our scheme, the closed-loop stability can be maintained throughout RL explorations and the explored PID parameters by the RL agent converge quickly to the optimum. Moreover, through simulation verification, the developed RL-based PID tuning method can adapt the PID parameters to changes in the process model automatically without requiring any knowledge about the underlying operating condition, in contrast to other adaptive methods such as the gain scheduling control.

Reinforcement learning, PID tuning, closed-loop stability, deterministic policy gradient

Proportional-Integral-Derivative (PID) controllers have been the dominant type of controller for the process industry accounting for more than 80% of industrial process control ^{[1]}. Such a popularity arises from their simplicity in the structure for design and effectiveness for controlling practical systems ^{[2, 3]}. However, some roadblocks still exist for PID design to enable a higher level of automation and adaptivity for controlling complex processes. First, the control performance of PID strictly relies on the setting of its parameters. Despite the fact that some empirical guidelines are available for tuning PID control even for complex systems, the resultant PID parameters may not be the optimal values ^{[4]}. As reported in ^{[5]}, most conventional methods use trial and error to tune PID for complex systems where insights about the system may be limited. Therefore, how to automate the search of optimal PID parameters becomes an interesting topic ^{[6]}. Second, the parameters of traditional PID controllers generally remain fixed after tuning. As a result, they are mainly applicable for controlling time-invariant systems. For time-varying systems, as frequently encountered in practice such as robotics and vehicles, the lack of adaptivity of traditional fixed-parameter PID controllers makes them unable to maintain high control performance under such scenarios ^{[7]}.

Traditional PID tuning methods can be roughly classified into three categories: heuristic tuning, rule-based tuning, and optimization-based (or model-based) tuning ^{[8]}. Heuristic tuning often relies extensively on trial and error based on the understanding of the role of each PID parameter. This method is easy to implement, however, it can be time-consuming and cannot guarantee to reach a robust or optimal solution ^{[8]}. Rule-based tuning establishes simple models (often first-order plus dead-time model) to approximate the process based on the step test, including methods such as Ziegler-Nichols, Cohen-Coon, Kappa-Tau, and Lambda tuning ^{[9]}. These methods are widely used, however, they are sensitive to the discrepancies between the true process and the approximation model. Optimization-based methods can find the optimal PID parameters given the availability of an accurate process model along with the desirable engineering specification. However, a sufficiently accurate model is required for such methods ^{[10]}, which can be difficult in practice.

Adaptive PID has been proposed in the literature to mitigate the above non-adaptivity issue. These control strategies can be classified into model-based approaches ^{[11, 12]}, evolution optimization-based approaches ^{[13]}, and neural network-based approaches ^{[14]}. Model-based approaches assume the presence of an accurate model that can represent the true dynamics exactly to allow for the adaptivity of PID controllers, which does not hold in practice due to the difficulty in accessing such high-quality models for complex dynamical systems ^{[15]}. Evolution optimization-based approach is hard to reach real-time and online adaptation due to the slow computation speed ^{[16]}. Adaptive PID based on neural networks employs supervised learning to optimize network parameters for enabling adaptivity. This is limited by the fact that the teaching signal is hard to acquire ^{[6]}. To this end, more advanced machine learning techniques have been adopted to promote the user-free automatic tuning and adaptivity of traditional PID controllers.

Among emerging machine learning techniques for further empowering PID control, reinforcement learning (RL) has shown unique advantages to address the above issues. As a sequential decision-making method, RL can iteratively optimize an objective, usually black-box, via guided trial-and-error exploration ^{[17]}. First, due to the data-driven nature, automatic tuning of PID parameters can be achieved where expert knowledge is not needed. Second, RL can learn the optimal strategy in real-time and online by interacting with the environment. PID tuning is essentially a black-box optimization problem where the relation between tuning parameters and control performance is unknown. Optimal PID parameters shall be the ones that optimize the control performance. In light of this observation, RL has been preliminarily employed to facilitate the PID tuning ^{[18]}. Typical model-free RL algorithms can be broadly grouped into value-based methods (e.g., Q-learning ^{[19]}, SARSA ^{[20]}, and temporal difference (TD) learning ^{[17]}), and policy-based algorithms (e.g., policy gradient ^{[21]}). Value-based methods learn an action or state value function, from which the optimal policy can be derived. In contrast, policy-based methods directly learn the optimal policy from the experience, therefore, they have better convergence and are effective for continuous action space. In particular, the deterministic policy gradient (DPG) method, proposed by DeepMind ^{[22]}, receives wide attention due to its sample efficiency by considering deterministic rather than stochastic policies. As a result, the usage of off-policy actor-critic-based DPG algorithm has shown attractive prospects for addressing the PID tuning problem ^{[23]}.

Treating the PID tuning as an RL problem has been reported in the literature where different RL algorithms have been employed ^{[4, 24]}. Actor-critic learning based on a single RBF network is proposed in ^{[16]} to enable an adaptive self-tuning PID control, which is later implemented to wind turbine control ^{[25]}. The combination of fuzzy PID and RL is reported in ^{[1, 26]}. Applications of RL-based self-tuning PID include soccer robot ^{[27]}, multicopter ^{[28]}, and human-in-the-loop physical assistive control ^{[5]}. For process control, reports in ^{[23, 29]} consider the PID tuning in sample-by-sample and episodic modes, respectively. A sample-efficient deep RL with episodic policy transfer is proposed for PID tuning in a robotic catheter system in ^{[30]}. Despite these rapid advancements, maintaining the closed-loop stability during RL search has not been thoroughly studied. Note that the ultimate goal of RL-based PID tuning is to accomplish online and adaptive tuning to maintain high performance even in the presence of changes in operating conditions. Such expectations require that the attempted PID parameters be located in the stable region to avoid unstable response of the closed-loop process.

The main contribution of this work is to develop a stability-preserving automatic PID tuning approach based on RL, particularly the DPG algorithm. The majority of the literature in this direction so far has focused on establishing an adaptive auto-tuning framework with RL approaches. However, how to maintain the closed-loop stability during the exploration of the PID parameter space by the RL agent has not been well studied. For instance, as one of the few reports that briefly discuss the stability issue, the authors in ^{[29]} use an anti-windup strategy to force the input to be within the bound so as to maintain the boundedness of the output in the presence of unstable PID controller during RL search. However, this paper does not directly tackle the stability issue in a rigorous way. In contrast to existing work, our approach explicitly accounts for the closed-loop stability issue during RL exploration. For the proposed framework, we use a conservative but stable PID controller as a baseline. A supervisor mechanism monitors the closed-loop behavior of process variables at every time step throughout a training episode (i.e., one complete step test). Once the running reward (e.g., cumulative tracking error) exceeds a threshold, the employed PID controller will be replaced by the baseline controller to correct the process variables and prevent instability. In this work, we primarily focus on the automatic tuning aspect of RL-based PID tuning with stability preservation. In the simulation, we demonstrate the adaptivity of our method with respect to the changes in the system model throughout the offline episodic tuning. However, how to fully accomplish the online adaptivity to varying operating conditions with stability preservation will be tackled in our future work.

This paper is outlined as follows. In Section 2, we introduce the fundamentals about the form of PID control used in this work and the DPG algorithm that will be adapted in our proposed method. Section 3 is devoted to the introduction of the proposed episodic RL-based PID auto-tuning method, where the closed-loop stability can be maintained throughout the RL training. In Section 4, simulation examples are provided to demonstrate the effectiveness and adaptivity of the proposed techniques. This paper concludes in Section 5.

We consider the standard position form of PID control ^{[9]}:

where

where

We employ the *anti-windup compensation*^{[31]} to address the issue that the integral action continues to increase or decrease if the saturation persists. The presence of windup of the integral action can trigger the nonlinearity of the controller that may lead to instability of the closed-loop system ^{[23]}. With the anti-windup compensation, the integral term does not accumulate once the control action reaches the upper or lower bound. Further, to avoid the sudden jump of the derivative of the error when the setpoint is adjusted, the *derivative kick* will be overcome by using the differentiated output

The standard reinforcement learning contains a learning agent that interacts with an environment. The agent (e.g., a controller) represents a decision-making mechanism, whereas the environment often represents an object (e.g., a plant or a process) ^{[32]}. The objective of an agent is to find the best policy such that the long-term cumulative reward (known as *return*) is optimized by interacting with the environment that is often characterized by a Markov decision process (MDP) ^{[17]}. Specifically, at each time step

The agent's policy is often described by a mapping

where

The state value function *expected* return at a given state if the policy follows

Similarly, the action-value function

With the Markov property, the recursive relation (Bellman expectation equation) between the

where

The advantage of using

Value-based methods have been well developed to solve the Bellman expectation or optimality equations. Among these methods, Q-learning has been the most widely used due to its off-policy nature and ability to discover the global optimum ^{[19, 33]}. Q-learning solves the Bellman optimality equation (8) via temporal difference (TD):

where

To generalize Q-learning to continuous state and action spaces, neural networks have been extensively used as FAs to represent

Note that the objective above is a supervised learning problem with

where

In contrast to value-based methods where action or state value functions must be obtained before acquiring the optimal policy, policy-based methods directly search for the best policy to optimize the RL objective:

where *actor network*, parameterized by

where ^{[21]}, turns out to be

which is in terms of the *critic network*. The parameters of actor and critic networks can be updated sequentially using (12) and (14). This forms the well-known *actor-critic* architecture ^{[34]}. The actor aims to find the optimal policy and the critic serves as a judge to assess the current policy prescribed by the actor.

Stochastic policies often demand a large number of samples for evaluating the policy gradient (15), especially for high-dimensional action space. As an extension of DQN, *deterministic policy gradient* (DPG) is proposed by Silver et al. ^{[22]} where the deterministic policy is given as

Compared with the stochastic counterpart (15), the gradient of the deterministic policy above has expectation with respect to only the state transition probability, thereby the required number of data samples can be significantly reduced.

As an extension of DQN, the DPG algorithm inherits some techniques adopted by DQN. (i): As shown in (8), the target of Q-learning problem with FA can be represented as *Fixed Q target* is one solution to alleviate this issue, in which a *critic target network**target actor network**replay buffer* is often used to store a large number of past transition pairs ^{[22, 35]}.

In this section, we demonstrate the procedures of adapting the above DPG algorithm to the PID tuning problem with stability guarantee.

The overall structure of the proposed DPG-based PID tuning with stability guarantee is shown in Figure 2. The proposed scheme is *multi-scale* in time for RL training and closed-loop operation. For clarity, we define the *update step* as the step that the parameters of RL agent are updated once, or equivalently, the parameters in the PID controller are changed once. We use *operation step* is defined as one time step in a closed-loop operation, denoted as ^{[23]}, if the RL agent and PID parameters are updated at every operation step, as has been done by many works, the fast switching of controller parameters may lead to instability ^{[36]}. The details of each block in Figure 2 are provided below.

In the proposed scheme, the closed-loop system with PID control is viewed as the environment. The outputs of the environment are the trajectories of process variables, such as MVs, CVs, and setpoint, throughout an episodic closed-loop operation. Specifically, after the

where ^{[18, 32]} generally stacks the entire input and output trajectories, possibly with setpoint and other information, into a tuple as the environment state. However, this can lead to the issue of overly high-dimensional input to the DPG networks, especially for slow processes that require a large episode length

where

At the end of each episode, the RL agent receives information from the environment, updates its parameters, and then delivers one set of PID parameters to the controller. The closed-loop system then operates under the new PID controller for the next episode.

The proposed method will use the reward

where

As discussed in Section 2, our RL will utilize the DPG algorithm for training the agent. At the end of each closed-loop operation episode, the setpoint, MV, and CV trajectories will be consumed to extract the environment state, action (PID parameters used for this episode), and reward, as shown in Figure 2. Such information is saved into the replay buffer. A batch of ^{[37]}. As a result, the returned action ^{[35]}. In this work, we use layer normalization ^{[38]} to mitigate this issue. For the actor network, in order to encourage the agent's exploration of PID parameter space at the beginning, we add a Gaussian noise after the 'tanh' function. However, the variance of the noise decays over episodes so that the agent will focus more on exploitation at the later stage of the training. Note that the architectures (e.g., the number of layers and nodes for each layer) of these networks in Figure 3 for specific problems need fine tuning to optimize the performance of the trained RL agent. As shown in Figure 2, the output (action) of the RL agent is the next set of PID parameters to be deployed to the closed-loop system. However, some PID parameters, if implemented, may give rise to unstable closed-loop responses that can cause disastrous consequences in practice. To preserve the stability and thus ensure safety, a novel mechanism is proposed to correct those poorly selected PID parameters before causing instability, as discussed in the next subsection.

Figure 3. The structure of the actor and critic networks. Left: The actor network where layer normalization is used before each network layer. Decaying noise is added to the output to encourage exploration at the beginning of RL training. Right: The critic network that consumes state and action, and returns the

Ensuring the stability of the closed-loop system during RL exploration of PID parameters is critical for practical implementation. However, algorithms such as DPG employed by RL agent focus solely on maximizing the return (13) that does not directly relate to the closed-loop stability, especially when the training episode has a short duration where the unstable behavior cannot be clearly manifested. Moreover, constrained RL algorithms ^{[39]} that restrict the searchable action space to ensure safety may not be applicable since the stability region of the PID parameter space is usually extremely difficult to obtain and be formulated as explicit constraints. Reward modification methods such as Lagrangian relaxation ^{[40, 41]} only add some soft penalty on constraint violation or unstable response to discourage such explorations to a certain extent, that is, they cannot strictly ensure that the recommended PID parameters by RL agent for subsequent trials are within the stable region. In this work, we propose a novel idea that is based on a baseline PID controller for ensuring the stability for RL-based PID tuning.

The proposed method utilizes a conservative but stable PID controller as a *baseline*. The corresponding PID parameters are denoted as *de-tuned* setup (thus possibly poor performance). Prior to the training of RL agent, the baseline controller is deployed to the system to observe the closed-loop response over an episode. The tracking error in (20) then can be computed based on the acquired response profiles, and this is used as the *benchmark* reward

Figure 4. The flow chart for the proposed stability-preserving RL-based PID tuning approach. The running reward is denoted as "RR".

At time *explored parameters*, whereas the actually utilized parameters until the end of the episode are defined as *implemented parameters*. For an unstable PID controller, the implemented parameters are indeed the same as the baseline parameters. The conservativeness of the baseline controller can, to the maximum extent, prevent the process variables from divergence and drive them to the steady-state. The developed stability-preserving DPG-based PID tuning algorithm is shown in Table 1. The theoretical proof of the closed-loop stability of our method can be found in the Appendix.

Table 1

Pseudo code for stability-preserving DPG-based PID tuning

Algorithm: Stability-preserving DPG-based PID tuning | |

1: | Input: initial policy parameter |

2: | Initialize target networks |

3: | Repeat |

4: | Set running reward |

where | |

5: | Perform closed-loop operation under |

for | |

If | |

PID | |

One-step simulation forward, update | |

6: | Observe the next state |

7: | Store the transition |

8: | Update RL agent parameters: |

9: | Randomly select a batch of transitions |

the number of transitions as | |

10: | Compute the target for each sample in |

11: | Update the Q-function by one step of gradient descent using cost function w.r.t. |

| |

12: | Update the policy by one step of gradient descent using cost function w.r.t. |

| |

13: | Update the parameters |

| |

14: | Until Convergence then End Repeat |

**Remark 1.** Similar to the definition of reward (20), the running reward defined in (21) can be flexible by including other performance metrics, e.g., penalty on the inputs and even the input change rate. However, the form of the running reward shall be consistent with that of the reward in (20). The underlying principle of using the performance metric as the reward for the RL agent to learn stays the same. The selection of

RL-based PID auto-tuning as proposed in this work inherits similarities with gain scheduling ^{[42]}. Both strategies seek an ultimate goal of achieving online adaptivity of controllers to varying operating conditions. However, they also have important differences. For gain scheduling, a set of scheduling parameters that indicate the current operating condition must be measured online. The deployed controller under current scheduling parameters is updated in real-time by, for example, linearly combining baseline controller parameters that were designed beforehand at prescribed operating conditions. For the RL-based PID tuning, the optimal PID parameters are obtained through the online interactions between RL agent and closed-loop system, and thus it does not need to measure any scheduling parameter. In the literature, two strategies have been reported: PID parameters can be updated in real-time at every time step ^{[29]} or in an episodic mode as in our work. For the former approach, since the PID parameters are changed instantaneously instead of waiting the closed-loop response to evolve for a duration, it is hard to define a reasonable metric to assess the control performance (e.g., overshoot, integrated squared error, etc.) under given parameters. Moreover, the overly fast abrupt switching of PID parameters during RL search may lead to closed-loop instability ^{[29]}. For our episodic approach, these two issues can be well addressed and one potential issue is the large number of time steps required to complete the training. In fact, one of our future research topics is exactly on how to accelerate the training of RL agent using sample efficient RL techniques to significantly reduce the number of data samples required for the training. In this way, this potential issue can be mitigated.

**Remark 2.** Despite the stability-preserving feature of the proposed approach, some potential limitations may exist for its practical applications. One of the main difficulties, as mentioned above, is the relatively low sample efficiency of not only our method but also general RL approaches. We will tackle this problem in our future work using employing sample efficient RL methods to accelerate the training by reducing the number of required samples. Another potential limitation is the determination of hyper-parameters for the proposed algorithm (e.g., learning rate, actor/critic network structures, etc.) that could affect the convergence and sample efficiency. One solution could be to employ existing well-developed RL models as a starting point (similar to the idea of transfer learning) to facilitate the selection of hyper-parameters.

In this section, we study the PID tuning for a second-order plus dead-time (SOPDT) process with model as

with time-delay as 10 seconds. For simulating this process, the sampling time is selected as 1 second. Due to the large time-delay, PID control is the preferred mode. The ranges for the three parameters are respectively

Table 2 lists the key parameters adopted by our algorithm for the following simulations. In determining some parameters such as learning rates, noise level, replay buffer, and batch size, we referred to benchmark DPG models ^{[22]} and did slight tuning on top of those benchmark values. Other parameters such as actor and critic network structure and noise level are determined based on our insight into the specific problem along with some trial and error. Through this process, we discover that the algorithm outcome is sensitive to some parameters such as the learning rate and network structure, but not sensitive to parameters such as the buffer size, batch size, etc. Our interpretation is that those sensitive parameters are directly related to the overall complexity, scale, and optimization performance of the resultant RL problem. Therefore, one potential challenge in implementing such RL-based PID auto-tuning methods is the determination of such parameters for the specific application. However, to accelerate and automate the selection of hyper-parameters will be an interesting and important future topic.

Table 2

List of key hyper-parameters used by our RL-based PID tuning algorithm

Parameters | Values | Parameters | Values |

Learning rate | 0.0002 | Learning rate | 0.0002 |

Memory size | 1000 | Batch size | 32 |

Episode length | 200 | Environment state dimension | 30 |

Discount factor | 0.99 | Polyak averaging coefficient | 0.999 |

Structure for actor and its target networks | [40, 30, 1] | Structure for critic and its target networks | [40, 30, 1] |

Reward threshold | 15 | Scaling parameter | 1 |

Initial noise variance | 0.05 | Noise decay factor | 0.001 |

**Case 1:** For the first case, PID control is used but only two free parameters

Figure 5. Stability-preserving automatic PID tuning results with Algorithm 1 with two tunable parameters

**Case 2:** For the second case, we consider the full-scale setup of PID tuning with three tunable parameters. We use the same baseline controller as the previous case. However, to better clarify the difference between stability-preserving and traditional RL-based PID tuning without stability preservation, in this example, we choose the bounds of MV to be ^{[23]}, the MV bound can ensure bounded unstable response so as to give a meaningful reward value to RL agent. Without such a bound, the presence of a portion of divergent responses will significantly deteriorate the data quality in the replay buffer, which will disable the RL agent to learn a useful policy. These results again clearly verify the effectiveness of our stability-preserving RL-based PID automatic tuning framework.

Figure 6. PID tuning results with Algorithm 1 with three tunable parameters

**Comparison with existing PID tuning methods**. There are numerous methods available in the literature for PID tuning such as Ziegler-Nichols method ^{[43]}, direct synthesis method, internal model control (IMC) tuning, IMC-Maclaurin closed-loop tuning ^{[44]} and so on. In this work, we specifically select the following three classical methods for comparison with our method: IMC-PID tuning, IMC-Maclaurin tuning, and closed-loop specified tuning. Prior to using the IMC-PID tuning method, a first-order plus dead-time (FOPDT) model has to be developed to approximate the SOPDT model. To this end, the following relations are used to obtain the FOPDT approximation model ^{[44]}:

where

Table 3

The selected classical PID tuning methods for SOPDT model

Method | PID parameter determination |

IMC-PID tuning ^{[44]} | |

IMC-MAC tuning ^{[45]} | |

Closed-loop specified tuning ^{[46]} |

Figure 7 compares the setpoint tracking performance between our method and the other three selected tuning guidelines. Among these step response curves of the output, our method tracks the setpoint quickly despite the presence of a secondary peak. The closed-loop specified tuning gives the worst performance. The embedded bar plot shows that our method yields the minimum tracking error among others. The reason is that our RL-based PID tuning aims at finding the optimum in the PID parameter space to maximize the reward. However, the other classical tuning methods are more empirical and their purpose is to quickly provide a good set of PID parameters without explicitly optimizing a specific objective.

Figure 7. Comparison of the of RL-based PID auto-tuning (our approach), closed-loop specified, IMC-MAC, IMC-PID tuning method.

**Adaptivity to changes in system model**. In this subsection, we extend the previous results from Case 2 to study the adaptivity of our stability-preserving RL-based PID tuning. To this end, after the initial RL training, we purposely introduce a change in the system gain. We then examine whether the RL agent is able to adapt to the new system via interactions with the closed-loop system. Specially, this simulation experiment starts after the episodic reward in the top left plot of Figure 6 settles down, e.g., after 2000 episodes. With the trained RL agent, we operate the closed-loop system in an episodic mode and gradually change the process gain from 0.3 to 0.5 in 1000 episodes. Such a slow transition in the system parameter is analogous to practical situations when the process operation condition slowly drifts over time. If a scheduling parameter is measurable (e.g., if the process gain is measurable for our case) to indicate the current operating condition in real-time, gain scheduling control will be a variable approach to enable the adaptivity of controllers ^{[42]}. In this subsection, we compare our RL-based PID auto-tuning approach with gain-scheduling PID by assuming that the process gain (i.e., the scheduling parameter) is measurable in real-time. Note that our RL-based PID tuning does not require any measurable scheduling parameters, in contrast to gain scheduling. For the gain-scheduling PID control, several benchmark controllers under prescribed operating conditions shall be designed in advance. Then, the real-time PID controllers can be obtained by linearly interpolating those baseline controllers based on the current value of the scheduling parameter. For simplicity, we pre-designed two PID controllers under two prescribed conditions where the process gains are 0.3 and 0.6, respectively. We used the guideline provided by ^{[9]} for SOPDT to design two benchmark PID controllers as

Figure 8. Comparison of the adaptivity of RL-based PID auto-tuning (our approach), gain-scheduling PID, and the fixed baseline PID controller to the process gain change. Left: episodic rewards (the negative of mean squared tracking error) for these methods; Right: The ultimate CV profiles after the PID adapts to the new process gain based on these methods. Note that that green color shows the results where a fixed baseline PID from Case 2 is used throughout the entire experiment.

Figure 9. Evolution of the CV profiles with RL-based PID auto-tuning after introducing the process gain change.

**Remark 3.** Similar to other RL-based methods as well as shown by the results from this simulation section, the major computation burden of the proposed algorithm lies in the training of the RL agent. For a large and complex PID parameter space, the exploration of the space and discovering the optimum can take a large number of training samples. Indeed, there are a number of factors that can directly affect the training time, such as the learning rate, episode length, and action noise level. If the learning rate is overly small together with a long episode, the RL agent would take more time to obtain the state and reward from the closed-loop system and require more iterations to converge to the optimum, thus can considerable considerably increase the training time. In addition, if the action noise level is too large, then RL's exploration of the PID space can consume much more time. Therefore, a good trade-off between exploration and exploitation is also an important factor that affects the computation complexity of the proposed method.

**Remark 4.** In this section, we used a linear second-order system as an example to implement and validate our algorithms. Although the selected system is simple, our main objective here is to verify that the proposed stability-preserving approach can ensure closed-loop stability during the search of optimal PID parameters, as indicated in the results above. However, our ultimate long-term objective is to achieve a fully online automatic PID tuning framework for generic nonlinear systems (e.g., CSTR) based on RL methods to enable the online adaptivity, optimality, and stability throughout varying operating conditions. As part of our future work, the proposed stability-preserving method will be applied to nonlinear systems where online auto-tuning will also be achieved. The determination of closed-loop stability can be conducted via monitoring the running reward, instead of relying on theoretical approaches that can be nontrivial for nonlinear systems.

We presented a novel stability-preserving framework for RL-based automatic PID tuning. This work is motivated by the observation that although methods exist for the PID tuning for complex systems, many conventional PID tuning methods rely on trial and error. The obtained PID parameters from such empirical methods may not be optimal. Moreover, existing RL-based tuning methods cannot ensure the closed-loop stability during policy search. To this end, we propose a novel multi-scale episodic PID automatic tuning framework based on the DPG algorithm. For this method, the RL agent receives the reward and updates the network parameters once after each entire closed-loop operation. To preserve the closed-loop stability, we employ a stable PID controller as a baseline whose reward is used as a benchmark. A supervisor mechanism monitors the running reward for any explored PID parameter. Once the running reward exceeds the benchmark, the supervisor mechanism replaces the underlying PID controller with the baseline controller to prevent unstable response. Simulation examples are provided to show that our method can not only discover the optimal tuning parameters efficiently but also preserve the closed-loop stability for all explored PID parameters, compared with the standard RL-based PID tuning without considering stability preservation. Moreover, the proposed method is able to adapt to system changes without requiring any knowledge about the underlying operating condition. Future work includes enabling the online and efficient PID tuning with RL-based methods for nonlinear systems while ensuring stability during the policy search.

This appendix is devoted to the proof of the overall closed-loop stability of our stability-preserving method.

We consider a generic nonlinear system

where

where

**Definition (Exponential stability)**^{[47]}. Consider a generic autonomous nonlinear closed-loop system

for all

For the proposed stability-preserving strategy, define the reward threshold for switching into the baseline controller as

Note that the switching time

M.A. Chowdhury acknowledges the support of Distinguished Graduate Student Assistantships (DGSA) from Texas Tech University. Q. Lu acknowledges the new faculty startup funds from Texas Tech University. All the authors acknowledge the valuable comments and suggestions from anonymous reviewers.

Concept development and data acquisition: Lakhani AI, Lu Q

Drafting the manuscript: Lu Q, Chowdhury MA

Modifying the manuscript: Lakhani AI, Chowdhury MA

Data will be made available upon e-mail request to the corresponding author.

None.

All authors declared that there are no conflicts of interest.

Not applicable.

© The Author(s) 2022.

1. Boubertakh H, Tadjine M, Glorennec PY, Labiod S. Tuning fuzzy PD and PI controllers using reinforcement learning.

DOIPubMed*ISA Transactions*2010;49:543-51.2. Ziegler JG, Nichols NB, et al. Optimum settings for automatic controllers.

DOI*Trans ASME*1942;64.3. Borase RP, Maghade D, Sondkar S, Pawar S. A review of PID control, tuning methods and applications.

DOI*International Journal of Dynamics and Control*2020:1-10.4. Lee D, Lee SJ, Yim SC. Reinforcement learning-based adaptive PID controller for DPS.

DOI*Ocean Engineering*2020;216:108053.5. Zhong J, Li Y. Toward human-in-the-loop PID control based on CACLA reinforcement learning. In: International Conference on Intelligent Robotics and Applications. Springer; 2019. pp. 605-13.10.1007/978-3-030-27535-8-54.

6. Guan Z, Yamamoto T. Design of a reinforcement learning PID controller.

DOI*IEEJ Transactions on Electrical and Electronic Engineering*2021;16:1354-60.7. Teoh E, Yee Y. Implementation of adaptive controllers using digital signal processor chips. In: Intelligent Tuning and Adaptive Control. Elsevier; 1991. pp. 109-13.10.1016/B978-0-08-040935-1.50022-3.

8. Bucz Š, Kozáková A. Advanced methods of PID controller tuning for specified performance.

DOI*PID Control for Industrial Processes*2018:73-119.9. Seborg DE, Edgar TF, Mellichamp DA, Doyle â…¢ FJ. Process dynamics and control. John Wiley & Sons; 2016.10.1021/cen-v056n033.p045.

10. Abushawish A, Hamadeh M, Nassif AB. PID controller gains tuning using metaheuristic optimization methods: A survey.

DOI*Journal of Huaqiao University (Natural Science)*2020;14:87-95.11. Chang WD, Hwang RC, Hsieh JG. A multivariable on-line adaptive PID controller using auto-tuning neurons.

DOI*Engineering Applications of Artificial Intelligence*2003;16:57-63.12. Yu D, Chang T, Yu D. A stable self-learning PID control for multivariable time varying systems.

DOI*Control Engineering Practice*2007;15:1577-87.13. Zhou K, Zhen L. Optimal design of PID parameters by evolution algorithm.

*Journal of Huaqiao University (Natural Science)*2005;26:85-88.14. Chen J, Huang TC. Applying neural networks to on-line updated PID controllers for nonlinear process control.

DOI*Journal of Process Control*2004;14:211-30.15. Hou Z, Chi R, Gao H. An overview of dynamic-linearization-based data-driven control and applications.

DOI*IEEE Transactions on Industrial Electronics*2016;64:4076-90.16. Wang XS, Cheng YH, Wei S. A proposal of adaptive PID controller based on reinforcement learning.

DOI*Journal of China University of Mining and Technology*2007;17:40-44.17. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.10.1109/TNN.1998.712192.

18. Shin J, Badgwell TA, Liu KH, Lee JH. Reinforcement Learning-Overview of recent progress and implications for process control.

DOI*Computers & Chemical Engineering*2019;127:282-94.19. Watkins CJ, Dayan P. Q-learning.

DOI*Machine Learning*1992;8:279-92.20. Rummery GA, Niranjan M. On-line Q-learning using connectionist systems. vol. 37. Citeseer; 1994.

21. Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems; 2000. pp. 1057-63.

22. Silver D, Lever G, Heess N, Degris T, Wierstra D, et al. Deterministic policy gradient algorithms. In: International Conference on Machine Learning. PMLR; 2014. pp. 387-95.

23. Lawrence NP, Stewart GE, Loewen PD, Forbes MG, Backstrom JU, et al. Reinforcement learning based design of linear fixed structure controllers.

DOI*IFAC-PapersOnLine*2020;53:230-35.24. Qin Y, Zhang W, Shi J, Liu J. Improve PID controller through reinforcement learning. In: 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC). IEEE; 2018. pp. 1-6.10.1109/GNCC42960.2018.9019095.

25. Sedighizadeh M, Rezazadeh A. Adaptive PID controller based on reinforcement learning for wind turbine control. In: Proceedings of World Academy of Science, Engineering and Technology. vol. 27. Citeseer; 2008. pp. 257-62.

26. Boubertakh H, Glorennec PY. Optimization of a fuzzy PI controller using reinforcement learning. In: 2006 2nd International Conference on Information & Communication Technologies. vol. 1. IEEE; 2006. pp. 1657-62.10.1109/ICTTA.2006.1684633.

27. El Hakim A, Hindersah H, Rijanto E. Application of reinforcement learning on self-tuning PID controller for soccer robot multi-agent system. In: 2013 Joint International Conference on Rural Information & Communication Technology and Electric-vehicle Technology (rICT & ICeV-T). IEEE; 2013. pp. 1-6.10.1109/rICT-ICeVT.2013.6741546.

28. Park D, Yu H, Xuan-Mung N, Lee J, Hong SK. Multicopter PID attitude controller gain auto-tuning through reinforcement learning neural networks. In: Proceedings of the 2019 2nd International Conference on Control and Robot Technology; 2019. pp. 80-84.10.1145/3387304.3387327.

29. Lawrence NP, Stewart GE, Loewen PD, Forbes MG, Backstrom JU, et al. Optimal PID and antiwindup control design as a reinforcement learning problem.

DOI*IFAC-PapersOnLine*2020;53:236-41.30. Omisore OM, Akinyemi T, Duan W, Du W, Wang L. A novel sample-efficient deep reinforcement learning with episodic policy transfer for PID-based control in cardiac catheterization robots.

*arXiv preprint arXiv: 211014941*2021.10.1145/3387304.3387327.31. Astrom KJ, Rundqwist L. Integrator windup and how to avoid it. In: 1989 American Control Conference. IEEE; 1989. pp. 1693-98.10.23919/ACC.1989.4790464.

32. Spielberg S, Tulsyan A, Lawrence NP, Loewen PD, Gopaluni RB. Deep reinforcement learning for process control: A primer for beginners.

*arXiv preprint arXiv: 200405490*2020.10.1002/aic.16689.33. Watkins CJCH. Learning from delayed rewards. King's College, Cambridge United Kingdom; 1989. .10.1016/0921-8890(95)00026-C.

34. Degris T, White M, Sutton RS. Off-policy actor-critic.

*arXiv preprint arXiv: 12054839*2012.35. Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. Continuous control with deep reinforcement learning.

*arXiv preprint arXiv: 150902971*2015.36. Malmborg J, Bernhardsson B, Åström KJ. A stabilizing switching scheme for multi controller systems.

DOI*IFAC Proceedings Volumes*1996;29:2627-32.37. Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference On Machine Learning. PMLR; 2015. pp. 448-56.

38. Ba JL, Kiros JR, Hinton GE. Layer normalization.

*arXiv preprint arXiv: 160706450*2016.39. Liu Y, Halev A, Liu X. Policy learning with constraints in model-free reinforcement learning: A survey. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence; 2021.

40. Chow Y, Ghavamzadeh M, Janson L, Pavone M. Risk-constrained reinforcement learning with percentile risk criteria.

*The Journal of Machine Learning Research*2017;18:6070-120.41. Bohez S, Abdolmaleki A, Neunert M, Buchli J, Heess N, et al. Value constrained model-free continuous control.

*arXiv preprint arXiv: 190204623*2019.10.24963/ijcai.2021/614.42. Stewart GE. A pragmatic approach to robust gain scheduling.

DOI*IFAC Proceedings Volumes*2012;45:355-62.43. Ellis G. Four types of controllers.

*Control System Design Guide*2012.10.1016/B978-0-12-385920-4.00006-0.44. Panda RC, Yu CC, Huang HP. PID tuning rules for SOPDT systems: Review and some new results.

DOIPubMed*ISA Transactions*2004;43:283-95.45. Rivera DE, Morari M, Skogestad S. Internal model control: PID controller design.

DOI*Industrial & Engineering Chemistry Process Design and Development*1986;25:252-65.46. Lee Y, Park S, Lee M, Brosilow C. PID controller tuning for desired closed-loop responses for SI/SO systems.

DOI*AIChE Journal*1998;44:106-15.47. Khalil HK. Nonlinear Systems (Third Edition).

DOI*Patience Hall*2002;115.

Lakhani AI,
Chowdhury MA,
Lu Q. Stability-preserving automatic tuning of PID control with reinforcement learning.
* Complex Eng Syst* 2022;2:3. http://dx.doi.org/10.20517/ces.2021.15

1419

203

0

0

3

© 2016-2022 OAE Publishing Inc., except certain content provided by third parties

## Comments

Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.