ProportionalIntegralDerivative (PID) control has been the dominant control strategy in the process industry due to its simplicity in design and effectiveness in controlling a wide range of processes. However, most traditional PID tuning methods rely on trial and error for complex processes where insights about the system are limited and may not yield the optimal PID parameters. To address the issue, this work proposes an automatic PID tuning framework based on reinforcement learning (RL), particularly the deterministic policy gradient (DPG) method. Different from existing studies on using RL for PID tuning, in this work, we explicitly consider the closedloop stability throughout the RLbased tuning process. In particular, we propose a novel episodic tuning framework that allows for an episodic closedloop operation under selected PID parameters where the actor and critic networks are updated once at the end of each episode. To ensure the closedloop stability during the tuning, we initialize the training with a conservative but stable baseline PID controller and the resultant reward is used as a benchmark score. A supervisor mechanism is used to monitor the running reward (e.g., tracking error) at each step in the episode. As soon as the running reward exceeds the benchmark score, the underlying controller is replaced by the baseline controller as an early correction to prevent instability. Moreover, we use layer normalization to standardize the input to each layer in actor and critic networks to overcome the issue of policy saturation at action bounds, to ensure the convergence to the optimum. The developed methods are validated through setpoint tracking experiments on a secondorder plus deadtime system. Simulation results show that with our scheme, the closedloop stability can be maintained throughout RL explorations and the explored PID parameters by the RL agent converge quickly to the optimum. Moreover, through simulation verification, the developed RLbased PID tuning method can adapt the PID parameters to changes in the process model automatically without requiring any knowledge about the underlying operating condition, in contrast to other adaptive methods such as the gain scheduling control.
ProportionalIntegralDerivative (PID) controllers have been the dominant type of controller for the process industry accounting for more than 80% of industrial process control ^{[1]}. Such a popularity arises from their simplicity in the structure for design and effectiveness for controlling practical systems ^{[2, 3]}. However, some roadblocks still exist for PID design to enable a higher level of automation and adaptivity for controlling complex processes. First, the control performance of PID strictly relies on the setting of its parameters. Despite the fact that some empirical guidelines are available for tuning PID control even for complex systems, the resultant PID parameters may not be the optimal values ^{[4]}. As reported in ^{[5]}, most conventional methods use trial and error to tune PID for complex systems where insights about the system may be limited. Therefore, how to automate the search of optimal PID parameters becomes an interesting topic ^{[6]}. Second, the parameters of traditional PID controllers generally remain fixed after tuning. As a result, they are mainly applicable for controlling timeinvariant systems. For timevarying systems, as frequently encountered in practice such as robotics and vehicles, the lack of adaptivity of traditional fixedparameter PID controllers makes them unable to maintain high control performance under such scenarios ^{[7]}.
Traditional PID tuning methods can be roughly classified into three categories: heuristic tuning, rulebased tuning, and optimizationbased (or modelbased) tuning ^{[8]}. Heuristic tuning often relies extensively on trial and error based on the understanding of the role of each PID parameter. This method is easy to implement, however, it can be timeconsuming and cannot guarantee to reach a robust or optimal solution ^{[8]}. Rulebased tuning establishes simple models (often firstorder plus deadtime model) to approximate the process based on the step test, including methods such as ZieglerNichols, CohenCoon, KappaTau, and Lambda tuning ^{[9]}. These methods are widely used, however, they are sensitive to the discrepancies between the true process and the approximation model. Optimizationbased methods can find the optimal PID parameters given the availability of an accurate process model along with the desirable engineering specification. However, a sufficiently accurate model is required for such methods ^{[10]}, which can be difficult in practice.
Adaptive PID has been proposed in the literature to mitigate the above nonadaptivity issue. These control strategies can be classified into modelbased approaches ^{[11, 12]}, evolution optimizationbased approaches ^{[13]}, and neural networkbased approaches ^{[14]}. Modelbased approaches assume the presence of an accurate model that can represent the true dynamics exactly to allow for the adaptivity of PID controllers, which does not hold in practice due to the difficulty in accessing such highquality models for complex dynamical systems ^{[15]}. Evolution optimizationbased approach is hard to reach realtime and online adaptation due to the slow computation speed ^{[16]}. Adaptive PID based on neural networks employs supervised learning to optimize network parameters for enabling adaptivity. This is limited by the fact that the teaching signal is hard to acquire ^{[6]}. To this end, more advanced machine learning techniques have been adopted to promote the userfree automatic tuning and adaptivity of traditional PID controllers.
Among emerging machine learning techniques for further empowering PID control, reinforcement learning (RL) has shown unique advantages to address the above issues. As a sequential decisionmaking method, RL can iteratively optimize an objective, usually blackbox, via guided trialanderror exploration ^{[17]}. First, due to the datadriven nature, automatic tuning of PID parameters can be achieved where expert knowledge is not needed. Second, RL can learn the optimal strategy in realtime and online by interacting with the environment. PID tuning is essentially a blackbox optimization problem where the relation between tuning parameters and control performance is unknown. Optimal PID parameters shall be the ones that optimize the control performance. In light of this observation, RL has been preliminarily employed to facilitate the PID tuning ^{[18]}. Typical modelfree RL algorithms can be broadly grouped into valuebased methods (e.g., Qlearning ^{[19]}, SARSA ^{[20]}, and temporal difference (TD) learning ^{[17]}), and policybased algorithms (e.g., policy gradient ^{[21]}). Valuebased methods learn an action or state value function, from which the optimal policy can be derived. In contrast, policybased methods directly learn the optimal policy from the experience, therefore, they have better convergence and are effective for continuous action space. In particular, the deterministic policy gradient (DPG) method, proposed by DeepMind ^{[22]}, receives wide attention due to its sample efficiency by considering deterministic rather than stochastic policies. As a result, the usage of offpolicy actorcriticbased DPG algorithm has shown attractive prospects for addressing the PID tuning problem ^{[23]}.
Treating the PID tuning as an RL problem has been reported in the literature where different RL algorithms have been employed ^{[4, 24]}. Actorcritic learning based on a single RBF network is proposed in ^{[16]} to enable an adaptive selftuning PID control, which is later implemented to wind turbine control ^{[25]}. The combination of fuzzy PID and RL is reported in ^{[1, 26]}. Applications of RLbased selftuning PID include soccer robot ^{[27]}, multicopter ^{[28]}, and humanintheloop physical assistive control ^{[5]}. For process control, reports in ^{[23, 29]} consider the PID tuning in samplebysample and episodic modes, respectively. A sampleefficient deep RL with episodic policy transfer is proposed for PID tuning in a robotic catheter system in ^{[30]}. Despite these rapid advancements, maintaining the closedloop stability during RL search has not been thoroughly studied. Note that the ultimate goal of RLbased PID tuning is to accomplish online and adaptive tuning to maintain high performance even in the presence of changes in operating conditions. Such expectations require that the attempted PID parameters be located in the stable region to avoid unstable response of the closedloop process.
The main contribution of this work is to develop a stabilitypreserving automatic PID tuning approach based on RL, particularly the DPG algorithm. The majority of the literature in this direction so far has focused on establishing an adaptive autotuning framework with RL approaches. However, how to maintain the closedloop stability during the exploration of the PID parameter space by the RL agent has not been well studied. For instance, as one of the few reports that briefly discuss the stability issue, the authors in ^{[29]} use an antiwindup strategy to force the input to be within the bound so as to maintain the boundedness of the output in the presence of unstable PID controller during RL search. However, this paper does not directly tackle the stability issue in a rigorous way. In contrast to existing work, our approach explicitly accounts for the closedloop stability issue during RL exploration. For the proposed framework, we use a conservative but stable PID controller as a baseline. A supervisor mechanism monitors the closedloop behavior of process variables at every time step throughout a training episode (i.e., one complete step test). Once the running reward (e.g., cumulative tracking error) exceeds a threshold, the employed PID controller will be replaced by the baseline controller to correct the process variables and prevent instability. In this work, we primarily focus on the automatic tuning aspect of RLbased PID tuning with stability preservation. In the simulation, we demonstrate the adaptivity of our method with respect to the changes in the system model throughout the offline episodic tuning. However, how to fully accomplish the online adaptivity to varying operating conditions with stability preservation will be tackled in our future work.
This paper is outlined as follows. In Section 2, we introduce the fundamentals about the form of PID control used in this work and the DPG algorithm that will be adapted in our proposed method. Section 3 is devoted to the introduction of the proposed episodic RLbased PID autotuning method, where the closedloop stability can be maintained throughout the RL training. In Section 4, simulation examples are provided to demonstrate the effectiveness and adaptivity of the proposed techniques. This paper concludes in Section 5.
We consider the standard position form of PID control ^{[9]}:
where
where
We employ the
The standard reinforcement learning contains a learning agent that interacts with an environment. The agent (e.g., a controller) represents a decisionmaking mechanism, whereas the environment often represents an object (e.g., a plant or a process) ^{[32]}. The objective of an agent is to find the best policy such that the longterm cumulative reward (known as
Illustration of the interaction between RL agent and environment.
The agent's policy is often described by a mapping
where
The state value function
Similarly, the actionvalue function
With the Markov property, the recursive relation (Bellman expectation equation) between the
where
The advantage of using
Valuebased methods have been well developed to solve the Bellman expectation or optimality equations. Among these methods, Qlearning has been the most widely used due to its offpolicy nature and ability to discover the global optimum ^{[19, 33]}. Qlearning solves the Bellman optimality equation (8) via temporal difference (TD):
where
To generalize Qlearning to continuous state and action spaces, neural networks have been extensively used as FAs to represent
Note that the objective above is a supervised learning problem with
where
In contrast to valuebased methods where action or state value functions must be obtained before acquiring the optimal policy, policybased methods directly search for the best policy to optimize the RL objective:
where
where
which is in terms of the
Stochastic policies often demand a large number of samples for evaluating the policy gradient (15), especially for highdimensional action space. As an extension of DQN,
Compared with the stochastic counterpart (15), the gradient of the deterministic policy above has expectation with respect to only the state transition probability, thereby the required number of data samples can be significantly reduced.
As an extension of DQN, the DPG algorithm inherits some techniques adopted by DQN. (i): As shown in (8), the target of Qlearning problem with FA can be represented as
In this section, we demonstrate the procedures of adapting the above DPG algorithm to the PID tuning problem with stability guarantee.
The overall structure of the proposed DPGbased PID tuning with stability guarantee is shown in
Illustration of the stabilitypreserving RLbased PID tuning.
In the proposed scheme, the closedloop system with PID control is viewed as the environment. The outputs of the environment are the trajectories of process variables, such as MVs, CVs, and setpoint, throughout an episodic closedloop operation. Specifically, after the
where
where
At the end of each episode, the RL agent receives information from the environment, updates its parameters, and then delivers one set of PID parameters to the controller. The closedloop system then operates under the new PID controller for the next episode.
The proposed method will use the reward
where
As discussed in Section 2, our RL will utilize the DPG algorithm for training the agent. At the end of each closedloop operation episode, the setpoint, MV, and CV trajectories will be consumed to extract the environment state, action (PID parameters used for this episode), and reward, as shown in
The structure of the actor and critic networks. Left: The actor network where layer normalization is used before each network layer. Decaying noise is added to the output to encourage exploration at the beginning of RL training. Right: The critic network that consumes state and action, and returns the
Ensuring the stability of the closedloop system during RL exploration of PID parameters is critical for practical implementation. However, algorithms such as DPG employed by RL agent focus solely on maximizing the return (13) that does not directly relate to the closedloop stability, especially when the training episode has a short duration where the unstable behavior cannot be clearly manifested. Moreover, constrained RL algorithms ^{[39]} that restrict the searchable action space to ensure safety may not be applicable since the stability region of the PID parameter space is usually extremely difficult to obtain and be formulated as explicit constraints. Reward modification methods such as Lagrangian relaxation ^{[40, 41]} only add some soft penalty on constraint violation or unstable response to discourage such explorations to a certain extent, that is, they cannot strictly ensure that the recommended PID parameters by RL agent for subsequent trials are within the stable region. In this work, we propose a novel idea that is based on a baseline PID controller for ensuring the stability for RLbased PID tuning.
The proposed method utilizes a conservative but stable PID controller as a
The flow chart for the proposed stabilitypreserving RLbased PID tuning approach. The running reward is denoted as "RR".
At time
Pseudo code for stabilitypreserving DPGbased PID tuning
1:  Input: initial policy parameter 
2:  Initialize target networks 
3:  
4:  Set running reward 
where 

5:  Perform closedloop operation under 
for 



PID 

Onestep simulation forward, update 

6:  Observe the next state 
7:  Store the transition 
8:  
9:  Randomly select a batch of transitions 
the number of transitions as 

10:  Compute the target for each sample in 
11:  Update the Qfunction by one step of gradient descent using cost function w.r.t. 


12:  Update the policy by one step of gradient descent using cost function w.r.t. 


13:  Update the parameters 


14: 
RLbased PID autotuning as proposed in this work inherits similarities with gain scheduling ^{[42]}. Both strategies seek an ultimate goal of achieving online adaptivity of controllers to varying operating conditions. However, they also have important differences. For gain scheduling, a set of scheduling parameters that indicate the current operating condition must be measured online. The deployed controller under current scheduling parameters is updated in realtime by, for example, linearly combining baseline controller parameters that were designed beforehand at prescribed operating conditions. For the RLbased PID tuning, the optimal PID parameters are obtained through the online interactions between RL agent and closedloop system, and thus it does not need to measure any scheduling parameter. In the literature, two strategies have been reported: PID parameters can be updated in realtime at every time step ^{[29]} or in an episodic mode as in our work. For the former approach, since the PID parameters are changed instantaneously instead of waiting the closedloop response to evolve for a duration, it is hard to define a reasonable metric to assess the control performance (e.g., overshoot, integrated squared error, etc.) under given parameters. Moreover, the overly fast abrupt switching of PID parameters during RL search may lead to closedloop instability ^{[29]}. For our episodic approach, these two issues can be well addressed and one potential issue is the large number of time steps required to complete the training. In fact, one of our future research topics is exactly on how to accelerate the training of RL agent using sample efficient RL techniques to significantly reduce the number of data samples required for the training. In this way, this potential issue can be mitigated.
In this section, we study the PID tuning for a secondorder plus deadtime (SOPDT) process with model as
with timedelay as 10 seconds. For simulating this process, the sampling time is selected as 1 second. Due to the large timedelay, PID control is the preferred mode. The ranges for the three parameters are respectively
List of key hyperparameters used by our RLbased PID tuning algorithm
Learning rate 
0.0002  Learning rate 
0.0002 
Memory size 
1000  Batch size 
32 
Episode length 
200  Environment state dimension 
30 
Discount factor 
0.99  Polyak averaging coefficient 
0.999 
Structure for actor and its target networks  [40, 30, 1]  Structure for critic and its target networks  [40, 30, 1] 
Reward threshold 
15  Scaling parameter 
1 
Initial noise variance 
0.05  Noise decay factor  0.001 
Stabilitypreserving automatic PID tuning results with Algorithm 1 with two tunable parameters
PID tuning results with Algorithm 1 with three tunable parameters
where
The selected classical PID tuning methods for SOPDT model
IMCPID tuning ^{[44]}  
IMCMAC tuning ^{[45]}  
Closedloop specified tuning ^{[46]} 
Comparison of the of RLbased PID autotuning (our approach), closedloop specified, IMCMAC, IMCPID tuning method.
Comparison of the adaptivity of RLbased PID autotuning (our approach), gainscheduling PID, and the fixed baseline PID controller to the process gain change. Left: episodic rewards (the negative of mean squared tracking error) for these methods; Right: The ultimate CV profiles after the PID adapts to the new process gain based on these methods. Note that that green color shows the results where a fixed baseline PID from Case 2 is used throughout the entire experiment.
Evolution of the CV profiles with RLbased PID autotuning after introducing the process gain change.
We presented a novel stabilitypreserving framework for RLbased automatic PID tuning. This work is motivated by the observation that although methods exist for the PID tuning for complex systems, many conventional PID tuning methods rely on trial and error. The obtained PID parameters from such empirical methods may not be optimal. Moreover, existing RLbased tuning methods cannot ensure the closedloop stability during policy search. To this end, we propose a novel multiscale episodic PID automatic tuning framework based on the DPG algorithm. For this method, the RL agent receives the reward and updates the network parameters once after each entire closedloop operation. To preserve the closedloop stability, we employ a stable PID controller as a baseline whose reward is used as a benchmark. A supervisor mechanism monitors the running reward for any explored PID parameter. Once the running reward exceeds the benchmark, the supervisor mechanism replaces the underlying PID controller with the baseline controller to prevent unstable response. Simulation examples are provided to show that our method can not only discover the optimal tuning parameters efficiently but also preserve the closedloop stability for all explored PID parameters, compared with the standard RLbased PID tuning without considering stability preservation. Moreover, the proposed method is able to adapt to system changes without requiring any knowledge about the underlying operating condition. Future work includes enabling the online and efficient PID tuning with RLbased methods for nonlinear systems while ensuring stability during the policy search.
This appendix is devoted to the proof of the overall closedloop stability of our stabilitypreserving method.
We consider a generic nonlinear system
where
where
for all
For the proposed stabilitypreserving strategy, define the reward threshold for switching into the baseline controller as
Note that the switching time
M.A. Chowdhury acknowledges the support of Distinguished Graduate Student Assistantships (DGSA) from Texas Tech University. Q. Lu acknowledges the new faculty startup funds from Texas Tech University. All the authors acknowledge the valuable comments and suggestions from anonymous reviewers.
Concept development and data acquisition: Lakhani AI, Lu Q
Drafting the manuscript: Lu Q, Chowdhury MA
Modifying the manuscript: Lakhani AI, Chowdhury MA
Data will be made available upon email request to the corresponding author.
None.
All authors declared that there are no conflicts of interest.
Not applicable.
© The Author(s) 2022.
Zhong J, Li Y. Toward humanintheloop PID control based on CACLA reinforcement learning. In: International Conference on Intelligent Robotics and Applications. Springer; 2019. pp. 60513.
10.1007/978303027535854
Teoh E, Yee Y. Implementation of adaptive controllers using digital signal processor chips. In: Intelligent Tuning and Adaptive Control. Elsevier; 1991. pp. 10913.
10.1016/B9780080409351.500223
Seborg DE, Edgar TF, Mellichamp DA, Doyle â…¢ FJ. Process dynamics and control. John Wiley & Sons; 2016.
10.1021/cenv056n033.p045
Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018.
10.1109/TNN.1998.712192
Rummery GA, Niranjan M. Online Qlearning using connectionist systems. vol. 37. Citeseer; 1994.
Sutton RS, McAllester DA, Singh SP, Mansour Y. Policy gradient methods for reinforcement learning with function approximation. In: Advances in Neural Information Processing Systems; 2000. pp. 105763.
Silver D, Lever G, Heess N, Degris T, Wierstra D, et al. Deterministic policy gradient algorithms. In: International Conference on Machine Learning. PMLR; 2014. pp. 38795.
Qin Y, Zhang W, Shi J, Liu J. Improve PID controller through reinforcement learning. In: 2018 IEEE CSAA Guidance, Navigation and Control Conference (CGNCC). IEEE; 2018. pp. 16.
10.1109/GNCC42960.2018.9019095
Sedighizadeh M, Rezazadeh A. Adaptive PID controller based on reinforcement learning for wind turbine control. In: Proceedings of World Academy of Science, Engineering and Technology. vol. 27. Citeseer; 2008. pp. 25762.
Boubertakh H, Glorennec PY. Optimization of a fuzzy PI controller using reinforcement learning. In: 2006 2nd International Conference on Information & Communication Technologies. vol. 1. IEEE; 2006. pp. 165762.
10.1109/ICTTA.2006.1684633
El Hakim A, Hindersah H, Rijanto E. Application of reinforcement learning on selftuning PID controller for soccer robot multiagent system. In: 2013 Joint International Conference on Rural Information & Communication Technology and Electricvehicle Technology (rICT & ICeVT). IEEE; 2013. pp. 16.
10.1109/rICTICeVT.2013.6741546
Park D, Yu H, XuanMung N, Lee J, Hong SK. Multicopter PID attitude controller gain autotuning through reinforcement learning neural networks. In: Proceedings of the 2019 2nd International Conference on Control and Robot Technology; 2019. pp. 8084.
10.1145/3387304.3387327
Omisore OM, Akinyemi T, Duan W, Du W, Wang L. A novel sampleefficient deep reinforcement learning with episodic policy transfer for PIDbased control in cardiac catheterization robots.
10.1145/3387304.3387327
Astrom KJ, Rundqwist L. Integrator windup and how to avoid it. In: 1989 American Control Conference. IEEE; 1989. pp. 169398.
10.23919/ACC.1989.4790464
Spielberg S, Tulsyan A, Lawrence NP, Loewen PD, Gopaluni RB. Deep reinforcement learning for process control: A primer for beginners.
10.1002/aic.16689
Watkins CJCH. Learning from delayed rewards. King's College, Cambridge United Kingdom; 1989. .
10.1016/09218890(95)00026C
Degris T, White M, Sutton RS. Offpolicy actorcritic.
Lillicrap TP, Hunt JJ, Pritzel A, Heess N, Erez T, et al. Continuous control with deep reinforcement learning.
Ioffe S, Szegedy C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference On Machine Learning. PMLR; 2015. pp. 44856.
Ba JL, Kiros JR, Hinton GE. Layer normalization.
Liu Y, Halev A, Liu X. Policy learning with constraints in modelfree reinforcement learning: A survey. In: Proceedings of the 30th International Joint Conference on Artificial Intelligence; 2021. .
Bohez S, Abdolmaleki A, Neunert M, Buchli J, Heess N, et al. Value constrained modelfree continuous control.
10.24963/ijcai.2021/614
Ellis G. Four types of controllers.
10.1016/B9780123859204.000060