© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Offshore crane operations are frequently carried out under adverse weather conditions. While offshore cranes attempt to finish the load-landing or -lifting operation, the impact between the loads and the vessels is critical, as it can cause serious injuries and extensive damage. Multiple offshore crane operations, including load-landing operations, have used reinforcement learning (RL) to control their activities. In this paper, the Q-learning algorithm is used to develop optimal control sequences for the offshore crane's actuators to minimize the impact velocity between the crane's load and the moving vessel. To expand the RL environment, a mathematical model is constructed for the dynamical analysis utilizing the Denavit–Hartenberg (DH) technique and the Lagrange approach. The Double Q-learning algorithm is used to locate the common bias in Q-learning algorithms. The average return feature was studied to assess the performance of the Q-learning algorithm. Furthermore, the trained control sequence was tested on a separate sample of episodes, and the hypothesis that, unlike supervised learning, reinforcement learning cannot have a global optimal control sequence but only a local one, was confirmed in this application domain.
Marine operation, offshore crane, Q-learning algorithm, reinforcement learning
Setting a heavy object down on the deck of a vessel is one of the most common marine operations. During the load-landing or -lifting process, any disturbance such as heave motion may lead to a significant impact on the load and equipment, which may cause fatal injury to the crew and permanent damage. Due to a variety of factors, such as ship motions, crane mechanics, and other factors, achieving a soft load landing with acceptable impact force and a small distance is challenging. Hence, a lot of efforts have been carried out to facilitate this operation; some introduced a variety of control algorithms to automate the process, and others intended to provide training methodology for the operators through simulators, virtual reality, and augmented reality. Several control techniques for marine operations have been proposed. A payload position control of offshore crane was developed by Park et al. using uniformly ultimately bounded (UUB) theory, integrated with the input–output linearization control technique (IOLC). A passive heave compensation system for the offshore landing process was studied by Huster et al. and Ni et al.. The passive shock isolator has some limitations due to the higher weights of loads and the hard working environment; hence, Zhu et al. presented a feedback control strategy where a shock isolator with an optimal controller is introduced to reduce and minimize the peak force transferred to the load. The commonality of the previous techniques is that they are model-based control algorithms. Hence, different dynamical models are presented in the literature. Mackojc et al. presented a six-degree of freedom (6DOF) mathematical model for the vessel and a 3DOF model for the lifting system, and they combined them to produce a payload–vessel system to facilitate the comprehensive investigation of mutual interactions. Idres et al. and Ellermann et al. developed a nonlinear coupled model for a crane and cargo based on the assumption that the cargo was a point mass. Cha et al. established the coupled model between the floating crane and the cargo considering the geometries of cargo. Spong's book and Williams et al. built a robot-like model for the knuckle crane using DH notation and Lagrange's approach. Another approach is using the reinforcement learning (RL) theory. RL is a learning method that gradually explores the optimal policy by interacting with the environment. Andersson et al. proposed actuator space control policies for a forestry crane manipulator to be energy efficient in log grasping by involving a simple curriculum in a deep reinforcement learning setup. Moreover, Gaudet et al. introduced a new integrated guidance and control method for a spaceship based on reinforcement learning concepts. They used the control algorithm to learn a policy that directly maps the lander's estimated state to a commanded thrust for each engine, resulting in precise and fuel-efficient trajectories. Sun and Xie introduced a backstepping control scheme based on reinforcement fuzzy Q-learning to control container cranes. Ding built different environments and used deep neural networks with RL algorithms to set down the load softly. Based on the aforementioned results, in this work, we focus on providing optimal control sequences for an offshore crane to get acceptable impact velocity while landing its load on a moving vessel using reinforcement learning algorithms, where a mathematical model of the offshore crane was established to be involved in the reinforcement learning environment, Q-learning algorithm was created using MATLAB platform to control the crane actuators, and the algorithm performance was measured against the variation of its hyperparameters. Moreover, the bias that usually occurs in Q-learning algorithms was tested using Double Q-learning algorithm. Double Q-learning is an off-policy value-based reinforcement learning algorithm with no positive bias in estimating the action values in stochastic environments.
The remainder of this paper is organized as follows. In Section 2, the problem statement with the operation system assumptions are described in detail, and the algorithms' structures are provided as well. In Section 3, a description of the RL algorithms that are used during this work is given. In Section 4, a simulation for the control sequences is demonstrated, and a comparison between the behavior of the environment under different assumptions is introduced. Section 5 is a discussion of the obtained results, and it stands on the strength of this work.
In this work, two different environments are considered, namely initial environment and upgraded environment, as described in the following.
In this environment, the wave of every episode is randomly generated from the JONSWAP spectra created by the same sea state. Therefore, the wave elevation differs between episodes but not the statistical properties . Here, the sea state is assumed to be:
where Hs is the wave height and Tp is the peak wave period. In a real application, the vertical motion of a vessel can be predicted approximately 4 s ahead of time with high accuracy. This gives us the reliability to plane a control sequence that can be visualized to the crane operator using technologies such as augmented reality. Moreover, the following assumptions are imposed on the load-landing operation problem:
Ⅰ Neglect all the crane and load dynamics.
Ⅱ The mass and stiffness of the barge are neglected.
Ⅲ The barge has the same amplitude of the wave (barge dynamics is neglected and its response is 1 to all the wave frequency).
Ⅳ Neglect the wave effect on the crane.
Ⅴ Consider the hoist speed as the action to the control input directly.
In this environment, the agent–environment interaction is described through the following set of equations:
where Ph is the hoist position; Pw is the wave height; Dr is the relative distance between Ph and Pw; at is the chosen action (hoist velocity); and tstep is the time step of discretization. The action space in this environment consists of 11 actions
The environment is upgraded by inserting the forward kinematic model of a knuckle crane. Considering the crane as a three-revolute joints robotics arm, the input to the environment, in this case, is the actuators' angles.
State observation (forward kinematic)
The mathematical model for the knuckle crane was generated by simulating the crane as a robotic arm with three joins, as shown in Figure 1, with the following assumptions:
Ⅰ Neglect the dynamics of the load and wires.
Ⅱ The second and third joints are actuated through hydraulic cylinders, but the actuator models are not taken into account.
Ⅲ The wave amplitude changes with time but not with the x-direction of the global reference; in other words, the crane end-effector is assumed to be interacting with the same wave amplitude in all of its x-direction positions at a specific time instant.
The model is established using the Denavit–Hartenberg (DH) parameters in Table 1.
Hence, the four homogeneous transformation matrices from frame 0 to 3 are obtained as follows:
The position of the end-effector, as a function of the joints angles, can be represented as:
where qi, with
For the action space in this environment, the effect of the hydraulic cylinders that actuate the crane links is neglected, and the crane angles are assumed to be controlled directly. Hence, the action space in this environment is described as:
Note that q1 is an array of zeros and only a 2D workspace is considered. Hence, in the upgraded environment, we have 25 different combinations of actions, where we take the permutations of q2 and q3 and set the number of elements of q1 to 25 elements. Moreover, the dynamical model [Equation (3)] was developed using the Lagrange approach and the equation of motion, which are used to extract the torque control sequence from the optimal control sequence of joint angles.
The impact velocity
The state-of-the-art Q-learning algorithm is used in this work with deterministic dynamics state estimation and one-step TD (TD(0)) learning approach; the structure of the Q-Learning algorithm was modified by implementing the reward structures described in Section 2.2, the state observation methodology depending on the environment under consideration, and the setting parameters in Table 2. The full structure of the algorithm is described in Algorithm 1.
Algorithm 1 Q-Learning Algorithm. 1: initialize Q-function, 2: set initial state 3: policy = 4: let P(exploratory) = Probability of selecting an action according to exploratory policy.
5: repeat 6: set P(exploratory) = 7: for every time step k=0, 1, 2, 8: if policy = exploratory then 9: select action 10: else 11: select action 12: end if 13: apply 14: if impact occurred then 15: measure impact velocity 16: set 17: else 18: measure intermediate reward 19: set 20: end if 21:
22: 23: end for 24: update N-steps, update 25: until
|Episode length||90 s||0.97|
|Action space||Av or Aq|
|State space||150 states|
|Agent||Ph(1)+ 30 s of Pw||N-steps(limit)||10, 000|
The Double Q-learning algorithm has the ability to determine an unbiased estimate of the Q-value, because for each update, one set of weights is used to determine the greedy policy and the other to determine its value . Algorithm 2 is the structure of the Double Q-learning in this framework, where the action
Algorithm 2 Double Q-Learning Algorithm. 1: initialize 2: set initial state 3: repeat 4: Choose 5: Choose (random) either UPDATE(A) or UPDATE(B)
6: if UPDATE(A) then 7: Define 8: else if UPDATE(B) then 9: Define 10: end if 11: apply 12: if impact occurred then 13: measure impact velocity 14: end if 15: 16: until
In Table 2,
Figure 3 contains acceptable landing operations, where the agent managed to land the load on the vessel with impact velocity less than 0.3 m/s.
Note that, at the beginning of the episodes, the hoist velocity is fixed to 0.3 m/s; hence, the learning process takes place after the hoist reaches 2 m from the average wave amplitude. Moreover, this fixed starting action helps to avoid the delusion at the beginning of an episode, for example, going above the starting position, which would slow down the learning process. In addition, these results were obtained with the initial setting mentioned in Table 2.
Since the algorithm is model-free, it is not possible to measure the bias that usually exists in the Q-learning algorithm. Hence, the results of the Q-learning are compared to the Double Q-learning results so the bias value can be measured. Figure 5 shows that DQL has the same sparsity
Moreover, it is clear that DQL does not suffer from the delusion of the values of the average return at the beginning of the learning process; on the contrary, QL does suffer from overshooting the asymptotic value, which is clear in the case of
Note that the episodes number in the case of
The same Q-learning algorithm was tested in the upgraded environment, and it can be noticed that in the majority of the episodes in this case, the agent could find an accepted impact velocity at early stages of the episode time span, as shown in Figure 4.
Moreover, by using the dynamical model, the input torques corresponding to the optimal control sequence of angles of one of the episodes can be calculated, as shown in Figure 6, where it is clear that Link 2 has the highest inertia, which extends the limits of the torque domain.
Figure 7 indicates two differences between the initial environment (Ini Env) and the upgraded environment (Upg Env). First, the average return in the case of the upgraded environment is much less sparse than the initial environment. Second, the asymptotic average return values do not coincide.
Reinforcement learning is, by definition, an online learning environment. There is no separate test phase because the agent never stops learning. In other words, RL problems are usually of the "continual learning'' type and the goal is to get the highest total reward, which is usually discounted over time. The most common technique for measuring RL performance is to look at the average return. In contrast with the supervised learning case, there are no standard performance measures yet, even in popular application domains. Hence, the usual testing phase is to compare the algorithms to each other, as in our comparison between Q-learning and Double Q-learning in Section 4.1.2. In this section, we aim to generalize the optimal control sequences that were achieved in the online learning process. Hence, a testing sample of 500 episodes was generated considering the initial environment. For specific configurations, the agent was forced to follow a randomly selected control sequence from those considered optimal in the online learning process. The accuracy, which describes the agent success ratio to achieve acceptable impact velocity among the 500 episodes of the testing sample, is reported in Table 3 for different configurations. Note that the agent accuracy on the upgraded environment does not exceed 1.5% for all configurations.
Agent accuracy on 500 episodes with
The purpose of this work is to control the impact that occurs in offshore crane load landing operations in one of the most complicated environments due to the impact of several conditions. Although strong assumptions were set, two environments were established. The first considered the crane hoist as a point mass and the second was enlarged with the crane structure using a robot-like mathematical model, which was established and simulated using MATLAB and Simulink. Both environments have the sea waves model involved in the vessel's motion where the crane's load is supposed to be placed.
Optimal control sequences were generated using the Q-learning algorithm, and they managed to reach acceptable impact velocities in online learning processes. The performance of the Q-learning algorithm was tested, and we conclude the following: an intermediate reward structure is needed to overcome the effect of the agent delusion that occurs, as the reward in the usual Q-learning algorithm is delayed to the impact point; hence, less time would be required to accomplish the task.
For the same discount factor, the lower the learning rate, the less sparse the values of the average return, and the more delayed the convergence to the asymptotic value. Moreover, the asymptotic value of the average return is reduced while reducing the learning rate; although this effect is not essential in the initial environment, it is clear in more complicated environments such as the upgraded environment in our work. The change of the discount factor---for the same other hyperparameters values---is directly proportional to the asymptotic value of the average return but with no magnificent effect on the convergence time. The algorithm performance was compared to the Double Q-learning technique; hence, the bias in the Q-learning technique can be adjusted.
Reinforcement learning, in general, does not have a separate testing sample, as every learning process is online and every control sequence is unique on its own episode. In other words, we can say that the obtained control sequence has local optimality. This was verified. We tested the agents on a sample of 500 episodes and concluded that the accuracy of the agent is facing a high variance; even when we set all the algorithm hyperparameters to be fixed, the initial position of the crane where the control sequence is tested still has a significant effect in the variation of the accuracy value.
On the other hand, although the accuracy range of the best trials in the initial environment was between 71% and 80%, this range was tremendously reduced to between 0.1% and 1.5% in the upgraded environment. We attribute that to the size of action_space, as the initial environment agent has one input of 11 actions and the upgraded environment agent has three inputs and 25 actions; hence, the more complicated the action_space, the more unique the optimal control sequence and the harder it is to find a global optimal control sequence. Thus, choosing the action_space is important not only from the learning point of view but also from the feasibility in the physical domain; in this work, it is shown that avoiding setting up sudden movements and more effort to the links with high inertia are needed.
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Maamoun KSA, Karimi HR
This work was partially supported by the Italian Ministry of Education, University and Research through the Project "Department of Excellence LIS4.0-Lightweight and Smart Structures for Industry 4.0" and in part by the Horizon Marie Skłodowska-Curie Actions program (101073037).
All authors declared that there are no conflicts of interest.
© The Author(s) 2022.
1. Park HC, Chakir S, Kim YB, Lee DH. A Robust payload control system design for offshore cranes: experimental study. Electronics 2021;10:462.DOI
2. Huster A, Bergstrom H, Gosior J, White D. Design and operational performance of a standalone passive heave compensation system for a work class ROV. In: OCEANS 2009. IEEE; 2009. pp. 1–8.DOI
3. Ni J, Liu S, Wang M, Hu X, Dai Y. The simulation research on passive heave compensation system for deep sea mining. In: 2009 International Conference on Mechatronics and Automation. IEEE; 2009. pp. 5111–16.DOI
4. Zhu M, Zhang P, Zhu C, Jia X. Dynamic analysis and optimal control of the landing process of the offshore installation. Adv Mech Eng 2017; doi: 10.1177/1687814017727971.DOI
5. Mackojć A, Chiliński B. Preliminary modelling methodology of a coupled payload-vessel system for offshore lifts of light and heavyweight objects. Bulletin of the Polish Academy of Sciences: Technical Sciences 2022;70: e139003. Available from:.DOI
6. Idres M, Youssef K, Mook D, Nayfeh A. A nonlinear 8-DOF coupled crane-ship dynamic model. In: 44th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference; 2003. p. 1855.DOI
7. Ellermann K, Kreuzer E, Markiewicz M. Nonlinear dynamics of floating cranes. Nonlinear Dynamics 2002;27:107-83.DOI
8. Cha JH, Roh MI, Lee KY. Dynamic response simulation of a heavy cargo suspended by a floating crane based on multibody system dynamics. Ocean Engineering 2010;37:1273-91.DOI
9. Spong MW, Hutchinson S, Vidyasagar M, et al. Robot modeling and control. vol. 3. Wiley New York; 2006.DOI
10. Williams LA. Modelling, Simulation and Control of offshore crane Develop a kinematic and dynamic crane model and study of several control designs [MastersThesis]. Universitetet i Agder; University of Agder. Norway; 2018. Available from: http://hdl.handle.net/11250/2564033.
11. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018. Available from: https://mitpress.mit.edu/9780262039246/.
12. Andersson J, Bodin K, Lindmark D, Servin M, Wallin E. Reinforcement Learning Control of a Forestry Crane Manipulator. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. pp. 2121–26.DOI
13. Gaudet B, Linares R, Furfaro R. Deep reinforcement learning for six degree-of-freedom planetary landing. Adv Space Res 2020;65:1723-41.DOI
14. Sun X, Xie Z. Reinforcement Learning-Based Backstepping Control for Container Cranes. Mat Pro Eng 2020;2020.DOI
15. Ding M. Reinforcement Learning For Offshore Crane Set-down Operations [MastersThesis]. University of Croningen. Netherlands; 2018. Available from: https://www.ai.rug.nl/~mwiering/Thesis_Mingcheng_Ding.pdf.
16. Vazirizade SM. An intelligent integrated method for reliability estimation of offshore structures wave loading applied in time domain[PhdThesis]. The University of Arizona. USA; 2019. Available from: https://repository.arizona.edu/handle/10150/636592.
17. Küchler S, Mahl T, Neupert J, Schneider K, Sawodny O. Active control for an offshore crane using prediction of the vessel's motion. IEEE/ASME Transactions on Mechatronics 2010;16:297-309.DOI
18. Hasselt H. Double Q-learning. Advances in neural information processing systems 2010;23. Available from: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9-Abstract.html.
19. Zhang C, Vinyals O, Munos R, Bengio S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv: 180406893 2018.
Maamoun KSA, Karimi HR. Reinforcement learning-based control for offshore crane load-landing operations. Complex Eng Syst 2022;2:13. http://dx.doi.org/10.20517/ces.2022.28