Correspondence to: Khaled Said Ahmed Maamoun, Automation and control, Politecnico di Milano, 32, Piazza Leonardo Da Vinci, Milan 20133, Italy. Email: khaledsaid.maamoun@mail.polimi.it
© The Author(s) 2022. Open Access This article is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, sharing, adaptation, distribution and reproduction in any medium or format, for any purpose, even commercially, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Offshore crane operations are frequently carried out under adverse weather conditions. While offshore cranes attempt to finish the loadlanding or lifting operation, the impact between the loads and the vessels is critical, as it can cause serious injuries and extensive damage. Multiple offshore crane operations, including loadlanding operations, have used reinforcement learning (RL) to control their activities. In this paper, the Qlearning algorithm is used to develop optimal control sequences for the offshore crane's actuators to minimize the impact velocity between the crane's load and the moving vessel. To expand the RL environment, a mathematical model is constructed for the dynamical analysis utilizing the Denavit–Hartenberg (DH) technique and the Lagrange approach. The Double Qlearning algorithm is used to locate the common bias in Qlearning algorithms. The average return feature was studied to assess the performance of the Qlearning algorithm. Furthermore, the trained control sequence was tested on a separate sample of episodes, and the hypothesis that, unlike supervised learning, reinforcement learning cannot have a global optimal control sequence but only a local one, was confirmed in this application domain.
Marine operation, offshore crane, Qlearning algorithm, reinforcement learning
Setting a heavy object down on the deck of a vessel is one of the most common marine operations. During the loadlanding or lifting process, any disturbance such as heave motion may lead to a significant impact on the load and equipment, which may cause fatal injury to the crew and permanent damage. Due to a variety of factors, such as ship motions, crane mechanics, and other factors, achieving a soft load landing with acceptable impact force and a small distance is challenging. Hence, a lot of efforts have been carried out to facilitate this operation; some introduced a variety of control algorithms to automate the process, and others intended to provide training methodology for the operators through simulators, virtual reality, and augmented reality. Several control techniques for marine operations have been proposed. A payload position control of offshore crane was developed by Park et al.^{[1]} using uniformly ultimately bounded (UUB) theory, integrated with the input–output linearization control technique (IOLC). A passive heave compensation system for the offshore landing process was studied by Huster et al.^{[2]} and Ni et al.^{[3]}. The passive shock isolator has some limitations due to the higher weights of loads and the hard working environment; hence, Zhu et al.^{[4]} presented a feedback control strategy where a shock isolator with an optimal controller is introduced to reduce and minimize the peak force transferred to the load. The commonality of the previous techniques is that they are modelbased control algorithms. Hence, different dynamical models are presented in the literature. Mackojc et al.^{[5]} presented a sixdegree of freedom (6DOF) mathematical model for the vessel and a 3DOF model for the lifting system, and they combined them to produce a payload–vessel system to facilitate the comprehensive investigation of mutual interactions. Idres et al.^{[6]} and Ellermann et al.^{[7]} developed a nonlinear coupled model for a crane and cargo based on the assumption that the cargo was a point mass. Cha et al.^{[8]} established the coupled model between the floating crane and the cargo considering the geometries of cargo. Spong's book^{[9]} and Williams et al.^{[10]} built a robotlike model for the knuckle crane using DH notation and Lagrange's approach. Another approach is using the reinforcement learning (RL) theory. RL is a learning method that gradually explores the optimal policy by interacting with the environment^{[11]}. Andersson et al.^{[12]} proposed actuator space control policies for a forestry crane manipulator to be energy efficient in log grasping by involving a simple curriculum in a deep reinforcement learning setup. Moreover, Gaudet et al.^{[13]} introduced a new integrated guidance and control method for a spaceship based on reinforcement learning concepts. They used the control algorithm to learn a policy that directly maps the lander's estimated state to a commanded thrust for each engine, resulting in precise and fuelefficient trajectories. Sun and Xie^{[14]} introduced a backstepping control scheme based on reinforcement fuzzy Qlearning to control container cranes. Ding^{[15]} built different environments and used deep neural networks with RL algorithms to set down the load softly. Based on the aforementioned results, in this work, we focus on providing optimal control sequences for an offshore crane to get acceptable impact velocity while landing its load on a moving vessel using reinforcement learning algorithms, where a mathematical model of the offshore crane was established to be involved in the reinforcement learning environment, Qlearning algorithm was created using MATLAB platform to control the crane actuators, and the algorithm performance was measured against the variation of its hyperparameters. Moreover, the bias that usually occurs in Qlearning algorithms was tested using Double Qlearning algorithm. Double Qlearning is an offpolicy valuebased reinforcement learning algorithm with no positive bias in estimating the action values in stochastic environments.
The remainder of this paper is organized as follows. In Section 2, the problem statement with the operation system assumptions are described in detail, and the algorithms' structures are provided as well. In Section 3, a description of the RL algorithms that are used during this work is given. In Section 4, a simulation for the control sequences is demonstrated, and a comparison between the behavior of the environment under different assumptions is introduced. Section 5 is a discussion of the obtained results, and it stands on the strength of this work.
In this work, two different environments are considered, namely initial environment and upgraded environment, as described in the following.
In this environment, the wave of every episode is randomly generated from the JONSWAP spectra created by the same sea state. Therefore, the wave elevation differs between episodes but not the statistical properties ^{[16]}. Here, the sea state is assumed to be:
where H_{s} is the wave height and T_{p} is the peak wave period. In a real application, the vertical motion of a vessel can be predicted approximately 4 s ahead of time with high accuracy^{[17]}. This gives us the reliability to plane a control sequence that can be visualized to the crane operator using technologies such as augmented reality. Moreover, the following assumptions are imposed on the loadlanding operation problem:
Assumption 1
Ⅰ Neglect all the crane and load dynamics.
Ⅱ The mass and stiffness of the barge are neglected.
Ⅲ The barge has the same amplitude of the wave (barge dynamics is neglected and its response is 1 to all the wave frequency).
Ⅳ Neglect the wave effect on the crane.
Ⅴ Consider the hoist speed as the action to the control input directly.
State observation
In this environment, the agent–environment interaction is described through the following set of equations:
where P_{h} is the hoist position; P_{w} is the wave height; D_{r} is the relative distance between P_{h} and P_{w}; a_{t} is the chosen action (hoist velocity); and t_{step} is the time step of discretization. The action space in this environment consists of 11 actions
The environment is upgraded by inserting the forward kinematic model of a knuckle crane. Considering the crane as a threerevolute joints robotics arm, the input to the environment, in this case, is the actuators' angles.
State observation (forward kinematic)
The mathematical model for the knuckle crane was generated by simulating the crane as a robotic arm with three joins, as shown in Figure 1, with the following assumptions:
Assumption 2
Ⅰ Neglect the dynamics of the load and wires.
Ⅱ The second and third joints are actuated through hydraulic cylinders, but the actuator models are not taken into account.
Ⅲ The wave amplitude changes with time but not with the xdirection of the global reference; in other words, the crane endeffector is assumed to be interacting with the same wave amplitude in all of its xdirection positions at a specific time instant.
The model is established using the Denavit–Hartenberg (DH) parameters in Table 1.
DH parameters
Link  a_{i}  α_{i}  d_{i}  
1  0  l_{1}  q_{1}  
2  l_{2}  0  0  q_{2} 
3  l_{2}  0  0  q_{3} 
Hence, the four homogeneous transformation matrices from frame 0 to 3 are obtained as follows:
and
The position of the endeffector, as a function of the joints angles, can be represented as:
where q_{i}, with
For the action space in this environment, the effect of the hydraulic cylinders that actuate the crane links is neglected, and the crane angles are assumed to be controlled directly. Hence, the action space in this environment is described as:
Note that q_{1} is an array of zeros and only a 2D workspace is considered. Hence, in the upgraded environment, we have 25 different combinations of actions, where we take the permutations of q_{2} and q_{3} and set the number of elements of q_{1} to 25 elements. Moreover, the dynamical model [Equation (3)] was developed using the Lagrange approach and the equation of motion, which are used to extract the torque control sequence from the optimal control sequence of joint angles.
where
The impact velocity
where
The stateoftheart Qlearning algorithm is used in this work with deterministic dynamics state estimation and onestep TD (TD(0)) learning approach; the structure of the QLearning algorithm was modified by implementing the reward structures described in Section 2.2, the state observation methodology depending on the environment under consideration, and the setting parameters in Table 2. The full structure of the algorithm is described in Algorithm 1.
Algorithm 1 QLearning Algorithm. 
1: initialize Qfunction, 
2: set initial state 
3: policy = 
4: let P(exploratory) = Probability of selecting an action according to exploratory policy. 
5: repeat 
6: set P(exploratory) = 
7: for every time step k=0, 1, 2, 
8: if policy = exploratory then 
9: select action 
10: else 
11: select action 
12: end if 
13: apply 
14: if impact occurred then 
15: measure impact velocity 
16: set 
17: else 
18: measure intermediate reward 
19: set 
20: end if 
21:

22: 
23: end for 
24: update Nsteps, update 
25: until 
Algorithm setting
Setting  Value  Setting  Value 
t_{step}  0.2 s  
Episode length  90 s  0.97  
Action space  A_{v} or A_{q}  
State space  150 states  
Agent  P_{h}(1)+ 30 s of P_{w}  Nsteps(limit)  10, 000 
The Double Qlearning algorithm has the ability to determine an unbiased estimate of the Qvalue, because for each update, one set of weights is used to determine the greedy policy and the other to determine its value ^{[18]}. Algorithm 2 is the structure of the Double Qlearning in this framework, where the action
Algorithm 2 Double QLearning Algorithm. 
1: initialize 
2: set initial state 
3: repeat 
4: Choose 
5: Choose (random) either UPDATE(A) or UPDATE(B) 
6: if UPDATE(A) then 
7: Define 
8: else if UPDATE(B) then 
9: Define 
10: end if 
11: apply 
12: if impact occurred then 
13: measure impact velocity 
14: end if 
15: 
16: until 
In Table 2,
Figure 3 contains acceptable landing operations, where the agent managed to land the load on the vessel with impact velocity less than 0.3 m/s.
Note that, at the beginning of the episodes, the hoist velocity is fixed to 0.3 m/s; hence, the learning process takes place after the hoist reaches 2 m from the average wave amplitude. Moreover, this fixed starting action helps to avoid the delusion at the beginning of an episode, for example, going above the starting position, which would slow down the learning process. In addition, these results were obtained with the initial setting mentioned in Table 2.
Since the algorithm is modelfree, it is not possible to measure the bias that usually exists in the Qlearning algorithm. Hence, the results of the Qlearning are compared to the Double Qlearning results so the bias value can be measured. Figure 5 shows that DQL has the same sparsity
Moreover, it is clear that DQL does not suffer from the delusion of the values of the average return at the beginning of the learning process; on the contrary, QL does suffer from overshooting the asymptotic value, which is clear in the case of
Note that the episodes number in the case of
The same Qlearning algorithm was tested in the upgraded environment, and it can be noticed that in the majority of the episodes in this case, the agent could find an accepted impact velocity at early stages of the episode time span, as shown in Figure 4.
Moreover, by using the dynamical model, the input torques corresponding to the optimal control sequence of angles of one of the episodes can be calculated, as shown in Figure 6, where it is clear that Link 2 has the highest inertia, which extends the limits of the torque domain.
Figure 7 indicates two differences between the initial environment (Ini Env) and the upgraded environment (Upg Env). First, the average return in the case of the upgraded environment is much less sparse than the initial environment. Second, the asymptotic average return values do not coincide.
Reinforcement learning is, by definition, an online learning environment. There is no separate test phase because the agent never stops learning. In other words, RL problems are usually of the "continual learning'' type and the goal is to get the highest total reward, which is usually discounted over time. The most common technique for measuring RL performance is to look at the average return. In contrast with the supervised learning case, there are no standard performance measures yet, even in popular application domains^{[19]}. Hence, the usual testing phase is to compare the algorithms to each other, as in our comparison between Qlearning and Double Qlearning in Section 4.1.2. In this section, we aim to generalize the optimal control sequences that were achieved in the online learning process. Hence, a testing sample of 500 episodes was generated considering the initial environment. For specific configurations, the agent was forced to follow a randomly selected control sequence from those considered optimal in the online learning process. The accuracy, which describes the agent success ratio to achieve acceptable impact velocity among the 500 episodes of the testing sample, is reported in Table 3 for different configurations. Note that the agent accuracy on the upgraded environment does not exceed 1.5% for all configurations.
Agent accuracy on 500 episodes with
Agent  Initial pos  Accuracy 
QL  
DQL  
QL  
DQL  
QL  
DQL 
The purpose of this work is to control the impact that occurs in offshore crane load landing operations in one of the most complicated environments due to the impact of several conditions. Although strong assumptions were set, two environments were established. The first considered the crane hoist as a point mass and the second was enlarged with the crane structure using a robotlike mathematical model, which was established and simulated using MATLAB and Simulink. Both environments have the sea waves model involved in the vessel's motion where the crane's load is supposed to be placed.
Optimal control sequences were generated using the Qlearning algorithm, and they managed to reach acceptable impact velocities in online learning processes. The performance of the Qlearning algorithm was tested, and we conclude the following: an intermediate reward structure is needed to overcome the effect of the agent delusion that occurs, as the reward in the usual Qlearning algorithm is delayed to the impact point; hence, less time would be required to accomplish the task.
For the same discount factor, the lower the learning rate, the less sparse the values of the average return, and the more delayed the convergence to the asymptotic value. Moreover, the asymptotic value of the average return is reduced while reducing the learning rate; although this effect is not essential in the initial environment, it is clear in more complicated environments such as the upgraded environment in our work. The change of the discount factorfor the same other hyperparameters valuesis directly proportional to the asymptotic value of the average return but with no magnificent effect on the convergence time. The algorithm performance was compared to the Double Qlearning technique; hence, the bias in the Qlearning technique can be adjusted.
Reinforcement learning, in general, does not have a separate testing sample, as every learning process is online and every control sequence is unique on its own episode. In other words, we can say that the obtained control sequence has local optimality. This was verified. We tested the agents on a sample of 500 episodes and concluded that the accuracy of the agent is facing a high variance; even when we set all the algorithm hyperparameters to be fixed, the initial position of the crane where the control sequence is tested still has a significant effect in the variation of the accuracy value.
On the other hand, although the accuracy range of the best trials in the initial environment was between 71% and 80%, this range was tremendously reduced to between 0.1% and 1.5% in the upgraded environment. We attribute that to the size of action_space, as the initial environment agent has one input of 11 actions and the upgraded environment agent has three inputs and 25 actions; hence, the more complicated the action_space, the more unique the optimal control sequence and the harder it is to find a global optimal control sequence. Thus, choosing the action_space is important not only from the learning point of view but also from the feasibility in the physical domain; in this work, it is shown that avoiding setting up sudden movements and more effort to the links with high inertia are needed.
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Maamoun KSA, Karimi HR
Not applicable.
This work was partially supported by the Italian Ministry of Education, University and Research through the Project "Department of Excellence LIS4.0Lightweight and Smart Structures for Industry 4.0" and in part by the Horizon Marie SkłodowskaCurie Actions program (101073037).
All authors declared that there are no conflicts of interest.
Not applicable.
Not applicable.
© The Author(s) 2022.
1. Park HC, Chakir S, Kim YB, Lee DH. A Robust payload control system design for offshore cranes: experimental study. Electronics 2021;10:462.
DOI2. Huster A, Bergstrom H, Gosior J, White D. Design and operational performance of a standalone passive heave compensation system for a work class ROV. In: OCEANS 2009. IEEE; 2009. pp. 1–8.
DOI3. Ni J, Liu S, Wang M, Hu X, Dai Y. The simulation research on passive heave compensation system for deep sea mining. In: 2009 International Conference on Mechatronics and Automation. IEEE; 2009. pp. 5111–16.
DOI4. Zhu M, Zhang P, Zhu C, Jia X. Dynamic analysis and optimal control of the landing process of the offshore installation. Adv Mech Eng 2017; doi: 10.1177/1687814017727971.
DOI5. Mackojć A, Chiliński B. Preliminary modelling methodology of a coupled payloadvessel system for offshore lifts of light and heavyweight objects. Bulletin of the Polish Academy of Sciences: Technical Sciences 2022;70: e139003. Available from:.
DOI6. Idres M, Youssef K, Mook D, Nayfeh A. A nonlinear 8DOF coupled craneship dynamic model. In: 44th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference; 2003. p. 1855.
DOI7. Ellermann K, Kreuzer E, Markiewicz M. Nonlinear dynamics of floating cranes. Nonlinear Dynamics 2002;27:10783.
DOI8. Cha JH, Roh MI, Lee KY. Dynamic response simulation of a heavy cargo suspended by a floating crane based on multibody system dynamics. Ocean Engineering 2010;37:127391.
DOI9. Spong MW, Hutchinson S, Vidyasagar M, et al. Robot modeling and control. vol. 3. Wiley New York; 2006.
DOI10. Williams LA. Modelling, Simulation and Control of offshore crane Develop a kinematic and dynamic crane model and study of several control designs [MastersThesis]. Universitetet i Agder; University of Agder. Norway; 2018. Available from: http://hdl.handle.net/11250/2564033.
11. Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018. Available from: https://mitpress.mit.edu/9780262039246/.
12. Andersson J, Bodin K, Lindmark D, Servin M, Wallin E. Reinforcement Learning Control of a Forestry Crane Manipulator. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. pp. 2121–26.
DOI13. Gaudet B, Linares R, Furfaro R. Deep reinforcement learning for six degreeoffreedom planetary landing. Adv Space Res 2020;65:172341.
DOI14. Sun X, Xie Z. Reinforcement LearningBased Backstepping Control for Container Cranes. Mat Pro Eng 2020;2020.
DOI15. Ding M. Reinforcement Learning For Offshore Crane Setdown Operations [MastersThesis]. University of Croningen. Netherlands; 2018. Available from: https://www.ai.rug.nl/~mwiering/Thesis_Mingcheng_Ding.pdf.
16. Vazirizade SM. An intelligent integrated method for reliability estimation of offshore structures wave loading applied in time domain[PhdThesis]. The University of Arizona. USA; 2019. Available from: https://repository.arizona.edu/handle/10150/636592.
17. Küchler S, Mahl T, Neupert J, Schneider K, Sawodny O. Active control for an offshore crane using prediction of the vessel's motion. IEEE/ASME Transactions on Mechatronics 2010;16:297309.
DOI18. Hasselt H. Double Qlearning. Advances in neural information processing systems 2010;23. Available from: https://proceedings.neurips.cc/paper/2010/hash/091d584fced301b442654dd8c23b3fc9Abstract.html.
19. Zhang C, Vinyals O, Munos R, Bengio S. A study on overfitting in deep reinforcement learning. arXiv preprint arXiv: 180406893 2018.
Maamoun KSA, Karimi HR. Reinforcement learningbased control for offshore crane loadlanding operations. Complex Eng Syst 2022;2:13. http://dx.doi.org/10.20517/ces.2022.28
Comments
Comments must be written in English. Spam, offensive content, impersonation, and private information will not be permitted. If any comment is reported and identified as inappropriate content by OAE staff, the comment will be removed without notice. If you have any queries or need any help, please contact us at support@oaepublish.com.