Offshore crane operations are frequently carried out under adverse weather conditions. While offshore cranes attempt to finish the load-landing or -lifting operation, the impact between the loads and the vessels is critical, as it can cause serious injuries and extensive damage. Multiple offshore crane operations, including load-landing operations, have used reinforcement learning (RL) to control their activities. In this paper, the Q-learning algorithm is used to develop optimal control sequences for the offshore crane's actuators to minimize the impact velocity between the crane's load and the moving vessel. To expand the RL environment, a mathematical model is constructed for the dynamical analysis utilizing the Denavit–Hartenberg (DH) technique and the Lagrange approach. The Double Q-learning algorithm is used to locate the common bias in Q-learning algorithms. The average return feature was studied to assess the performance of the Q-learning algorithm. Furthermore, the trained control sequence was tested on a separate sample of episodes, and the hypothesis that, unlike supervised learning, reinforcement learning cannot have a global optimal control sequence but only a local one, was confirmed in this application domain.
Setting a heavy object down on the deck of a vessel is one of the most common marine operations. During the load-landing or -lifting process, any disturbance such as heave motion may lead to a significant impact on the load and equipment, which may cause fatal injury to the crew and permanent damage. Due to a variety of factors, such as ship motions, crane mechanics, and other factors, achieving a soft load landing with acceptable impact force and a small distance is challenging. Hence, a lot of efforts have been carried out to facilitate this operation; some introduced a variety of control algorithms to automate the process, and others intended to provide training methodology for the operators through simulators, virtual reality, and augmented reality. Several control techniques for marine operations have been proposed. A payload position control of offshore crane was developed by Park
The remainder of this paper is organized as follows. In Section 2, the problem statement with the operation system assumptions are described in detail, and the algorithms' structures are provided as well. In Section 3, a description of the RL algorithms that are used during this work is given. In Section 4, a simulation for the control sequences is demonstrated, and a comparison between the behavior of the environment under different assumptions is introduced. Section 5 is a discussion of the obtained results, and it stands on the strength of this work.
In this work, two different environments are considered, namely initial environment and upgraded environment, as described in the following.
In this environment, the wave of every episode is randomly generated from the JONSWAP spectra created by the same sea state. Therefore, the wave elevation differs between episodes but not the statistical properties ^{[16]}. Here, the sea state is assumed to be:
where H_{s} is the wave height and T_{p} is the peak wave period. In a real application, the vertical motion of a vessel can be predicted approximately 4 s ahead of time with high accuracy^{[17]}. This gives us the reliability to plane a control sequence that can be visualized to the crane operator using technologies such as augmented reality. Moreover, the following assumptions are imposed on the load-landing operation problem:
Assumption 1
Ⅰ Neglect all the crane and load dynamics.
Ⅱ The mass and stiffness of the barge are neglected.
Ⅲ The barge has the same amplitude of the wave (barge dynamics is neglected and its response is 1 to all the wave frequency).
Ⅳ Neglect the wave effect on the crane.
Ⅴ Consider the hoist speed as the action to the control input directly.
In this environment, the agent–environment interaction is described through the following set of equations:
where P_{h} is the hoist position; P_{w} is the wave height; D_{r} is the relative distance between P_{h} and P_{w}; a_{t} is the chosen action (hoist velocity); and t_{step} is the time step of discretization. The action space in this environment consists of 11 actions
The environment is upgraded by inserting the forward kinematic model of a knuckle crane. Considering the crane as a three-revolute joints robotics arm, the input to the environment, in this case, is the actuators' angles.
The mathematical model for the knuckle crane was generated by simulating the crane as a robotic arm with three joins, as shown in
Crane kinematic structure.
Assumption 2
Ⅰ Neglect the dynamics of the load and wires.
Ⅱ The second and third joints are actuated through hydraulic cylinders, but the actuator models are not taken into account.
Ⅲ The wave amplitude changes with time but not with the x-direction of the global reference; in other words, the crane end-effector is assumed to be interacting with the same wave amplitude in all of its x-direction positions at a specific time instant.
The model is established using the Denavit–Hartenberg (DH) parameters in
DH parameters
0 | l_{1} | q_{1} | ||
l_{2} | 0 | 0 | q_{2} | |
l_{2} | 0 | 0 | q_{3} |
Hence, the four homogeneous transformation matrices from frame 0 to 3 are obtained as follows:
and
The position of the end-effector, as a function of the joints angles, can be represented as:
where q_{i}, with
For the action space in this environment, the effect of the hydraulic cylinders that actuate the crane links is neglected, and the crane angles are assumed to be controlled directly. Hence, the action space in this environment is described as:
Note that q_{1} is an array of zeros and only a 2D workspace is considered. Hence, in the upgraded environment, we have 25 different combinations of actions, where we take the permutations of q_{2} and q_{3} and set the number of elements of q_{1} to 25 elements. Moreover, the dynamical model [Equation (3)] was developed using the Lagrange approach and the equation of motion, which are used to extract the torque control sequence from the optimal control sequence of joint angles.
where
The impact velocity
where
The state-of-the-art Q-learning algorithm is used in this work with deterministic dynamics state estimation and one-step TD (TD(0)) learning approach; the structure of the Q-Learning algorithm was modified by implementing the reward structures described in Section 2.2, the state observation methodology depending on the environment under consideration, and the setting parameters in
1: initialize Q-function, 2: set initial state 3: policy = 4: let P(exploratory) = Probability of selecting an action according to exploratory policy.
5: 6: set P(exploratory) = 7: 8: 9: select action 10: 11: select action 12: 13: apply 14: 15: measure impact velocity 16: set 17: 18: measure intermediate reward 19: set 20: 21:
22: 23: 24: update N-steps, update 25:
Algorithm setting
t_{step} | 0.2 s | ||
Episode length | 90 s | 0.97 | |
Action space | A_{v} or A_{q} | ||
State space | 150 states | ||
Agent | P_{h}(1)+ 30 s of P_{w} | N-steps(limit) | 10, 000 |
The Double Q-learning algorithm has the ability to determine an unbiased estimate of the Q-value, because for each update, one set of weights is used to determine the greedy policy and the other to determine its value ^{[18]}. Algorithm 2 is the structure of the Double Q-learning in this framework, where the action
1: initialize 2: set initial state 3: 4: Choose 5: Choose (random) either UPDATE(A) or UPDATE(B)
6: 7: Define 8: 9: Define 10: 11: apply 12: 13: measure impact velocity 14: 15: 16:
In
Agent–environment interactions.
Acceptable landing episodes.
Note that, at the beginning of the episodes, the hoist velocity is fixed to 0.3 m/s; hence, the learning process takes place after the hoist reaches 2 m from the average wave amplitude. Moreover, this fixed starting action helps to avoid the delusion at the beginning of an episode, for example, going above the starting position, which would slow down the learning process. In addition, these results were obtained with the initial setting mentioned in
Since the algorithm is model-free, it is not possible to measure the bias that usually exists in the Q-learning algorithm. Hence, the results of the Q-learning are compared to the Double Q-learning results so the bias value can be measured.
Acceptable landing episodes.
A comparison of QL and DQL with
Moreover, it is clear that DQL does not suffer from the delusion of the values of the average return at the beginning of the learning process; on the contrary, QL does suffer from overshooting the asymptotic value, which is clear in the case of
Note that the episodes number in the case of
The same Q-learning algorithm was tested in the upgraded environment, and it can be noticed that in the majority of the episodes in this case, the agent could find an accepted impact velocity at early stages of the episode time span, as shown in
Moreover, by using the dynamical model, the input torques corresponding to the optimal control sequence of angles of one of the episodes can be calculated, as shown in
Example of a torque optimal control sequence.
Environments comparison.
Reinforcement learning is, by definition, an online learning environment. There is no separate test phase because the agent never stops learning. In other words, RL problems are usually of the "continual learning'' type and the goal is to get the highest total reward, which is usually discounted over time. The most common technique for measuring RL performance is to look at the average return. In contrast with the supervised learning case, there are no standard performance measures yet, even in popular application domains^{[19]}. Hence, the usual testing phase is to compare the algorithms to each other, as in our comparison between Q-learning and Double Q-learning in Section 4.1.2. In this section, we aim to generalize the optimal control sequences that were achieved in the online learning process. Hence, a testing sample of 500 episodes was generated considering the initial environment. For specific configurations, the agent was forced to follow a randomly selected control sequence from those considered optimal in the online learning process. The accuracy, which describes the agent success ratio to achieve acceptable impact velocity among the 500 episodes of the testing sample, is reported in
Agent accuracy on 500 episodes with
QL |
||
DQL |
||
QL |
||
DQL |
||
QL |
||
DQL |
The purpose of this work is to control the impact that occurs in offshore crane load landing operations in one of the most complicated environments due to the impact of several conditions. Although strong assumptions were set, two environments were established. The first considered the crane hoist as a point mass and the second was enlarged with the crane structure using a robot-like mathematical model, which was established and simulated using MATLAB and Simulink. Both environments have the sea waves model involved in the vessel's motion where the crane's load is supposed to be placed.
Optimal control sequences were generated using the Q-learning algorithm, and they managed to reach acceptable impact velocities in online learning processes. The performance of the Q-learning algorithm was tested, and we conclude the following: an intermediate reward structure is needed to overcome the effect of the agent delusion that occurs, as the reward in the usual Q-learning algorithm is delayed to the impact point; hence, less time would be required to accomplish the task.
For the same discount factor, the lower the learning rate, the less sparse the values of the average return, and the more delayed the convergence to the asymptotic value. Moreover, the asymptotic value of the average return is reduced while reducing the learning rate; although this effect is not essential in the initial environment, it is clear in more complicated environments such as the upgraded environment in our work. The change of the discount factor---for the same other hyperparameters values---is directly proportional to the asymptotic value of the average return but with no magnificent effect on the convergence time. The algorithm performance was compared to the Double Q-learning technique; hence, the bias in the Q-learning technique can be adjusted.
Reinforcement learning, in general, does not have a separate testing sample, as every learning process is online and every control sequence is unique on its own episode. In other words, we can say that the obtained control sequence has local optimality. This was verified. We tested the agents on a sample of 500 episodes and concluded that the accuracy of the agent is facing a high variance; even when we set all the algorithm hyperparameters to be fixed, the initial position of the crane where the control sequence is tested still has a significant effect in the variation of the accuracy value.
On the other hand, although the accuracy range of the best trials in the initial environment was between 71% and 80%, this range was tremendously reduced to between 0.1% and 1.5% in the upgraded environment. We attribute that to the size of action_space, as the initial environment agent has one input of 11 actions and the upgraded environment agent has three inputs and 25 actions; hence, the more complicated the action_space, the more unique the optimal control sequence and the harder it is to find a global optimal control sequence. Thus, choosing the action_space is important not only from the learning point of view but also from the feasibility in the physical domain; in this work, it is shown that avoiding setting up sudden movements and more effort to the links with high inertia are needed.
Made substantial contributions to conception and design of the study and performed data analysis and interpretation: Maamoun KSA, Karimi HR
Not applicable.
This work was partially supported by the Italian Ministry of Education, University and Research through the Project "Department of Excellence LIS4.0-Lightweight and Smart Structures for Industry 4.0" and in part by the Horizon Marie Skłodowska-Curie Actions program (101073037).
All authors declared that there are no conflicts of interest.
Not applicable.
Not applicable.
© The Author(s) 2022.
Huster A, Bergstrom H, Gosior J, White D. Design and operational performance of a standalone passive heave compensation system for a work class ROV. In: OCEANS 2009. IEEE; 2009. pp. 1–8.
10.23919/OCEANS.2009.5422276
Ni J, Liu S, Wang M, Hu X, Dai Y. The simulation research on passive heave compensation system for deep sea mining. In: 2009 International Conference on Mechatronics and Automation. IEEE; 2009. pp. 5111–16.
10.1109/ICMA.2009.5246165
Mackojć A, Chiliński B. Preliminary modelling methodology of a coupled payload-vessel system for offshore lifts of light and heavyweight objects.
10.24425/bpasts.2021.139003
Idres M, Youssef K, Mook D, Nayfeh A. A nonlinear 8-DOF coupled crane-ship dynamic model. In: 44th AIAA/ASME/ASCE/AHS/ASC Structures, Structural Dynamics, and Materials Conference; 2003. p. 1855.
10.2514/6.2003-1855
Spong MW, Hutchinson S, Vidyasagar M, et al. Robot modeling and control. vol. 3. Wiley New York; 2006.
10.1109/tac.2006.890316
Williams LA. Modelling, Simulation and Control of offshore crane Develop a kinematic and dynamic crane model and study of several control designs [MastersThesis]. Universitetet i Agder; University of Agder. Norway; 2018. Available from: http://hdl.handle.net/11250/2564033.
Sutton RS, Barto AG. Reinforcement learning: An introduction. MIT press; 2018. Available from: https://mitpress.mit.edu/9780262039246/.
Andersson J, Bodin K, Lindmark D, Servin M, Wallin E. Reinforcement Learning Control of a Forestry Crane Manipulator. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE; 2021. pp. 2121–26.
10.1109/IROS51168.2021.9636219
Ding M. Reinforcement Learning For Offshore Crane Set-down Operations [MastersThesis]. University of Croningen. Netherlands; 2018. Available from: https://www.ai.rug.nl/~mwiering/Thesis_Mingcheng_Ding.pdf.
Vazirizade SM. An intelligent integrated method for reliability estimation of offshore structures wave loading applied in time domain[PhdThesis]. The University of Arizona. USA; 2019. Available from: https://repository.arizona.edu/handle/10150/636592.
Hasselt H. Double Q-learning.
Zhang C, Vinyals O, Munos R, Bengio S. A study on overfitting in deep reinforcement learning.