863
1 INTRODUCTION
Path planning is one of the most fundamental issues in
autonomous control of robots, cars, ships, and aircrafts.
In these control systems, objects need to be moved
along the shortest possible path while avoiding
obstacles. The basis of path planning is the
development of appropriate algorithms, which have
evolved from the initial traditional graph-search based
methods to modern methods based on optimization.
Path planning can be divided into global planning
intended to determine optimal routes when a map is
available, and local path planning focusing on dealing
with obstacles not represented on the map by using
localized obstacle avoidance techniques [14].
Depending on whether the obstacles are stationary or
moving, local path planning can be further divided
into static or dynamic [35].
The algorithms used in global path planning can be
further divided into three main groups. The first group
consists of traditional algorithms (Dijkstra, A*), while
the second group includes optimization algorithms
based on artificial intelligence (GA-genetic algorithm,
PSO-particle swarm optimization algorithm, etc.), and
Optimizing the Path of a Mobile Agent in the
Environment with Static Obstacles using Reinforcement
Learning
A. Sawicki & M. Tomera
Gdynia Maritime University, Gdynia, Poland
ABSTRACT: Path planning for a mobile agent concerns the problem of searching for a collision-free and optimal
path between the initial and target positions. The space in which the agent moves contains a number of obstacles
modeled by ordered grids, each representing the obstacle location in the space of the agent movement. The final
boundary of an obstacles is formed by its actual boundary plus a minimum safe distance, taking into account the
size of the agent, which allows to treating the obstacles as points in the environment. In the article, reinforcement
learning algorithms were used to determine the path, including numerous design methods: dynamic
programming (policy iteration algorithm and value iteration algorithm), Monte Carlo method (Monte Carlo
control algorithm), temporal-difference (TD) learning (Q-learning algorithm and Sarsa algorithm), eligibility
traces (Q( )) algorithm and Sarsa( ) algorithm), planning and learning (Dyna-Q algorithm), and gradient
methods (Q-learning algorithm with Adam optimizer and Sarsa algorithm with Adam optimizer). The
reinforcement learning algorithms operate on the principle of determining the agent's policy, which seeks the
minimum distance between the initial and target positions of the mobile agent, while avoiding obstacles. These
learning procedures differ between, dynamic programming, which requires a good knowledge of the
environment model to determine the agent's policy, and other methods which do not require this knowledge. The
aim of the reported work was to examine the above named algorithms in terms of their effectiveness and speed
of finding the optimal solution. Based on the results of the simulation studies, the most effective methods turned
out to be that using gradient methods for optimization, i.e.: Q-learning with Adam optimizer.
http://www.transnav.eu
the International Journal
on Marine Navigation
and Safety of Sea Transportation
Volume 19
Number 3
September 2025
DOI: 10.12716/1001.19.03.20
864
the third group consists of algorithms based on RRT
(rapidly exploring random tree) path planning.
In the first group, the oldest algorithm is Dijkstra's
algorithm, proposed in 1959, which is based on graph
search [7]. The algorithm builds a graph model and
then traverses most of the subnodes, one by one
according to the greedy principle, using the relaxation
method to optimize the path selection. The Dijkstra's
algorithm gives a good planning result when the
volume of map data is small. When it is large, the
results become poorer and can even fail to meet the
requirements of path planning [31]. A-star (A*)
algorithm is a traditional algorithm obtained by
improving Dijkstra's algorithm by adding, a heuristic
function. This algorithm is based on the use of a raster
grid, to discretize the solution search space [3, 26], thus
making the basis for drawing graphs of connections
between the points. The A-star algorithm traverses the
graph and finds paths for a given weighted graph,
source node, and destination node [16, 32].
The second group includes artificial intelligence-
based optimization algorithms applicable in global
route planning. These algorithms determine the
optimal traversal path by establishing mathematical
models, such as genetic algorithms (GA) [15],
differential evolution (DE) [6], ant colony optimization
(ACO) [24], firefly algorithm (FA) [1], particle swarm
optimization (PSO) [36], whale optimization algorithm
(WOA) [29], flower pollination algorithm (FPA) [4],
grey wolf optimizer (GWO) [18], sparrow search
algorithm (SSA) [34], and cuckoo search algorithm
(CSA) [20]. Swarm intelligence algorithms are widely
used in optimization computations due to their strong
optimization ability and ease of implementation.
The third global path planning group includes the
algorithm based on stochastic sampling with rapidly-
exploring random tree (RRT) path planning [21]. This
algorithm is designed to efficiently search nonconvex
and, high-dimensional spaces by randomly building a
space-filling tree. The tree is constructed incrementally
from samples randomly taken from the search space
and is inherently biased to grow toward large,
unsearched problem regions. The RRT algorithm was
developed by Steven LaValle [13]
The algorithms used in local route planning can be
divided into four main groups. The first group consists
of algorithms based on Dubins curves, while the
second group includes the dynamic window approach
(DWA) [15, 24], the third group consists of algorithms
based on the artificial potential field (APF) method,
and the fourth group consists of reinforcement
learning (RL).
In the first group, there are path planning
algorithms based on mathematical rules, such as those
using Dubins curves [9]. These algorithms make it
possible to obtain paths with bounded curvature and
derivative of curvature which connect two assigned
positions (positions and orientations). The solution is
based on the Dubins shortest path approximation and
can be used to connect an ordered sequence of points
located between the initial and target points [22, 30].
The second group includes the dynamic window
approach (DWA). This approach can make the agent
avoid unknown obstacles in time and improve the
smoothness of the transition path in the moving
process. However, it has some drawbacks, such as a
single movement state, a long path, and the ease of
falling into a local minimum. Therefore, this method is
often used in combination with the A* algorithm [17]
or the genetic algorithm (GA) [15] or the ant colony
optimization (ACO) algorithm [24].
The third group of local path planning includes the
artificial potential field (APF) method, which has a
simple algorithm, few iterations, high real-time
performance and smooth path, but it can easily fall into
a local minimum and performs poorly in scenarios
with many obstacles [19].
The fourth group includes reinforcement learning
(RL) [5], being the object of study in this article.
Combining reinforcement learning with artificial
neural networks (ANN) creates deep reinforcement
learning (DRL) which shows promise in path planning
[25]. Reinforcement learning is the algorithm that aims
to find the optimal strategy for an agent interacting
with an unknown environment. For the actions
performed, the agent receives positive, negative or zero
reward signals, and the calculated strategy should
maximize the cumulative rewards. The agent explores
the environment to quickly learn about actions with
significant rewards. This definition covers a wide
range of problems that can be solved, including path
planning [11].
2 MATHEMATICAL MODELING OF
REINFORCEMENT LEARNING
Reinforcement learning (RL) is a part of machine
learning (ML), being, in turn, a subset of artificial
intelligence (AI). The basic components of
reinforcement learning are a computational agent and
an environment, as shown in Figure 1. Based on the
observation of the state st of the environment, the agent
takes the action at. The environment is an entity with
which the agent can interact through a predefined set
of actions A = {a1, a2, }. The set A describes all possible
actions, while the set S describes all possible states of
the environment S = {s1, s2, … } [28].
Figure 1. Interaction between agent and environment.
The agent learns by exploring various actions in the
environment, receiving feedback in the form of
rewards, and adjusting its behaviour to maximize the
expected rewards. At any time step t, the agent first
observes the current state st of the environment and its
corresponding reward value rt. Then, it decides what
to do next, based on the information on the state and
the reward. The action at that the agent intends to
perform is introduced into the environment, as a result
of which the agent can obtain a new state st+1 and a new
reward rt+1. In this way, the agent creates a sequence or
action a
t
Agent
Environment
reward r
t
r
t+1
state s
t
s
t+1
865
trajectory that starts as follows: s0, a0, r1, s1, a1, r2, s2, a2,
r3, (State, Action, Reward, etc.).
The basic task in reinforcement learning is to
determine a policy for the agent, denoted as (at|st),
which defines the agent's decision-making strategy. It
maps states to actions, indicating what action the agent
should take in each state. The goal of the reinforcement
learning is to find parameters that define the policy
(a|s) and maximize the value of the sum of rewards
on the trajectory.
The sequential decision making by the agent is
modeled as a Markov decision process (MDP).
2.1 Markov Decision Process
Markov decision processes (MDP) allow modeling
stochastic environments in which the future state
depends solely on the current state and the action
taken, which is known as the Markov property. This
means that subsequent states in the future do not
depend on the past but only on the current value of the
state, and, consequently, when the agent makes a
decision, it is not necessary to analyze its past history
on the trajectory.
The Markov decision process (MDP) provides a
formal way to represent the interaction of an agent
with its environment, in which the agent takes actions
in different states to maximize cumulative rewards.
The MDP is described by five variables (S, A, p, R,
),
where [2]:
S is the set of states (discrete or continuous);
A is the set of actions (discrete or continuous);
p(st+1| st, at) is the transition function that determines
the probability of transitioning to state st+1 after
performing action at in state st;
R(st+1|st, at) is the reward function assigning a value
(reward or penalty) for transitioning from state st to
st+1 as a result of the action at;
[0,1] is the discount rate that determines the
importance of future rewards relative to the current
ones.
Figure 2 shows a graphical model of the Markov
decision process, where an edge denotes the transition
probability between states. The graphical model
contains both rewards, states, and actions, and the
transition probabilities are conditioned by both states
and actions, hence p(st+1|st, at).
Figure 2. Graphical model of the Markov decision process
[10].
2.2 Expected Return
Through interactions with the environment, the agent
learns from the outcomes and adjusts its policy to
improve its decision-making over time. According to
this approach, the agent tries to choose actions such
that the sum of discounted rewards it receives in the
future is maximized. In particular, it chooses an action
that maximizes the expected return Gt, which is the
sum of discounted rewards on the trajectory:
(1)
where T is the final time step, is the parameter 0
1, referred to as the discount rate and used to reduce
the weights as the time step t increases, and rt is the
immediate reward received at time t. The value of the
reward r depends on the state and the action, or
sometimes only on the state, and determines which
states and actions are better. The final time step T
occurs when the agent's interaction with the
environment is decomposed into finite-length
sequences, called episodes. Each episode ends with a
special state, called the terminate state, after which the
return to the standard initial state takes place. Formula
(1) also applies to the cases in which T = 0 and = 0.
2.3 Value Functions
The agent's goal is to find a policy (a|s) , that
optimizes the state-value function or the action-value
function. The state-value function allows the agent to
assess how good the state is, in which the agent
currently is, while the action-value function allows to
assess how good the performance of a given action is
in a given state.
The state-value function
(s) is the expected return
when the agent applies policy
in the given state s
1
1
( ) [ | ] |
T
kt
t t k t
kt
s E G s s E r s s

−−
=+

= = = =



(2)
where E
{] denotes the expected value. The function
bears the name of the state-value function for policy
.
A fundamental property of the value functions used
throughout reinforcement learning and dynamic
programming is that they satisfy the recurrence
relations similar to those described for the expected
return given by (2). For any policy
and any state s, the
following consistency condition holds between the
value of state s and the value of the possible successor
states following it [28]:
11
',
( ) [ | ] [ | ]
( , ) ( ', | , )[ ( ')]
t t t t t
a s s
s E G s s E r G s s
s a p s r s a r s


++
= = = + =
=+

(3)
where s' is the next state st+1 and r is the immediate
reward of going to the state st+1. Equation (3) is the
Bellman equation for
(s). The value of the function
(s) is the unique solution of the Bellman equation.
From the definition of the state function
its optimal
value can be easily determined as
*
( ) max ( )ss


=
(4)
Similarly, the action-value function q
(s,a) is the
expected return starting from state s, after taking action
a, and then applying policy
s
0
p(s
1
|s
0
,a
0
)
a
0
(a
0
|s
0
)
r
1
R(s
1
|s
0
,a
0
)
. . .
s
1
p(s
2
|s
1
,a
1
)
a
1
(a
1
|s
1
)
r
2
R(s
2
|s
1
,a
1
)
s
2
866
1
1
( , ) [ | , ]
|,
t t t
T
kt
k t t
kt
q s a E G s s a a
E r s s a a

−−
=+
= = =

= = =



(5)
The function q
bears the name the action-value
function for policy
. Similarly to the state-value
function
, the optimal value of the action-value
function q
can be defined as
*
*
11
( , ) max ( , )
( ) | ,
t t t t
q s a q s a
E r s s s a a


++
=

= + = =

(6)
Compared to the state-value function
a special
feature of the action-value function q
is that the
optimal policy
can be obtained directly from q*(s,a)
**
( ) arg max ( , )
aA
s q s a
=
(7)
The optimal state-value function
*(s) denotes the
expected discounted return when the agent is in a
given state s and then applies the policy
*. The optimal
value function q*(s,a) denotes the expected discounted
return when the agent is in a given state s and for a
given action a in that state, it applies the policy
* [10].
2.4 Bellman’s Optimality Equations
Because
* is the state-value function for policy
*, it
must satisfy the self-consistency condition described
by Bellman's equation for state values (3). Because it is
the optimal state-value function, however, the
consistency condition for
* can be written in a special
form without reference to any particular policy. This is
the Bellman optimality equation for
*. Intuitively, the
Bellman optimality equation expresses the fact that the
value of a state under an optimal policy must equal the
expected return for the best action from that state [28]:
**
*
11
', '
( ) max ( , )
max ( ) | ,
max ( , ) ( ', | , )[ ( ']
aA
t t t t
a
a
a s r
s q s a
E r s s s a a
s a p s r s a r s


++
=

= + = =

=+

(8)
Equation (8) is the Bellman optimality equation for
*. The corresponding Bellman optimality equation for
action-value function q* has the form:
)]','(max[),|,'(
,|)',(max),(
*
'
','
1
*
'
1
*
asqrasrsp
aassasqrEasq
a
rs
ttt
a
t
+=
==+=
++
(9)
For Markov decision processes with a finite number
of states, the Bellman optimality equation (8) for
* has
a single solution. This general equation is in fact a
system of equations, one for each state, so if there are n
states, then there are n equations and n~unknowns. If
the dynamics p of the environment is known, then we
can solve this system of equations for
* using an
arbitrary methods for solving systems of nonlinear
equations. We can also solve a related set of equations
for q*. However, in this case, problems may arise for
very large numbers of states, since the computational
complexity of the solution increases by order n
3
, where
n is the number of states [28]. In addition, the tasks
solved in reinforcement learning rely on the interaction
of the agent with the environment, of unknown
dynamics. Hence, direct methods to solve the system
of nonlinear equations cannot be succesfully applied to
the Bellman optimality equations (8) and (9).
Fortunately, in practice, there are iterative methods to
solve problems modeled as Markov decision processes,
including dynamic programming, Monte Carlo
estimation, and temporal- difference learning, which
also allow solving the Bellman optimality equations
[8].
3 ALGORITHMS FOR DETERMINING THE
OPTIMAL AGENT POLICY
Depending on the level of knowledge of the
environment, different methods are used to determine
the agent policy
. In the case of full knowledge of the
environment model, dynamic programming methods
can be used. Otherwise, if the environment model is
not known, then such methods as Monte Carlo or
temporal-differences learning are available.
3.1 Dynamic Programming
Dynamic programming (DP) refers to a set of
algorithms that can be used to compute optimal agent
policies assuming that an accurate model of the
environment, described as a Markov decision process
(MDP) with finite sets of states S, actions A, and
rewards R is given. The key idea of dynamic
programming, as well as reinforcement learning in
general, is to use the value functions (
, q
) to search
for good policies
. The updates of the value function
are closely related to the Bellman equations,
transformed into assignment instructions. Classical
methods used in dynamic programming iteratively
update the value of one state based on the values of all
possible states following it and the probabilities of their
occurrence. When the updates no longer cause any
changes in the values, the convergence occurs to values
that satisfy the Bellman equation.
Policy evaluation refers to iterative computation of
a value function for a given policy. Policy
improvement is accomplished by computing an
improved policy with the given value function for that
policy. Combining these two computations yields the
policy iteration and the value iteration, the two most
popular DP methods. Each of them can be used to
reliably compute optimal policies and value functions
for finite MDPs, assuming full knowledge of the MDP.
3.1.1 Policy Iteration Algorithm
The policy iteration consists of finding the optimal
policy via a sequence of monotonically improved
policies
k, based on the evaluation of the value
function
k-1 for the previous policy
k-1. Policy
iteration consists of two simultaneous, and interacting
processes, one that makes the value function consistent
with the current policy (policy evaluation), and the
other that makes the policy greedy with respect to the
current value function (policy improvement). In policy
867
iteration, these two processes alternate, each ending
before the other begins, but this is not necessary. For
example, in value iteration, only one policy evaluation
iteration is performed between each policy
improvement. Algorithm 1 contains the pseudocode
for implementing the policy iteration.
Algorithm 1.
Policy Iteration algorithm for estimating

*
Setting parameter values:
= 0.99 and
= 10
-12
.
1. Initialize:
V(s) 0, for all s S
(s) 0, for all s S
2. Policy Evaluation
repeat
0
for each state s S:
V(s)
V(s)
( ', | , ( ))[ ( ')]
', '
p s r s s r V s
sr

+
max(, |
- V(s)|)
until <
3. Policy Improvement
policy-stable True
for each state s S:
old-action
(s)
(s)
argmax ( ', | , ( ))[ ( ')]
', '
p s r s s r V s
a
sr

+
if old-action
(s) then
policy-stable False
end
end
if policy-stable then
stop
return V
’,
else
go to 2
end
3.1.2 Value Iteration Algorithm
The value iteration is obtained by transforming the
Bellman optimality equation (8) into an update rule for
each state. Each such operation updates the value of
one state based on the values of all possible successor
states and their probability of occurrence. When the
updates do not cause any further changes in the values,
the convergence occurs to the values that satisfy the
Bellman equation. Based on the state-value function
from the Bellman optimality equation (9), the optimal
policy is determined. Algorithm 2 shows the
subsequent steps performed during the value iteration.
Algorithm 2.
Value Iteration algorithm for estimating

*
Setting parameter values:
= 0.99 and
= 10
-12
.
1. Initialize:
V(s) 0, for all s S
(s) 0, for all s S
2. Value Iteration
repeat
0
for each state s S:
q(a) 0 (for all a A)
for each action a A:
s’,r p(s’,r|s,a)
r + V(s’)
q(a) q(a) +
end
qmax max q(a)
max(, |qmax - V(s)|)
V(s) qmax
end
until <
3. Policy Calculation
for each state s S:
q(a) 0, (for all a A)
for each action a A:
s’,r p(s’,r|s,a)
r +
V(s’)
q(a) q(a) +
end
(s) argmaxaq(a)
end
3.2 Monte Carlo methods
Monte Carlo methods are used to find optimal policies
when only sample episodes are given and no other
knowledge about the dynamics of the environment.
However, they require a great many episodes to be able
to determine the optimal policy, making them of little
practical use for solving any but the smallest problems.
3.2.1 Monte Carlo Control Algorithm
This paper presents the Monte Carlo control
method written as a pseudocode in Algorithm 3. This
is a method based on the
-soft-greedy policy, which
means that (a|s) > 0 for all s S , but gradually moves
closer and closer to the deterministic optimal policy.
All non-greedy actions are determined by formula (10),
for all a A.
*
1 | | if
( | ) max (( , )
*
| | if
A a A
a s q a s
t t t
a
A a A

+ =
=
(10)
where A* = maxa Q(st,a)
The greedy strategy refers to a strategy in which the
agent always chooses the action with the highest value
of the value function. The selected action refers to the
greedy action and the agent is said to exploit the
environment. The
-greedy strategy refers to a strategy
in which the agent chooses the greedy action with
probability 1-
and a random action with a small
probability
. The random action refers to the non-
greedy action and the agent is said to explore the
environment.
Algorithm 3.
Monte Carlo Control algorithm
(first visit for -soft policies)
Setting parameter value: = 0.99.
Algorithm parameter: small > 0
Initialize:
(s) an arbitrary -soft policy
Q(s,a) R (arbitrarily), for all s S, a A
Returns(s,a) empty list, for all s S, a A
repeat (for each episode):
repeat (for each step t of episode):
at - soft-greedy}(st,Q)
Generate an episode: s0,a0, r1,…, sT-1, aT-1, rT
until s is terminal
G 0
for each step of episode, t=T-1, T-2,…,0:
G G + rt+1
if first visit the pair st, at in s0, a0, s1, a1, st-1, at-1
Append G to Returns(st,at)
Q(st,at) average(Returns(st,at))
A* argmaxa Q(st,a)
end
end
until stop
868
3.3 Temporal-Difference Learning
Temporal-difference (TD) learning is a combination of
the ideas of Monte Carlo methods and dynamic
programming (DP). Like Monte Carlo methods, TD
methods can learn directly from experience without
engaging a model of environment dynamics. Like DP,
TD methods update the estimates based in part on
other learned estimates, without waiting for the final
result. The most popular algorithms that use the idea
of learning time differences are Q-learning and Sarsa.
Both of these algorithms learn the action-value
function.
3.3.1 Q-learning Algorithm
Q-learning learns the action-value function Q(s,a),
which is a prediction of the return associated with each
action a A, in each state s S. The temporal-
difference (TD) between the target value and the
estimated value at different time steps is used to
update the value function Q [27]. The most basic
method of calculating the temporal-difference TD
involves updating the state-value function V(s) as
follows:
V(st) V(st) + [rt + V(st+1) V(st)] (11)
where is the learning rate, and is the discount factor.
At each state, the value function V corresponds to the
action-value function Q for the action with the highest
predicted return, which can be written as
V(st) Q(st, at) (12)
Since the primary goal is to maximize the obtained
return Q(st,at), the function estimating the value of the
state V(st+1) can be written as
( ) max ( , )
11
V s Q s a
tt
aA
++
(13)
After substituting the dependencies (12) and (13)
into formula (11), the single-step update equation in
the Q-learning method is obtained
( , ) ( , ) [ max ( , ) ( , )
1
Q s a Q s a r Q s a Q s a
t t t t t t t t
aA

+ +
+
(14)
As shown in [33], this update converges to some
fixed value for the Markov decision problems with a
finite number of states, for which the function Q is
stored in an array. When the function Q converges,
then the optimal policy is the action in each state with
the highest predicted return.
Algorithm 4 contains a pseudocode based on
equation (14), which allows to determine the route for
the mobile agent using the Q-learning method.
Algorithm 4. Q-learning algorithm
Setting parameter values: = 0.99 and = 0.01.
Initialize: Q(s,a) 0, for all s S, a A
repeat (for each episode):
Initialize state s0
repeat (for each step t of episode):
at - greedy(st,Q)
rt, st+1 Env(st,at)
Q(st,at) Q(st,at) + [rt+ maxa Q(st+1,a) Q(st,at)]
st st+1
until s is terminal
until stop
3.3.2 Sarsa Algorithm
This method was proposed by Rummery and
Niranjan [23] under the name of Modified
Connectionist Q-Learning (MCQ-L). The currently
used name for this method, Sarsa, proposed by Richard
Sutton [28], is derived from the first letters of the words
(state-action-reward-state-action) and is used to
determine the table Q containing the probability of
taking a specific action a in state s. The update rule in
this method is formulated as follows:
Q(st,at) Q(st,at) + [rt+1+Q(st+1,at+1) - Q(st,at)] (15)
Algorithm 5 presents the Sarsa algorithm steps to
update the Q table based on equation (15).
Algorithm 5. Sarsa algorithm
Setting parameter values: = 0.99 and = 0.01.
Initialize: Q(s,a) 0, for all s S, a A
repeat (for each episode):
Initialize state s0
a0 maxa Q(s0, a)
repeat (for each step t of episode):
at - greedy(st,Q)
rt, st+1 Env(st,at)
Q(st,at) Q(st,at) + [rt+1+ Q(st+1,at+1) - Q(st,at)]
st st+1
at at+1
until s is terminal
until stop
3.4 Eligibility Traces
Eligibility traces unify and generalize temporal-
difference (TD) learning and Monte Carlo methods.
When TD methods are supplemented with eligibility
traces, they form a family of methods spanning a
spectrum that has Monte Carlo methods at one end
(=1) and single-step TD methods at the other end
(=0). In between, there are intermediate methods that
are often better than either of the extreme methods. An
eligibility trace is a temporary record that maintains a
trail of state-action pairs taken over time.
When the eligibility traces are extended with a Q-
learning, the obtained combination is known as the
Watkins Q() algorithm [also simply refered to as Q()
algorithm]. It is similar to the Q-learning algorithm,
except that it is augmented with eligibility traces. The
eligibility traces are updated in two steps. First, if a
non-greedy action is taken, the eligibility traces are set
to zero for all state-action pairs. Otherwise, they are
decomposed by . When the eligibility traces are
extended with the Sarsa algorithm, the obtained
combination is known as the Sarsa() algorithm. Here,
the basic algorithm is similar to the Sarsa algorithm,
except that the backups are performed n steps later
rather than one step later. The algorithms Q() and
Sarsa(), from [28], are presented here as Algorithm 6
and Algorithm 7, respectively.
Algorithm 6. Q() algorithm
Setting parameter values:= 0.99 and = 0.01.
Initialize:
Q(s,a) 0, for all s S, a A
E(s,a) 0, for all s S, a A
repeat (for each episode):
Initialize state s0
Initialize action a0
repeat (for each step t of episode):
rt, st+1 Env(st,at)
869
at+1 - greedy(st+1,Q)
a* argmaxaQ(st+1,a)
rt+Q(st+1, a*) - Q(st,at)]
E(st,at) 1
for all s S
for all a A
Q(s,a) Q(s,a) +

E(s,a)
if at+1 == a* then
E(s,a)

E(s,a)
else
E(s,a) 0
end
end
end
st st+1
at at+1
until s is terminal
until stop
Algorithm 7. Sarsa() algorithm
Setting parameter values:= 0.99 and = 0.01.
Initialize:
Q(s,a) 0, for all s S, a A
E(s,a) 0, for all s S, a A
repeat (for each episode):
Initialize state s0
Initialize action a0
repeat (for each step t of episode):
rt, st+1 Env(st,at)
at+1 - greedy(st+1,Q)
rt+
Q(st+1, at+1) Q(st,at)]
E(st,at) 1
for all s S
for all a A
Q(s,a) Q(s,a) +

E(s,a)
E(s,a)

E(s,a)
end
end
st st+1
at at+1
until s is terminal
until stop
3.5 Planning and Learning
There are methods that learn a model from experience.
Then, the obtained model is used to plan the agent's
policy. A good example of this method is the Dyna-Q
algorithm, which has a simple architecture that
integrates the main functions needed for learning and
planning the agent's actions.
3.5.1 Dyna-Q Algorithm
Dyna-Q is a combination of planning and temporal-
difference (TD) learning. Planning is a process that
takes a model of environment as input and creates a
policy using simulated experience generated
randomly, while direct TD uses real experience
obtained from environmental interactions to refine the
action-value function Q. The pseudocode for Dyna-Q
is shown as Algorithm 8 [28], where Model-
transitions(s,a) contains the next predicted state, and
Model-rewards(s,a) contains the received reward for the
state-action pair (s, a), with N being the number of
planning steps. If the planning step N = 0, then the only
active algorithm remains the Q-learning algorithm.
Algorithm 8. Dyna-Q algorithm
Setting parameter values:= 0.99 and = 0.01.
Initialize:
Q(s,a) 0, for all s S, a A
Model-transitions(s,a) 0, for all s S, a A
Model-rewards (s,a) 0, for all s S, a A
N constant
repeat (for each episode):
Initialize state s0
repeat (for each step t of episode):
at - greedy(st,Q)
rt, st+1 Env(st,at)
Q(st,at) Q(st,at) + [rt+maxaQ(st+1,a) Q(st,at)]
Model-transitions(st,at) st+1
Model-rewards(st,at) rt
k 0
while k < N do
sk random previously observed state from Model-
transitions(s,a)
ak random action previously taken from Model-
transitions(s,a)
sk+1 Model-transitions(sk,ak)
rk Model-rewards(sk,ak)
Q(st,at) Q(st,at) + [rt+ maxaQ(st+1,a) Q(st,at)]
k k+1
end
st st+1
until s is terminal
until stop
3.6 Temporal-Difference Learning with Adam Optimizer
The Adam (Adaptive Moment Estimation) optimizer is
an efficient stochastic optimization algorithm that
requires only first-order gradients with a small
memory footprint. The algorithm computes the first
and second moments of the gradients, which are the
estimates of the mean and uncentered variance of the
gradients, respectively. These moments are used to
update the model parameters [12].
Algorithms 9 and 10 contain pseudocodes that
represent the subsequent steps to update the Q table
parameters using the Q-learning and Sarsa methods
with the Adam optimizer. The Adam algorithm
introduces two hyperparameters
1 and
2, which
control the decay rates of the first and second moments,
respectively. Typically, these parameters have the
values:
1 = 0.9,
2 = 0.999, which have been found to
work very well in practice [12]. The algorithm starts by
setting the first mt and second (vt) moment variables
equal to zero. During each training iteration, the
gradients of the model parameters with respect to the
loss function are computed. The first and second
moments are then updated using the exponential
moving averages.
Algorithm 9.
Q-Learning algorithm with Adam optimizer
Setting parameter values: and = 0.01,
1 = 0.9,
2 = 0.999,
= 10
-8
,
= 0.99.
Initialize:
f(s,a;
) 0, for all s S, a A
m0 0, (1st moment vector)
v0 0, (2
st
moment vector)
repeat (for each episode):
Initialize timestep: t 0
Initialize state: s0
repeat (for each step t of episode):
Q(s,a) f(s,a;
)
at - greedy(st,Q)
rt, st+1 Env(st,at)
y rt+maxaQ(st+1,a)
Q Q(st,at) - y
f(s,a;
) 0
f(st,at;
)  f(st,at;
) +
Q
t t + 1
gt
f(s,a;
)
870
mt
1mt-1 + (1
1) gt
2
(1 )
2 1 2
v v g
t t t

+
ˆ
1 (1 )
21
tt
( )
ˆ
( , ; ) ( , ; ) ( )
1
f s a f s a m v
t t t t
+
st st+1
until s is terminal
until stop
Algorithm 10.
Sarsa algorithm with Adam optimizer
Setting parameter values: and = 0.01, 1 = 0.9, 2 = 0.999, = 10-8,
= 0.99.
Initialize:
f(s,a;
) 0, for all s S, a A
Q(s,a) 0, for all s S, a A
m0 0, (1
st
moment vector)
v0 0, (2
st
moment vector)
repeat (for each episode):
Initialize timestep: t 0
Initialize state: s0
a0 argmaxaQ(s0,a)
repeat (for each step t of episode):
Q(s,a) f(s,a;
)
rt, st+1 Env(st,at)
at+1 - greedy(st+1,Q)
y rt+ Q(st+1,at+1)
Q Q(st,at) y
f(s,a;
) 0
f(st,at;
)
f(st,at;
) + Q
t t + 1
gt
f(s,a;
)
mt
1mt-1 + (1
1) gt
2
(1 )
2 1 2
v v g
t t t

+
ˆ
1 (1 )
21
tt
( )
ˆ
( , ; ) ( , ; ) ( )
1
f s a f s a m v
t t t t
+
st st+1
at at+1
until s is terminal
until stop
4 RESULTS OF SIMULATION STUDIES
The simulation studies of the developed reinforcement
learning algorithms were performed using MATLAB
software. The environment in these simulations was a
maze consisting of 20x20 = 400 fields, of which 56 fields,
i.e. 14% of all fields, were static obstacles. Hence, the
number of available environmental states was 344. The
exemplary arrangement of the assumed obstacles is
shown in Figure3, with obstacle positions marked grey.
The initial field was located in the upper left corner of
the maze, while the target field was in the lower right
corner. A mobile agent can move from its current
position to an adjacent position in one of four
directions: north, south, east, or west, except for the
directions in which it encounters obstacles or leaves the
environment. In such a case, the agent maintains its
current position. The main goal of the reinforcement
learning algorithms was to determine the shortest path
for a mobile agent from the initial position (upper left
corner of the maze) to the target position (lower right
corner of the maze). In the analysed case, the shortest
path, consisted of 38 steps, and there are several paths
in the maze, which met this condition. In the learning
process, for each algorithm, the minimum number of
episodes required to find the shortest path was sought.
This learning stage was called path planning for a
mobile agent. An episode is a single run of the
reinforcement learning algorithm, searching for a path
between the initial and target points. The learning
process is an iterative solution of the optimization
problem, in which the total reward obtained in a single
episode is maximized. In all algorithms, for which the
results are shown in Table 1, the agent, after reaching
the target position, received a reward of 50, while in the
remaining possible fields, a reward of 1. The fields
containing static obstacles were inaccessible to the
agent.
Figure 3 shows the routes found by the dynamic
programming algorithms (policy iteration and value
iteration), both routes are 38 steps long. In these
algorithms, during the learning stage, the mobile agent
could also move into forbidden fields, which were
treated as holes. After reaching the target position, the
agent received a reward of 50, but on each holes-type
fields it got a reward of 10, and on the remaining
possible fields a reward of 0.
Figure 3. Paths planned by DP algorithms: policy iteration
(Algorithm 1; red dots) and value iteration (Algorithm 2; blue
dots).
Figure 4 presents the results of training three
algorithms: Monte Carlo control, Q-learning and Sarsa.
The figure shows the number of steps in individual
episodes that the learning algorithm performed in
search of a path between the initial and target points in
the tested maze. If the learning algorithm did not find
a path in the maximum number of steps planned for a
single episode, then the episode was ended without
reaching the goal. For the Q-learning and Sarsa
algorithms, the maximum number of steps in a single
episode was constant and amounted to 1000. The
situation was different in the Monte Carlo control
algorithm, in which the initial value of the maximum
number of steps in a single episode was 100000, and
only after finding the path to the target point for the
first time it was reduced to 10000. Then, after the next
three episodes in which it was possible to find a path
to the target point, it was reduced to 1000. The
minimum number of episodes required to run each
learning algorithm before finding the shortest
transition path consisting of 38 steps is given in Table
1. The transition routes obtained with the minimum
number of episodes required to run the algorithm are
shown in Figure 5. Of these three algorithms presented
871
in Figures 4 and 5, the best result was obtained for the
Q-learning algorithm, which required running 7000
episodes in the learning process.
Figure 6 presents the results of training three
algorithms: Q(), Sarsa() and Dyna -Q. Similarly to
Figure 4, the graphs shows the number of steps in
individual episodes. For the Q() and Sarsa()
algorithms, the values of were selected in such way
that they gave the minimum number of episodes
required to find the shortest path, consisting of 38
steps, between the initial and target points. The paths
obtained for each of these algorithms are shown in
Figure 7, with the number of episodes given in Table 1.
For the Dyna-Q algorithm, the value of N = 10 was
selected, which allowed it to obtain similar training
results as the other two algorithms: Q() and Sarsa().
Out of this group, the best result was obtained for the
Sarsa(=0.94) algorithm.
Figure 4. Comparison steps per episodes of algorithms:
Monte Carlo control (Algorithm 3), Q-learning (Algorithm 4)
and Sarsa (Algorithm 5).
Figure 5. Paths planned by algorithms: Monte Carlo control
(green dots), Q-learning (blue dots) and Sarsa (red dots) with
the number of Episodes from Table 1.
The results of training the last three analyzed
algorithms: Dyna-Q again, but this time with N=30, and
Q-learning and Sarsa, both with the Adam optimizer
are shown in Figure 8. The graphs shows the number
of steps in individual episodes. The transition routes
obtained for these algorithms are shown in Figure 9.
Figure 6. Comparison steps per episodes and sums of
rewards per episodes of algorithms: Q(
) (Algorithm 6);
Sarsa() (Algorithm 7) and Dyna-Q (N=10; Algorithm 8) .
Figure 7. Paths planned by algorithms: Dyna-Q (N=10; green
dots), Q() (blue dots) and Sarsa() (red dots) with the
number of Episodes from Table 1.
It is noteworthy that the best results of all
algorithms tested have been obtained for this group of
algorithms. Analyzing the results shown in Figure 8, it
can be seen that the Dyna-Q algorithm (N=30) finds a
transition route close to the optimal one quite quickly,
but then these consecutive routes found are still longer
than the shortest one and analysing many more
episodes was needed to determine the shortest
transition route.
The situation is different with the Q-learning and
Sarsa algorithms with the Adam optimizer, which after
finding the shortest transition route very quickly, also
quickly stabilize the result on this solution. In this
group of algorithms, the best result was obtained for
the Q-learning algorithm with the Adam optimizer,
and this algorithm, which required running 210
872
episodes in the learning process, was considered the
best of all algorithms tested.
Figure 8. Comparison steps per episodes and sums of
rewards per episodes of algorithms: Q() (Algorithm 6);
Sarsa() (Algorithm 7) and Dyna-Q (N=10; Algorithm 8) .
Figure 9. Paths planned by algorithms: Dyna-Q (N=10; green
dots), Q() (blue dots) and Sarsa() (red dots) with the
number of Episodes from Table 1.
Table 1. The performance of the algorithms Monte Carlo
control, Q-learning, Sarsa, Q(), Sarsa(), Dyna-Q, Q-
learning with Adam optimizer, Sarsa with Adam
Algorithm
Episodes
Max steps per episode
Monte Carlo control
9000
1000
1
Q-learning
7000
1000
Sarsa
7450
1000
Q()
2900
1000
Sarsa()
1200
1000
Dyna-Q, (N=10)
2400
1000
Dyna-Q, (N=30)
600
1000
Q-learning with Adam
210
1000
Sarsa with Adam
225
1000
1
This value has been finally set after finding four times the path to
the target position, with the initial value of 100000 - for the first time,
and the value of 10000 - for the second, third and fourth times.
5 CONCLUSIONS
The paper evaluates a number of selected
reinforcement learning algorithms used to solve the
problem of trajectory planning for a mobile agent in the
environment with static obstacles. The most important
evaluation criterion was the speed of determining the
shortest path between the initial and target points. A
wide range of algorithms was considered, including
dynamic programming, Monte Carlo, temporal-
difference learning, planning and learning, and the use
of the Adam optimizer.
The principle of operation of the dynamic
programming algorithms, which consists in: policy
iteration and value iteration is completely different
from that of remaining algorithms, as it requires the
knowledge of the exact model of the environment to
determine the agent's policy allowing for finding the
minimum length of the transition route between the
initial and target points, which in this case was equal
to 38 steps.
The remaining algorithms did not require the
knowledge of the exact mathematical model of the
environment and they taught the agent a policy that
minimized the transition path between the initial and
final points by gathering the experience based on the
impact on the environment. In this group of
algorithms, the simulation studies have shown that the
Q-learning and Sarsa algorithms, in which the Adam
gradient optimizer was used, were much faster than
the others to find the optimal transition path between
the initial and target points. The analysis of the
obtained simulation results shows that the Q-learning
algorithm with the Adam optimizer can be considered
the best choice for finding the transition path for a
mobile agent in a real environment.
ACKNOWLEDGMENTS
This research was funded by project of Gdynia Maritime
University in Poland, No. WE/2025/PZ/03: ”New methods of
controlling the motion of an autonomous ship in open and
restricted waters”.
REFERENCES
[1] M. N. Ab Wahab, A. Nazir, A. Khalil, B. Bhatt, M. H.
Mohd Noor, M. F. Akbar, and A. S. A. Mohamed,
“Optimised path planning using enhanced firefly
algorithm for a mobile robot,” PLoS ONE, vol. 19, no. 8,
e0308264, Aug. 2024, doi: 10.1371/journal.pone.0308264.
[2] R. E. Bellman, “A Markovian decision process,” J. Math.
Mech., vol. 6, no. 5, pp. 679-684, 1957, doi:
10.1512/iumj.1957.6.56038.
[3] M. Blaich, M. Rosenfelder, M. Schuster, O. Bittel, and J.
Reuter, “Fast grid based collision avoidance for vessels
using A-star search algorithm,” In Proceedings of the 17th
International Conference on Methods and Models in
Automation and Robotics, MMAR, Aug. 27-30, 2012; pp.
385-390, doi: 10.1109/MMAR.2012.6347858.
[4] P. Brundavani, and Y. Avanija, “Robot path planning and
tracking with the flower pollination search optimization
algorithm,” J. Comput. Allied Intell., vol. 2, no. 4, pp. 70-
81, Aug. 2024, doi: 10.69996/jcai.2024020.
[5] C. Chen, X.-Q. Chen, F. Ma, X.-J. Zeng, and J. Wang, ”A
knowledge-free path planning approach for smart ships
873
based on reinforcement learning,” Ocean Eng., vol. 189,
e106299, Aug. 2019, doi: 10.1016/j.oceaneng.2019.106299.
[6] J. Chen, J. Liang, and Y. Tong, “Path planning of mobile
robot based on improved differential evolution
algorithm,” In Proceedings of the 16th International
Conference on Control, Automation, Robotics and Vision,
ICARCV, Dec. 13-15, 2020, pp. 811-816, doi:
10.1109/ICARCV50220.2020.9305415.
[7] E. W. Dijkstra, “A note on two problems in connexion
with graphs,” Numer. Math, vol. 1, pp. 269-271, 1959.
[8] H. Dong, Z. Ding, and S. Zhang, Eds. “Deep
Reinforcement Learning: Fundamentals, Research and
Applications,” Springer Nature Book, Jun. 2020,
https://deepreinforcementlearningbook.org.
[9] L. E. Dubins, “On curves of minimal length with a
constraint on average curvature, and with prescribed
initial and terminal positions and tangents.” Am. J. Math.,
vol. 79, pp. 497-516, 1957, doi: 10.2307/2372560.
[10] V. Francois-Lavet,P. Henderson, R. Islam, M. G.
Bellemare, and J. Pineau, “An Introduction to deep
reinforcement learning,” Found. Trends Mach. Learn.,
vol. 11, pp. 219-354, Nov. 2018, doi:
10.48550/arXiv.1811.12560.
[11] R. Jaramillo-Martínez, E. Chavero-Navarrete, and T.
Ibarra-Pérez, “Reinforcement-learning-based path
planning: A reward function strategy,Applied Sci., vol.
14, e7654, Aug. 2024, doi: 10.3390/app14177654.
[12] D.P. Kingma, and J. L. Ba, “Adam: A method for
stochastic optimization,” Computing Research
Repository, Dec. 2014, https://arxiv.org/abs/1412.6980.
[13] S. M. LaValle, “Rapidly-exploring random trees: A new
tool for path planning,” Annu. Res. Rep., Computer
Science Department, Iowa State University. Ames, Iowa,
USA, 1998.
https://api.semanticscholar.org/CorpusID:14744621.
[14] A. Lazarowska, “Comparison of discrete artificial
potential field algorithm and wave-front algorithm for
autonomous ship trajectory planning,” IEEE Access, vol.
8, pp. 221013-22102, Dec. 2020, doi:
10.1109/ACCESS.2020.3043539.
[15] Y. Li, J. Zhao, Z. Chen, G. Xiong, and S. Liu, “A robot
path planning method based on improved genetic
algorithm and improved dynamic window approach,”
Sustainability, vol. 15, no. 5, e4656, Mar. 2023, doi:
10.3390/su15054656.
[16] M. Lin, K. Yuan, C. Shi, and Y. Wang, “Path planning of
mobile robot based on improved A* algorithm,” In
Proceedings of the 29th Chinese Control And Decision
Conference, CCDC, May 28-30, 2017, pp. 3570-3576, doi:
10.1109/CCDC.2017.7979125.
[17] L. Liu, J. Yao, D. He, J. Chen, J. Huang, H. Xu, B. Wang,
and J. Guo, “Global dynamic path planning fusion
algorithm combining jump-A* algorithm and dynamic
window approach,” IEEE Access, vol. 9, pp. 19632-19638,
Jan. 2021, doi: 10.1109/ACCESS.2021.3052865.
[18] L. Liu, L. Li, H Nian, Y. Lu, H. Zhao, and Y. Chen,
“Enhanced grey wolf optimization algorithm for mobile
robot path planning,” Electronics, vol. 12, no. 18. e4026,
Sep. 2023, doi: 10.3390/electronics12194026.
[19] J. Luo, Z. X. Wang, and K. L. Pan, “Reliable path
planning algorithm based on improved artificial potential
field method,” IEEE Access, vol. 10, p. 108276-108284,
Oct. 2022, doi: 10.1109/ACCESS.2022.3212741.
[20] W. Min, L. Mo, B. Yin, and S. Li,” An improved cuckoo
search algorithm and its application in robot path
planning,” Appl. Sci., vol. 14, no. 20, e9572, Oct. 2024, doi:
10.3390/app14209572.
[21] D. K. Muhsen, F. A. Raheem,and A. T. Sadiq, ”A
Systematic review of rapidly exploring random tree RRT
algorithm for single and multiple robots,” Cybern. Inf.
Technol., vol. 24, pp. 78-101, Sep. 2024, doi: 10.2478/cait-
2024-0026.
[22] G. Parlangeli, and G. Indiveri, “Dubins inspired 2D
smooth paths with bounded curvature and curvature
derivative,” IFAC Proceedings Volumes, vol. 43, no. 16,
pp. 252-257, 2010, doi: 10.3182/20100906-3-IT-2019.00045.
[23] G. A. Rummery, and M. Niranjan, “On-line Q-learning
using connecionist systems,” Technical Report,
Cambridge University Engineering Department.
Cambridge, England. 1994.
[24] D. Shan, S. Zhang, X. Wang, and P. Zhang, “Path-
planning strategy: adaptive ant colony optimization
combined with an enhanced dynamic window
approach,” Electronics, vol. 13, no. 5, e825, Feb. 2024, doi:
10.3390/electronics13050825.
[25] R. Singh, J. Ren, and X. Lin, “A review of deep
reinforcement learning algorithms for mobile robot path
planning,” Vehicles, vol. 5, no. 4, pp. 1423-1451, Oct. 2023,
doi: 10.3390/vehicles5040078.
[26] P. Sudhakara, and V. Ganapathy, “Trajectory planning
of a mobile robot using enhanced A-star algorithm,” J. Sci.
Technol., vol. 9, no. 41, pp. 1-10, 2016, doi:
10.17485/ijst/2016/v9i41/93816.
[27] R. S. Sutton, “Learning to predict by the methods of
temporal differences,” Mach. Learn., vol. 3, pp. 9-44, 1988,
https://10.1007/BF00115009.
[28] R. S. Sutton, and A. G. Barto, “Reinforcement Learning.
An Introduction,” 2d edition, MIT Press. Cambridge, MA,
USA, 2020.
[29] R. Tang, X. Tang, and H. Zhao, “Enhancing whale
optimization algorithm with differential evolution and
Lévy flight for robot path planning.” Int. J. Adv. Comput.,
Sci. Appl., vol. 15, no. 6, pp. 401-410, 2024, doi:
10.14569/IJACSA.2024.0150540.
[30] P. Vana, and J. Faigl, “Optimal solution of the
generalized Dubins interval problem: Finding the
shortest curvature-constrained path through a set of
regions,” Auton. Robots, vol. 44, pp. 1359-1376, Aug.
2020, doi: 10.1007/s10514-020-09932-x.
[31] S. X. Wang, “The improved Dijkstra's shortest path
algorithm and its application”. Procedia Eng., vol. 29, pp.
1186-1190, Feb. 2012, doi: 10.1016/j.proeng.2012.01.110.
[32] X. Wang, Z. Liu, and J. Liu, “Mobile robot path planning
based on an improved A-star algorithm,” In Proceedings
of the 2nd International Conference on Computer
Graphics, Artificial Intelligence, and Data Processing,
ICCAID, May 23, 2023, doi: 10.1117/12.2674526.
[33] Watkins, C. Learning from delayed rewards. Ph.D.
dissertation, Cambridge University, Cambridge, U.K.,
1989.
[34] Y. Xu, B. Sang, and Y. Zhang, “Application of improved
sparrow search algorithm to path planning of mobile
robots,” Biomimetics, vol. 9, no. 6, e351, Jun. 2024, doi:
10.3390/biomimetics9060351.
[35] G. Zhang, Y. Deng, W. Zhang, and C. Huang, “Novel
DVS guidance and path-following control for
underactuated ships in presence of multiple static and
moving obstacles”, Ocean Eng., vol. 170, pp. 100–110,
Dec. 2018, doi: 10.1016/j.oceaneng.2018.10.009.
[36] L. Zheng, W. Yu, G. Li, G. Qin, and Y. Luo, Particle
swarm algorithm path-planning method for mobile
robots based on artificial potential fields,” Sensors, vol.
23, no. 13, e6082, Apr. 2023, doi: 10.3390/s23136082.