579

1 INTRODUCTION

With the rapid development of artificial intelligence

and automation technologies, intelligent shipping and

unmanned vessels are gradually moving toward

practical applications, bringing new opportunities to

fields such as water traffic management, autonomous

ship navigation, and ocean monitoring [1]. Achieving

autonomous navigation and obstacle avoidance for

unmanned vessels relies on the precise understanding

of various objects in maritime traffic scenes, such as

ships, buoys, water bodies, and the sky. Semantic

segmentation technology, as a key means to achieve

this goal, is gaining widespread attention in maritime

scenarios [2].

Early studies employed traditional digital image

processing methods for water body detection and

obstacle recognition. For example, P. Santana et al. [3]

utilized digital image processing techniques for water

body detection, and Fefilatye et al. [4] proposed

obstacle detection methods based on handcrafted

features. However, these methods' reliance on simple

features leads to significant declines in segmentation

accuracy in complex scenarios such as coastal areas,

near-shore regions, or docks, especially under

conditions of visual ambiguity and light reflections.

Additionally, with the widespread adoption of high-

resolution devices, traditional methods face challenges

of slow processing speeds and poor segmentation

performance when handling large-scale image data.

Edge-Guided Multi-Scale Fusion and Importance-Aware

Learning for Real-Time Semantic Segmentation in

Waterborne Navigation

L. Chen

, J. Zou

, Y. Huang

, Y. Zhou

, G. Hao

& Y. Zhang

Wuhan University of Technology, Wuhan, China

Wuhan Institute of Shipbuilding Technology, Wuhan, China

Hubei University of Chinese Medicine, Wuhan, China

ABSTRACT: Effective multi-scale feature representation and focused attention on critical objects are essential for

accurate perception of waterborne navigation scenes. To address the insufficient exploitation of multi-scale

information in existing methods that leads to imprecise segmentation, this study proposes a real-time semantic

segmentation method for waterborne navigation scenes through multi-scale information enhancement and

importance-weighted optimization. First, DDRNet-23-slim is selected as the backbone network for feature

extraction. An edge-guided branch is embedded into its shallow layers, and a Dynamic Feature Fusion Module

(DFFM) is constructed by integrating a lightweight hybrid attention mechanism, effectively enhancing multi-scale

feature interaction capabilities. Second, the loss function is improved using an importance-weighted strategy to

prioritize critical objects during training. Finally, a parameter-free attention mechanism is introduced in the

upsampling stage, maintaining real-time performance while ensuring segmentation stability for key objects under

complex background interference. Evaluations on the On_Water and Seaships datasets demonstrate that the

proposed method achieves mIoU scores of 83.1% and 73.2%, respectively, with ship segmentation accuracy

reaching 88.2% on On_Water. The inference speed attains 69.1 FPS, outperforming mainstream real-time

segmentation models (e.g., DDRNet, STDC) in balancing accuracy and efficiency. Notably, it exhibits stronger

robustness in complex inland river scenarios with dense shore structures and numerous small targets.

http://www.transnav.eu

the International Journal

on Marine Navigation

and Safety of Sea Transportation

Volume 19

Number 2

June 2025

DOI: 10.12716/1001.19.02.30

580

In recent years, semantic segmentation methods

based on deep convolutional neural networks (CNNs)

have achieved remarkable results in terrestrial traffic

scenes due to their powerful feature learning

capabilities. However, directly applying these methods

(e.g., PSPNet) to maritime traffic scenes still presents

multiple challenges. Cheng et al. [5] proposed a deep

network-based method for land-sea boundary

segmentation in high-resolution remote sensing

images, but it often suffers from underfitting and

insufficient dynamic water surface feature extraction in

maritime scenes. C.Y. Jeong et al. [6] used the Pyramid

Scene Parsing Network (PSPNet) for horizon detection,

which improved boundary recognition to some extent,

but misjudgments still occur under conditions of visual

ambiguity or unclear boundaries between the horizon

and the sky. Meanwhile, M. Kristan et al. [7] monocular

obstacle detection model struggles with large sea

surface fluctuations and small obstacle recognition.

Some studies have attempted to enhance segmentation

performance by incorporating multimodal data. For

instance, Scherer et al. [8] fused data from stereo

cameras, IMU/GPS, and laser scanners, achieving

certain improvements. However, these methods

heavily rely on external hardware and complex post-

processing steps, making real-time segmentation

difficult in resource-constrained environments. Borja

Bovcon et al. [9] improved segmentation accuracy

under visually ambiguous conditions by designing a

semantic separation loss function combined with high-

precision IMU data, but such reliance on external high-

precision data limits their application on small devices

like mobile platforms. Furthermore, to address issues

such as water surface reflections and boundary

ambiguity, Qiao Yulong et al. [10] improved water

surface segmentation accuracy through image

preprocessing and sea-sky line estimation parameters.

Bao Xuecai et al. [11] introduced attention mechanisms

and fully connected conditional random field models

to enhance floating object boundary recognition, while

Xiong Rui et al. [12] designed a fast feature extraction

network to balance real-time performance and

accuracy. However, these methods still fall short in

effectively fusing multi-scale information, focusing on

key targets, and ensuring overall system real-time

performance, making it difficult to fully meet the dual

requirements of segmentation accuracy and real-time

performance for autonomous navigation and obstacle

avoidance of unmanned vessels.

In summary, although some solutions for semantic

segmentation in maritime navigation scenes have been

proposed, existing methods still struggle to achieve

ideal segmentation results due to challenges such as

complex lighting variations, adverse weather

conditions, dynamic backgrounds, and small target

detection in maritime environments. Some scholars

have directly borrowed segmentation methods from

terrestrial traffic scenes, but these methods fail to

adequately account for interference factors such as

water surface reflections and rain fog in maritime

environments, leading to issues like edge blurring and

inaccurate recognition in practical applications.

Additionally, given the limited computational

resources of small unmanned vessels and the demand

for real-time processing, there is an urgent need to

develop an efficient and specialized real-time semantic

segmentation method for maritime navigation scenes.

To this end, this paper proposes a real-time semantic

segmentation model for maritime navigation scenes

that integrates multi-scale information and importance

weighting (Multi-Scale Importance-Aware Network,

MSIA-Net). The model aims to address edge blurring

in object segmentation, enhance recognition accuracy

for complex shore backgrounds and multi-scale object

contours, and thereby provide more accurate and real-

time environmental understanding for autonomous

navigation and obstacle avoidance of unmanned

vessels.

2 MODEL NETWORK ARCHITECTURE DESIGN

To address the challenges of numerous floating objects,

frequent small obstacles, and complex background

interference in maritime navigation scenes while

ensuring real-time segmentation performance, this

paper introduces an edge-guided branch and a

dynamic feature fusion module into the backbone

network of DDRNet-23-slim, achieving precise capture

of target boundaries and multi-scale information.

Specifically, the edge-guided branch employs the Sobel

operator for edge detection and integrates the SE

module for channel attention adjustment, generating

edge-aware feature maps to enhance the model's

ability to detect small obstacles and floating objects.

The dynamic feature fusion module combines the

edge-guided module with multi-scale features from

the backbone network, adaptively generating weights

for features at different scales through a lightweight

attention network, thereby achieving efficient fusion

and further improving the model's segmentation

performance for multi-scale objects.

Additionally, to enhance the quality of feature maps

without increasing model complexity, a parameter-free

attention mechanism is introduced during the

upsampling phase. This mechanism adjusts feature

maps in both channel and spatial dimensions,

effectively enhancing their representational capacity

while maintaining the model's lightweight design.

Finally, to further improve the model's performance

in complex background and multi-scale object

segmentation tasks, an improved loss function based

on importance weighting is proposed. This loss

function assigns different weights to different pixels,

enabling the model to focus more on pixels that

significantly impact segmentation results during

training, thereby enhancing the model's robustness

and accuracy. The overall architecture of MSIA-Net is

illustrated in Figure 1.

Figure 1. Network structure of MSIA

581

2.1 Enhancement of the Backbone Network with Multi-

Scale Information

To adapt to the segmentation tasks of ships of varying

sizes and complex shore backgrounds in maritime

navigation scenes, this paper improves the backbone

network of DDRNet by introducing an edge-guided

branch and integrating multi-scale information

through a dynamic feature fusion module, thereby

enhancing the model's segmentation accuracy for

ships, obstacles, and shore boundaries in complex

water scenes. Specifically, by incorporating the Edge

Guidance Module (EGM), the model's sensitivity to

water surfaces, waves, and small object boundaries is

significantly enhanced. Combined with the dynamic

feature fusion module, multi-scale contextual

information is dynamically aggregated, expanding the

receptive field and improving the model's ability to

model complex background regions.

2.1.1 Edge-guided Branch

To address the segmentation task of ships of

varying sizes and complex shoreline backgrounds in

waterborne navigation scenarios, this paper improves

the backbone network of DDRNet-23-slim by

introducing an edge-guided branch (whose structure is

shown in Figure 2). The EGM is the core component of

the edge enhancement branch, aiming to explicitly

enhance edge information to improve the model's

segmentation accuracy of target boundaries in complex

scenes. This module first converts the three-channel

RGB image into a grayscale image, then uses the Sobel

operator to perform edge detection on the grayscale

image to enhance the edge response. Subsequently, the

Squeeze-and-Excitation (SE) module is employed to

adaptively weight the edge features, ensuring that

edge information plays a more significant role in the

subsequent feature fusion process.

The Sobel operator is a widely used edge detection

method in the field of image processing. It primarily

relies on discrete differential operators to approximate

the first-order gradient magnitude and direction at

each pixel location in an image. The Sobel operator

employs a pair of 3×3 convolution kernels to estimate

the gradient magnitudes in the horizontal directions

and vertical directions , respectively:

1 0 1 1 2 1

2 0 2 0 0 0

1 0 1 1 2 1

x input y input

G f G f

− − − −

   

   

= −  = 

   

−

   

(1)

where * denotes the convolution operation, and finput

represents the input feature map. These two

convolution kernels highlight regions with rapid

intensity changes along specific directions, thereby

identifying edges in the image. To ensure that edge

information plays a more significant role in the

subsequent feature fusion process, the EGM module

adaptively weights the edge features using the

Squeeze-and-Excitation (SE) module. The SE module

computes channel-wise weights by performing global

average pooling on the feature map, followed by

weighted fusion of the features to enhance the

important characteristics. he computational process of

this module is as follows:

( )

()

edge

f SE Sobel f=

(2)

Figure 2. Edge-guided Branch

2.1.2 Dynamic Feature Fusion Module with Multi-scale

Information Enhancement

In the context of semantic segmentation for

waterborne navigation scenes, the fusion of multi-scale

information presents a significant challenge.

Waterborne environments typically encompass a

variety of features at different scales, such as intricate

shoreline structures, water surfaces, and vessels, each

carrying distinct semantic information across scales. To

address this, we propose an enhanced dynamic feature

fusion module (whose structure is shown in Figure 3)

designed to dynamically integrate features from

multiple levels. This module assigns appropriate

fusion weights to different spatial locations, effectively

capturing positional discrepancies between feature

maps and the original input image.

The implementation begins with upsampling

feature maps of different scales, f1 and f2 , to the size of

fedge followed by concatenation along the channel

dimension. A 1×1×1 spatial convolution kernel is then

applied to the concatenated feature map, and a

Sigmoid function is used to compute the spatial weight

wsp which enhances the representation of critical spatial

regions. The spatial weight wsp is calculated as:

( )

1 1 1 1 2

( ; ; )

s edgep

Conv conw cat f f f





(3)

where



denotes the Sigmoid function. Concurrently,

channel weights wch are derived through a cascade of

average pooling (AVGPool), a convolutional layer

(Conv), and a Sigmoid activation. The channel weight

wch is expressed as:

( )

1 11 2

( ; ; )

edgech

cw oncat fConv AVGPoo ffl





(4)

Subsequently, each feature map is weighted using

the computed spatial weights wsp and channel weights

wch, followed by element-wise addition. A 1×1

convolution is applied to the fused feature map to

compress its channels, yielding the final feature map

Fend. The final output is computed as:

( )

1 1 1 2

()

end edge sp ch

f Conv f f f w w



=    

(5)

By generating dynamic weights, the proposed

module adaptively adjusts the importance of each

feature based on the global context of the input. This

enables the model to selectively emphasize key

features, thereby enhancing its ability to represent

multi-scale information effectively. The dynamic

feature fusion mechanism ensures robust performance

582

in capturing diverse and complex features inherent in

waterborne navigation scenes.

Figure 3. Improved Dynamic Feature Fusion Module

2.2 Parameter-free Attention Guidance

In real-time semantic segmentation for waterborne

navigation scenarios, key targets (such as ships,

obstacles, and navigation marks) typically occupy only

a small portion of the image, while the majority of the

area consists of water surfaces. This not only results in

relatively sparse information about key targets in the

overall image but also introduces interference in

feature extraction due to factors such as water

reflections, waves, and environmental lighting

changes. Traditional attention mechanisms (e.g.,

spatial attention or channel attention) often rely on

additional convolutional or fully connected layers to

generate attention weights, thereby increasing model

parameters and computational overhead, which is not

ideal for real-time segmentation tasks.

To address this issue, the SimAM (Simple

Parameter-Free Attention Module) is introduced. It

constructs an energy function to measure the

importance of each neuron(whose structure is shown

in Figure 4), enabling direct attention weighting on the

original three-channel feature maps without

introducing additional parameters. The core idea of

SimAM is to evaluate the importance of a neuron based

on its ability to distinguish itself from other neurons

within the same channel (its structure is shown in the

figure ). Specifically, if a neuron can be clearly

distinguished from other neurons in the same channel

(i.e., it exhibits strong linear separability), it indicates

that the neuron is more active in expressing key

features and its information is more representative.

Conversely, neurons with lower discriminability are

considered less important. By dynamically assessing

the importance of each neuron, SimAM can perform

fine-grained spatial and channel-wise weighting

adjustments on the original feature maps without

adding extra parameters. This enhances the focus on

key targets while suppressing irrelevant background

information. The specific implementation is as follows:

( ) ( )

( ) 1 1

min

E t w t b w o b w



−



=  + − +  + + +







(6)

where N represents the number of neurons, t

represents the output of the target neuron,

{}

−

represents the outputs of the remaining neurons, and



is a small positive constant used to prevent

overfitting during the optimization process,

is a

weight parameter,

is a bias term used to adjust the

output of the linear combination, helping the model

better fit the requirements of the energy function.

Finally, the energy values are normalized using a

Sigmoid function to obtain the final attention weights

as follows:

( ) ( ( ))A t E t



=−

(7)

Through the above formula, regions with lower

energy values, which indicate higher discriminability,

are assigned higher weights. This achieves the goal of

emphasizing important pixel regions while

suppressing less important ones.

Figure 4. Parameter-free Attention Mechanism

2.3 Improved Loss Function Based on Importance

Weighting

In waterborne navigation scenarios, safety is always

the core concern. Particularly, targets such as ships and

obstacles, due to their significant impact on collision

risks and emergency avoidance, should be given

higher attention during model training. While

navigation aids (e.g., buoys) also play a supporting

role, their associated safety risks are relatively lower.

On the other hand, water surfaces, as the background,

contribute the least to safety warnings. Therefore,

when designing the loss function, it is necessary to

assign different importance weights to samples of

different categories. This ensures that the model can

focus more on the detection and segmentation of

critical targets during training. Figure 5 illustrates the

safety-driven weight-aware ranking, where ships and

obstacles are categorized as Class 1 (highest weight),

navigation aids as Class 2 (medium weight), and water

surfaces as Class 3 (lowest weight). This ranking

intuitively reflects the importance of each category

under safety considerations.

Figure 5. Weight-aware Ranking

To incorporate the aforementioned task

requirements into the loss function, a scaling factor is

introduced for each pixel's category, combined with

the proportion of that category in the overall dataset to

determine its importance weight wi. The enhanced loss

function is defined as:

( ( ) )

improved i

Loss w L f xi yi



=



(8)

where N is the total number of samples, y’i represents

the model's predicted output, yi is the ground truth

label, and



denotes the model parameters. In this

study, the scaling factor for ships and obstacles (Class

1) is set to



1=1.5, for navigation aids and ports (Class

2) to



2=1.25, and for water surfaces (Class 3) to



3=1

583

The importance weight wi for each category is

calculated as follows:





=

(9)

Additionally, to enhance the smoothness of the

model's segmentation, a regularization term is

introduced into the loss function to penalize abrupt

changes in the model's output. The final importance-

weighted loss function is defined as follows:

( ( ), )(

; )

weighted i

Loss w L f xi yi R f



=  +



(10)

where



is a small positive constant to avoid division

by zero, y’i represents the squared difference between

the predicted values of adjacent pixels, and



is the

regularization coefficient. Through this design, the loss

function can better address class imbalance issues,

enhance the model's focus on critical categories, and

improve the robustness and accuracy of segmentation.

This makes it suitable for practical applications in

complex environments.

3 EXPERIMENT RESULTS AND COMPARATIVE

ANALYSIS

The experimental environment in this study includes

the CentOS 7 operating system, an Intel(R) Xeon(R)

Platinum 8260C CPU @ 2.30GHz processor, 32GB of

memory, and an NVIDIA Tesla P100 GPU with 16GB

of memory. The experiments were conducted using the

PyTorch deep learning framework, with CUDA

version 11.8.

3.1 Dataset and Hyperparameter Settings

3.1.1 Waterborne Navigation Scene Dataset

The experimental data for the On_Water dataset is

sourced from high-resolution videos of the Yangtze

River's Wuhan section and maritime ship navigation

videos. To enhance the diversity of the dataset, images

meeting the requirements of this study were selected

from the Multi-modal Obstacle Detection Dataset

(MODD) [13] and the Singapore Maritime Dataset

(SMD) [14] as supplementary data. Ultimately, 1200

images with a resolution of 1920×1080 were obtained.

Based on the characteristics of waterborne navigation

scenarios, the objects in the dataset were categorized

into five classes: water surface, ships, navigation aids,

ports, and obstacles. Semantic segmentation labels

were created using the LabelMe annotation tool, and

the dataset was divided into training, validation, and

test sets in a 6:2:2 ratio. To improve the model's

generalization ability, 20% of the images were

randomly rotated by 15°, 20% were horizontally

flipped, and an additional 10% were augmented with

Gaussian noise. Figure 6 shows a partial display of the

labels and original images from the On_Water dataset,

where ships are marked in red, water surfaces in green,

obstacles in yellow, ports in purple, navigation aids in

blue, and the background in black.

The Seaships dataset is a publicly available dataset

specifically designed for ship recognition and

classification tasks in waterborne navigation scenarios.

It contains a large number of real-world waterborne

navigation images, covering a variety of aquatic

environments and ship types. To better adapt to the

semantic segmentation task for waterborne navigation

scenarios, this study selected 200 images from the

dataset that include typical waterborne navigation

scenes. The objects in these images were categorized

into four classes: ships, water surfaces, obstacles, and

navigation aids, along with a background class. Table

1 provides the counts of each object category in the

On_Water and SeaShips datasets.

Figure 6. Visualization of On_Water Dataset

Figure 7. Visualization of SeaShips Dataset

3.1.2 Experimental Hyperparameter Settings

The specific hyperparameter settings used during

training are shown in Table 1. For performance

comparison experiments with other algorithms, the

hyperparameters were set to the same values.

Table 1. Hyper Parameters setting

Parameter

On_water

SeaShips

Batch Size

Input Image Size

720×720

Iterations

200

Learning Rate Decay

warmup Poly

Optimizer

Adam

Loss Function

Weighted_loss

3.1.3 Model Evaluation Metrics

In this study, Pixel Accuracy (PA), mean Pixel

Accuracy (mPA), and mean Intersection over Union

(mIoU) are used to evaluate the accuracy of the

semantic segmentation model. The specific calculation

formulas are as follows:





(10)

584

mPA



(11)

ij ji ii

mIoU

n N N N

+−





(12)

where Nii represents the number of pixels correctly

classified as category i, Nij denotes the number of pixels

whose true category is i but are incorrectly classified as

category j. This study employs FPS (Frames Per

Second) to evaluate the real-time performance of the

semantic segmentation model, which indicates the

number of image frames processed per second. The

specific calculation formula is as follows:

FPS

(13)

where T represents the time (in seconds) required to

process each frame.

3.2 Ablation Study

3.2.1 Ablation Study of the Proposed Modules

To verify the impact of each proposed improvement

on the model's performance, ablation experiments

were conducted on the On_Water test set. The results

of the ablation experiments are shown in Table 2.

Table 2. Ablation Study Results of Different Modules

Group

Improved_

Loss

DFFM+

EGM

SimAM

Parameters

(M)

Inference

Speed

(FPS)

mIoU

(%)

Baseline

5.69

76.7

72.3

√

5.72

74.2

74.5

√

5.73

72.0

75.6

√

5.69

72.6

77.9

√

5.76

71.2

78.1

√

5.73

71.5

79.3

√

5.73

70.6

80.5

Ours

√

5.76

69.1

83.1

Analyzing the results in Table 2, it can be observed

that the SimAM module significantly improves mIoU

without increasing the number of parameters, making

it suitable for waterborne navigation applications that

require high precision and have constraints on model

complexity. The DFFM and EGM modules further

enhance the model's multi-scale feature extraction

capability by introducing an edge-guided branch and

dynamically adjusting the weights of feature maps at

different scales, improving the understanding of

complex waterborne environments. Additionally, the

introduction of the Improved_Loss function, which

incorporates importance weighting, enhances the

segmentation stability of the model. Combining the

experimental results from all groups, the proposed

optimal model achieves an mIoU of 83.1% and an

inference speed of 69.1 frames per second (FPS). This

demonstrates that the proposed method maintains a

high inference speed while significantly improving

segmentation accuracy, validating its superiority in

waterborne navigation scenarios.

3.2.2 Ablation Study on Attention Mechanisms

To further validate the performance of the proposed

parameter-free attention mechanism, this study

conducts comparative experiments with other

attention mechanisms and visualizes the feature maps

guided by attention using Grad-CAM. The

performance analysis of different attention

mechanisms is shown in the table 3, where the data is

obtained by keeping other structures consistent and

only replacing the attention mechanism. The

visualization of attention-guided feature maps is

presented in Figure 8.

Table 3. Ablation Study Results of Different Attention

Mechanisms

组别

Parameters/M

Inference

Speed/FPS

mIoU/%

Baseline

5.69

76.7

72.3

+CBAM

5.89

55.6

73.5

+Dual Attention

5.72

33.4

74.6

+SimAM(Ours)

5.69

72.6

77.9

Figure 8. Feature Visualization of Different Attention

Mechanisms

The experimental results from Table 3 and Figure 8

demonstrate that the introduction of different attention

mechanisms significantly impacts segmentation

performance in waterborne navigation scenarios. The

incorporation of CBAM (Convolutional Block

Attention Module), which combines channel attention

and spatial attention, effectively focuses on key regions

of ships and exhibits strong background suppression

capabilities. However, it shows slight limitations in

handling edge details and weaker responses to small

target areas. The introduction of Dual Attention,

leveraging its global modeling capability, excels in

capturing the overall contours and edge features of

ships, with heatmaps displaying more uniform

distributions and stronger detail-capturing abilities.

Nevertheless, it performs slightly worse in background

suppression, and its higher computational complexity

limits its applicability in real-world scenarios. In the

proposed model, the introduction of the SimAM

demonstrates more outstanding performance. Its

simple structure and computational efficiency enable

significant background suppression while maintaining

strong responses to ship regions, making it highly

suitable for scenarios with high real-time

requirements. Comprehensive analysis indicates that

the parameter-free attention mechanism offers a better

balance between real-time performance and

segmentation accuracy, making it more suitable for

waterborne navigation segmentation tasks.

3.2.3 Ablation Study on Loss Functions

To verify the impact of the proposed improved loss

function on the model's performance, this study

conducts a performance analysis on the On_Water test

dataset using CE_Loss, Focal_Loss, Dice_Loss, and

Weighted_Loss. Additionally, the training loss curves

under different loss functions are compared. Table 4

presents the model's performance under different loss

functions.

585

Table 4. Ablation Study Results of Different Loss Functions

Group

MPA/%

mIoU/%

CE_Loss

77.2

72.3

+Focal_Loss

78.1

72.9

+Dice_Loss

79.4

73.5

+improved_Loss

(Ours)

80.2

74.5

Analysis of Table 4 reveals that the proposed

Improved_Loss demonstrates significant superiority

compared to various loss functions. Compared to

CE_Loss, this method more effectively addresses class

imbalance issues, thereby improving the model's

segmentation accuracy in complex scenarios. While

both Focal_Loss and the proposed Improved_Loss can

handle imbalanced samples, the latter further

integrates the severity of misclassification across

different categories, achieving overall performance

optimization. This is particularly evident in

waterborne navigation scenarios, where the proposed

loss function exhibits higher stability in detail-rich

environments. Compared to Dice_Loss, although the

latter shows improvements in small target

segmentation and misclassification severity, the

proposed Improved_Loss combines the advantages of

both, achieving significant improvements in

segmentation accuracy metrics (mIoU) and pixel

accuracy (mPA). This highlights its excellent

segmentation performance and robustness in

waterborne navigation scenarios. Therefore, the

proposed Weighted_Loss provides a more effective

solution for handling complex and class-imbalanced

practical applications, demonstrating stronger

adaptability and stability in diverse and imbalanced

environments.

3.3 Model Performance Comparison Experiments

To validate the effectiveness of the proposed method,

comparative experiments were conducted on the

On_Water Dataset and the SeaShips dataset,

comparing the proposed method with baseline models

and several mainstream segmentation methods.

3.3.1 Performance Comparison with Baseline Models on

the On_Water Dataset

To analyze the performance improvement of the

proposed method for different categories in

waterborne navigation scenarios, a segmentation

accuracy comparison experiment was conducted for

each category on the On_Water test dataset.

Additionally, to more intuitively demonstrate the

improvement in recognition performance for various

categories in waterborne navigation scenarios, typical

images containing waterborne navigation scenes were

selected for visualization comparison experiments. The

results are as follows.

Table 5. Segmentation Accuracy Comparison of Different

Algorithms for Each Class on the On_Water Test Set

Category \ Model

Baseline

Ours

Boat

75.4

85.2

Water Surface

84.6

91.7

Obstacle

70.3

82.4

Buoy

62.1

78.9

Dock

69.1

75.4

mIoU/%

72.3

83.1

Analysis of Table 5 shows that the proposed

method, through the introduction of an improved

backbone network, a parameter-free attention

mechanism, a multi-scale feature fusion module, and

an importance-weighted loss function, effectively

enhances segmentation accuracy across multiple

categories. Overall, the mean Intersection over Union

(mIoU) increased from 72.3% to 83.1%, a significant

improvement of 10.8 percentage points, demonstrating

a substantial enhancement in the model's semantic

segmentation capability. Specifically, the segmentation

accuracy for boats improved from 75.4% to 85.2%,

water surfaces from 84.6% to 91.7%, obstacles from

70.3% to 82.4%, buoys from 62.1% to 78.9%, and docks

from 69.1% to 75.4%. Notably, the improvement is

most significant for large objects such as boats and

water surfaces, indicating the effectiveness of the new

backbone network and multi-scale feature fusion

module in handling large-scale targets. For complex

small objects such as buoys, the segmentation accuracy

improved the most, by 16.8 percentage points, proving

the significant advantage of the multi-scale feature

fusion module and the importance-weighted loss

function in handling fine structures. The segmentation

accuracy for docks improved by 9.1 percentage points,

demonstrating the superior performance of the

proposed method in handling relatively complex small

targets. Overall, the proposed method achieved

improvements in segmentation accuracy across all

categories in waterborne navigation scenarios, with an

average mIoU increase of 10.8 percentage points, fully

validating the effectiveness of the proposed method in

enhancing the overall performance of semantic

segmentation models.

3.3.2 Performance Comparison and Analysis with

Different Models on the On_Water Dataset

As shown in Table 6, the performance of the

proposed model is compared with state-of-the-art real-

time semantic segmentation models on the On_Water

test set (where FPS is measured on a 4060 laptop with

an input image size of 1920×1080). Figure 14 provides

a visualization of the segmentation results of the

proposed model and other advanced real-time

semantic segmentation models on the On_Water test

set.

Table 6. Performance Comparison of Different Models on

the On_Water Dataset

Model

mPA(%)

Parameters/M

FPS/frame·s-1

mIoU(%)

ENet

60.5

0.37

21.9

58.3

ContextNet

67.3

0.87

85.8

66.1

ERFNet

72.1

2.06

11.9

69.7

STDC2

81.6

11.45

39.8

74.7

STDC1

78.0

7.43

62.8

73.5

DeepLabv3+

79.1

40.3

7.1

77.6

WasR

85.0

13.23

8.2

80.1

BiseNet

70.5

16.26

21.5

66.1

BiSeNetV2

71.4

1.88

32.4

68.4

PIDNet-L

79.6

36.9

14.4

78.2

PIDNet-M

76.8

28.5

19.1

75.6

PIDNet-S

74.4

7.6

53.4

73.5

Fast-SCNN

72.0

1.9

111.5

68.9

DDRNet-23-

slim

79.8

5.7

81.4

72.3

Ours

89.1

5.9

69.1

83.1

The experimental results in Table 6 demonstrate

that the proposed method exhibits comprehensive

superiority in real-time semantic segmentation tasks

586

for waterborne navigation scenarios. In terms of

segmentation accuracy, the proposed model achieves

an mIoU of 83.1%, significantly outperforming

mainstream models such as WaSR (80.1%) and

DeepLabv3+ (77.6%). Notably, the IoU for boundary

segmentation of key targets such as ships and obstacles

improves by over 15%, validating the effectiveness of

the multi-scale feature enhancement and boundary

perception modules. In terms of real-time

performance, the model achieves a high frame rate of

69.1 FPS, far exceeding traditional methods and fully

meeting the real-time perception requirements of high-

speed navigation scenarios. Compared to lightweight

models like DDRNet-23-slim (81.4 FPS), the proposed

method achieves a better balance between accuracy

(mIoU improved by 10.8%) and speed. In terms of

model complexity, the proposed method has only 5.9M

parameters, significantly fewer than WaSR (13.23M)

and STDC2 (11.45M). By incorporating dynamic

feature fusion and a parameter-free attention

mechanism, the model reduces computational

redundancy while ensuring efficient deployment on

resource-constrained onboard devices. In summary,

the proposed method overcomes the trade-off between

accuracy and speed, offering a high-precision (mPA

89.1%), high-frame-rate (69.1 FPS), and low-parameter

(5.9M) solution for real-time semantic segmentation in

complex waterborne scenarios. This significantly

enhances the environmental perception and obstacle

avoidance capabilities of autonomous navigation

systems.

To further analyze the performance of the proposed

algorithm compared to other algorithms in the

semantic segmentation task for waterborne navigation

scenarios, segmentation visualization experiments

were conducted on the On_Water test dataset, with the

results shown in Figure 9. Analysis of the data reveals

that, as seen in the first row of Figure 9d, the inclusion

of the dynamic feature fusion module enables better

integration of multi-scale information, effectively

distinguishing between ships, water surfaces, and

background elements (e.g., sky, distant mountains) in

the image. Particularly for multi-scale targets such as

shoreline structures, the proposed method achieves

clearer boundary segmentation compared to the

baseline. The dynamic feature fusion module

effectively addresses the scale variation of ships at

different distances, demonstrating superior

performance in segmenting smaller targets. As shown

in the second row of Figure 9d, the proposed method

produces more accurate segmentation of ship contours

with less background noise, a result of the importance-

weighted loss function that enhances the model's focus

on target objects (e.g., ships) while reducing

interference from background information. As seen in

the third row of Figure 9d, the proposed method

achieves the highest segmentation accuracy for objects

in waterborne navigation scenarios, particularly

excelling in distinguishing between foreground objects

(e.g., ships) and background elements (e.g., distant

buildings, shorelines). The experimental results

demonstrate that the proposed model can quickly and

accurately segment ships and other important objects,

producing clearer contours and significantly

improving segmentation performance in complex

waterborne environments.

Figure 9. Visualization of Segmentation Results of Different

Models on the On_Water Dataset

3.3.3 Performance Comparison with Other Advanced

Methods on the SeaShips Dataset

To further validate the effectiveness of the proposed

algorithm for waterborne navigation scenarios, a

performance comparison was conducted on the re-

annotated SeaShips test dataset. Table 7 shows the

performance comparison of different models on the

SeaShips dataset, and Figure 10 presents the

visualization of segmentation results of the proposed

model and existing state-of-the-art real-time semantic

segmentation models on the SeaShips dataset.

Table 7. Performance Comparison of Different Models on

the SeaShips Dataset

Model

mPA

Parameters/M

FPS/frame·s-1

mIoU/%

PSPNet

70.5

64.5

SegNet

70.3

64.1

STDCNet

75.3

11.46

48.8

69.1

Fast-SCNN

71.0

1.14

111.5

65.4

DeepLabv3+

72.1

40.35

8.9

66.4

WasR

77.3

13.23

9.5

70.8

BiseNet

70.4

16.26

27.7

64.5

BiSeNetV2

71.5

3.63

26.6

65.3

PIDNet-S

76.4

7.62

52.7

70.2

DDRNet-23-slim

73.1

5.69

101.6

68.5

Ours

80.1

5.92

69.1

73.2

Figure 10. Visualization Comparison of Segmentation

Results with Other Real-Time Semantic Segmentation

Algorithms on the SeaShips Dataset

Analysis of Figure 10 and Table 7 demonstrates that

the proposed algorithm, MSIANet, further validates

the effectiveness of the improvement strategies in

multiple aspects. In the second row of the

visualization, the small ship on the right side of the

image is accurately segmented by MSIANet and

PIDNet, while other methods either miss it entirely or

produce blurred and incomplete contours. This proves

that the proposed method significantly enhances the

model's ability to extract features from small target

regions, enabling clearer identification and

segmentation of key objects. In the first and third rows,

MSIANet better captures edge details at the junctions

between ships, water surfaces, and land. For example,

587

in the third row, the stacked cargo (red area) on the

ship is clearly and completely segmented by MSIANet,

while other methods produce blurred and irregular

edges. This is attributed to the introduction of the edge-

guided branch, which allows the model to focus more

on edge regions. Additionally, in the second row, the

background mountain (yellow area) and water surface

(green area) are clearly distinguished in MSIANet's

segmentation results without significant

misclassification. In contrast, other methods, such as

STDCNet and DDRNet, exhibit discontinuous

boundaries in the mountain area. The dynamic feature

fusion module enables the model to efficiently capture

multi-scale contextual information, maintaining

segmentation accuracy in complex backgrounds. In the

first and third rows, IDA-Fast-SCNN produces more

uniform segmentation results for multiple categories

(e.g., water surfaces, ships, mountains), without

overemphasizing one category or neglecting small

targets. This validates the effectiveness of the

importance-weighted loss function in handling class

imbalance and segmenting key objects. In summary,

the analysis of these specific details shows that

MSIANet significantly outperforms other models in

segmenting critical regions. This further demonstrates

the effectiveness of the proposed method for semantic

segmentation tasks in waterborne navigation

scenarios.

4 CONCLUSION

This paper proposes a real-time semantic segmentation

method for waterborne navigation scenarios, called

MSIA-Net, which enhances multi-scale information

and incorporates importance weighting to improve

segmentation accuracy in complex aquatic

environments while maintaining real-time

performance. By introducing an edge-guided branch

and a Dynamic Feature Fusion Module (DFFM), the

model's ability to perceive multi-scale information is

significantly enhanced. Additionally, a n improved

loss function based on importance weighting is

designed to increase the model's focus on critical

objects in waterborne navigation scenarios. A

parameter-free attention mechanism is also integrated

into the decoder to combine regional information from

the encoder and semantic information from the

decoder, restoring spatial details in the image and

guiding the model to focus on more critical objects.

On the On_Water dataset and the SeaShips dataset

constructed in this study, the proposed method

achieves segmentation accuracies of 83.1% and 73.2%,

respectively, while achieving an inference speed of 69.1

frames per second with only 5.76M parameters.

Considering the balance between parameters,

accuracy, and speed, the proposed algorithm

outperforms other lightweight networks in identifying

targets such as ships and small-sized obstacles, making

it more suitable for waterborne navigation scenarios.

LIMITATIONS AND FUTURE DIRECTIONS

This paper has demonstrated advancements in real-

time semantic segmentation for water navigation

scenes. However, the rapid evolution of intelligent

shipping and unmanned vessel technology

necessitates enhanced perception capabilities for

increasingly complex environments and demanding

applications. Consequently, further optimization of

current methodologies remains crucial. Future

research should focus on the following key areas:

1. Enhancing Model Robustness and Generalization

through Multi-Modal Integration: Future work

should focus on integrating multi-modal perception

methods to improve the model's robustness and

adaptability across diverse water environments and

when encountering dynamic targets.

2. Optimizing Perception of Dynamic Scenes and

Small Objects: Further research is needed to

enhance the model's ability to accurately segment

dynamic objects and small-scale targets on the

water surface, potentially through temporal

information integration and improved feature

representation.

REFERENCES

[1] Praczyk, T. Artifcial neural networks application in

maritime, coastal, spare positioning system. Theor. Appl.

Inf. 2006, 18, 1175–1189.

[2] Praczyk, T. Neural anti-collision system for autonomous

surface vehicle. Neurocomputing 2015, 149, 559–572.

[3] P. Santana, R. Mendica, and J. Barata, “Water detection

with segmentation guided dynamic texture recognition,”

in Proc. IEEE Int. Conf. Robot. Biomimet. (ROBIO),

Guangzhou, China, 2012, pp. 1836–1841.

[4] S. Fefilatyev and D. Goldgof, “Detection and tracking of

marine vehicles in video,” in Proc. Int. Conf. Pattern

Recognit., Tampa, FL, USA, 2008, pp. 1–4.

[5] Cheng, D.-C.; Meng, G.-F.; Cheng, G.-L.; Pan, C.-H. Senet:

Structured edge network for sea–land segmentation.

IEEE Geosci. Remote Sens. Lett. 2017, 14, 247–251.

[6] C.Y. Jeong, H.S. Yang, K.D. Moon. Horizon detection in

maritime images using scene parsing network[J]. Image

and vision processing and display technology,

2018,54(12):760-762.

[7] M. Kristan, V. S. Kenk, S. Kovaˇ ciˇ c, and J. Perš, “Fast

image-based obstacle detection from unmanned surface

vehicles,” IEEE Transactions on Cybernetics, vol. 46, no.

3, pp. 641–654, 2016.

[8] S. Scherer et al., “River mapping from a flying robot: State

estimation, river detection, and obstacle mapping,”

Auton. Robots, vol. 33, nos. 1–2, pp. 189–214, 2012.

[9] Bovcon, B., & Kristan, M. (2019). Benchmarking Semantic

Segmentation Methods for Obstacle Detection on a

Marine Environment.

[10] Qiao Y L, Zhao X C. Obstacle detection method based on

improved semantic segmentation model[J]. Journal of

Naval University of Engineering, 2023, 35(01): 18-24.

[11] Bao X C, Liu F Y, Nie J G, et al. Research on Multi-type

Floating Object Segmentation Method on Water Surface

Based on Improved Deeplabv3+[J/OL]. Water Resources

and Hydropower Engineering: 1-16.

[12] Xiong R, Cheng L, Hu T, et al. Research on Fast

Segmentation Algorithm for Feasible Region and

Obstacles of Unmanned Surface Vehicle[J]. Journal of

Electronic Measurement and Instrumentation, 2023,

37(02): 11-20.

[13] Kristan M, Sulic V, Kovacic S. Fast image-based obstacle

detection from unmanned surface vehicles[J]. IEEE

Transactions on Cybernetics, 2015, 46(12):2809-2821.

[14] Prasad D K, Rajan D, Rachmawati L, et al. Video

Processing from Electro-optical Sensors for Object

Detection and Tracking in Maritime Environment: A

Survey[J]. IEEE Transactions on Intelligent

Transportation Systems, 2017, 18(08):1993-2016.