312
In addition to these key components, the expert
assessors were provided with clear guidelines
outlining the importance of each criterion. For instance,
the 'Aim and objective' component was assessed based
on how well the scenario’s learning goals aligned with
the training needs of maritime professionals. The
'Scenario details' were evaluated for clarity, realism,
and relevance to practical training, including whether
the decision-making points were appropriately
challenging and applicable to real-world scenarios. The
'Assessment guide' was analyzed to determine how
well the scenario incorporated methods for evaluating
participant performance, ensuring the scenario was not
only instructional but also assessable in terms of
learning outcomes.
The expert evaluators were also given a scoring
methodology to assign numerical values to each
component. The scoring system ranged from 1 to 10,
where 1 represented poor performance and 10
represented an excellent scenario. Each criterion was
rated individually, and the total score for each scenario
was the sum of the scores across all components.
Experts were instructed to provide narrative feedback
to support their ratings, offering insight into why a
particular scenario scored highly or low. The final score
for each scenario was calculated by averaging the
ratings of all five experts.
To ensure a comprehensive evaluation of scenario
design, a Multi-Criteria Analysis (MCA) was
employed to assign weighted scores based on
predefined criteria, including scenario complexity,
assessment methodology, and adherence to maritime
training standards. The Intraclass Correlation
Coefficient (ICC) was calculated to assess the reliability
and consistency of experts' assessments. The ICC
values, which range from 0 to 1, provide insight into
the consistency of the experts' assessments. A high ICC
value (close to 1) indicates strong agreement among
experts, while a low value (close to 0) suggests
significant variability in their ratings. This structured
approach, combining quantitative scoring with
narrative analysis, provided both a numerical ranking
and qualitative insights into critical scenario elements.
The integration of ICC results strengthened the
evaluation process, ensuring its robustness. The
quantitative findings, supported by ICC analysis,
complemented and validated the qualitative
observations, offering a nuanced understanding of
variations in scenario design.
2.2 Limitations of the Study and Evaluation Process
While this study provides valuable insights into the
structure, content, and pedagogical relevance of AI-
generated training scenarios, certain limitations should
be acknowledged. A key limitation is the variability of
AI-generated content, as identical prompts do not
always produce consistent results. Additionally, the
absence of real-world validation means that the
practical applicability of these scenarios in actual
training environments remains uncertain. Since the
study was conducted without testing the scenarios in a
nautical simulator, their effectiveness in real-life
instructional settings could not be fully assessed.
The evaluation process itself also presents certain
challenges. Although a structured Multi-Criteria
Analysis (MCA) was employed to introduce a
quantitative dimension to the assessment, expert
judgment remained integral to the evaluation. While
the blind review approach minimized bias, expert
subjectivity cannot be entirely eliminated, as
assessments are influenced by individual expertise and
experience. Despite these constraints, the
methodological framework, combining both
quantitative scoring and qualitative analysis, ensures a
structured and transparent approach to scenario
evaluation. This contributes to a deeper understanding
of how AI can support the development of maritime
training exercises.
3 RESULTS
While the AI-generated scenarios were created using
ChatGPT Plus, the level of human involvement varied
considerably based on the creator's expertise. Scenario
creator III, with minimal prior experience in maritime
training, was generally satisfied with the outputs
generated by ChatGPT and made few modifications. In
contrast, scenario creator I, with relevant knowledge
but limited practical training experience, used the AI-
generated content as a basis to explore relevant
maritime applications, further refining the task
structure. Scenario creator II, an expert in maritime
education, provided more targeted feedback to
ChatGPT, specifying exact details regarding traffic
conditions, vessel types, and specific evaluation
criteria for the exercise, ensuring that the scenario
aligned with practical training needs.
The expert evaluations revealed certain differences in
the quality of the generated scenarios. Some of these
variations can be attributed to the individual expertise
of the creators, while others seem to be influenced by
the way the AI model was utilized in the creation
process. These differences were primarily related to
how much the scenario creators relied on the outputs
provided by ChatGPT, which varied depending on
their level of expertise. For example, the creator with
minimal prior experience in maritime training
generally accepted the AI-generated content with
minimal modification, while the more experienced
creators interacted more actively with the model,
refining the scenarios to better fit practical training
requirements. This pattern suggests that the
involvement of human expertise played a key role in
shaping the final scenario outputs, with more
experienced creators leveraging their knowledge to
guide the AI model more effectively. The specific
contributions of both human expertise and AI model
outputs in the scenario development process will be a
topic for future research.
To systematically compare the evaluated scenarios, the
following tables present the expert ratings and their
descriptive statistics. Table 3 presents the individual
scores assigned by five experts for each scenario based
on three key criteria: Aim & Scope (A&S), Scenario
Details (SD), Evaluation Part (EP). It shows the
Average Score (AS), as well.