LLM-based Maritime Training Feedback System: Implementing RAG-Enhanced Assessment Analysis with STCW Compliance

ABSTRACT: This paper presents the implementation and evaluation of a Retrieval-Augmented Generation (RAG) system designed to provide automatic STCW- compliant feedback on maritime assessment questions. Building on preliminary findings from ongoing research into technological proficiency [1] (β=0.457) and institutional readiness [2] (β=0.341), this implementation addresses a critical gap: the need for automated feedback systems that maintain regulatory alignment while reducing instructor workload. The system utilizes the Mistral-7B large language model optimized with QLoRA for efficient local deployment, combined with a RAG architecture to ensure contextually relevant feedback. Evaluation results demonstrate the system’s ability to generate accurate feedback with response times under 15 seconds and STCW concept coverage of 85%, addressing key implementation barriers identified in our previous studies. The paper discusses how this implementation addresses technological proficiency barriers (β=0.457, p<0.001) and enhances perceived usefulness through automated, standards-compliant feedback that supports both individual competency development and institutional readiness.

KEYWORDS: Large Language Models (LLMs), Retrieval-Augmented Generation (RAG), STCW-Compliant Feedback, Automated Assessment in MET, Model Optimization, Regulatory Alignment, Feedback Generation Performance, Maritime Education Technology Adoption

REFERENCES

S. Baradziej, T. E. Kim, and L. I. Magnussen. “(under review) Technological proficiency and adaptive learning technologies in maritime training: A PLS-SEM analysis”. In: Maritime Policy Management (2025).

S. Baradziej, T. E. Kim, and L. I. Magnussen. “(under review) Institutional Readiness for Adaptive Learning Technologies in Maritime Education”. In: WMU Journal of Maritime Affairs (2025).

International Maritime Organization. International Convention on Standards of Training, Certification and Watchkeeping for Seafarers (STCW). International Maritime Organization, 2011.

Z. Qi, R. Xu, Z. Guo, C. Wang, H. Zhang, and W. Xu. “Long2RAG: Evaluating Long-Context Long-Form Retrieval-Augmented Generation with Key Point Recall”. In: Findings of the Association for Computational Linguistics: EMNLP 2024 (2024), pp. 4852–4872. - doi:10.18653/v1/2024.findings-emnlp.279

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. Le Scao, T. Lavril, T. Wang, T. Lacroix, and W. El Sayed. “Mistral 7B”. In: arXiv preprint arXiv:2310.06825 (2023).

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer. “QLoRA: Efficient finetuning of quantized LLMs”. In: Advances in Neural Information Processing Systems. Vol. 36. 2023.

E. Kasneci, K. Sessler, S. Ku¨chemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. Gu¨nnemann, E. Hu¨llermeier, et al. “ChatGPT for good? On opportunities and challenges of large language models for education”. In: Learning and Individual Differences 103 (2023), p. 102274. - doi:10.1016/j.lindif.2023.102274

T. Y. Kung, P. Chen, G. Cheng, T. Sedoc, and C. Callison-Burch. “Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models”. In: PLOS Digital Health 2.2 (2023), e0000198. - doi:10.1371/journal.pdig.0000198

C.-H. Chiang and H.-y. Lee. “Can Large Language Models Be an Alternative to Human Evaluation?” In: Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. 2023, pp. 15607–15631. - doi:10.18653/v1/2023.acl-long.870

Q. Collaborative. “A framework for human evaluation of large language models in healthcare derived from literature review”. In: NPJ Digital Medicine 7.1 (2024), pp. 1–12. doi: 10.1038/s41746-024-01086-9. - doi:10.1038/s41746-024-01086-9

P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Ku¨ttler, M. Lewis, W.-t. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela. “Retrieval-augmented generation for knowledge-intensive NLP tasks”. In: Advances in Neural Information Processing Systems. Vol. 33. 2020, pp. 9459–9774.

T. Zhang, S. G. Patil, et al. “RAFT: Adapting Language Model to Domain Specific RAG”. In: (2024).

E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. “LoRA: Low-rank adaptation of large language models”. In: International Conference on Learning Representations. 2022.

G. Emad and W. M. Roth. “Contradictions in the practices of training for and assessment of competency: A case study from the maritime domain”. In: Education + Training 50.3 (2008), pp. 260–272. - doi:10.1108/00400910810874026

C. Sellberg. “Simulators in bridge operations training and assessment: a systematic review and qualitative synthesis”. In: WMU Journal of Maritime Affairs 16.2 (2017), pp. 247–263. - doi:10.1007/s13437-016-0114-8

F. Davis. “Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology”. In: MIS Quarterly (1989), pp. 319–340. - doi:10.2307/249008

A. Hevner, A. R, S. March, S. T, Park, J. Park, Ram, and Sudha. “Design Science in Information Systems Research”. In: Management Information Systems Quarterly 28 (Mar. 2004), pp. 75–. - doi:10.2307/25148625

J. Johnson, M. Douze, and H. J´egou. “Billion-scale similarity search with GPUs”. In: IEEE Transactions on Big Data 7.3 (2021), pp. 535–547. - doi:10.1109/TBDATA.2019.2921572

Citation note:

Baradziej S.: LLM-based Maritime Training Feedback System: Implementing RAG-Enhanced Assessment Analysis with STCW Compliance. TransNav, the International Journal on Marine Navigation and Safety of Sea Transportation, Vol. 19, No. 3, doi:10.12716/1001.19.03.16, pp. 831-838, 2025

BibTeX EndNote

Authors in other databases:

Simon Baradziej: ORCID iD icon

orcid.org/0009-0000-6561-6904

File downloaded 269 times