1193
4 RESULTS AND DISCUSSION
The applied optimization methods show consistent
and significant improvements across both the real
HELCOM AIS dataset and the synthetically generated
data. As presented in Table 7, each successive method
improved the data size reduction, compression ratio,
and storage efficiency.
For the generated 70M AIS dataset, the original raw
size was 10.69 GB, compressing to 1.55 GB using Zstd.
After applying all optimization step: reordering,
columns reduction, and normalization - the final
compressed size was reduced to 1.02 GB, with a
relative compression ratio of 10.46 times and a total
space saving of over 90% of storage; reducing the plain
dataset size by 38% of the original size without
compression.
In the case of the real HELCOM AIS dataset (70M
rows), the unoptimized data compressed to 2.8 GB. The
most effective method—normalization combined with
column reduction—reduced the raw dataset from 10.2
GB to 4.2 GB, and the compressed size to 838 MB,
achieving a relative compression ratio of 12.20. This
corresponds to an 8.2% compressed size compared to
the original dataset, meaning over 91% space saving.
Further, in an ultra-compression mode, where
maximum compression settings were used, the
compressed size dropped even further to 560 MB,
achieving a compression ratio of 18.25x, albeit with
significantly longer compression times (CPU time over
3 hours).
These results show that: a) structural data cleanup
and normalization lead to 2–3x better compression
than compression alone; b) reordering and
deduplicating static fields is a high-impact, low-effort
optimization; c) the Zstandard (Zstd) compressor
consistently outperforms other methods in both
compression ratio and efficiency; d) the methodology
is effective across both real and synthetic AIS datasets,
confirming its generalizability.
In terms of efficiency percentage (calculated as the
compression performance relative to the original
method), the most optimized method achieved over
65% storage efficiency improvement on the generated
dataset and about 30% improvement on the real
dataset.
The study revealed that applying method 2, which
combines the corresponding compression method over
sorted datasets without any schema modification,
reduced storage usage by 12 times. For instance, the
historical AIS dataset for the Gulf of Riga decreased
from 323 GB to 26 GB. However, applying more
advanced techniques, such as schema modification and
denormalization, enabled reductions of up to 18 times.
These approaches, however, require additional data
pre- and post-processing steps, as well as modifications
to existing IT solutions.
5 CONCLUSION
Optimizing AIS data formats significantly enhances
both storage efficiency and ecological sustainability. By
applying structural reorganizations—grouping,
column reordering, and deduplication—storage
savings of up to 66% were achieved. This research
demonstrates the potential of data-aware format
engineering over traditional compression methods
alone. The development of a synthetic data generator
further enables reproducibility and broader
experimentation.
Specialized algorithms, like adaptive Douglas-
Peucker variants and top-down kinematic
compression, exploit spatial and kinematic attributes
(position, speed, course) to achieve higher compression
ratios while maintaining critical trajectory features.
Combining these specialized methods with general
archivers as a second compression stage can maximize
storage savings. [13]
This composite view highlights that structural
optimization, dedicated trajectory compression, and
general archiving together offer the best approach for
efficient AIS data storage and sustained ecological
benefits.
These results are not just relevant for maritime
analysts, but also for data engineers, archivists, and
policymakers aiming for sustainable data practices in
marine science and beyond.
ACKNOWLEDGMENTS
The authors gratefully acknowledge HELCOM (Helsinki
Commission) for providing access to the AIS datasets used in
this study.
BIBLIOGRAPHY
[1] Clissa, L. (2022). Survey of big data sizes in 2021. arXiv.
https://doi.org/10.48550/arXiv.2202.07659
[2] Corvino, M., Daffinà, F., Francalanci, C., Giacomazzi, P.,
Magliani, M., Ravanelli, P., & Stahl, T. (2025). A
Methodology to extract Geo-Referenced Standard Routes
from AIS Data [Preprint]. arXiv.
https://doi.org/10.48550/arXiv.2503.22734
[3] Monserrate, S. G. (2022). The Cloud Is Material: On the
Environmental Impacts of Computation and Data
Storage. MIT Case Studies in Social and Ethical
Responsibilities of Computing, Winter 2022.
https://doi.org/10.21428/2c646de5.031d4553
[4] Safdie, S. (2024). What is the carbon footprint of data
storage? Greenly. https://greenly.earth/en-
gb/blog/industries/what-is-the-carbon-footprint-of-data-
storage
[5] Aujoux, C., Kotera, K., & Blanchard, O. (2021). Estimating
the carbon footprint of the GRAND project, a multi-
decade astrophysics experiment. Astroparticle Physics,
131, 102587.
https://doi.org/10.1016/j.astropartphys.2021.102587
[6] Istrate, R., Tulus, V., Grass, R. N., Vanbever, L., Stark, W.
J., & Guillén-Gosálbez, G. (2024). The environmental
sustainability of digital content consumption. Nature
Communications, 15(1), 3981.
https://doi.org/10.1038/s41467-024-47621-w
[7] Stecuła, B., Stecuła, K., & Kapczyński, A. (2022).
Compression of text in selected languages—Efficiency,
volume, and time comparison. Sensors, 22(17), 6393.
https://doi.org/10.3390/s22176393
[8] Sobczyński, S. (2025, May 25). zstd vs zip vs 7-Zip
(LZMA2): .NET compression benchmark. hasto.pl.
https://hasto.pl/compression-benchmark-zip-vs-7-zip-
lzma2-vs-zstandard