465
1 INTRODUCTION
Maritime Surveillance is an essential priority for
World’s Security, both for the civilian aswellasfor
the military sector. In this aspect, Maritime Domain
Awareness(MDA)playsacriticalrole.MDAis “the
effective understanding of any activity associated
with the maritime environment that could impact
upon the security, safety, economy or environment”
[1]. Thus MDA implies the collect
ion, fusion and
disseminationofenormousquantitiesofdatainorder
to build intelligence and create a comprehensive
CommonOperatingPicture(COP).Accordingto [2],
MDA is the sine qua non of maritime security and
dependsonsurveillanceandexchangeofinformat
ion
between the international communities. However,
currentcapabilitiestoachievethatawarenessarestill
under development, what especially concerns the
integration of data from different sources and
increase of the quality of maritimerelated data.
Therefore, the current potential stemming from
utilization of this data is not yet fully exploited,
part
icularly in view of data fusion and the use of
intelligentdataanalysistools.
To fulfil this potential, methods and systems for
creating a complete maritime situation picture are
required.This includesfor examples systems, which
integratestaticanddynamicdataaboutvesselsfrom
AIS with information from external sources (furt
her
calledasancillaryinformation).Suchsystemswould
support operators in charge in the process of
monitoring and controling of the maritime traffic as
wellasintheOODAloop[3]:
Observe:toknowwhatisgoingon,
Orient:tounderstandwhatisgoingon,
Named Entity Disambiguation for Maritime-related
Data Retrieved from Heterogenous Sources
J
.Małyszko,W.Abramowicz&M.Stróżyna
PoznańUniversityofEconomicsandBusiness,Poznań,Poland
ABSTRACT:The articleconcernsintegrationand disambiguationof data relatedto themaritimedomain. A
developedsystemisdescribed,whichcollects andmergesdataaboutseveral maritimerelatedentities(vessels,
vesseltypes,ports,companiesetc.)retrievedfromdifferentinternetsourcesand feedsthedataint
oasingle
database.Thisprocessishowevernottrivial.Therearefewchallenges,whichneedtobefacedtosuccessfully
conductit.Firstly,indifferentsources,entitiesmaybereferencedtoindifferentways,forexample,byusing
differenttextstrings.Additionally,someofthesereferencesmaybeam
biguous,i.e.potentiallythereference
may pointto morethan one entity.To enable efficientanalysis ofdata coming from differentsources, such
ambiguitiesmustberesolvedautomaticallyasapreprocessingstep,beforethedataisuploadedtothedatabase
and utilized in further computations. The aim of the disambiguation process is to assign art
ificial, unique
identifierstoeachentityandthen,ifpossible,automaticallyassigntheseidentifierstoeachdataitemrelatedto
a given entity. In the article, developed methods for resolving such ambiguities are discussed and their
evaluationispresented.
http://www.transnav.eu
the International Journal
on Marine Navigation
and Safety of Sea Transportation
Volume 10
Number 3
September 2016
DOI:10.12716/1001.10.03.12
466
Decide:toweighttheoptionsandtheirimpact,
Act:tocarryoutthedecision.
In this article, we describe part of the work that
wasconductedduringSIMMO(SystemforIntelligent
MaritineMOnitoring)project,aimingatintegrationof
data from multiple sources to enhance MDA. The
maingoal
oftheSIMMOwastodevelopaprototype
ofasystem,basedonthestateoftheartinformation
fusion and intelligence analysis techniques, which
generates an enhanced Recognised Maritime Picture
(RMP) and thus supporta user in situation analysis
and decisionmaking. This aim was addressed by
providinginformation
aboutvesselsofhigherquality
and by automatic detection of potential threats
(suspicious vessels) with regard to defined criteria.
Thesystemisaddressedtodifferentstakeholdersand
entitiesfromthemaritimedomain.
As mentioned above, the SIMMO system collects
and fuses data from two types of data sources: AIS
and selected
internetdata sources. In this article we
focus only on the latter type. Having data retrieved
fromselectedinternetsources(whichisretrievedby
web scrappers), the system performs merging and
integrationofthisdataintoaconsistentdataset.The
dataitselfconcernsdifferentmaritimerelatedentities,
such vessels,
flags, ports, vessel types, classification
societies and companies. Each of these entities is
describedinmoredetailinsection3
The data integration is a complex process and
there are few challenges, which need to be faced to
successfullyconductit.Firstly,ineachdatasourcethe
sameentitymaybe
referencedindifferentwaysand
differentcategoriesmaybeusedtodescribethesame
issues.Forexample,differentwords(names)maybe
usedtocallthesameentity(e.g.aportoraship)or
categoriesusedintwosourcesmaybedevelopedon
different levels of granularity. Therefore, before
the
dataisaddedtothedatabase,suchdifferencesmust
berecognized and thedata needs tobe aligned. For
example,thesystemshouldrecognize whichentities
arebeingreferencedtointhedataand,basedonthat,
assigntothisdataanuniqueidentifier,whichcanbe
easily used for
subsequent analysis. This process is
calleddisambiguation.Inthisarticlewepresentaset
of methods, designed and implemented within the
SIMMOproject,whichaimatsolvingsuchissues.
The article is organized as follows. First, a brief
analysis of related work is described to give the
reader the context
of the task, which had to be
performed to reach the objectives of the research.
Nextsection3constitutesthemainpartofthearticle
andcontainsdescriptionofthedevelopedapproaches
forallanalyzedentities:vessels(subsection3.1),flags
(3.2), ports (3.3), vessel types (3.4), classification
societies 3.5 and companies
(3.6). The article is
concludedwithashortsummaryandanoutlookona
possiblefutureresearchdirections.
2 RELATEDWORK
2.1 Disambiguationprocess
The research is related to ETL (Extract, Transform,
Load) task. ETL refers to a process in a database
usage,especiallyinadatawarehouse,that:
extractsdatafromhomogeneousorheterogeneous
datasources;inETL,theseareusuallydatabases,
whichmaybeaccesseddirectlyorusingdedicated
API. In traditional ETL research, an important
issue is reducing an overload of the data source,
resulting from extracting data from it (to ensure,
that the performance of the
original data source
willnotsuffer)and,atthesametime,keepingthe
data as uptodate as possible [4]. Still, in case
whenthesourcesarewebpages(asisthecasein
SIMMO)thisissueisnolongervalid,astherateon
whichqueriesmay
besenttoawebpageisstrictly
defined.
transforms the data in order to store it in the
proper format or structure, for the purposes of
querying and analysis; transformation steps used
here often are adhoc, developed to fit a given
situation, and straightforward if studied
individually. Still,
as the number of such
transformation steps of this kind may grow, a
proper approach should be utilized to ensure
proper efficiency and elegance in terms of
semantics[4].
loads the data into the final target (database or
datawarehouse)forpossibleexploitation.
Theareaofinterestofthis
articleisdistinguished
fromtraditionalETLasitfocusesmoreontheissueof
merging data from different sources than on the
wholeprocess.Oneofthemostimportanttasksinthe
areaofdataintegration,whichwasconductedduring
our work, is entity disambiguation, which in the
literature is also
referred to as duplicate detection,
record linkage, reference matching or entityname
clustering and matching problem [6]. It is a well
known problem in the area of data integration. It
resultsfromthefactthatreferencestoasingleentity
may be different due to different reasons, such as
typographical
errors, abbreviations etc. [6]. The
mentionedproblemisespeciallyimportanttohandle
when data from many different sources is to be
integrated. As different systems are developed and
maintainedbydifferentparties,oftentoservespecific
needs, in these systems the same entities may be
referencedincompletelydifferentway[7].
According to [7], the following steps should be
followedtoperformthediscussedtask:
1 dataanalysis,whichgoalistoidentifyerrorsand
inconsistenciesthatneedtoberemoved,
2 definition of transformation workflow and
mapping rules, which as a result is to provide
methods and their implementations for
data
disambiguation,
3 verification, which goal is to evaluate to what
extentthemethodsdevelopedinthepreviousstep
give expectedresults; this step, together with the
previous one, may be performed iteratively
multipletimes,
4 transformation,whichprocessestheavailabledata
usingmethodsselectedduringpreviousstepsand
updatesthe
availabledatabasewithfinalvalues,
5 backflow of cleaned data, which is updating the
source database with the new, cleaned data (if
possible).
Basictoolsusedforentitydisambiguation problem
are string similarity measures. These measures,
havingatinputtwostrings,returnanumericalvalue
representing distance (or similarity) between
them.
467
Based on such measures, for example, two strings
which were found to be very similar to each other
mayberecognizedasreferringtothesameentity(the
difference between them may result, for example,
from misspelling [8]). A simple, well known string
similaritymeas ures areLevenshteindistance[6]and
Jarodistance[9].
Usingthestringsimilaritymeasuresonattributes
used to identify entities, it is possible to match the
strings based on similarities between the values of
these fields. Still, even greater challenge must be
faced when there is no single uniquely identifying
fieldfora certainentity. In
suchsituations, multiple
fieldsmustbecomparedtoestablishsomesimilarity
measurebetweenthetworecords[10].
To successfully perform entity disambiguation,
lexical resources may also be needed to identify
different ways, how a certain entity may be
referenced to. For example, in paper [11] one of
resources that was used
for entity disambiguation
was Disambiguation Dictionary that maps all
ambiguouspropernamestothesetofuniqueentities
they refer to. An example given in the mentioned
article is similar to many cases which were
encountered also in the SIMMO. Let’s assume a
situation when an abbreviation ACC is used,
which
refers to an entity, which for the system is known
under a main name e.g. American College of
Cardiology.Suchmappingcannotbeeasilyidentified
using,forexample,Jaromeasure.Thisproblem may
be solved if there is a proper dictionary, in which
alternative names for known entities are defined.
Additionally,inmanysituationssuchmappingsmay
be ambiguous: e.g. ACC may also be Asian Cricket
Council. To resolve such difficulties, usually
additional data must be taken into account, e.g.
context,inwhichagivenwordappears.
2.2 Maritimerelatedinternetdatasources
AsitwasindicatedintheIntroduction,
creationofthe
enhancedMaritimePicturerequiresusageofdifferent
datasources.The datasources,whichare applicable
inthemaritime surveillance domain,canbe divided
into three categories. The first and the most widely
usedaresensors,whichincludekinematicdataforthe
observed objects in their coverage area and
can be
further divided on active (e.g. radar, sonar) and
passive(whichrelyondatabroadcastedintentionally
byobjects,e.g.AIS,LRIT).Asurveyonsensorsused
inmaritimesurveillancecanbefoundin[12].
The first and the second category of sources are
basically accessible only to the maritime
authorities.
Thereforetheycanbereferredascloseddatasources.
Moreover, most of them do not publish data in any
wayontheInternet.
Thethirdcategoryconsistsofdatasources,which
are publicly available via Internet (hereinafter
referred to as internet data sources). This data
includes inter alia vessel traffic
data, reports and
news.Thereare organizationsand communitiesthat
providethemaritimerelateddataonlineandmakeit
accessible for the public. For example, there are
different organizations, such as ports, that publish
theirvessel trafficdata or theirfacilities information
online. In addition, there are various online
communities
such as blogs, forums and social
networks, which provide the possibility of sharing
information about maritime events [13]. The main
advantages of using such internet data sources are:
possibility to reveal facts, which are not reported to
the maritime authorities or available in their
databases, global context of data and lack
of
legitimate limitations of exchanging data between
differentcountries.
Themaritimerelatedinternet sources canbe also
divided into shallow and deep sources. The former
are soyrces, which are indexable by conventional
search engines, like Google or Yahoo. The deep
sourcesconsistofonlinedatabasesthatareaccessible
via Web
interface, but poorly indexed by regular
search engines and, in consequence, not available
throughregularWebsearch[14].Suchwebpagesare
not directly accessible through static URL links, but
rather dynamically generated as response to queries
submitted through the query interface of an
underlyingdatabase[15].
The deep web is
an important source of
information in the maritime domain. The analysis
conducted within the SIMMO project revealed that
there is a number of online databases, containing
valuable information on various maritime entities,
suchasvessels,ports,shipownersetc.
Asaresult,therearedifferentkindofdatasources
in
the maritime domain that provide heterogeneous
data regarding maritime entities. However, in the
existing maritime surveillance systems, usually only
thedatareceivedfromsensorsareused[16,12].Non
sensor data includes for example expert knowledge,
which is further fused with sensor data [17]. Mano
et.al. [18] proposed a system that
collects data from
radarsanddatabasessuchasenvironmentaldatabase,
Lloyd’sInsuranceandTF2000VesselDB.Dinget.al.
[19]inturnproposedanarchitectureofacentralized
integrated maritime surveillance system for the
Canadian coasts, fusing HFSWR, ADS (Automatic
Dependant Surveillance) reports, visua l reports,
information sources and radar. The
solely research,
whichfocusesonusageofopendataavailableonthe
Internet for the purpose of maritime surveillance, is
presentedin[13].
3 RESULTS
IntheSIMMOsystem,dataaboutdifferentmaritime
related entities is retrieved from several internet
sourcesandthencombinedintoasingledatamodel.
Theseentities
are:
vessels,
ports, which may be referenced to in many
differentcontexts,forexamplecurrentdestination
ofagivenvessel,homeportforavessel,location
wherevesselinspectionstakesplaceetc.,
flags,correspondingtothecountryofregistration
ofagivenvessel,
classification societies, which
are organizations
providingclassificationandstatutoryservicesand
assistance to the maritime industry, as well as
regulatorybodieswithregardstomaritimesafety
and pollution prevention, based on the
468
accumulationof maritimeknowledge and
technology
1
,
companies,which maybeincertain relationships
withvessels(e.g.ownersormanagers).
It is crucial to ensure that, as a result of data
integration, it is possible to easily identify, which
entitya particular dataitemrefers to, regardlessthe
sourcefromwhichitwasretrieved.Tobe
abletodo
that,inthedatamodelartificialidentifiershavebeen
introduced which are assigned to all entities. Such
identifiershavethefollowingcharacteristics:
dataitemsconcerningthesameentityshouldhave
thesameIDassigned,
data items concerning different entities should
havedifferentIDsassigned.
Having
suchIDs assigned, itispossible to query
the databaseusing regular SQL queriesand retrieve
required results, regardless the fact that in different
datasourcesthesameentitymaybereferencedtoin
different ways. Still, the main challenge is how to
automatically assign the identifiers to the entities to
ensure that the two characteristics of IDs described
abovearesatisfiedtothegreatestpossibleextent.The
processofassigningIDstodifferententitiesiscalled
entitydisambiguation.
3.1 Vessels
IntheSIMMOproject,themainfocusisputondata
about vessels. Thus, apart from the disambiguation
process, an
additional step is performed in the
system,aimingatfusingdataintoasinglerecord.The
fusionisunderstoodaschoosingone,finalvaluefor
eachattributeofagivenvesselwhichisthenusedby
the analytical module and presented to a system’s
userinthedisplaymodule.Basedon
thedatafusion,
asinglerecordwithvaluesforallship’sattributesis
generated. This record, based on a set of defined
rules,ismostlikelytobecorrectandvalid.Below,in
points3.1and3.1theprocessofdisambiguationand
fusionofvesseldataispresented.
Vessel data
disambiguation. The vessel data
disambiguation is a process of assigning the same
identifier to each data record concerning the same
vessel (such identifier should be unique to a given
vessel).Aschemarepresentinganexampleofavessel
disambiguation is presented in Figure 1, where are
two records with selected data
about static vessel
features from two different sources (MarineTraffic
andMaritime Connector).Let us assume thatit was
noticedthatcallsignandvesselnameinbothrecords
areequal.Basedonthatitmaybedecidedthatthese
recordsconcernthesamevessel.Insuchsituation,to
bothrecords
shipIdisassigned.Also,thisshipIdisto
be assigned to any other data item which concerns
thisparticularvessel.
In the research it was assumed that
disambiguation of vessel data may be performed
similarlyasitwasdoneintheexampleabove,thatis
by checking if values of a
certain attribute (or a
collection of attributes, e.g. pairs of attributes, as in
example above) in records coming from different

1
http://www.iacs.org.uk/document/public/explained/Class_What
Why&How.PDF,accessed20160323
sources are equal. As soon as the system identifies
that there is a match between values of some
attributes, the same shipId is to be assigned to both
records. To ensure that such processing will give as
good results as possible, it is important to correctly
define in what order different
attributes will be
analyzed in search for the match. For example, first
theattributeshouldbeanalysed,whichisbelievedto
givethemostreliableresultsandifthematchcannot
befound(e.g.because valuesofsuchattributesmay
bemissing),itshouldbemovedtolessreliableones.
Figure1.Asimpleschemapresentingthegoal ofthevessel
datamergingprocess.
Vesselsarecharacterizedbyanumberofdifferent
attributes, which may be used for disambiguation
purposes. Some of these attributes are specifically
assigned by various organizations to enable unique
identification of vessels in certain contexts. In the
analyzeddatasourcestheseattributesare:
IMO (International Maritime Organization Ship
IdentificationNumber
Scheme) numbers
assigned permanently to each ship for
identification purposes. That number should
remain unchanged upon transfer of the ship to
other flag(s) and is inserted in the ship’s
certificates
2
,
MMSI (Maritime Mobile Service Identitiy) nine
digit number used by several systems (including
AIS) to uniquely identify a ship or a coast radio
station. MMSIs are regulated and managed
internationally by the International
Telecommunications Union in Geneva,
Switzerland
3
,
Ca llSign,
VesselName.

2
http://www.imo.org/en/OurWork/MSAS/Pages/IMOidentification
numberscheme.aspx,accessed20160401
3
http://www.navcen.uscg.gov/index.php?pageName=mtMmsi,accessedon
20160401
469
Table1.Duplicatesinsourceattributevaluepairsforattributes,whichpotentiallymaybeusedfordisambiguationofvessel
datarecords
__________________________________________________________________________________________________
Attribute#ofduplicates %ofduplicates #ofduplicates%ofduplicates
(allvalues) (allvalues)(distinctvalues) (distinctvalues)
__________________________________________________________________________________________________
IMO3610.0981780.088
MMSI20430.5599870.321
ShipName+CallSign 22681.01011000.541
CallSign2610911.63052612.979
ShipName15715024.9884223011.433
__________________________________________________________________________________________________
Some of these identifiers are assigned by
international organizations and are to be unique on
the worldwide scale (e.g. MMSI and IMO). Thus, if
tworecordsfromtwodistinctsourcessharethesame
MMSIorIMOnumber,itishighlyprobablethatthey
concern the same vessel. For other
attributes such
assumption can be used with less certainty, as they
maynotbedistinct.
For the above listed attributes it is estimated to
whatdegreeagivenattributeisreliable as aunique
identifier of a vessel. Such reliability is estimated
basedonthefollowingheuristics.Foreachattributeit
ischeckedhowmanytimesitsvalueisduplicatedin
a single data source. For example, for MMSI it is
counted how many times its value is duplicated in
MarineTraffic, next how many times the value is
duplicated in Maritime Connector etc. In the end,
these numbers are summed. The
more duplicated
valuesare found,the less reliablethis attribute isas
far as unique identification of vessels is concerned.
Thus, the attributes are ordered in a descending
manner according to the number of such duplicates
(relatively to the number of all values of such
attribute in the database). The ordered
list is then
used in disambiguation process, i.e. the
disambiguation is performed using in the first place
themostreliableattributes.
The results of the data analysis are presented in
table1.Secondandthirdcolumnrefertoallavailable
data records (i.e. in how many records there are
values,
whichareduplicated),whilefourthandfifth
columnrefertodistinctvalues(e.g.ifdistinctvalues
ofagivenattributeareanalysed,howmanyofthem
appearmorethanonceinasingledatasource).Based
ontheresults,theattributewiththehighestreliability
isIMOnumber, as duplicatesoccur
inless thanone
per mille of cases. For MMSI, duplicates occur in
morethanahalfpercentofcases,whatstillmaybe
consideredasreasonablylow.Thereforeitalsocanbe
used in disambiguation process. Still, for Call Sign
andShip Name,duplicates aremuch morecommon
andthusdisambiguationba sedontheseattributesis
likelytogivemuchworseresults.However,theymay
beusedtogether,sinceduplicatesforcombinationof
ShipNameandCallSignoccurinabout1%ofcases.
Anotherimportantissueconcerningdifferent(sets
of) attributes is how often a certain value of
the
attribute appears in two or more data sources.
Obviously,onlyifthevalue appearsinmorethanone
data source, it may be used to identify that two
recordsfromdifferentdatasourcesrefertothesame
ship. In table 2 statistics concerning this issue are
presented.
Finally, table
3 presents statistics on how many
rowsareaffectedifthedisambiguationisperformed
usingthedescribedapproach,intheorderpresented
inthetable (firstbasedonIMO, thenonMMSIetc).
Thevalueinthesecondcolumntakesintoaccountthe
fact that disambiguation was already performed
based on
previous attributes. Thus, if the row was
disambiguated based on IMO, it is not further
analyzedwhetheritcanbedisambiguatedalsobased
on MMSI. The data in the last row reflects for how
many rows there were no matches for the analysed
attributes (or the attribute set). The third
column
contains values from the second column divided by
the number of all rows with data about vessels and
sumsupto100.
Table 2. How often the same value of different attributes
maybefoundinmorethan onesource(ShipName+Call
Signrowworksonpairsofvaluesofthesetwoattributes)
_______________________________________________
Attribute#ofdistinctvalues %ofdistinctvalues
_______________________________________________
IMO10251550.627
MMSI4815015.700
ShipName+CallSign 173348.569
CallSign2212412.884
ShipName7940722.677
_______________________________________________
Table 3. How many rows are affected when merging of
vesseldataisconductedintheorderpresentedinthetable,
fromtoptothebottom(ShipName+CallSignrowworks
onpairsofvaluesofthesetwoattributes)
_______________________________________________
Attribute#ofrowsaffected %ofrowsaffected
_______________________________________________
IMO26966242.518
MMSI297444.690
ShipName+CallSign 6780.107
CallSign14050.222
ShipName250473.949
_______________________________________________
Notmerged30837148.621
_______________________________________________
In our research, the vessel data disambiguation
was performed according to the described approach
anditsresultsarepresentedintable3.Theattributes
ShipNameandCallSignwereskippedastheywere
theleastreliableattributes.
Theproposedapproachcouldbefurtherextended
by using additional attributes
of vessels in the
disambiguation process. In the database, many
additionalattributesofvesselswerecollected,suchas
flag,length,yearofbuiltandowner.Theseattributes
maybeusedasanextradisambiguationinformation
for data, for which less reliable attributes were
utilized in the disambiguation, such as Ship Name
(for example, both Ship Name and flag attributes
must be equal to decide that both rows concern the
470
same vessel). Still, as the number of rows, which
couldbedisambiguatedbasedonCallSignandShip
Name was relatively low, this issue was skipped in
theperformedwork.
Vessel data fusion. The goal of data fusion is to
selectforeachattributedescribingacertainvesseland
from
all records describing that vessel, a single
attribute value which will be considered to be the
mostaccurateone.Forexample,let’sassumethatwe
have three records from three different sources for
vesselwithshipId=1.AccordingtosourceA,theflag
forthisshipisPoland,according
toBitisGermany,
andto source Cagain itis Poland.The goal ofdata
fusion is to select one of these values, Germany or
Poland,tobetheprimaryvalueforthisattribute.The
recordwith fuseddatais to consistof such primary
valuesforeachattribute,
aspresentedonpicture2.
Thedatafusionmaybeperformedbasedon:
selecting the most common value, i.e. the value
thatoccursinthedatasourcesmostoften.Itmay
beassumedthatthevalueiscorrectbecausemany
or most sources report exactly the same value
(Argumentum
adpopulumlikeinference),
assigning different priorities to different data
sources(basedonsomepreviousassessmentofthe
datasources)andselectingthevaluefromsource
with the highest priority. The priority should
reflecthowreliablethesourceisaccordingtothe
conductedassessment,
analysisofagreementbetweendifferent
attributes.
Forexample,firstsignsofaCallSigncorrespond
totheflagofthevessel.Thus,ifthevalueofFlag
attributeisdifferentthanwhatwasexpectedfrom
the Call Sign, this value may be chosen to be
treatedaslessreliableone.
Figure2.Asimpleschemaofavesseldatafusion.
Themightbesituations,whenagiventhevalueof
a given attribute is provided only in one source. In
thiscase,thisvalueistobeusedinthefusedrecord.
Also, in many cases a given attribute will have the
same value assigned in each record concerning a
given
ship(i.e.manysourcesprovidethesamevalue
of a given attribute). In such situation this value is,
obviously,goingtobechosenastheprimaryva lue.In
other cases, if there are different values assigned in
records from different sources (i.e. different data
sources provide different values of a
certain
attribute), part of the values must be discarded and
onlytheonethatischosenastheprimary valueisput
inthefinal,fusedrecord. Inthedevelopedsystem,for
each vessel’s attribute a rule on how its values are
fusedwaschosenbyanexpert.
3.2 Flags
In
thedata sourcesusedin theresearch,each vessel
has the flag assigned, which reflects its country of
registration. A flag is referred by a string being a
nameofa givencountry.Althougheachcountryhas
exactlyonename,theremaybedifferentvariantshow
the name is written,
e.g. due to abbreviations of
country names or spelling errors. For example, one
can easily find different ways how United States of
Americaisreferredtoindifferentdatasources:
USA(US)
U.S.A.
UnitedStatesofAmerica
UnitedStates
UnitedStates(withoutspace)
Apartfromthat,sometimes
theflagnamedoesnot
refertothenameofthecountry,buttooneofits
territories, e.g. Isle of Man and not the United
Kingdom.Insomescenarios,itmightbeuseful to
recognize,basedonthenameoftheterritory,what
isthemaincountryassociatedwith
agivenstring.
Figure3. A paragraph (together with a part of its HTML
code)fromWikipediaarticleaboutMexicoUnitedStates
relations,inwhichphrase“UnitedMexicanStates”isused
as an anchorto link to the article, which name is Mexico.
Based on such links, Wikipedia lexicalization dataset is
generated
Inthedevelopedsystem,thelistofflagstobeused
wasdefinedupfront.Foreachflagasinglestringwas
assignedasitsmainnameandanumericalidentifier
was assigned as well, called flagId. The goal of flag
disambiguationisto,foracertainstringrepresenting
a flag’s name
, identify which flagId this string
correspondsto.
Abasicprerequisitetoperformsuchidentification
is a comprehensive lexical resource, containing for
exampleflagnamevariantflagIdmappings.Havinga
string representing a certain flag , the system can
check in the lexical resource whether there is a
mapping with
a givenflag name variant and, based
on that, assign appropriate flagId (the one that is
flagId pa ired with the given flag’s name variant).
Obviously, the crucial factor to enable successful
disambiguationofflagsistoobtainacomprehensive
listofsuchflagnamevariantflagIdma ppings.Firstof
all,the
systemshouldbeabletoassignflagIdtoany
flag name variant found in the corpus. This can be
done relatively easily, as the number of unique flag
namestringsintheavailablecorpusislessthan600.
In this case, a human expert is able to manually
analyzeall
thecasesandaddnecessarymappingsto
471
ensurea100%ofcoverageofflagnamevariantsfrom
thecorpusinthelexicalresource.
Still, the aim of the research was to develop a
method which would allow to obtain also other
mappings, not available in the initial corpus and
which would extend the lexicon. The extended
lexicon
wouldbenecessaryincasewhenanewdata
(e.g. data from a new data source) is added to the
system,containingpreviouslyunknownvariantsofa
flag’sname.
The resource which was used to automatically
generate such extended lexicon was lexicalization
datasetfrom DBpedia project. “DBpedia is a crowd
sourced community effort to extract structured
informationfromWikipediaandmakethis
information available on the Web”
4
. The mentioned
lexicalizationdatasetmaybeunderstoodasalexicon
containing a list of Wikipedia concepts (i.e. article
names) and their alternative names (i.e. text strings
whichmaybeusedtorefertothese concepts).
The lexicalization dataset is generated
automaticallybasedonanalysisofhyperlinkswithin
Wikipedia (the so
called interwiki links). In mamy
cases, the Wikipedia article includes links, which
point to some other Wikipedia pages. In such links,
often the text of an anchor can be treated as an
alternativenametotheconceptthelinkpointsto.We
refer to such anchor texts as surface forms.
As an
example please refer to Figure 3, based on which
phrase (surface form) “United Mexican States” may
beidentifiedtobeanalternativenameoftheconcept
“Mexico”. Thus, if it is possible to figure out that a
given Wikipedia concept corresponds to a certain
flag,allsurfaceformsof
linkspointingtothatconcept
canbeautomaticallygotandthesesurfaceformscan
beconsideredasthealternativeflagnames.
ThelexicalizationdatasetisprovidedbyDBpedia
in a form of a plain text file in a defined format. A
sampleofdatafromthisfileispresentedin
Listing1.
_______________________________________________
<http://dbpedia.org/resource/Poland> <http://lexvo.org/
ontology#label> "Poland"@en <http://dbepdia.org/
spotlight/id/Poland---Poland> .
<http://dbpedia.org/resource/Poland> <http://lexvo.org/
ontology#label> "Polish"@en <http://dbepdia.org/
spotlight/id/Poland---Polish> .
<http://dbpedia.org/resource/Poland> <http://lexvo.org/
ontology#label> "Republic of Poland"@en <http://
dbepdia.org/spotlight/id/Poland---Republic_of_Poland>.
_______________________________________________
Listing1.Selected lines fromDBpedialexicalizationdataset
withsurfaceformspointingtotheconceptʺPolandʺ
The surface forms, which point to the concept
name are often correct alternative names of a given
country. However, in some cases itmay turnedout,
that some of the retrieved surface forms are useless
from the point of view of flags disambiguation
process and only introduce the noise. For example,
the surface forms pointing to the concept Poland
include “Polish”, “Poland’s”, “Polishborn” and
“Pole”.Thesesurfaceformsareunnecessary,asflag’s
namesindatasourcesarereferredtousingnouns.
Such unnecessary variants may be easily filtered
out by discarding words ending with a predefined
sequences(e.g.“ish”,
born”and“’s”).

4
http://wiki.dbpedia.org/
The abovedescribed inference on flag name
variantsiscorrectonlywhenitispossibletoconnect
thenameoftheflag,asknowntooursystem,withthe
nameoftheWikipediaconceptcorrespondingtothe
givencountry.Thisinferenceprocesscanbedonein
an automatic manner only when
the flag name and
nameoftheWikipediaconceptareexactlythesame.
Additionally, it must be ensured that the found
conceptindeedconcernsthegivencountryandisnot
otherconcept withthe samename. For thispurpose
SPARQL query is used
5
(correctly defined SPARQL
query may ensure that a given concept indeed is a
country).Incase,whenthecorrectconceptcannotbe
foundinthisway,theadditionalprocessingmustbe
conducted,basedonthefollowingprocedure:
1 Takethemainflagnameasknowntothesystem
andcheck,
iftherearesomeinterwikilinksinthe
DBpedia with surface forms equal to this string.
Fetch the list of the matching surface forms and
concepts,whichthesesurfaceformspointto.
2 FetchfromtheDBpediaalistofallconceptswhich
refertoexistingcountriesusingaSPARQLquery.
3 Make an intersection of two sets obtained in the
step1and2andbasedonthatidentifythenameof
theconceptcorrespondingtoagivencountry.
4 Getsurfaceformsofallinterwikilinkspointingto
the found concept and add them as flag name
variants.
Using
the abovedescribed approach, in total it
waspossibletoextractaround1500flagnamevariant
flagId pairs. Thanks to that, thedeveloped system
was able to automatically disambiguate flag name
variantforalmosteveryflagnamestringwhichwas
retrievefrominternetdatasources.Fortheremaining
flag names,
which still could not be disambiguated,
appropriate mappings were added manually by
expertstoensurefullcoverage.Finally,thedeveloped
solution was evaluated manually by an expert, who
was shown a sample of 300 flag name strings as
foundindatasourcestogetherwithflagIdsassigned
by the system. According
to the expert, the system
performedthedisambiguationcorrectlyin299outof
these300cases(morethan99.6%).
3.3 Ports
In the used internet data sources, ports are used in
differentcontexts,including:
homeportofaparticularvessel,
portvisitedbyaship,
currentvesseldestination,
port in which the Port State Control inspection
tookplace.
Similarlytotheflags,intheSIMMOsystemthere
isapredefinedlistofknownportsandeachporthasa
unique identifier assigned the portId (which is an
integervalue)aswellasthema innameofthe
port(a
text string). Additionally, for each port its location
andLOCODE
6
arespecified.

5
SPARQLisasemanticquerylanguagefordatabases,abletoretrieveand
manipulatedatastoredinResourceDescriptionFramework(RDF)usedby
DBpedia
6
LOCODEisageographiccodingschemedevelopedandmaintainedby
UnitedNationsEconomicCommissionforEurope.
472
When a new information referencing a port is
acquired from a data source, the system needs to
recognize which port this information concerns and
assignanappropriate portid.Thisdisambiguation is
performed in the manner described in the following
paragraphs.
Development of lexical resources. From the
technical point of view,
the disambiguation of port
names, in its basic form, is very similar to the
disambiguation of the flag names. Again, there is a
lexical resource with pairs of port name variant
portId.Havingacollectionofsuchpa irs,storedina
form of database table, for any port name string
extracted from a data source, the system searches
through all pairs to find the matching port name
variant and assigns the corresponding portId to the
portstring.
Thelexicalresourceoftheportnamevariantsused
intheSIMMO system,was obtainedusing thesame
approach as the one described for
the flags, i.e.
utilizing DBpedia lexicalization dataset. Using this
procedure, for each port name it was possible to
obtainitsnamevariants.Forexample,fortheportof
Saint Petersburg in Russia the following port name
variantswereidentified:
St Petersberg; St. Petersburg; Leningrad; Saint
Peterburg; Sankt Peterburg; SanktPetersburg; St.
Petersberg; Petrograd; SP; SaintPetersburg; Saint
Petersburg; St. Petersburg, Russia; Petersburg; St.
Petersburgh;St.Petersburg;Piter;SanktPetersburg;St
Petersburg;St.Peterburg;LeningradSaintPetersburg;
SaintPetersburg,Russia;SaintPetersberg
Nevertheless,contrarytotheflagnamevariants,in
caseoftheportssomeadditionalrequirementshadto
betakenintoaccount.
First
ofall,the namesof portsarenot unique.In
manycases,thereismorethanoneportwithagiven
name (or a given name variant). For example, apart
from St. Petersburg in Russia, there is a city with
exactly the same name on Florida, USA, in which
there
is a port as well. As a result, if only the port
name string is taken into account, it would be
impossible to choose the correct port in other way
than by chance. Sometimes, in a port name string
thereisanadditionalinformationaboutthecountry,
in which the port
is located (e.g. “St. Petersburg,
Russia”).Ifcorrectlyprocessed,thisinformationmay
beusedasanindicationwhichportisthecorrectone.
Still, if there is no such information, other approach
mustbeused.Inthe nextsubsectionsthe developed
approachesforcopingwiththisissuearepresented.
Disambiguation
ofthehomeportbasedonvessel
flag. At first, let’s analyze a situation, in which a
certainvesselinadatasourcehasahomeportname
assignedandthisnameisnotunique,e.g.Portsmouth
(there are four ports with such name known to the
SIMMOsystem).Thus,based
solelyontheportname
string, the system doesn’t know which port this
information actually is referring to. To solve this
issue, it was assumed that it is likely that the home
portislocatedinthecountryassociatedwithaflagof
theanalyzedvessel.Therefore,insuchsituations,
the
final portId is assigned according to the following
procedure:
1 GetallportIds,forwhichagivenportnamestring
isanamevariant,
2 Get list of countries, in which these ports are
located, based on their LOCODEs (in the system
each port has LOCODE assigned and two first
lettersoftheLOCODErefertoacountry),
3 Checkifanyofcandidateports(fromthelistfrom
step1)islocatedinthecountryassociatedwiththe
flag of the vessel; if so, assign the corresponding
portId as an identifier of the home port of the
vesselbeing
processed.
Usingthedescribedapproach,wehaveprocessed
the data extracted on ports extracted from internet
sources.Thedescribedambiguitywas foundin2118
cases. Among them, in around 74,7% of cases the
assigned flag of the ship was matching one of the
ambiguousports.Thisinformationwasthenusedto
decidewhichportId shouldbe assigned.The flagof
the country was not known in 14,3% of cases. In
10,95%ofcasestheflagassociatedwiththegivenship
wasnotmatchinganyofthepossiblehomeports.
DisambiguationofthevisitedportsbasedonAIS
messages and geographical proximity.
Another
scenario, when assigning the correct portId is
challenging, is information about historical visits of
vesselsinports.Alistofvesselportcallswithnames
of ports is retrieved from the internet sources (e.g.
MarineTraffic).Theportstringsusedinthisdatamust
bedisambiguatedaproperportIdsmustbe
assigned.
Ifitisunclearwhichportwasactuallyvisitedby
the vessel (e.g. name of the visited port is
Portsmouth), information about geographical
coordinatesofthevesselatagiventimestamp,taken
from AIS messages, is used by to resolve the
ambiguity.Inthisapproach,ageographicalproximity
of
the vessel to locations of different ports is
calculated As a result, the port for which such
proximity is the highest is selected and its portId is
assignedtothedataonvisitedports.
Disambiguation of ports based on the port
importance.Insomecases,theapproachesdescribed
in the previous
subsections are not sufficient to
correctly determine which port (out of those with a
similar name) should be chosen during the
disambiguationprocess.Thismaybeforexampledue
tothefactthatthereisnoindicationatall,whichport
isactuallyreferencedto.Forexample,portswiththe
samenamemaybegeographicallyveryclosetoeach
other, as in the case of two Vancouver ports, just
across the USACanadian border. In such case,
proximity of a vessel to these ports may be
insufficienttocorrectlydetermine,whichportshould
bechosenduringthedisambiguationprocess.
Fora
human,inmanysituationsitisobvious,after
analyzingtheavailabledata,towhichportthedatais
probably referring to. For example, Vancouver in
Canadaisa hugecityandaveryimportantport(47th
largest container port according to World Shipping
Council
7
), while a town with the same name in the
United States is likely much less important, at least
fromthepointofviewofthemaritimedomain.

7
http://www.worldshipping.org/abouttheindustry/globaltrade/
top50worldcontainerports,accessed26Jan2016
473
Thus, similar reasoning was implemented in the
SIMMO system. For this end, additional data about
differentports(andcities associatedwiththem) was
utilized. Again DBpedia was used as a knowledge
base,towhichSPARQLqueriesaboutconcepts(ports
andcities)weresentinordertogetvaluesofDBpedia
attributes, which potentially might be useful for
determiningtheimportanceoftheportandcity.The
obtainedattributesincluded:
populationTotal; population of the city, as a
measureofthesizeofthecity;itwasassumedthat
usually ports in larger cities are of greater
importancethanforcities
ofsmallersize,
shipBuilder; the larger number of ships built in
thiscity,themoreimportantthisportprobablyis,
shipHomeport; if the port is a homeport for a
larger number of vessels, it is probably more
importantaswell.
Havingthesevalues,thesystemisabletochoose
the most important port based on the following
heuristics.Eachpossibleportiscomparedseparately
forthesethreevalues.Thentheportwhichonaverage
isonthehighestpositionintherankingisselectedas
thefinaldisambiguatedport.
Granularityof ports.Another difficultywithport
disambiguationarisesfrom
thefactthatportsmaybe
perceivedondifferentlevelsofgranularity.Sincethe
SIMMO uses only the main name of the city as the
nameofthe port,therestill maybe smallerports or
docks in the area of the city, with names not
containingthenameofthe
ma incitybutanameofa
city district. For example, port name string
ʺHoogvlietʺ corresponds to a district of Rotterdam
andinsomedatasourcesisprovidedasthenameof
the visited port. Still, the system’s knowledge base
there is only information about Rotterdam port and
not about
Hoogvliet. In such cases, portId of
Rotterdam should be assigned toʺHoogvlietʺ.
However, often there are no mappings found in the
DBpedia lexicalization dataset, which could be used
insuchscenario.
_______________________________________________
<geoname>
<toponymName>Gemeente Rotterdam</toponymName>
<name>Gemeente Rotterdam</name>
<lat>51.88246</lat>
<lng>4.28784</lng>
<geonameId>2747890</geonameId>
<countryCode>NL</countryCode>
<countryName>Netherlands</countryName>
<fcl>A</fcl>
<fcode>ADM2</fcode>
</geoname>
<geoname>
<toponymName>Hoogvliet</toponymName>
<name>Hoogvliet</name>
<lat>51.86333</lat>
<lng>4.3625</lng>
<geonameId>2753666</geonameId>
<countryCode>NL</countryCode>
<countryName>Netherlands</countryName>
<fcl>P</fcl>
<fcode>PPL</fcode>
</geoname>
_______________________________________________
Listing2.AnexcerptfromtheresponseofGeoN ames Place
HierarchyforqueryaboutHoogvlietplacename
To resolve situation described above, GeoNames
8
webserviceisused. Foreachportnamestring,which
the system was not able to disambiguate based on

8
http://www.geonames.org/
mappings from DBpedia lexicalization dataset,
GeoNames Place Hierarchy web service is queried
9
inordertocheck,whetheranyofgeographicalunits
higherinthehierarchytothegivenportcanbefound
in the system’s knowledge base (in the list of port
namesorportnamevariants)seethelisting2).Ifsuch
geographicalunit(port)isfound,inthenextstep,
itis
checkedwhetherthelocationofthe analyzedunitis
similartothelocationoftheknownportidentifiedin
theprevious step(aswas previouslymentioned, the
locationofknownportsisstoredinthesystem).Ifthe
coordinates are similar, it could be concluded that
thereisa
suburbcityrelationshipbetweenthedistrict
(theprocessedportcall)andthecityassociatedwith
the port from the system’s knowledge base. In such
case, the name of the suburb is added as a name
variantoftheknowledgebase,inthesamewayasit
was done for mappings
retrieved from DBpedia
lexicalizationdataset.
EvaluationoftheportdisambiguationprocessThe
proposedmethodsfortheportnamedisambiguation
wereevaluatedusingthedatasetsextractedfromthe
external data sources and stored in the SIMMO
system. Below the results of the evaluation are
presented.
Using the abovedescribed approach, the system
was able to assign portIds in 234710 cases out of
343610records,inwhichaportstringwasspecified.
Thisismorethan68%ofcases.Theinabilitytoassign
portIdtotheremainingportstringsmayresultfrom
oneoftworeasons:
an analysed port is not included
in the system’s
knowledgebase (e.g.a givenport isa small port
onariver) andthusdisambiguationcouldnotgive
anyresults,
an analysed port is included in the system’s
knowledgebase,butthedisambiguationfailedto
identifyit.
To check possible reasons, a sample of 150 port
strings was randomly generated, outof all cases for
which disambiguation failed. This sample was
presented to human annotators which analysed,
whetheraportnamestringispresentinthesystem’s
knowledgebase.Itturnedoutthatonly12,8%ofport
strings without portIds assigned, were actually
available in the
knowledge base. Therefore, the
developed methods failed to assign portIds in less
than 5% of cases (12,8% out of 32% cases for which
theportIdwasnotassigned).
Further evaluation concerned checking to what
degree the assigned portIds are correct. The
evaluationofsuchdisambiguationmaybedifficult,as
insome
situationstheremaybenotenoughdataeven
for a human being to decide which port the given
data actually corresponds to. Therefore, it was
decided to run the evaluation only for the visits in
ports. For this data, it was automatically checked
whatwasthegeographicaldistancebetweena
given
vesselandtheportconnectedwithagivenportIdata
defined timestamp. If the distance was relatively
small then it may be assumed that the the correct
portIdwasassigned.

9
ThisweservicereturnsallGeoNameshigherupinthehi
erarchyofaplacename.Source:
http://www.geonames.org/export/placehierarchy.html,accessed12Apr2016
474
Thefigure4presentstheaccumulateddistribution
of the distances between position of the analysed
vesselandthelocationofthedisambiguatedport.The
medianofthedistances is10.17miles.In90%ofcases
thedistancewasbelow45milesandin95%ofcases
thedistancewasbelow
130miles.For150miles,this
valueissettledon97%anditdoes notrisewiththe
furtherdistanceincrease.Whileitisimpossibletoset
a solid threshold to determine when the vessel
actually is in a given port, the accuracy of port
disambiguation using the defined methods
may be
evaluatedasbeingbetween90%upto97%.
Figure4. Distances between positions of the vessel and
locationofthedisambiguatedportwhich,accordingtothe
data,thevesselvisitedatthegiventimestamp
3.4 Vesseltypes
Each vessel may be described as being of a certain
type,e.g. tug,fishing vessel, cargoetc. Usually,in a
givendatasource,thereisapredefinedtaxonomyof
such vessel types, which is coherently used to
describe vessels. Still, different data sources usually
usedifferent taxonomies. The
same vessel type may
bedescribedusingdifferentstrings,e.g.inasourceA
as “Fishing vessel”, while in a source B simply as
“fishing”.Also,indifferentdatasourcesvesseltypes
maybeperceivedondifferentgranularitylevelsorbe
apartofhierarchiesthataresomehoworthogonalto
eachother.Thisisacommonproblemwhendealing
with interoperability of different systems and data
integration.
Again, if the data about vessel types is to be
disambiguated automatically, simple relying on
wordsusedtodescribe suchvesseltypes (strings)is
not sufficient, and rather a unique identifier
vesselTypeId should be
assigned to each vessel
type(foreachwordcorrespondingtoanyvesseltype,
regardlessthesource).Also,suchidentifiershouldbe
consistentacrossdifferentdatasources,sothatevenif
intwodatasourcestwodifferenttextstringsareused
to refer to the same vessel type, in the system the
samevesselTypeIdshouldbeassigned.
Abasicrule,whichisusedtodeterminewhether
two different strings from different sources refer to
the same vessel type, is identification of vessel type
namepairsusedindatasourcestorefertothesame
vessel. For example, let’s assume that for a given
vessel,inasourceAitstypeisspecifiedas“fishing”
and in a source B as “fishing vessel”. Let’s also
assumethatthevesseltypelist,whichisusedinthe
systemasthemainlistofthevesseltypes,istheone
fromthesourceA(furtherreferred
asthemainlistof
vessel types). Finally, let’s assume that it was
identifiedthatacertainnumberofshipswhichinthe
source A are assigned with “fishing” vessel type, in
thesourceBaredescribedas“fishingvessels”.Thus,
itmaybeexpectedthatbothstringsrefertothe
same
vesseltype.
The reasoning described above was used in the
SIMMO as a primary method for vessel types
disambiguation. Still, using only one approach may
be insufficient and some other methods should be
usedaswell,e.g.stringsimilaritymeasuresbetween
vesseltypenames.
_______________________________________________
[’fishing vessel’, ’fishing’]: 1020 [’fishing’,
’trawler’]: 1156
[’container ship’, ’cargo hazard a (major)’]: 1222
[’passenger’, ’ro-ro/passenger ship’]: 1246
[’crude oil tanker’, ’tanker’]: 1347 [’tanker’, ’oil
products tanker’]: 1661 [’tanker’, ’oil/chemical
tanker’]: 2086 [’cargo’, ’container ship’]: 2162
[’cargo’, ’general cargo’]: 5536 [’cargo’, ’bulk
carrier’]: 6309
_______________________________________________
Listing3.Themostfrequentmappingsbetweenvesseltypes
in two data sources used in the SIMMO system:
MarinetrafficandMaritimeConnector.Thenumberstothe
right refer to the number of cases, when both vessel type
namesfromtwodifferentdatasources(valuesinbrackets)
refertoavesselwith
thesameshipIdassigned.
Whatalsohavebeentakenintoaccountisthatin
differentsourcesthetaxonomyofvessel typenames
mayhaveadifferentgranularity.Forexample,inthe
sourceAsomevesselmaybeassignedatypeʺinland
tankerʺ, while in the source B, there is only a more
general vessel
type,ʺtankerʺ. In such case, isa
relationship occurs, which is true only in one
directionandfalseintheother.Forexample,itistrue
thateach“inlandtanker” isa“tanker”,butitisfalse
thateach“tanker”isan“inlandtanker”.Therefore,a
mapping is correct
only if in the main list of vessel
names,amoregeneralvesseltypenameisspecified.
Insuchcase,astringreferringtomoredetailedvessel
typecanbeusedasavesseltypenamevariantforthe
moregeneraltype.
Inordertoidentifysuch situations,thefollowing
heuristic
wasused.Itisassumedthatthelongername
(i.e.consistingofalargernumberofwords)describes
a more detailed entity. This assumption is based on
an observation that additional words in vessel type
namestringsoftenrestrictnumberofvesselsthatmay
be described using this name. For
example,
“oil/chemicaltanker” ismore detailedthan a simple
“tanker”. Moreover to ensure that both vessel type
namesrefertosimilarconcepts,itmustbecheckedif
the longer name contains the shorter one. For
example,string “oil/chemicaltanker” contains string
“tanker”.Thus,if,inagivenmapping,itisidentified
thatthemoregeneralterm(theshorterstring)isinthe
mainlistofvesseltypenamesusedinthesystemand
theothervesseltypenamefromthemappingcontains
this string, then this mapping may be used in
disambiguation(the lessgeneral string may be used
as a
vessel type name variant for the more general
one).
475
A list of the most common mappings, where the
abovementioned heuristics is used, is presented in
Listing4.
_______________________________________________
[’tug’, ’pusher tug’]: 98
[’tanker’, ’lng tanker’]: 101 [’tanker’, ’bunkering
tanker’]: 119
[’dredger’, ’trailing suction hopper dredger’]: 187
[’tanker’, ’chemical tanker’]: 222
[’cargo’, ’ro-ro cargo’]: 342 [’tanker’, ’inland
tanker’]: 397 [’tanker’, ’lpg tanker’]: 507 [’passenger’,
’passengers ship’]: 644 [’fishing’, ’fishing vessel’]:
1020
[’passenger’, ’ro-ro/passenger ship’]: 1246 [’tanker’,
’crude oil tanker’]: 1347 [’tanker’, ’oil products
tanker’]: 1661 [’tanker’, ’oil/chemical tanker’]: 2086
[’cargo’, ’general cargo’]: 5536
_______________________________________________
Listing 4. Vessel type mappings, filtered using a simple
stringsimilaritymeasure.Thenumberstotherightreferto
the number of cases, when the two vessel type names in
brackets referred in two different data sources to a vessel
withthesameshipIdassigned.
Finally, manual analysis may be performed on
otherpotentialmappingsbyanexpertand,basedon
that, additional mappings may be added to the
systemknowledgebase.
3.5 Classificationsocieties
Each vessel belongs to a classification society. The
goal of the classification societies is “to provide
classificationandstatutoryservicesand
assistanceto
the maritime industry and regulatory bodies as
regards maritime safety and pollution prevention,
based on the accumulation of maritime knowledge
and technology”
10
. Names of classification societies,
similarlytootherdatatypes,areexpressedasstrings
andineachdatasourcethesameclassificationsociety
maybereferredto,usingadifferentstring.Therefore,
for each acquired classification society name in the
disambiguation process a proper identifier classId
shouldbeassigned. Inthe
SIMMOsystem,therewas
an initial list of known classification societies with
assignedclassIds.Thislistwaslaterextendedduring
thedisambiguationprocess.
_______________________________________________
[’bureau veritas’, ’nippon kaiji kyokai’]: 22 [’american
bureau of shipping’, ’bureau veritas’]: 29 [’det norske
veritas’, ’lloyds register’]: 32 [’american bureau of
shipping’, ’lloyds register’]: 41 [’dnv gl’,
’germanischer lloyd’]: 56
[’registro italiano navale’, ’american bureau of shipping
’]: 61
[’korean shipping register’, ’korean register’]: 121
[’dnv gl’, ’det norske veritas’]: 176
[’lloyd\’s shipping register’, ’lloyds register’]: 267
[’lloyds shipping register’, ’lloyds register’]: 605
_______________________________________________
Listing5. The most frequent mappings between
classificationsocieties based onthefactthat thesamevessel
was assigned different classification society strings in
different sources. The results are much worse than for
vesseltypes
The analysis of classification societies names
started with generation of mappings in the same
mannerasitwasdoneforflagsandvesseltypes,i.e.
bychecking,ifasingevesselindifferentdatasources
has different classification society names assigned.
However,inthecaseoftheclassificationsocieties,this
approach
did not bring a lot of correct results, as

10
See http://www.iacs.org.uk/document/public/explained/Class_What
Why&How.PDFfordetails.
shownonListing5;onlyafewofthemostcommon
mappingswere correctand used infurtheranalysis.
This is probably due to the fact that vessels may
change their classification society relatively often, in
comparison to change of the vessel type (e.g.
changing vessel type may require expensive
modifications
ofthevesselitself).Therefore,different
classification societies assigned to the same ship in
different sources may result from the fact that
information in one sources may be outdated in
comparisontoinformationprovidedintheotherone.
Taking into account the obtained results, it has
turned out that the
number of distinct classification
societynames,forwhichthe systemwasnot ableto
assign classId based on the string comparison
method, was only 192. Since, this number was
relativelysmall,amanualanalysisofthestringsand
assignment of the correct classIDs could have been
performed. Based on the analysis,
the system’s
knowledgebaseabouttheclassificationsocietyname
variants wasupdated.This allowedto disambiguate
allclassificationsocietynamestrings.
3.6 Companynames
Indifferentdatasourcesdifferentstringsmaybeused
to refer to the same company. In many cases, such
strings are similar, for exampleʺStar Shipping
Ltdʺ
andʺStar Shipping Limitedʺ. The aim of
disambiguation in this case is to determine if two
strings in fact refer to the same company and if so,
assignthesameidentifiercompanyIdtobothofthem.
In the first step, identification if different strings
refer to the same company
was performed by
utilizingastringsimilaritymeasure,namelytheJaro
distance[9].Havingtwostrings,thismeasurereturns
a numeric value between 0 and 1. The more similar
thestringsare,thehighervalueisreturned.
The basic difficulty in the disambiguation of
company names results from the fact
that even for
humans this task can be performed only with a
limited certainty level (saying to what extend the
output of the disambiguation is correct). It may be
even more difficult to define how the term “single
company”isunderstood andhowtorelatethattothe
analysisbeingperformed.
Let’sanalyzethefollowing
pairofcompanynames:“PalmaliRostov,Russia”and
“Palmali Shipping Services Instabul, Turkey”. It is
clear(atleastforahuman)thatthesestringsreferto
entities located in different countries. Still, after
performingasearchontheInternet,itmaybelearnt
that bothentities
belong to the same group, Palmali
GroupofCompanies
11
.Insuchcase,classifyingthese
twostringsasthenamesofeitherthesamecompany
ortwodifferentcompaniesdependsondefinitionofa
singlecompany.
Still, in some cases, names of companies are not
similar as faras Jaro measure is concerned, but still
they may refer to the
same company. For example,
let’sassume thatwehavethefollowingstrings: “U.S.,
Dept. of Transportation” and “USA Government‐
WashingtonDC,U.S.A”.Jarosimilaritybetweenthem

11
http://palmali.com.tr/en/default.asp
476
isonlyaround0.54.Still,ahumanwillnoticethatthe
Department of Transportation is a part of the USA
Government.Whatismore,intheanalyseddatain17
cases these two names were used in different data
sourcesastheownersofthesameships.
The above mentioned
example shows that string
similaritymeasureinmanysituationsisnotsufficient
to decide, whether two strings refer to the same
companyornot.Basedonthisobservation,additional
analysis was performed in which associations
between the company names and the vessels was
were identified. The analysis is similar to
the one
conductedforthevesseltypes.Again,forallshipsit
was analysed what company name strings are
providedindifferentdata sources foragivenvessel
(companynameshipIdcompanynamemappings).If
a certain pair of names occurs frequently in such
analysis,itmaybeassumedthatthis
pairreferstothe
same company. Still, similarly as in the case of the
classification societies, the owner of a vessel may
change relatively often, so if the data in different
sources are outdated, the created mappings may be
incorrect.
Taking all these aspects into account, it was
analysed with
what precision the automatic
disambiguationofcompanynameswasperformed.In
the conducted experiment, different values of string
similarity measure were set as a threshold for
classifying two company names as referring to the
same company. Two variants were analysed: 1) in
which only the string similarity measure was used
and
2)inwhichallpairs forwhichnoshipwasfound,
werediscarded.Basedonbothvariant,thecompany
namescanbeeitherclassifiedasreferringtothesame
ortodifferentcompanies.
To be able to evaluate the proposed approach, a
sampleofdatawaspresentedtohumanexperts.
The
taskwasperformedbythreeannotators.Eachofthem
waspresentedwithacollectionof pairsofcompany
nameswithdifferentsimilaritiesbetweenthem.Also,
for a part of these pairs, both strings were actually
relatedtothesamevessel,whilefortheotherpartnot
(theannotatorsdidnot
knowwhatwas thesimilarity
between strings and whether it was found in
mappingsornot).Eachpairwasannotatedbyexactly
twoannotators.Toeachpair,theannotatorswereto
assignoneofthreevalues:
bothcompanynamesrefertothesamecompany,
companynamesreferto
differentcompanies,
unknown (there is not enough information to
decide which of the two other options should be
chosen).
Figure5. Precision of the proposed company
disambiguation method for different thresholds on string
similarityandfortwovariants:withorwithoutadditional
filteringbasedonmappingsfoundintheavailabledata
Thenexactlythesamedatasamplewasprocessed
automatically by the system. The obtained results
werecomparedwithannotationsproducedbyhuman
experts to check the accuracy. Figure 5 presents
resultsoftheperformedexperiments.Aftersettinga
certainthresholdforthestringsimilarity,eachpairof
thecompanynames,with
similaritylargerorequalto
the threshold, may be classified as referring to the
samecompany.Optionally,anadditionalfilteringcan
be performed to discard all pairs which were not
foundinthedatabaseasreferringtothesamevessel.
ThechartinFigure5presentswhat,accordingto
the
annotators,istheprecisionofclassification
12
.Blue
linepresentstheprecisionobtainedforthepairs,for
which at least one company name shipId company
namemappingwasfoundintheavailabledata,while
the green line corresponds to the pairs without this
additionalrequirement. Thechartclearly showsthat
the precisionof the results obtained solely
based on
thestringsimilarityisverylow. Evenaftersettinga
very high threshold, it is not higher than 0.5.
Utilizationof company name shipId company name
mappings allows to dramatically increase the
precision,evenformuchlowerthresholds.
Based on the experiments, the disambiguation of
company names with
threshold equal to 0.7 was
conducted. As a result, only for pairs found in the
identifiedmappings,theprecisionofdisambiguation
accountedtothelevelof90%.Usingthisapproach,it
was possible to assign IDs to 11525 out of 115419
records, what constitutes around 10% of company
namesfoundin
thedata.
4 SUMMARYANDFUTUREWORK
The process of disambiguation of named entities is
the basic task, which need to be performed to
integrate data coming from heterogeneous internet
data sources and to enable further analysis of the
integrated data. In the article various approaches to
disambiguation of the
named entities related to
maritimedomainarepresented.Usingthedeveloped
approaches, for some types of entities, the
disambiguationcouldhavebeenperformedwiththe
high accuracy. It concerns inter alia ports, flags,
vesselsandclassificationsocieties.
Still,fortheothertypesofentities,likemaritime
relatedcompaniesorvesseltypes,
thereisa needfora
furtherresearchanddevelopmentofmethods,which
wouldprovideamoreprecisefusionofdata.Forthe
vessel types, probably a different data model (e.g. a
taxonomy with isa relationships) could be used.
However, it would require a more prolonged
engagement of the
domain experts. For the
disambiguationofthecompanynames,anadditional
reasoningmaybeimplemented,whichwouldutilize
data from additional sources, being an indication of
whatstringsareusedtoreferencethesamecompany.
Still,accordingtotheperformedevaluation,itmaybe
concluded that in general the presented approaches

12
Precisionisunderstoodasaratioofpairscorrectlyclassifiedbythesystem
asreferringtothesamecompanytoallpairsclassifiedassuch
477
maybesuccessfullyutilizedinsimilarsystemsinthe
future.
ACKNOWLEDGEMENT
Thisworkwassupportedbyagrantprovidedforthe
project SIMMO: System for Intelligent Maritime
MOnitoring(contractnoA1341RTGP),financedby
the Contributing Members of the JIPICET 2
ProgrammeandsupervisedbytheEuropean
Defence
Agency.
REFERENCES
International Maritime Organisation: The International
Aeronautical and Maritime Search and Rescue
(IAMSAR)Manual.IMO/ICAO,London(2013)
el Pozo, F., Dymock, A., Feldt, L., Hebrard, P., di
Monteforte, F.S.: Maritime surveillance in support of
csdp.Technicalreport,EuropeanDefenceAgency(2010)
Angerman,W.S.:Comingfullcirclewithboyd’soodaloop
ideas:An
analysisofinnovationdiffusionandevolution.
Technicalreport,DTICDocument(2004)
Vassiliadis, P.: A survey of extract–transform–load
technology. International Journal of Data Warehousing
andMining(IJDWM)5(3)(2009)1–27
Abramowicz, W., Eiden, G., Małyszko, J., Stróżyna, M.,
We˛cel, K.: SIMMO Project. Deliverable 1.2 Report on
selected internet
data sources, defined cooperation
models and intelligence analysis scenarios. Research
report,Poznan´UniversityofEconomics,LuxSpaceSarl
(2015)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection
usinglearnablestringsimilaritymeasures.In:
Proceedings of the Ninth ACM SIGKDD International
ConferenceonKnowledgeDiscoveryandDataMining.
KDD’03,NewYork,NY,USA,
ACM(2003)39–48
Rahm, E., Do, H.H.: Data cleaning: Problems and current
approaches.IEEEDataEng.Bull.
23(4)(2000)3–13
Alberga,C.N.: Stringsimilarityandmisspellings.Commun.
ACM10(5)(May1967)302–313
Jaro, M.A.: Advances in recordlinkage methodology as
applied to matching the 1985census of tampa, florida.
Journal of the American Statistical Association 84(406)
(1989)414–420
Elmagarmid,A.K.,Ipeirotis,P.G.,Verykios,V.S.:Duplicate
record detection: A survey. Knowledge and Data
Engineering,IEEETransactions on19(1)(2007)1–16
Wentland,W.,Knopp,J.,Silberer,C.,Hartung,M.:Building
a multilingual lexical resource for named entity
disambiguation, translation and transliteration. In:
LREC.(2008)
Vespe, M., Sciotti, M., Battistello, G.: Multisensor
autonomous tracking for maritime surveillance. In:
Radar, 2008 International Conference on, IEEE (2008)
525–530
Kazemi,S.,Abghari, S.,Lavesson,N.,Johnson,H.,Ryman,
P.: Open data for anomaly detection in maritime
surveillance.ExpertSyst.Appl.40(14)(2013)5719–5729
Kaczmarek, T.,
Węckowski, D. 347. In: Harvesting Deep
Web Data through Produser Involvement. IGI Global
(2013)200–221
Chang, K.C.C., He, B., Li, C., Patel, M., Zhang, Z.:
Structured databases on the web: Observations and
implications.ACMSIGMODRecord33(3)(2004)61–70
Rhodes,B.J.,Bomberger,N.A.,Seibert,M.,Waxman,A.M.:
Maritime situation monitoring
and awareness using
learning mechanisms. In: Military Communications
Conference,2005.MILCOM2005.IEEE,IEEE(2005)646–
652
Helldin,T., Riveiro, M.:Explanationmethodsforbayesian
networks: review and application to a maritime
scenario.In:Proc.ofthe3rdAnnualSkövdeWorkshop
onInformationFusionTopics(SWIFT 2009). (2009) 11–
16
Mano,
J.P.,Georgé,J.P.,Gleizes,M.P.:Adaptivemultiagent
system for multisensor maritime surveillance. In:
Advances in Practical Applications of Agents and
MultiagentSystems.Springer(2010)285–290
Ding, Z., Kannappan, G., Benameur, K., Kirubarajan, T.,
Farooq,M.:Wideareaintegratedmaritimesurveillance:
An updated architecture with data fusion. In:
Proceedings of
the Sixth International Conference of
Information Fusion, Australia. Volume 2. (2003) 1324–
1333