225
1 INTRODUCTION
Theintegrationofvariousdatabasesisaprerequisite
for studying and predicting different Earth’s
processes, like climate changes, sea level rise, etc.
Further analysis and modeling of the natural
phenomenasuggestanupdating,harmonizationand
standardizingofthevariousmeasuredparametersfor
the preparation of scientifically based assessments
and forecasts. This requires the available primary
measurements and preliminary analyzed data to be
subjected to a qualitative and quantitative check.
Recentadvancesinprocessingofthelargeamountsof
dataconcern thedeveloping ofalgorithmsto extract
the hidden and potentially useful knowledge from
them,suggestingthattheyare
completeandreliable.
Onecommonprobleminthetimeseries analysisis
thepresenceofgaps(asequence ofmissingvaluesor
omitted observations) that disrupts or makes it
impossible to use them for research and practical
purposes.In practice, different mathematicalmodels
andmethodsforfillingof(ʺrecoveryʺ)
missingvalues
areapplied.Often,whenthesevaluesareforashort
interval,the linearinterpolation isenough.
Commonly used methods for reconstruction of the
missingvaluesintimeseriesare:
1 Substitutionbythemeanvalue‐newinformation
isnot added to the time series asthe Root Mean
SquareError(RMSE)isreduced;
2 Single linear, multiple linear or nonlinear
regression,withwhichtheavailableinformationis
accounted, the dimensionality of the sample is
increasedandtheRMSEisreduced;
3 Multiple filling, the socalled Monte Carlo
algorithms with Markov chains in which the
missing value is filled
with the estimated size
values;
Gap Filling of Daily Sea Levels by Artificial Neural
Networks
L.Pashova
NationalInstituteofGeophysics,GeodesyandGeography,BulgarianAcademyofSciences,Bulgaria
P.KoprinkovaHristova
InstituteofInformationandCommunicationTechnologies,BulgarianAcademyofSciences,Bulgaria
S.Popova
InstituteofSystemEngineeringandRobotics,BulgarianAcademyofSciences,Bulgaria
ABSTRACT:Intherecentyears,intelligentmethodsasartificialneuralnetworksaresuccessfullyappliedfor
data analysis from different fields of the geosciences. One of the encountered practical problems is the
availabilityofgapsinthe timeseries thatprevent
their comprehensiveusage forthe scientificand practical
purposes.Thearticlebrieflydescribestwotypesoftheartificialneuralnetwork(ANN)architectures‐Feed
ForwardBackpropagation(FFBP)andrecurrentEchostatenetwork(ESN).Insomecases,theANNcanbeused
asanalternativeonthetraditionalmethods,tofillin
missingvaluesinthetimeseries.Wehavebeenconducted
severalexperimentstofillthemissingvaluesof dailysealevelsspanninga5yearsperiodusingbothANN
architectures.Amultiplelinearregressionforthesamepurposehasbeenalsoapplied.Thesealeveldataare
derived from the records
of the tide gauge Burgas, which is located on the western Black Sea coast. The
achievedresultshaveshownthattheperformanceofANNmodelsisbetterthanthatoftheclassicaloneand
theyareverypromisingfortherealtimeinterpolationofmissingdatainthetimeseries.
http://www.transnav.eu
the International Journal
on Marine Navigation
and Safety of Sea Transportation
Volume 7
Number 2
June 2013
DOI:10.12716/1001.07.02.10
226
4 Kalman filter‐a recursive twostep method. It
allowsprocessingoftimeseriesontheprincipleof
predictioncorrection,etc.
Ifthere is a long sequence of missingvalues, the
method used to fill gaps need to be chosen very
carefully due to its effect on the subsequential
analysisoftimeseries.Anymethodthatcanbeused
in such case has its advantages and disadvantages.
Sometimes,whenmissingdataarenotrareandthey
areindifferentsegmentsofthetimeseries,asuitable
compromise between computational speed and
quality of results has to be made. The
choice of
proceduredependsmainlyon thepropertiesoftime
series and the main purpose of their analysis.
Comparisonofdifferentfillinmethodsofthemissing
values in time series are presentedin (Dergachev et
al.,2001;Kondrashov&Ghil,2006;Moffatetal.,2007;
Musialetal.,2011).
Artificial
neural networks (ANN), as innovative
approach greatly enhanced the opportunities for
analysisandtreatmentofinformationbecausetheyhave
less restrictive requirements with respect to available
knowledgeaboutthecharacterofrelationshipsamong
processeddata,functionalmodels,typeofdistribution,
etc. They provide a rich, powerful and robust
nonparametric modeling
framework with proven and
potentialapplicationsinmanyfieldsofthesciences.The
advantages of ANN encouraged many researchers to
use the neural network models in broad spectrum of
realworld applications. Sometimes, the ANNs are a
betteralternative,eithersubstitutiveorcomplementary,
to the traditional computational schemes for solving
many
scientificandengineeringproblems(e.g.,Wenzel
&Schröter,2010;Pashova&Popova,2011).Multilayer
ANN with feed forward connections that are trained
using the backpropagation algorithm (FeedForward
Backpropagat ion Network‐FFBP) is one of the first
neuralarchitecturesthatarewidelyusedformodeling
of nonlinear dependences (Rumelhart and
Clelland,
1986; Allende et al., 2002). For modeling of dynamic
dependences, however, it is often needed to use
recurrentANN(RNN).Onesuchmodernarchitecture,
calledʺechoʺ (Echo State Networks‐ESN) offers
simplifiedtrainingalgorithmandbecomewidelyused
for studying the nonlinear dynamical dependencies
(Jaeger,2003;Lukosevicius&Jaeger,2009;
Koprinkova
Hristovaetal.,2011).TheseANNsarerecognizedasthe
best models for time series analysis and prediction
(Zang&Behera,2012).
The nearshore sea level variations are of great
importanceforstudyingtherelativesealevelchange,
practicalrealizationoftheheightreferencesurfacein
geodesy, many
coastal engineering applications, etc.
These va riations are registered by tide gauges,
whose continuous registrations of the sea level
represent a superposition of many stochastic and
nonlinear processes. Missing observations in the
time series of such type of data are very common.
Thisrequirestheapplicationofvariousmethodsof
interpolation and/or
extrapolation, which allow
filling the incomplete time series with necessary
accuracyforfurtheranalysis.Thearticlepresentsthe
results obtained after applying two types of the
artificial neural networks and a multiple linear
regression(MLR)forfillinggapsinthetimeseriesof
dailysealevels.Datafromthetide
gaugeinBurgas,
which is located on the western Black Sea coast
spanning the period 1985‐1989 are analyzed to
modelthemaximum,meanandminimumsealevels.
ComparisonoftheperformancesofthetwoANNand
MLRmodelsforfillinggapsinthedailysealevelsis
alsopresented.
2
ARTIFICIALNEURALNETWORKMODELS
USEDINTHESTUDY
Since the sea level variations can be represent as
nonlinear dynamic process, ANN architectures were
considered as appropriate candidates for its
modeling. Application of different approaches of
ANNs applications for the sea level analysis can be
seen in (Tsai et al., 2009;
Pashova & Popova, 2011).
Thereareseveralnetworkarchitectures,whichcanbe
used for modeling and filling the missing values of
thesealevels.Amultiplelinearregressionisanother
method often used for filling the missing values in
time series.In this studywe applied FFBP andESN
architectures in
comparison with MLR model. The
ANNsandMLRperformancewereassessedinterms
of the root mean square error (RMSE) and the
correlation coefficient R (or coefficient of
determinationR
2
).
2.1 Feedforwardbackpropagation(FFBP)ANN
Feedforward (FF) or layered ANNs are one of the
first neural network architectures with typical
structureisshownonFigure1.Theyconsistofseveral
consecutivelayers ofnonlinear units calledneurons.
Connections are allowed only between neighbor
layers directed from the
first (input) to the last
(output)layer.ThespecificationofFFmodelstructure
includesadeterminationof thenumber ofthe input
andtheoutputneurons(dependingonthespecificsof
the function that will be modeled); a choice of the
numberofhiddenlayersandthenumberofneurons
ineachoneofthem,andofthenonlinearprocessing
functions of all neurons (usually a kind of sigmoid
shaped nonlinearity). “Neurons” in the first layer,
showed by squares, are not typical nonlinear units.
They only distribute the input vector to the first
hiddenlayer(markedbycircleson
theFigure1).Itis
wellknownthatusuallyonehiddenlayerissufficient
tomodelanycomplexnonlineardependencebetween
theinputandoutputvector.Thetrainingalgorithmof
this type of neural networks is usually performed
applyingtheerrorbackpropagation(BP),fromwhich
theirpopularnamehasbeen
shortenedtoFFBP.
The output of each hidden layer of neurons is
calculated by nonlinear dependence of the linear
combinationofoutputsoftheneuronsintheprevious
hiddenlayer:
txWftx
iijj
(1)
For the input and the first hidden layer
dependencesare:

 
txWftx
tntinttintintx
in
T
12
1
(2)
227
out(t)
in(t)
z
-1
z
-1
in(t-
t)
in(t-n
t)
TDLs
W
in
W
ij
x
1
(t) x
2
(t) x
k
(t)
Figure1. Neural network with feedforward back
propagation(FFBP)architecture.
Heret denotes a discrete momentin time,
t
is
(sampling)discretizationstepandfisamonotonically
increasing function, usually nonlinear sigmoid
(logistic sigmoid or hyperbolic tangent) for the
hidden layers and usually a linear function for the
outputlayerofthenetwork.
This architecture represents a static dependence
model between its input and output vectors of the
network. To be able to model a dynamical process’
dependence, lines of time delay elements (briefly
TDL) are inserted at the network input, that keep
“memory”ofthepaststatesofthemodeledprocess.
2.2 Echostatenetwork(ESN)
Echostatenetworks(ESN)arearelativelynewclass
of RNNs that
belong to the so called “reservoir”
approach (Lukosevicius & Jaeger, 2009). The main
ideaofthisapproachconsistsinageneratingofrich
“reservoir” of dynamic neurons with nonlinear
activation functions and with recurrent connections
betweenthem.Thenetworkoutputiscalculatedasa
linear combination between current states of
the
“reservoir” neurons. Training of this type of
architectureissimplifiedbysettingtheparametersof
the linear combination (i.e. the weights of the
connectionsbetweenthe“reservoir”andtheoutput)
using the least squares algorithm. Hence the RNN
training is significantly faster and the application of
therecursiveversionof
trainingalgorithmallowson
linetrainingtoo.
Echo State Network (Jaeger, 2003) is a simplified
version of the “reservoir” architecture with sigmoid
output nonlinearity of the “reservoir” neurons
(usually hyperbolic tangens).Figure 2 shows the
structureofESNnetwork.Itsoutputlayercalculatesa
linear combination between the current state
of the
network input in(t) and the “reservoir” X(t) as
follows:



tX
tin
Wtout
out
(3)
Figure2.Echostatenetwork(ESN)architecture.
Wout is
)(
Xinout
nnn
dimensional matrix, where
n
out,ninandnXarethedimensionsofthevectorsout,in
and X respectively. The current state of “reservoir”
neurons depends on their previous state and on the
currentnetworkinput:

ttXWtinWtX
resin
tanh
(4)
Here W
in and Wres are matrices containing the
weights of connections at the input and inside the
“reservoir” with corresponding dimensions
Xin
nn
and
XX
nn
.Thesematricesarerandomlygenerated
and are not a subject to training. The recurrent
connectionsinside the “reservoir”create an effectof
“memory”aboutthenetworkpaststates,thatmakes
sucharchitectureapropercandidateforthemodeling
of dynamic dependences. Its advantage in
comparison to the static layered architectures
with
TDLneural networksatthe inputisthatthereis no
necessity to have a priory information about the
needednumberofTDLsfora particularprocessthat
willbemodeled.
3 APPLICATIONOFDIFFERENTMODELSFOR
FILLINGUPMISSINGDAILYSEALEVELS
3.1 Handlingincompletetimeseriesof
sealevels
Forecastingthesealevelvariationsinrealtimeisan
importantactivityinthedesignofcoastalengineering
structures, decisionmaking related to navigation of
vesselsandtheconstructionofoffshoreplatformsin
the Black Sea. The main sources of information for
studyingthesevariations arethecontinuallyoperated
tide gauges established on the sea coasts. Such
information is urgently needed to support the
development, calibration and improvement the
operational capacity of the integrated systems for
forecasting and early warning of dangerous natural
phenomenaintheseaandcoastalareas.Continuous
monitoringofthesealevelalongtheBulgarian
Black
Seacoastiscarriedoutsince1928(Pashova&Popova,
2011).Sincethen,thedataonaveragedaily,monthly
andannualvaluesofthesealevelcontaingapswith
differenttimeduration.Thepresenceofmissingdata
isduetovariousfactors:technicalreasons,failureof
recordingequipment;interruptionof
theregistration
due to defective recording equipment; misuse and
incorrect use of records by the field staff; etc. In
extreme events like storm surge or high waves the
continuous registrations are also terminated due to
technicallimitationsofequipment.
out(t)
Reservoir
res
X(t
)
in
out
in(t)
228
Tofillgapsinthetimeseriesofsealevelthetidal
regimeintheBlackSeahastobeknownapriori.The
missing values for different time periods are
completed for scientific and applied research
purposes. Restoration of gaps in observational data
used for modeling and forecasting of
the natural
phenomena should be made at the earliest possible
stageoftheprocessingoftheoriginalmeasurements.
The classical methods for modeling of the sea level
fluctuations (e.g. harmonic analysis) cannot always
represent the complex timevarying meteorological
effectson sea level, which are produced by weather
conditions like
wind, atmospheric pressure, rainfall,
etc.Therefore,adaptationofthemodelsinrealtimeis
needed, in order to account better for the time
varyingenvironmentalchanges.
3.2 FFBPandESN
The structure of FFBP neural network model was
chosenafterrepeatedtestingfortheoptimalchoiceof
parameters (Pashova and Popova,
2011). For each
variable (daily maximum H_max, mean H_mean or
minimumH_minsealevels)anindividualFFBPmodel
is trained. Increasing the number of neurons and the
numberofdelaysrequiresmorecomputation,andthis
hasa tendencyto overfitthedata when the numbers
aresettoohigh,but
itallowstheANNtosolvemore
complicated problems. After several tests, the best
numberoftappeddelaylines(TDLs)isdeterminedto
be6basedontheautocorrelationfunctionofthedaily
values. Hence the input vector for each model is
consistedoftheprevious7dailyvaluesofthe
modeled
variable, i. e. its size is 7. The output of the network
predictsitscurrentvalue,i.e.itssizeis1.Thenumber
ofneuronsintherestoflayersisdeterminedapplying
the criteria of the minimum squared error and the
highest correlation coefficient between the observed
and
modelingdailysealevels.Thenumberofneurons
inthehiddenlayerwaschosenbasedonthemultiple
reruns of different structures of the FFBP models
(Pashova&Popova,2011).Onehiddenlayerisfound
tobeappropriatetomodelsealevelsandtheoptimal
numberofneuronsinitwas
foundto be15 neurons.
Hence our FFBP model has 7:15:1 architecture. The
Matlabprogrammingenvironmentisusedfortraining
FFBPmodels(Demuth&Beale,2000;Gilat,2011).The
standard training procedure divides the time series
randomlyinto3partswithratio70:15:15%fortraining,
testing and verification respectively. Training
is done
with the LevenbergMarquardt algorithm, which has
the fastest convergence for FFBP networks. The
criterionforstoppingtheiterationsiswhentheerrorof
the sample for verification began to increase. This
model, evaluation criteria of its applicability and the
maincharacteristicsofthetimeseriesofdailysea
levels
and factors influencing the sea level change are
described in detail in previous studies presented in
(Pashova&Popova,2011;Pashovaetal.,2012).
The structure of the ESN model also contains 15
neuronsintheʺreservoirʺtobecomparablewiththe
FFBPANN model.It wasfindthat
thedifference in
thepredictionresultsofsealeveldatabetween15and
100neuronswareinsignificant.Toevaluatetheeffect
ofʺmemoryʺ of theʺreservoirʺ two versions of ESN
modelwas trained‐with1input andwith7 inputs
respectivelyforonestepbackintimeandfor7steps
backintimeforthemodeleddailyseavalues ofthe
three variables. The training of the ESN model is
made using free available Matlab toolbox
(http://www.reservoircomputing.org/software). In
comparisonwiththeFFBPmodelthetimeseriesare
divided into training and test samples in a ratio of
85:15%.Since
theESNwastrainedbyanoniterative
procedurethatapplieslinearregressionwithasingle
representationofeachelementoftheteachingsample,
there is no need to define stopping criteria for its
training.
InthecaseofbatchtrainingofESN,thealltraining
data for model input are
presented consecutively to
the network and the corresponding output is
calculated and collected. The weights of the output
connections are determined by solving linear
regression equation in one step using all network
input/outputdata.Hencethereservoirstate“evolves”
with each new data as if the “gaps” are missing. In
thecaseofonlinetraining, eachinputofthetraining
dataispresentedto thenetwork.Thecorresponding
output is calculated and the output weights are
adjustedusingrecursiveleastsquares(RLS)method.
If “data gap” is reached, the predicted by model
output is used to replace the missing data
at model
input.Inthiswaythereservoirstatedependsonthe
ESNmodelpredictionsandevolvesindependenceon
the accumulated by the current moment knowledge
about the process dynamics. This will allows more
“realistic”predictions,especiallyforlongerdatagaps.
Theoutcomesaftertrainingofbothtypesof
ANN
models are directly dependent on the initial
conditions therefore 20 ESN and FFBP models were
generated and trained. The averaged mean squared
errors(MSE) ofthe simulationwithall thedataand
coefficients of regression R, as well as errors MSE
b
and regression coefficients R
b of the besttrained
modelsarepresentedinTable1.
3.3 MLRmodel
Thefillingofthemissingvaluesofdailysealevelsfor
the same period for three time series of study have
been completed by the multiple linear regressions
(MLRusingthefollowingmodel:
)(...)1()(1
ˆ
121
stytytyty
s
(5)
where
1
ˆ
ty
isthe predictedsea level bytheMLR
model,whichwillbefilledinsteadthemissingdaily
value,
)(ty
is the current daily mean, and s is a
number of backward steps like in the case of ANN
models. The predicted missing value is a linear
combination of several independent va riables‐the
meandailysealevelandseveraldaily valuesbefore
it. The unknown coefficients
1,
2,…,
s+1 are
determinedinitiallyusingallavailablevaluesforthe
dailysealevelsfora5yearperiod.
For filling of the missing daily values with
differentlength ofgapsinthetime seriesfor all the
model types we proceed as follow: if the missing
values are several consecutive ones, than
each
predicted by a model missing value is included as
knowninthelineofthe6TDLsvaluesusedtopredict
the next one; this operation is repeated moving
229
forward with one step while all the consistently
missingvaluesarefilled.Thisprocesscontinuesuntil
thecompletionofallmissing valuesfor therelevant
period.
For all the models the least squares error was a
criterion that was minimized by the respective
training procedure used to estimate the unknown
parameters of the corresponding model. The
MATLABcodesarewrittento train andto testeach
ANNandMLRmodel’srepresentation.
4 RESULTSANDDISCUSSION
In this study, the time series of observations of the
meandailysealevelisseenasasequenceofdiscrete
values trough regular intervals
with the sampling
step
dayt 1
. Here the daily maximum H_max,
mean H_mean and minimum H_min sea levels are
modeled, which are determined with millimeter
precision relatively to theʺzeroʺ point of the tide
gaugeBurgas.TotesttheapplicabilityoftheANNto
fillthemissingvaluesinthetime seriesofdaily sea
levels,
theperiodfrom1January1985to31December
1989isselected.Therequirednumbersofthevalues
in the three time series is totally1826, 151 (8.3%) of
whicharemissing.Mostofthedatagapsincludetime
periods from1 to 34 days up to1to 3
weeks for a
fiveyearanalyzedperiod.
Theresultsforthefull5 yearsperiodofstudyare
presented graphically on Fig. 1. After that for two
periods,coveringtwoandthreeweekswithmissing
values,theresultsarepresentedindetails.OnFig.2
(a, h) these periods are between
1050 and 1110 day
and between 1590 and 1660 day of observations
respectively. The observed daily sea levels, the
modeled, andthe predicted by the three models are
depictedindetailscorrespondinglyforbothperiods.
In Table 1 the estimates of the MSE and correlation
coefficient R for all the
models are given. These
estimates are obtained as results after filling of the
missingsealevelsinthetimeseries.
ThemeanvalueofMSEobtainedfromaveraging
of the MSEs of all 20 trained FFBP models and the
MSE
b of the best model differ by 0.20.4. The
corresponding difference between the mean MSE
valueandtheMSE
bofthebestobtainedESNmodelis
anorderofmagnitudesmaller.Thiscanbeexplained
by the different algorithms used for training of the
twoneuralnetworks.Whiletheprobabilityoffalling
intoalocalminimumofthegradientalgorithmused
for FFBP model is great, for the training
of ESN
modelsaonesteplinearregressionisused.Although
thegenerationofʺreservoirʺisrandomly,forallESN
modelsthesimilarresultsarereceived.Thebestones
ofbothANNmodelswereusedtofillinthemissing
dailysealevelsatthethreetimeseries.
Comparatively lower accuracy
is obtained for
onlinetrainedESNmodelascanbeseeninTable 1.
This can be explained with the real time training of
the network that uses previous predications of the
modelfornexttrainingsteps.However,theachieved
accuracy is still enough for practical purposes.
Besides the on
line procedure has the advantage to
train the model in real time with significantly less
computational resource compared to the other
models. The obtained results for online trained ESN
model are very promising for practical applications
takingintoaccounttheneedofrealtimepredictionof
sea level variations under the
extreme weather
conditions.Thisadvantagecanbeusedformodeling
andpredictingthesealevelswithasmallersampling
step (e.g. several minutes), which is crucial in
forecastingthecoastalstormprocesses.
The comparison of the obtained estimates of the
MSEofFFBP,ESN andMLRmodelsshows thatthe
correlation coefficients differ by ~ 0.05 from the
previous work (Pashova et al., 2012). This can be
explainedby the nature of the modeledprocess, the
volume and the location of missing values in the
sampleofdailysealevels.
The resulting averaged values of MSE of the 20
trainedFFBP
modelsin theprevious work (Pashova
&Popova,2011)wasfor2yearperiodwhilethedata
inthisstudyrefertothe5yearperiod19851989.The
sample size for the two periods differs; respectively
the averaged mean square errors in training of the
FFBPneuralnetworksare
alsodifferent.Whenalarge
volumeofdataisusedfortraining,thegeneralization
ability of the ANN model increases, although the
RMSEcouldincreases.This meansthat themodel is
able to predict with high accuracy new values for
whichthenetworkisnotpreliminarytrained.
Comparing the graphs and
estimation criteria
presentedinTable1,wecanmaketheinferencesthat:
The phenomenon of change in daily maximum,
mean and minimum sea levels is nonlinear, and
both types of ANN