Deep Learning for Biomedical Unstructured Time Series

Deep Learning
for Biomedical
Unstructured
Time-series
1D Convolutional neural
networks (CNNs) for time
series analysis, and
inspiration from beyond
biomedical field
Petteri Teikari, PhD
Singapore Eye Research Institute (SERI)
Visual Neurosciences group
http://petteri-teikari.com/
Version “Wed 17 April 2019“

Time SeriesAnalysis VeryShortIntro

TimeSeries Basics
Regular time seriesvs. irregular timeseries
https://mediatum.ub.tum.de/doc/1444158/78684.pdf
UnstructuredBiomedical1DTimeSeries
Time-Frequencyvisualization
https://doi.org/10.3389/fnhum.2016.00605
Timeserieswithdiscrete“states”
Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,),
multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries
Manytypesof groundtruths possiblealsofor1Dtime
series Segmentation,classification,regression
https://arxiv.org/abs/1801.05394

TimeSeries Stationarity
Non-stationaritiessignificantly
distort short-term spectral,
symbolicand entropyheartrate
variabilityindicesNovember
2011PhysiologicalMeasurement
32(11):1775-86
DOI: 10.1088/0967-3334/32/11/S05
Testsof Stationarity
https://stats.stackexchange.com/questions/182764/stationarity-test
s-in-r-checking-mean-variance-and-covariance
Stationarity of order 2 For everyday use we often consider time series that have (instead of
strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html
●
aconstantmean
●
aconstantvariance
●
anautocovariancethatdoesnotdependontime.
Suchtimeseriesareknownas second-orderstationary or stationaryoforder2.
Examples of non-stationary processes are random walk with or without a
drift (a slow steady change) and deterministic trends (trends that are
constant, positive or negative, independent of time for the whole life of the
series).https://www.investopedia.com/articles/trading/07/stationary.asp

Time SeriesAnalysis LiteratureOverview

Representation vsSimilarity
https://arxiv.org/abs/1704.00794: “Time series
analysis approaches can be broadly categorized
into two families: (i) representation methods,
which provide high-level features for representing
properties of the time series at hand, and (ii)
similarity measures, which yield a meaningful
similarity between different time series for further
analysis.“
Classic representation methods are for instance
Fourier transforms, wavelets, singular value
decomposition, symbolic aggregate approximation,
andpiecewiseaggregateapproximation.
Time series may also be represented through the
parameters of model-based methods such as
Gaussian mixture models (GMM), Markov models and
hidden Markov models (HMMs), time series bitmaps
andvariantsofARIMA.
An advantage with parametric models is that they
can be naturally extended to the multivariate
case. For detailed overviews on representation
methods, we refer the interested reader to e.g.
Wangetal.(2013).
https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities
between pairs of time series may be utilized in a wide range of applications, such as
classification, clustering, and anomaly detection. Time series similarity measures include for
example dynamic time warping (DTW, the longest common subsequence (LCSS), the
extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and
representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction.
Attempts have been made to design kernels from non-metric distances such as DTW, of
which the global alignment kernel (GAK) is an example. There are also promising works on
deriving kernels from parametric models, such as the probability product kernel, Fisher kernel,
andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence
onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting.
Moreover, many of these methods cannot naturally be extended to deal with multivariate time
series (MTS), as they only capture the similarities between individual attributes and do not
modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not
designed to handle missing data, an important limitation in many existing scenarios, such
as clinical data where MTS originating from Electronic Health Records (EHRs) often contain
missingdata
In this work, we propose a surgical site infection detection framework for
patients undergoing colorectal cancer surgery that is completely
unsupervised, hence alleviating the problem of getting access to labelled
training data. The framework is based on powerful kernels for multivariate
time series that account for missing data when computing similarities.

Analysis withSimilarityMeasures
TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData
KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017)
https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab)
Similarity-based approaches represent a
promising direction for time series analysis.
However, many such methods rely on
parameter tuning, and some have
shortcomings if the time series are
multivariate (MTS), due to dependencies
between attributes, or the time series
containmissingdata.
In this paper, we address these challenges
within the powerful context of kernel
methods by proposing the robust time
series cluster kernel (TCK). The approach
taken leverages the missing data
handling properties of Gaussian
mixture models (GMM) augmented with
informative prior distributions. An ensemble
learning approach is exploited to ensure
robustness to parameters by combining the
clustering results of many GMM to
formthefinalkernel.
The experimental results demonstrated that the TCK
(1) is robust to hyperparameter settings, (2) is
competitive to established methods on prediction
tasks without missing data and (3) is better than
established methods on prediction tasks with missing
data.
In future works we plan to investigate whether the
use of more general covariance structures in the
GMM, or the use of HMMs as base probabilistic
models, could improve TCK.

Wavelets Shapelets→ Shapelets ”1DGabors”#1
Fast classification of univariate and multivariate time
seriesthrough shapelet discovery
https://doi.org/10.1007/s10115-015-0905-9
Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme
A Shapelet Selection Algorithm forTime Series Classification: New Directions
https://doi.org/10.1016/j.procs.2018.03.025
The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession.
Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply
reducesthe time consumption ofshapeletselection.
https://slideplayer.com/slide/8370683/
Forexample,aclassof
abnormalECG
measurementmaybe
characterised by an
unusualpatternthat
onlyoccurs
occasionallyatany
point during the
measurement.Shapelets
aresubseriesthatcapture
thistypeofcharacteristic.
Theyallowforthe
detection ofphase-
independentlocalised
similaritybetween series
within thesameclass.
Thegreattimeseriesclassificationbakeoff:areviewandexperimental
evaluationof recentalgorithmicadvances
Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017)
https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification

Wavelets Shapelets→ Shapelets ”1DGabors”#2
Afastshapelet selectionalgorithmfortime
series classification
https://doi.org/10.1016/j.comnet.2018.11.031
Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis
computed off-line, and the authorsaim tomake it moreefficient
Shapelet transformation algorithms have attracted a great deal of attention in the last
decade. However, the timecomplexity of the shapelet selectionprocess in shapelet
transformation algorithms is too high. To accelerate the shapelet selection process with
noreductioninaccuracy,wepresentedFSSforST.
The experimental results demonstrate that our proposed FSS was thousands of
timesfasterthantheoriginalshapelettransformation methodwithnoreduction
in accuracy. Our results also demonstrate that our method was the fastest method
among shapeletmethodsthathavetheleadinglevelofaccuracy.

RepresentationLearning with deeplearning #1
TowardsaUniversalNeuralNetworkEncoderforTime
Series
Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on
10May 2018)https://arxiv.org/abs/1805.03908
We have studied the use of a universal encoder for time
series in the specific case of classifying an out-of-sample data
set of an unseen data type. We have considered the cases of
no-adaptation,mappingadaptation,andfulladaptation.
In all cases we achieve performances that are competitive with
the state-of-the-art that, in addition, involve a compact reusable
representation and few training iterations. We have also studied
the effect of the representation dimensionality, showing that
small representations have an impact to no-adaptation and
mapping adaptation approaches,butnotmuch tofulladaptation
ones.
In the future, we plan to refine the encoder architecture, as well
as optimizing some of the parameters we empirically use in our
experiments. A very interesting direction for future research is
the adoption of one-shot learning schemas (Snelletal.2017;
Sutskeveretal.2014), which we find very suitable for the
current setting in time series classification problems.
A further option to enhance the performance of a universal
encoder is data augmentation, specially considering recent
linear instance/class interpolation approaches (
Zhangetal.2018).
In order to have sufficient knowledge to accomplish any task, and in order to be
applicable in the absence of labeled data or even without adaptation/re-training,
researchers have been increasingly adopting the generic concept of universal
encoders, specially within the text processing domain (note that related concepts also
existinother domains).
The basic idea is to train a model (the encoder) that learns a common representation
which is useful for a variety of tasks and that, at the same time, can be reused for
novel tasks with minimal or no adaptation. While it would seem that classical
autoencoders and other unsupervised models should perfectly fit this purpose, recent
research in sentence encoding shows that, with current means, encoders learnt with a
sufficiently large set of supervised tasks, or mixing supervised and
unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.

OneDeepMusicRepresentationtoRuleThem All?
Acomparativeanalysisofdifferentrepresentationlearning
strategies
JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic
(Submittedon13Feb2018)
Ourworkwilladdressthefollowing researchquestions:
–RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain
anetwork,whatistheinfluenceofthenumberandtypeofthetaskson
theeffectivenessofthelearneddeeprepresentation?
–RQ2:Howdovariousdegreesofinformationsharinginthedeep
architectureaffecttheultimatesuccessofalearneddeep
representation?
–RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep
representation?
Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single
learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a
learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is
likely to be informative for the unseen task. At the same time, this representation may not be that informative to another
unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more
learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.

LearningFiner-classNetworksforUniversal
Representations
JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline
Hudelot(Submittedon4 Oct2018)
Many real-world visual recognition use-cases can not directly benefit from
state-of-the-art CNN-based approaches because of the lack of many
annotated data. The usual approach to deal with this is to transfer a
representation pre-learned on a large annotated source-task onto a target-
task of interest. This raises the question of how well the original
representation is "universal", that is to say directly adapted to many
different target-tasks. To improve such universality, the state-of-the-art
consists in training networks on a diversified source problem, that is
modified either by adding generic or specific categories to the initial set of
categories.
We propose two methods to improve universality, but pay special attention
to limit the need of annotated data. We also propose a unified
framework of the methods based on the diversifying of the training
problem. Finally, to better match Atkinson's cognitive study about
universal human representations, we proposed to rely on the
transfer-learningschemeas wellasa new metric toevaluateuniversality.
We show thatourmethod learnsmore universal representationsthan state-
of-the-art, leading to significantly better results on 10 target-tasks from
multiple domains, using several network architectures, either alone or
combinedwithnetworkslearnedat acoarsersemantic level.

ImprovingClinicalPredictionsthroughUnsupervised
TimeSeriesRepresentationLearning
XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas,
Gunnar Rätsch(Submittedon2Dec2018)
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
We empirically showed that in scenarios
where labeled medical time series data is
scarce, training classifiers on unsupervised
representations provides performance gains
over end-to-end supervised learning using
raw input signals, thus making effective use
of information available in a separate,
unlabeled training set.
The proposed model, explored for the first
time in the context of unsupervised patient
representation learning, produces
representations with the highest
performance in future signal prediction
and clinical outcome prediction,
exceeding several baselines.
The idea behind applying attention mechanisms to time series forecasting is to enable the
decoder to preferentially “attend” to specific parts of the input sequence
during decoding. This allows for particularly relevant events (e.g. drastic changes in heart
rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.

UnsupervisedScalableRepresentationLearningforMultivariate
TimeSeries
https://github.com/White-Link/UnsupervisedScalableRepresentationLearni
ngTimeSeries
(PyTorch)
Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi
(Submittedon30Jan2019)
Hence, we propose in the following an unsupervised
method to learn general-purpose representations for
multivariate time series that comply with the issues of
varying and potentially high lengths of the studied time
series. To this end, we adaptrecognized deep learningtools
and introduce a novel unsupervised loss. Our
representations are computed by a deep convolutional
neuralnetworkwithdilatedconvolutions(i.e.TCNs).
This network is then trained unsupervised, using the first
specifically designed triplet loss in the literature of
time series, taking advantage of the encoder resilience to
time seriesofunequallengths.
We leave as future work the applicability of our method to
other tasks like forecasting, and the study of its impact if it
weretobeaddedinpowerful ensemblemethods.

Unsupervised speech representation learning
using WaveNet autoencoder
Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den
Oord(Submitted on 25 Jan 2019)
We consider the task of unsupervised extraction of
meaningful latent representations of speech by applying
autoencoding neural networks to speech waveforms. The
goal is to learn a representation able to capture high level
semantic content from the signal, e.g. phoneme identities,
while being invariant to confounding low level details in the
signal such as the underlying pitch contour or background
noise. The behavior of autoencoder models depends on the
kind of constraintthatis applied tothelatentrepresentation.
Our best models used MFCCs (mel-frequency cepstral
coefficient) as the encoder input, but reconstructed raw
waveforms at the decoder output. We used standard 13
MFCC features extracted every 10ms (i.e., at a rate of 100 Hz)
and augmented with their temporal first and second
derivatives. Such features were originally designed for
speech recognition and are mostly invariant to pitch and
similarconfoundingdetail in theaudiosignal. T

ATaleof Two Time Series Methods:Representation
Learningfor Improved Distance and RiskMetrics
https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076
345253-MIT.pdf
DivyaShanmugam (June2018)
Architecture of the proposed model. A single convolutional layer
extracts local features from the input, which a strided maxpool
layer reduces to a fixed-size vector. A fully connected layer
with ReLU activation carries out further, nonlinear dimensionality
reduction to yield the embedding. A softmax layer is added at
training time.
We introduce the multiple instance learning paradigm to risk
stratification. Risk stratification models aim to identify patients
at high risk for a given outcome so that doctors may intervene, with
the attempt of avoiding that outcome. Machine learning has led to
improved risk stratification models for a number of outcomes,
including stroke, cancer and treatment resistance [55]. To the best of
our knowledge, this is the first application of multiple instance learning
to risk stratification.
The extension of Jiffy to multi-label classification and unsupervised
learning poses a challenging but necessary task. The availability of
unlabeled time series data eclipses the availability of its annotated
counterpart. Thus, a simple network-based method for representation
learning on multivariate timeseries inthe absence oflabels isan important
line of work. There is also potential to further increase Jiffy’s speed by
replacing the fully connected layer with a structured [Bojarskietal.2016]
or
binarized[Rastegariet al.2016]
matrix.
The proposed risk stratification model extends naturally to a range of adverse
outcomes. The model is not limited to operating on ECG signals - it is
worth exploring whether the multiple instance learning approach may be
successful in other modalities of medical data, including voice. On a
theoretical level, strong generalization guarantees for distinguishing bags with
relative witnessratesdonotexistand are worth exploring asthese modelsare
appliedintherealworld.

Intro tomethods#1a
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones

Intro tomethods#1b
http://doi.org/10.1098/rsif.2013.0048
Structure inalibrary of8651time-seriesanalysisoperations. (a) A
summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by
a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series
analysisliterature.(b)A network representation of theoperationsinour library
thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7],
which wereretrieved fromourlibraryautomatically.Each nodein thenetwork
representsanoperationand linksencodedistancesbetweenthem(computed
using a normalized mutual information-based distancemetric, cf.electronic
supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof
ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded
community (indicated bya heavily outlined node, vertical axis). Similar pictures
can beproduced by targeting anygivenoperationin our library, thereby
connecting differenttime-seriesanalysismethodsthatneverthelessdisplay
similar behaviour acrossempiricaltimeseries.
Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis
methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of
versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure.
The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as
empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and
operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each
panel.
(a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof
scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures
3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d).
(d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in
our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f)
Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the
differencesbetween classesof labelled time-series datasets(cf. figure 5e).

Intro tomethods#1c
http://doi.org/10.1098/rsif.2013.0048
Highlycomparativetechniquesfortime-
seriesanalysistasks.Wedrawonourfull
library oftime-seriesanalysismethodsto:
(a) structure datasetsinmeaningfulways,
andretrieveandorganizeusefuloperations
for (b,e) classificationand(c,d) regression
tasks.(a)Fiveclassesof EEG signalsare
structuredmeaningfullyinatwo-
dimensional principalcomponentsspaceof
our libraryof operations.(b)Pairwise linear
correlationcoefficientsmeasuredbetween
the60mostsuccessful operationsfor
classifyingcongestiveheartfailureand
normalsinusrhythmRR intervalseries.
Clusteringrevealsthatmostoperationsare
organizedintooneof threegroups
(indicatedbydashedboxes).

Most of the time when people talk about time series and deep
learning, most likely they talking of Sequences (e.g. language)
instead of unstructuredtime series (e.g. voice waveform)

“Sequences” vs“TimeSeries”
“DenseTimeSeries”at videoframerate
Icehockeyas
gamecan be
simplifiedto
discreteevents
(sequences)
Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates

Time Series RNNsforsequences
The Unreasonable Effectivenessof
RecurrentNeuralNetworks
May21,2015|AndrejKarpathy
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
DanQ:ahybridconvolutionaland
recurrentdeepneuralnetworkfor
quantifyingthefunctionofDNA
sequences
Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44,
Issue11,20June2016,Pagese107,
https://doi.org/10.1093/nar/gkw226
DeepLearningforUnderstandingConsumerHistories
byTobiasLang- 25Oct2016
https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1
Sequences. Depending on your background you mightbewondering:
WhatmakesRecurrentNetworkssospecial?

TimeSeries LSTMsApplied
DeepAir|UCBerkeleySchoolofInformation
https://www.ischool.berkeley.edu/projects/2017/deep-air
This project investigates the use of the LSTM recurrent neural network (RNN) as a
framework for forecasting in the future, based on time series data of pollution and
meteorological information in Beijing. Our results show that the LSTM framework
produces equivalent accuracy when predicting future time stamps compared to the
baseline support vector regression for a single time stamp. Using our LSTM framework,
we can now extend the prediction from a single time stamp out to 5 to 10 hours in the
future.
Overview of our self-supervised approach for posture and sequence representation learning
using CNNLSTM. After the initial training with motion-based detections we retrain our model for
enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399
PianoGenie:An IntelligentMusicalInterface
Oct15,2018 |https://magenta.tensorflow.org/pianogenie
Chris Donahue ( chrisdonahue , chrisdonahuey ) ;Ian Simon ( iansimon , iansimon ) ;Sander Dieleman ( benanne , sedielem )
A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller
buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM
decoder then decodes these controller sequences back into piano performances. After
training, the encoder isdiscarded and controller sequencesareprovided byuser input.

Time Series RNN/LSTMsareoutdated?#1
ThefallofRNN/ LSTM
EugenioCulurciello
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Combining multiple neural attention modules, comes the “hierarchical
neural attention encoder”… Notice there is a hierarchy of attention
modules here, very similar to the hierarchy of neural networks. This is also
similar toTemporalconvolutionalnetwork(TCN)
→ Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional
NeuralNetworksforSequence-to-
SequencePrediction
MahaElbayad,LaurentBesacier,JakobVerbeek
(Submittedon11Aug 2018)
https://arxiv.org/abs/1808.03867|
https://github.com/elbayadm/attn2d

AnEmpiricalEvaluationof GenericConvolutional and
RecurrentNetworksforSequence Modeling
ShaojieBai,J.ZicoKolter,VladlenKoltun
(Revised19Apr2018)
https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN
For most deep learning practitioners, sequence modeling is
synonymous with recurrent networks. Yet recent results
indicate that convolutional architectures can outperform recurrent
networks on tasks such as audio synthesis and machine translation.
Given a new sequence modeling task or dataset, which architecture
should one use?
We conduct a systematic evaluation of generic convolutional and
recurrent architectures for sequence modeling. The models are
evaluated across a broad range of standard tasks that are commonly
used to benchmark recurrent networks. Our results indicate that a
simple convolutional architecture outperforms canonical
recurrent networks such as LSTMs across a diverse range of
tasks and datasets, while demonstrating longer effective memory. We
conclude that the common association between sequence modeling
and recurrent networks should be reconsidered, and convolutional
networks should be regarded as a natural starting point for sequence
modelingtasks.
The preeminence enjoyed by recurrent networks in sequence modeling
may be largely a vestige of history. Until recently, before the introduction of
architectural elements such as dilated convolutions and residual
connections, convolutional architectures were indeed weaker. Our
results indicate that with these elements, a simple convolutional
architecture is more effective across diverse sequence modeling tasks
than recurrent architectures such as LSTMs. Due to the comparable
clarity and simplicity of TCNs, we conclude that convolutional
networks should be regarded as a natural starting point and a
powerfultoolkit for sequence modeling

Dilated Temporal Fully-Convolutional Networkfor
Semantic Segmentation ofMotion CaptureData
NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik
Herrmann,Han Du, Klaus Fischer, PhilippSlusallek
(Submittedon 24Jun 2018)
Semantic segmentation of motion capture sequences
plays a key part in many data-driven motion synthesis
frameworks. It is a preprocessing step in which long
recordings of motion capture sequences are partitioned
into smaller segments. Afterwards, additional methods like
statistical modeling can be applied to each group of
structurally-similar segments to learn an abstract motion
manifold. The segmentation task however often
remains a manual task, which increases the effort and
costofgeneratinglarge-scalemotiondatabases.
We therefore propose an automatic framework for
semantic segmentation of motion capture data using a
dilated temporal fully-convolutional network. Our
model outperforms a state-of-the-art model in action
segmentation, as well as three networks for sequence
modeling.

TemporalConvolutionalNetworksandDynamicTimeWarping
canDrasticallyImprovetheEarlyPredictionofSepsis
MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten
Borgwardt(Submittedon7Feb2019)
https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318
For future work, we aim to extend our analysis to more types of data
sources arising from the ICU. Futoma et al. (2017b) already
employed a subset of baseline covariates, medication effects, and
missingness indicator variables. However, a multitude of feature
classes still remain to be explored and properly integrated. For
instance, the combination of sequential and non-sequential
features has previously been handled by feeding non-sequential
data into the sequential model (Futoma et al.,2017a).
We hypothesize that this could be handled more efficiently by
using a more modular architecture that incorporates both
sequential and non-sequential parts. Furthermore, we aim to obtain
a better understanding of the time series features utilized by the
model. Specifically, we are interested in assessing the
interpretability of the learned filters of the MGPTCN framework
and evaluate how much the activity of an individual filter contributes
to a prediction. This endeavor is somewhat facilitated by our use of a
convolutional architecture. The extraction of short per-channel
signals could prove very relevant for supporting diagnoses made by
clinical practitioners.
Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process
(MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly
spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward
pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including
both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during
training.
We evaluate all methods using Area under the Precision–Recall Curve
(AUPRC) and additionally display the (less informative) Area under the
Receiver Operator Characteristic (AUC). The current state-of-the-art
method, MGP-RNN, is shown in blue. The two approaches for early
detection of sepsis that were introduced in this paper, i.e. MGP-TCN and
DTW-KNN ensemble, are shown in pink and red, respectively. By using three
random splits for all measures and methods, we depict the mean (line) and
standard deviation error bars (shaded area).

Clinicalnotes and textreportunderstanding
Wordsas thesequences

StructuringClinicalText
Comparativeeffectiveness of convolutional neural
network(CNN)and recurrent neural network(RNN)
architectures for radiologytext reportclassification (2018)
https://doi.org/10.1016/j.artmed.2018.11.004
DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof
Medicine,Stanford,CA,USA
This paper explores cutting-edge deep learning methods for
information extraction from medical imaging free text
reports at a multi-institutional scale and compares them to the
state-of-the-art domain-specific rule-based system – PEFinder
andtraditionalmachinelearning methods– SVMandAdaboost.
Visualization methods have been developed to identify the
impact of input words on the output decision for both
deeplearning models.
DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA-
HNN)architecture.

ClinicalText +Images
Unsupervised MultimodalRepresentation Learning across
Medical Images and Reportsn
(MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.)
https://arxiv.org/abs/1811.08615 MITCSAIL
Joint embeddings between medical imaging modalities and
associated radiology reports have the potential to offer
significant benefits to the clinical community, ranging from cross-
domain retrieval to conditional generation of reports to the
broader goals of multimodal representation learning. In this work,
we establish baseline joint embedding results measured via both
local and global retrieval methods on the soon to be released
MIMIC-CXR dataset consisting of both chest X-ray images and
the associatedradiologyreports..
We establish baseline results using supervised and unsupervised joint embedding
methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval
evaluation metrics. Results show a possibility of incorporating more unsupervised data
into training for minimal-effort performance increase. A further study of joint
embeddings between these modalities may enable significant applications, such as
text/imagegenerationor theincorporationofotherEMRmodalities.

ElectronicHealthRecords
Visitsassequences,
eachsequencecancontain1Dbiosignals

EHRMining Risk Predictionmodel
Risk Prediction on Electronic Health Records with Prior
Medical Knowledge (2018)
https://doi.org/10.1145/3219819.3220020
We propose a novel and general framework called PRIME for
risk prediction task, which can successfully incorporate
discrete prior medical knowledge into all of the state-of-the-
art predictive models using posterior regularization technique.
Different from traditional posterior regularization, we do not need
to manually set a bound for each piece of prior medical
knowledge when modeling desired distribution of the target
disease on patients. Moreover, the proposed PRIME can
automatically learn the importance of different prior knowledge
with alog-linearmodel.
The limitation of this work is that the proposed PRIME is only
effective for common diseases. For rare and emerging
diseases, since there is little medical knowledge about them, it
is hard to incorporate any prior knowledge into deep learning
predictive models. Thus, the proposed PRIME may achieve
similar performance to the state-of-the-art baselines. In our
future work, we will focus on how to improve predictive
performanceofrisk predictionforrare diseases.

Intro to cleaning
Inthepreprocessing component,themainpurposeistocleanthe
data,filter theunusualpointsandmakeitsuitableastheinputtothe
CNN.Besidesthenormalstepsincludingtimestampalignment,
normalizationandmissingdataimputationfortimeseriesdatawith
trend,
themostimportantoperationtoimprovethedataqualityisthe
outlierdetection,interpolation andfiltering,inparticularfor
clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there
aremanymissingor outlier datapointsduetoerrorsincalibration,
measurements,and/or mistakesintheprocessofdatacollectionand
transmission.Here,severalmethodsareintroducedtohandlethese
scenarios[36].
●
DimensionReductionModel: thetimeseriescan beprojectedinto
lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent
analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas
outliers.
●
Proximity-basedModel: thedataaredeterminedbynearest
neighbouranalysis,clusterordensity.Thus thedatainstancesthat are
isolatedfromthemajorityareconsidered asoutliers.
●
Probabilistic Stochastic Filters:differentfiltersforthesignals, such
asgaussian mixturemodelsoptimized usingexpectation-maximization.
In ourcasethefiltercan beimplementedbeforetheCNN, duetothe
continuouscharacteristic oftheinputglycaemic timeseriesdata.
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationofcardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpress
https://doi.org/10.1088/1361-6579/aae304
Signalcleaning:
Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We
resampledallECGsto300HzusingthefastFourier transforminorder topassECG
segmentsofequallengthontotheCNN.
Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects,
or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks
asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4).
Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor
accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof
33sampleswasappliedtoremovethepersistentbaseline.
https://doi.org/10.3389/fnins.2013.00267
MEGandEEGdataanalysis withMNE-Python

TimeSeries Invariances
Acomplexity-invariantdistancemeasurefortimeseries
https://doi.org/10.1137/1.9781611972818.60
GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh.
In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM),
pages699–710.SIAM,2011.Citedby216

TimeSeries DTWthe classicalmethod
https://doi.org/10.1145/2888451.2888
456
StockPricePredictionwithFluctuationPatternsUsing
IndexingDynamic TimeWarpingand k∗
-Nearest
NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018)
https://doi.org/10.1007/978-3-319-93794-6_7

Learning invariances#1a
LearningtoExploit InvariancesinClinical
Time-SeriesDatausingSequence
TransformerNetworks
JeehehOh, JiaxuanWang, JennaWiens
(Submittedon 21 Aug2018)
Recently, researchers have started applying convolutional neural
networks (CNNs) with 1D convolutions to clinical tasks
involving time-series data. This is due, in part, to their
computational efficiency, relative to recurrent neural networks
and their ability to efficiently exploit certain temporal invariances,
(e.g.,phaseinvariance).
However, it is well-established that clinical data may exhibit many
other types of invariances (e.g., scaling). While preprocessing
techniques, (e.g., dynamic time warping) may successfully
transform and align inputs, their use often requires one to identify
thetypesofinvariancesinadvance.
In contrast, we propose the use of Sequence Transformer
Networks, an end-to-end trainable architecture that learns to
identify and account for invariances in clinical time-series data.
Applied to the task of predicting in-hospital mortality, our
proposedapproachachievesanimprovementintheAUROC.
Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor
learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom
thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and
task-dependenttransformations.Incontrasttodataaugmentationapproaches,our
proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.

Learning invariances#1b
LearningtoExploitInvariancesinClinicalTime-
Series DatausingSequenceTransformerNetworks
Jeeheh Oh, Jiaxuan Wang, JennaWiens
(Submitted on 21 Aug 2018)
Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe
SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample,
insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned
transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare
encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on
whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances
andfeature-specificinvariancescouldleadtofurther improvementsinperformance.

Learning invariances#2
Autowarp:LearningaWarpingDistancefromUnlabeledTime
Series UsingSequenceAutoencoders
Abubakar Abid, JamesZou StanfordUniversity
(Submitted on 23Oct2018)
Domain experts typically hand-craft or manually select a specific metric, such as dynamic time
warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end
algorithm that optimizesand learnsagood metric givenunlabeled trajectories.
We define a flexible and differentiable family of warping metrics, which encompasses common
metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation
power of sequence autoencoders to optimize for a member of this warping distance
family. The output is a metric which is easy to interpret and can be robustly learned from relatively
few trajectories.
Future work will extend these results to more challenge time series data, such as those with higher
dimensionality or heterogeneousdata.

Learning invariances#3
NeuralWarp:Time-Series SimilaritywithWarpingNetworks
Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018)
https://arxiv.org/abs/1812.08306 | Relatedarticles
In this paper we propose to learn a warping function for
aligning the indices of time series in a deep latent
representation. We compared the suggested architecture
with two types of encoders (CNN, or RNN) and a deep
forward network as a warping function. Experimental
comparisons to non-parametric and un-warped Siames
networks demonstrated that the proposed elastic deep
similaritymeasureismoreaccuratethanpriormodels.

SMOTE forimbalancedclasses
SMOTE-GPU:BigData preprocessingon
commodityhardwareforimbalancedclassification
ProgressinArtificialIntelligenceDecember2017,Volume6,
Issue4,pp347–354
https://doi.org/10.1007/s13748-017-0128-2
Consideringabinaryproblemwithamajorityclassanda
minorityclass,itislikelythatalearning algorithmignoresthe
later andstillachievesahighaccuracy.Thereare threemain
waysof dealingwiththesesituations [16]:
●
Algorithmicmodification Modifyinglearning algorithmsin
order totackletheproblembydesign.
●
Cost-sensitivelearningIntroducingcostsfor
misclassificationoftheminorityclassatdataor algorithmic
level.
●
DatasamplingPreprocessingthedatainorder toreduce
thebreachbetweenthenumberofinstancesofeachclass.
TheSMOTEtechniqueisbasedontheideaof
neighborhoodofthek-nearestneighbor (kNN)rule.
The area under the ROC curve results show that the use of
oversampling methods improves the detection of the minority
class in Big Data datasets. We have also shown how our design can
successfully work on a wide range of devices, including a laptop,
while requiring reasonable times, around 25 min on high-end devices,
and less than 2 h on the laptop, for the most time-demanding
experiment.
SMOTEforLearningfromImbalancedData:Progress and
Challenges,Markingthe15-yearAnniversary(2018)
https://doi.org/10.1613/jair.1.11192
●
GS4(Moutafis & Kakadiaris, 2014)
,SEG-SSC (Triguero et al.,2015)
and OCHS-SSC
(Dong et al.,2016)
generate synthetic examplestodiminish the
drawbacksproducedby the absence of labeled examples.
Several learning techniques were checked andsomeproperties
such asthecommonhiddenspacebetweenlabeledsamplesand
thesyntheticsamplewereexploited.
●
The technique proposed by Park et al. (2014) is a semi-
supervised active learning method in which labels are
incrementally obtained and applied using a clusteringalgorithm.
Inthe contextofcurrentchallengesoutlined,we highlightedtheneed
forenhancingthetreatmentof smalldisjuncts,noise, lack of data,
overlapping,datasetshiftandthecurseof dimensionality. To doso,the
theoreticalpropertiesof SMOTE re-garding these data
characteristics, and its relationship with the new synthetic
instances,mustbefurtheranalyzedindepth. Finally,wealsoposited
thatitisimportanttofocusondatasampling andpre-processing
approaches(such asSMOTE anditsextension)withintheframework
ofBig Dataandreal-timeprocessing.

Outlierdetection Whatto impute?

TypesofAnomalies
globalanomalies(x1, x2),
localanomaly x3
micro-cluster c3.
Asimpletwo-dimensionalexample
“Thissimpleexamplealready
illustratesthatanomaliesarenot
alwaysobviousandascoreis
muchmoreusefulthanabinary
labelassignment.”
AComparative EvaluationofUnsupervised
AnomalyDetectionAlgorithmsforMultivariate
Data(2016)
Markus Goldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
Threetypesofanomaly
schemes:
●
pointanomalydetection
●
collectiveanomaly
●
contextualanomalies

State-of-the-art 2 yearsoldcuttingedge#1
AComparativeEvaluationofUnsupervisedAnomaly
DetectionAlgorithms forMultivariateData (2016)
MarkusGoldstein,Seiichi Uchida
Dozens of algorithms have been proposed in this area, but unfortunately
the research community still lacks a comparative universal evaluation as
wellascommonpubliclyavailabledatasets.
These shortcomings are addressed in this study, where 19 different
unsupervised anomaly detection algorithms are evaluated on 10
different datasetsfrommultipleapplicationdomains.
By publishing the source code and the datasets, this paper aims to
be a new well-funded basis for unsupervised anomaly detection
research. Additionally, this evaluation reveals the strengths and
weaknessesofthedifferent approachesforthefirst time.
As a general summary for algorithmselection, werecommend to use
nearest-neighbor based methods, in particular k-NN for global tasks
and LOF for local tasks instead of clustering-based methods. If
computation time is essential, HBOS is a good candidate, especially for
larger datasets. A special attention should be paid to the nature of the
dataset when applying local algorithms, and if local anomalies are of
interest at allin thiscase.
Different anomaly detection modes
dependingon the availability of labels
in the dataset.
(a) Supervised anomaly detection uses a
fully labeled dataset for training. (b) Semi-
supervised anomaly detection uses an
anomaly-free training dataset. Afterwards,
deviations in the test data from that normal
model are used to detect anomalies. (c)
Unsupervised anomaly detection
algorithms use only intrinsic information of
the data in order to detect instances
deviatingfrom the majority of thedata.

State-of-the-art 2 yearsoldcuttingedge#2
A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor
Multivariate Data (2016)MarkusGoldstein, SeiichiUchida
A visualization of the results of the k-NN global
anomaly detection algorithm. The anomaly score is
represented by the bubble size whereas the color shows the
labelsoftheartificiallygenerateddataset.
Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor
(LOF) showstheusefulnessofthe reverseneighborhoodset.
For the red instance, LOF takes only the neighbors in the gray
area into account resulting in a high anomaly score. INFLO
additionally takes the blue instances into account (reverse
neighbors)andthusscorestheredinstancemorenormal.

Anomalydetection Cyber-physicalsystems
Anomaly DetectionwithGenerativeAdversarialNetworks for
MultivariateTimeSeries (2018)
Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg
InstituteofDataScience, National UniversityofSingapore,
Unsupervised machinelearningtechniquescanbeusedtomodelthe
systembehaviour andclassifydeviantbehavioursaspossibleattacks.
Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based
AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs.
WeusedLSTM-RNNinourGANtocapturethedistributionofthe
multivariatetimeseriesofthesensorsandactuatorsundernormal
workingconditionsofaCPS.
Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model
thetimeseriesofmultiplesensorsandactuatorsintheCPS
concurrently totakeintoaccountofpotentiallatentinteractions betweenthem.
ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe
GAN-traineddiscriminator together withtheresidualsbetweengenerator-
reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe
complexCPS.
We will also conduct further
research on feature
selection formultivariate
anomalydetection,and
investigate principled
methodsfor choosing the
latent dimension andPC
dimension withtheoretical
guarantees.

Anomalydetection Financialtime-series
Modelingapproachesfortimeseries forecastingand
anomaly detection (2018)
Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun
http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf
This project focuses on prediction of time series data for Wikipedia
page accesses for a period of over twenty-four months. The methods
explored here are K-nearest neighbors (KNN), Long short-term memory
network (LSTM), and Sequence to Sequence with Convolution Neural
Network (CNN) and we will compare predicted values to actual web traffic.
Thepredictionscan helpusinanomalydetectionintheseries.
Pre-processing : “The are many series in which values are zero. This
could be a missing value, or actual lack of web page access. In addition,
there are significant spikes in the data, where values have a broad range
from 1 to hundreds/thousandsfor several web pages. We normalize this
data by adding 1 to all entries, taking the log of the values, and setting
the mean to zero and variance to one. We have the results of fourier
analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.”
Our approaches to time series prediction depends on features extracted
from the the time series data itself. Our models learn periodicity, ramp and
other regular trends quite well. However, none of our models are able to
capture spikes or outliers that arise from external sources. Enhancing
the performance of the models will require augmenting our feature set from
othersourcessuchasnewseventsandweather.

“SpecialOutliers” Disguisedmissingvalues
FAHES:ARobustDisguised Missing
ValuesDetector
QatarComputingResearch Institute,HBKU, Doha,Qatar
https://doi.org/10.1145/3219819.3220109
Missing values are common in real-world data and may
seriously affect data analytics such as simple statistics
and hypothesis testing. Generally speaking, there are
two types of missing values: explicitly missing
values (i.e. NULL values), and implicitly missing values
(a.k.a. disguised missing values (DMVs)) such as
"11111111" for a phone number and "Some college" for
education. While detecting explicitly missing values is
trivial, detecting DMVs is not; the essential challenge is
the lack of standardization about how DMVs are
generated.
Onefutureworkweareplanning
toperformistoimproveFAHESto
detecttheDMVsthataregenerated
randomlywithintherangeofthe
data.For example,whenachildtries
tocreateanaccountonadomain
thathasaminimumagerestriction,
thechildfakesher agewitharandom
valuethatallowshimtocreatethe
account.Suchrandomfakevalues
arehard,ifnotimpossible,todetect.
Moreover,althoughDMVsarethe
focusofthispaper,therearemore
typesoferrorsarefoundinthewild.
Manyoftheprinciplesand
techniqueswehaveusedtodetect
DMVscanbeleveragedtodetect
other typesoferrors,soanatural
nextstepistoextendthe
infrastructurewehavebuiltto
detectthose.Thisopensnew
challengesrelatedtotherobust
identificationoferrorsthatcouldbe
interpreteddifferentlybydifferent
modules.

DeepLearning Outlier Detection overview

UncertaintyandNoveltydetection #1a
Does YourModel KnowtheDigit6Is NotaCat?ALessBiased
Evaluationof“Outlier” Detectors (2018)
AlirezaShafaei,MarkSchmidt,andJamesJ.Little
What makes this problem differentfrom a typical supervisedlearning setting
isthatwecannotmodelthediversityofout-of-distributionsamplesin
practice. The distribution of outliers used in training may not be the same as
the distribution of outliers encountered in the application. Therefore,
classical approaches that learn inliers vs. outliers with only two datasets
can yield optimistic results. We introduce OD-test, a three-dataset
evaluation scheme as a practical and more reliable strategy to assess
progress on this problem. The OD-test benchmark provides a
straightforward means of comparison for methods that address the out-of-
distributionsampledetectionproblem.
In real life deployment of products that use complex machinery such as
deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe
input. In the absence ofextrapolation guarantees, when the independently
and identically distributed (IID) assumption is violated, the behaviour of the
pipeline may be be unpredictable. From a quality assurance
perspective, it is desirable to detect and prevent these scenarios
automatically.
A reliable pipeline would first determine whether it can process a
given sample, then it would use the prediction of the target neural
network. The unfortunate incident that
mislabeledpeople asnon-human , for instance, is a clear example of
OOD extrapolation that could have been prevented by such a
decision scheme: the model simply did not know that it did
not know. While incidentsof similar nature have fueled researchon
de-biasing the datasets and the deep learning machinery, we still
wouldneed to identify the limitationsof ourmodels.
The application is not limited to fortifying large-scale user-
facing products. Successful detection of such violations could
also be used in active learning, unsupervised learning, learning with
noisy data, or simply be a condition to invoking transfer learning
strategies. In this work, we are interested in evaluating mechanisms
that detect OOD samples.

UncertaintyandNoveltydetection #1b
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
The Uncertainty View. A commonly invoked strategy in addressing
similarproblemsistocharacterizeanotionofuncertainty.
The literature distinguishes aleatoric uncertainty, the uncertainty inherent
to the process (the known unknowns, like flipping a coin), from epistemic
uncertainty, the uncertainty that can be eliminated with more information
(the unknown unknowns). The Bayesian approach to epistemic
uncertainty estimation is to measure the degree of disagreement among
thepotentiallyviablemodels(theposterior).
The MC-Dropout approach is often advertised as a feasible method to
estimateuncertainty for a variety of applications. Similarly, we can adopt a
non-Bayesian approach by training independent models and then
measuringthedisagreement.Lakshminarayananetal.showanensembleof
five neural networks (DeepEnsemble) that are trained with an
adversarialsample-augmented strategy is sufficient to provide a non-
Bayesian alternative to capturing predictive uncertainty. We evaluate
DeepEnsemble and MC-Dropout.
* The Abstention View
* The Anomaly View AEThreshold PixelCNN++ K-NNSVM
* The Novelty View OpenMax
We train these architectures with a cross-entropy loss (CE), and a k-way logistic
regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces
mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks –
it does not enforce mutual exclusivity of the predictions.
We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect
on the ability to predict OOD samples. CE loss cannot make a None prediction without an
explicitly defined None class, but KL loss can make None predictions through low activations of
all the classes.

UncertaintyandNoveltydetection #1c
VGG-backedandResnet-backedmethods
significantlydifferinaccuracy.Thegap
indicatesthesensitivityofthemethodstothe
underlyingnetworks.
Thismeansthattheimageclassificationaccuracy
maynotbetheonlyrelevantfactor inperformance
ofthesemethods.ODINislesssensitivetothe
underlyingnetwork.
Despitenotenforcingmutualexclusivity,training
thenetworkswithKLlossinsteadofCEloss
consistentlyreducestheaccuracyofOOD
detectionmethodsonaverage.

UncertaintyandNoveltydetection #1d
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch]
Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions:
(i) in-distribution techniques, and (ii) out-of-distribution techniques.
Guoetal. (2017) observed that
modern neural networks tend to
be overconfident in their
predictions. They show that
temperature scaling in the
softmax operator, also known as
Platt scaling, can be used to
calibrate the output probabilities of
a neural network to empirically
align the accuracy of a prediction
with its probability. Their efforts fall
under the uncertainty estimation
approaches.
Geifman and El-Yaniv (2017)
present a framework for selective
classification with deep neural
networks that follows the
abstention view. A selection
function decides whether to
make a prediction or not. For
the choice of selection function,
they experiment with MC-Dropout
and the softmax output. They
provide an analytical trade-off
between risk and coverage within
their formulation.
input perturbation serves as a way to assess how the network would behave nearby the given
input. When the temperature is 1 and the perturbation step is 0 we simply recover the
PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to
outperform the previous work [8] by a significant margin. We also assess the performance of ODIN
inourwork.
These methods provide an abstract idea which depends on the successful training of GANs. To
the best of our knowledge, training GANs is itself an active area of research, and it is not apparent
what design decisions would be appropriate to implement these ideas in practice. Furthermore,
someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.

UncertaintyandNoveltydetection #1e
Datasets.
We extend the previous work by evaluating over a broader set
of datasets with varying levels of complexity. The
variation in complexity allows for a fine-grained evaluation of
the techniques. Since OOD detection is closely related to the
problem of density estimation, the dimensionality of the
input image will be of vital importance in practical
assessments. As the input dimensionality increases, we
expect the task to become much more difficult.
Therefore, to provide a more accurate picture of performance,
itiscrucialtoevaluatethemethodsonhighdimensionaldata.
MC-Dropout
Inlow-dimensional
datasets,K-
NNSVMperforms
similarlyorbetter
than theother
methods
Thetop-performingmethod,ODIN,isinfluencedbythe
numberofclassesin thedataset.Similarto PbThreshold,ODIN
dependson themaximum signalin theclasspredictions,
thereforetheincreasednumberof classeswould directly affect
bothofthemethods.Furthermore,neitherofthemconsistently
prefersVGGoverResnetwithinalldatasets. Overall,ODIN
consistentlyoutperformsothersinhigh-dimensional
settings, but allthemethodshavea relativelylow average
accuracyinthe60%-78%range.

UncertaintyandNoveltydetection #1f

UncertaintyandNoveltydetection #2
To TrustOr NotTo Trust A Classifier
HeinrichJiang, Been Kim, Maya Gupta (2018)
Google Research;Google Brain
We propose a new score, called the trust
score, which measures the agreement
between the classifier and a modified
nearest-neighbor classifier on the testing
example. We show empirically that high
(low) trust scores produce surprisingly high
precision at identifying correctly (incorrectly)
classified examples, consistently
outperforming the classifier’s confidence
scoreas well as many other baselines.
Two example datasets and models. Predicting correctness (top row) and
incorrectness (bottom). The vertical dotted black line indicates accuracy level of the
classifier. The trust score consistently attains a higher precision for each given percentile
of classifier decision-rejection. Furthermore, the trust score generally shows increasing
precision as the percentile level increases, but surprisingly, many of the comparison
baselinesdo not.

Interpreting Neural NetworksWith Nearest
Neighbors
Eric Wallace, Shi Feng, Jordan Boyd-Graber
Local model interpretation methodsexplain individual
predictionsbyassigning animportance value to each
inputfeature. Thisvalue isoften determined by
measuringthe change in confidence when a feature is
removed. However, the confidence of neural networksis
nota robust measure of model uncertainty.
Thisissue makesreliably judgingthe importance of the
input featuresdifficult.We addressthisby changing
the test-time behaviorofneural networks using
Deep k-Nearest Neighbors. Without harmingtext
classification accuracy, thisalgorithm providesa more
robustuncertainty metric whichwe use to generate
feature importance values.
The resultinginterpretationsbetteralign withhuman
perception than baseline methods. Finally, we use our
interpretation methodto analyze model predictionson
dataset annotation artifacts.
Deepk-nearest neighbors: Towards confident,
interpretable and RobustDeep Learning
NicolasPapernot and Patrick D. McDaniel (2018)
Debugging ResNet model biases—This illustrates how the
DkNN algorithm helps to understand a bias identified by Stock and
Cisse [105] in the ResNet model for ImageNet. The image at the
bottom of each column is the test input presented to the DkNN.
Each test input is cropped slightly differently to include (left) or
exclude (right) the football. Images shown at the top are nearest
neighbors in the predicted class according to the representation
output by the last hidden layer. This comparison suggests that the
“basketball” prediction may have been a consequence of the ball
being in the picture. Also note how the white apparel color and
general arm positions of players often match the test image of
BarackObama.

AND:AutoregressiveNoveltyDetectors
Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara
(Submitted on4 Jul 2018)
We propose an unsupervised model for novelty
detection. The subject is treated as a density estimation
problem, in which a deep neural network is employed to learn a
parametric function that maximizes probabilities of training
samples. This is achieved by equipping an autoencoder with a
novel module, responsible for the maximization of
compressed codes' likelihood by means of autoregression. We
illustrate design choices and proper layers to perform
autoregressive density estimation when dealing with both
image and video inputs. Despite a very general formulation, our
model shows promising results in diverse one-class novelty
detectionandvideoanomalydetectionbenchmarks.
Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction
network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1

Anomalydetection withGANs#1
AnomalydetectionwithWassersteinGAN
IlyassHaloui, Jayant SenGupta, and Vincent Feuillard
(Submitted on11Dec2018)
https://arxiv.org/pdf/1812.02463
Inthispaper,we investigateGAN toperformanomalydetectionon
time series dataset. In order to achieve this goal, a bibliography is
made focusing on theoretical properties of GAN and GAN used for
anomaly detection. A Wasserstein GAN hasbeen chosen to learn the
representation of normal data distribution and a stacked encoder with
the generator performsthe anomaly detection. W-GAN with encoder
seems to produce state of the art anomaly detection scores on MNIST
datasetandweinvestigateitsusageon multi-variatetimeseries.
Based on this literature review, we chose to perform anomaly detection
using a Wasserstein Generative Adversarial Network. The main
reason is that Wasserstein GAN does not collapse contrarily to the
classical GAN which needs to be heavily tuned in order to avoid this
problem. Mode collapse can be blocking if we need to perform
anomaly detection: ifasubset ofour datadistributionisnotlearned bythe
generator, then all samples that are similar to this subset might end up
classified as abnormal. Another added value of the wasserstein GAN
version compared to a standard GAN is the possibility of using the loss
function of the discriminator to evaluate convergence since it is an
approximationoftheWassersteindistancebetween Pr
andPθ
.
A future improvement consists in considering CNN for both
the generator and discriminator in order to detect anomalies from
raw time series data. 1-D convolutions are needed and will be
investigated to produce good visual representations of time
series samples.A more thorough study of the impact of the
architecture should also be done.

Anomalydetection withGANs#2
MAD-GAN:MultivariateAnomalyDetectionforTimeSeries
DatawithGenerativeAdversarialNetworks
DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg
(Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore
In this work, we propose a novel Multivariate Anomaly Detection
strategywith GAN (MAD-GAN) to model the complex multivariate
correlations among the multiple data streams to detect
anomalies using both the GANtrained generator and discriminator.
Unlike traditional classification methods, the GAN-trained discriminator
learns to detect fake data from real data in an unsupervised fashion,
making it an attractive unsupervised machine learning technique for
anomalydetection
Given that this is an early attempt on multivariate anomaly detection on
timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther
investigations.Forexample,wehavenotedtheissuesofdeterminingthe
optimal subsequence length as well as the potential model instability of
theGANapproaches.
For future work, we plan to conduct further research on feature
selection for multivariate anomaly detection, and investigate principled
methods for choosing the latent dimension and PC dimension
with theoretical guarantees.Wealsohope toperformadetailedstudyon
the stability of the detection model. In terms of applications, we plan to
explore the use of MAD-GAN for other anomaly detection applications
such as predictive maintenance and fault diagnosis for smart buildings
andmachineries.

Uncertainty InsightsfromNLP uncertainty
QuantifyingUncertaintiesinNaturalLanguage
ProcessingTasks
YijunXiaoand William YangWang(Submitted on 18 May2018)
In this paper, we propose novel methods to study the
benefits of characterizing model and data
uncertainties for natural language processing (NLP)
tasks. With empirical experiments on sentiment analysis,
named entity recognition, and language modeling using
convolutional and recurrent neural network models, we
show that explicitly modeling uncertainties is not only
necessary to measure output confidence levels, but also
useful at enhancing model performances in various
NLPtasks.
1. We mathematically define model and data
uncertaintiesviathelawof totalvariance;
2. Our empirical experiments show that by accounting
for model and data uncertainties, we observe
significantimprovementsinthree importantNLPtasks;
3. We show that our model outputs higher data
uncertainties for more difficult predictions in sentiment
analysis andnamedentity recognitiontasks.

Uncertainty CNNs+GaussianProcesses
CalibratingDeepConvolutionalGaussianProcesses
Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio
Filippone. (Submitted on 26May 2018)
Despite the considerable interest in combining CNNs
with GPs, little attention has been devoted to
understand the implications in terms of the ability of
these models to accurately quantify the level of
uncertainty inpredictions.
This is the first work that highlights the issues of
calibration of these models, showing that GPs cannot
cure the issues of miscalibration in CNNs. We
have proposed a novel combination of CNNs and GPs
where the resulting model becomes a particular form of
a Bayesian CNN for which inference using variational
inference isstraightforward.
However, our results also indicate that combining CNNs
and GPs does not significantly improve the
performance of standard CNNs. This can serve as
a motivation for investigating new approximation
methods for scalable inference in GP models and
combinationswithCNNs.
CalibrationofConvolutionalNetworks:
The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of
support vector machines for probabilistic classification. Calibration techniques aim to learn a
transformation of the output using a validation set in order for the transformed output to give a reliable
account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless
of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques
include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs)
have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy
loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the
correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a
particular case of DNNs, however, show that depth has a negative impact on calibration, despite
the use of a cross-entropy loss, and that regularization improves the calibration properties of
classifiers[Guoetal.2017].
Combinationsof ConvNetsandGaussianProcesses:
Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian
CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian
DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these
approaches have not been investigated. In this work, we propose an alternative way to combine CNNs
and GPs, where GPs are approximated using random features expansions. The random feature
expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation,
turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of
Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian
CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation
of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a
practicalwayofcombiningCNNsand GPs.

Uncertainty in timestamps,modelingfor clinicaluse#1
Time-DiscountingConvolutionforEventSequences
withAmbiguousTimestamps
(Submitted on 6Dec2018)
This paper proposes a method for modeling event
sequences with ambiguous timestamps, a time-
discounting convolution. Unlike in ordinary time series,
time intervals are not constant, small time-shifts
have no significant effect, and inputting timestamps or
time durations into a model is not effective. The criteria
that we require for the modeling are providing
robustness against time-shifts or timestamps
uncertainty as well as maintaining the essential
capabilities of time-series models, i.e., forgetting
meaningless past information and handling infinite
sequences.
The proposed method handles them with a
convolutional mechanism across time with specific
parameterizations, which efficiently represents the event
dependencies in a time-shift invariant manner while
discounting the effect of past events, and a dynamic
pooling mechanism, which provides robustness
against the uncertainty in timestamps and enhances the
time-discounting capability by dynamically changing the
poolingwindowsize.

Typesof Missing Values
Feldmanetal.(2018): “Rubin (1976) discusses three possible
mechanisms for the formation of missing values, each reflecting a
different form of missing-data probabilities and relationships between the
measured variables, and each may lead to different imputation methods
(Luengoetal.,2012)”
Missing Completely at Random (MCAR): a missing value that cannot be
related to the value itself or to other variable values in that record. This is a
completely unsystematic missing pattern and therefore the observed data
canbethoughtofasarandomunbiasedsampleofacompletedataset.
Missing at Random (MAR): cases in which a missing value is related to
other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with
a "marital status" value "single", has a missing value in the "spouse name"
attribute). In other words, in MAR scenarios, incomplete data can be partially
explained and the actual value can be possibly predicted by other variable
values.
Missing Not at Random (MNAR): the missing value is not random and
depends on the actual value itself; hence, cannot be explained by other values
(e.g., an overweight person is reluctant to provide the "weight" value in a
survey). NMAR scenarios are the most difficult to analyze and handle, as the
missing data cannot be associated with other data items that are available in
thedataset.
https://statistical-programming.com/missing-data/
Missinginaction:the dangersofignoringmissingdata
https://doi.org/10.1016/j.tree.2008.06.014

Intro toimputationmethods
ComparisonofEstimatingMissingValues inIoTTime
Series DataUsingDifferentInterpolationAlgorithms
August2018
https://doi.org/10.1007/s10766-018-0595-5
“When collecting the Internet of Things data using various sensors or
other devices, it may be possible to miss several kinds of values of
interest.In thispaper,we focusonestimating the missing valuesin IoT
time series data using three interpolation algorithms, including
(1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3)
AdaptiveInverseDistanceWeighted.“
Onthechoiceofthebestimputationmethods formissingvalues
consideringthreegroups ofclassificationmethods
June2011
https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM
“In thiswork, wefocuson aclassification task with twenty-three classification methods
and fourteen different imputation approaches to missing values treatment that
are presented and analyzed. The analysis involves a group-based approach, in which
we distinguish between three different categories of classification methods.
Each category behaves differently, and the evidence obtained shows that the use of
determined missing values imputation methods could improve the accuracy obtained
for these methods. In this study, the convenience of using imputation methods
for preprocessing data sets with missing values is stated. The analysis suggests
that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“
We have discovered that the
Combined Multivariate Collapsing
(CMC) and Event Covering (EC)
methods show good behavior for
these two measures, and they are
two methods that provide good
results for an important range of
learning methods, as we have
previously analyzed. In short, these
two approaches introduce less
noise and maintain the mutual
information better.
Class centerbasedapproachformissingvalue
imputation2018
https://doi.org/10.1016/j.knosys.2018.03.026
A novel missing value imputation isintroduced, which iscomposedof
two modules. Each class center and its distances from the other
observed data are measured to identify a threshold. Then, the
identified threshold is used for missing value imputation. The
proposed approach outperforms the other approaches for both
numerical and mixed datasets. It requires much less imputation
timethanthemachinelearning basedmethods.

Imputation withDeepLearning#1
BRITS:BidirectionalRecurrentImputationforTime
Series
WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi
(Submittedon27May2018) https://arxiv.org/abs/1805.10572
https://github.com/NIPS-BRITS/BRITS
Existing imputation methods often impose strong
assumptions of the underlying data generating process,
such as linear dynamics in the state space. In this paper, we
propose BRITS, a novel method based on recurrent neural
networksformissingvalueimputationintimeseriesdata.
Our proposed method directly learns the missing
values in abidirectional recurrentdynamicalsystem,without
any specific assumption. The imputed values are treated as
variablesofRNNgraphandcan beeffectivelyupdatedduring
the backpropagation. We simultaneously perform missing
value imputation and classification/regression of applications
jointlyinoneneuralgraph.
BRITS has three advantages: (a) it can handle multiple
correlated missing values in time series; (b) it generalizes
to time series with nonlinear dynamics underlying; (c) it
provides a data-driven imputation procedure and
appliestogeneralsettingswithmissing data.
We evaluate the imputation performance in terms of
mean absolute error (MAE) and mean relative error
(MRE).

End-to-EndTimeSeriesImputationviaResidualShortPaths
Lifeng Shen,Qianli Ma,SenLi (2018)
http://proceedings.mlr.press/v95/shen18a.html
We propose an end-to-end imputation network with residual
short paths, called Residual IMPutation LSTM (RIMP-LSTM), a
flexible combination of residual short paths with graph-based
temporal dependencies. We construct a residual sum unit (RSU),
which enables RIMP-LSTM to make full use of previous revealed
information to model incomplete time series and reduce the
negative impact of missing values. Moreover, a switch unit is
designed to detect the missing values and a new loss function is
then developed to train our model with time series in the presence of
missing values in an end-to-end way, which also allows
simultaneous imputationand prediction.
RIMP-LSTM combines the merits of graph-based models with
explicitly modeled temporal dependencies via weighted
residual connection between nodes, with the ones of LSTM that can
accumulate historical residual information and learn the underlying
patternsof incomplete time seriesautomatically.
On the other hand, compared with IMP-LSTM, RIMP-LSTM has
better performance as it is good at modeling temporal
dependencies with weighted residual short paths, which
demonstrates that the reasonability of using these weighted residual
pathsto model graphlike temporal dependenciesforimputation.

Acontextencoderforaudioinpainting
AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018)
http://www.github.com/andimarafioti/audioContextEncoder
(Python,Matlab)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Here, the STFT features, meant as a reasonable first choice,
provided a decent performance. In the future, we expect more
hearing-related features to provide even better reconstructions. In
particular, an investigation of Audlet frames, i.e., invertible time-
frequency systems adapted to perceptual frequency scales, as
featuresforaudioinpaintingpresentintriguingopportunities.
Here, preferred architectures are those not relying on a
predetermined target and input feature length, e.g., a recurrent
network. Recent advances in generative networks will provide
other interesting alternatives for analyzing and processing audio
dataaswell.Theseapproachesareyettobefully explored.
Finally, music data can be highly complex and it is unreasonable to
expect a single trained model to accurately inpaint a large number
of musical styles and instruments at once. Thus, instead of training
on a very general dataset, we expect significantly improved
performance for more specialized networks that could be
trained by restricting the training data to specific genres or
instrumentation. Applied to a complex mixture and potentially
preceded by a source-separation algorithm, the resulting
modelscouldbeusedjointlyinamixture-of-experts.approach.

Imputation withDeepLearning#4: GANs
NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation
Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Leveraging multiresolution modeling and adversarial training, NAOMI is able to
learn the conditional distribution given very few known observations and
achieves superior performances in variousexperiments of both deterministic and
stochastic dynamics. Future work will investigate how to infer the
underlyingdistribution when complete training dataisunavailable.The trade-
off between partial observations and external constraints is another direction for
deepgenerativeimputationmodels.

Effect of missingvalues toclassificationperformance
Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in
classificationproblems
Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017,
https://doi.org/10.1080/03610926.2016.1277752
“This study suggests that the negative impact of poor data quality (DQ) on decision making is often
mediated by biased model estimation. To highlight this perspective, we develop an analytical framework
that links three quality levels – data, model, and decision. The general framework is first developed at a
high-level”
Evolutionary MachineLearningfor
ClassificationwithIncompleteData
Tran, CaoTruong(2018, PhDThesis)
http://hdl.handle.net/10063/7639
“The thesis develops approaches for
improving imputation for
classification with incomplete data by
integrating clustering and feature
selection with imputation. The approaches
improve both the effectiveness and the
efficiency of using imputation for
classificationwith incompletedata.
The thesis develops interval genetic
programming to directly evolve classifiers
for incomplete data. The results show that
classifiers generated by interval genetic
programming can be more effective and
efficient than classifiers generated the
combination of imputation and traditional
genetic programming. Interval genetic
programming is also more effective than
common classification algorithms able to
workdirectlywith incompletedata.”

Imputation and Classification
MissingData ImputationforSupervisedLearning
August 2018
https://doi.org/10.1080/08839514.2018.1448143
“Thispapercomparesmethodsforimputingmissing
categoricaldataforsupervisedclassificationtasks. “
The results of the present study show that perturbation can help increase predictive accuracy
for imputed models, but not one-hot encoded models. Future work can identify the conditions
under which missing-data perturbation can improve prediction accuracy. Interesting
extensions of this paper include evaluating the benefits of using missing-data
perturbation over more popularregularization techniquessuchas dropout training.
ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis).
TheAdult datasetcontainsN= 48,842examples
and 14 features(6 continuousand 8 categorical).The
predictiontask isto determinewhether aperson
makesover $50,000a year.

Decomposition LiteratureReview

CEEMD EmpiricalModeDecomposition
Empirical mode decomposition for
seismic time-frequency analysis
Jiajun Han and Mirko van der Baan
Geophysics (2013) 78 (2):O9-O19.
https://doi.org/10.1190/geo2012-0199.1
Complete ensemble empirical mode
decomposition decomposes a
seismic signal into a sum of
oscillatory components, with
guaranteed positive and smoothly
varying instantaneous frequencies.
Analysis on synthetic and real data
demonstrates that this method
promises higher spectral-spatial
resolution than the short-time
Fourier transform or wavelet
transform. Application on field data
thus offers the potential of
highlighting subtle geologic
structures that might otherwise
escape unnoticed.
CEEMD is a robust extension of EMD methods. It
solves not only the mode mixing problem, but also leads to
complete signal reconstructions. After CEEMD,
instantaneous frequency spectra manifest visibly higher
time-frequency resolution than short-time Fourier and
wavelet transforms on synthetic and field data examples.
These characteristics render the technique highly
promisingforseismic processingand interpretation.
Introducinglibeemd:Aprogrampackageforperformingthe
ensembleempiricalmodedecomposition(July2015)
ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E.
Räsänen C, R and Python
http://doi.org/10.1007/s00180-015-0603-9
https://bitbucket.org/luukko/libeemd

SourceSeparation ”signaldecomposition”#1
Wave-U-Net:AMulti-ScaleNeuralNetworkfor
End-to-EndAudioSourceSeparation
Daniel Stoller, Sebastian Ewert, Simon Dixon
Queen Mary Universityof London, Spotify
(Submitted on8 Jun2018)
https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net
“Models for audio source separation usually operate on the
magnitude spectrum, which ignores phase information and
makes separation performance dependant on hyper-parameters
for the spectral front-end. Therefore, we investigate end-to-end
source separation in the time-domain, which allows
modelling phase information and avoids fixed spectral
transformations. Due to high sampling rates for audio, employing a
long temporal input context on the sample level is difficult, but
required for high quality separation results because of long-range
temporalcorrelations.
In thiscontext, weproposethe Wave-U-Net,an adaptation of the
U-Net to the one-dimensional time domain, which repeatedly
resamples feature maps to compute and combine features at
different time scales. We introduce further architectural
improvements, including an output layer that enforces source
additivity, an upsampling technique and a context-aware
predictionframework toreduceoutput artifacts.
Experiments for singing voice separation indicate that our
architecture yields a performance comparable to a state-of-the-
artspectrogram-basedU-Netarchitecture,given thesamedata.
75 tracks from the training partition of the MUSDB
multi-track database are randomly assigned to
our training set. For singing voice separation, we
also add the whole CCMixter database to the
training set. No further data preprocessing is performed, only a
conversion to mono (except for stereo models) and downsampling to
22050 Hz.
For future work, we could investigate to
which extent our model performs a
spectral analysis, and how to incorporate
computations similar to those in a multi-
scale filterbank, or to explicitly compute
a decomposition of the input signal into a
hierarchical set of basis signals and
weightings on which to perform the
separation, similarto the TasNet [12].
Furthermore, better loss functions for
raw audio prediction should be investigated
such as the ones provided by generative
adversarial networks [3, 21], since the MSE
might not reflect the perceived loss of
quality well.

TasNet:SurpassingIdealTime-Frequency
MaskingforSpeechSeparation
YiLuo, NimaMesgarani
(Submitted on21 Sep 2018)
“TasNet uses a convolutional encoder to create a representation
of the signal that is optimized for extracting individual speakers.
Speaker extraction is achieved by applying a weighting
function (mask) to the encoder output. The modified encoder
representation is then inverted to the sound waveform using a
linear decoder. A linear deconvolution layer serves as a decoder
by invertin gthe encoder output back to the sound waveform. This
encoder-decoder framework is similar to the ICA method when
a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and
to the semi-nonnegative matrix factorization method (semi-NMF)
[Dingetal.2008], where the basis signals are the parameters of
thedecoder.
The masks are found using a temporal convolutional network
(TCN) consisting of dilated convolutions, which allow the
network to model the long-term dependencies of the speech
signal. This end-to-end speech separation algorithm significantly
outperforms previous time-frequency methods in terms
of separating speakers in mixed audio, even when compared to
the separation accuracy achieved with the ideal time-frequency
mask of the speakers. In addition, TasNet has a smaller model size
and a shorter minimum latency, making it a suitable solution for
bothofflineandreal-time speechseparation applications.“

DisentanglingCorrelatedSpeakerandNoisefor
SpeechSynthesis viaDataAugmentationand
AdversarialFactorization
Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang,
YonghuiWu, JamesGlass.
32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada.
https://openreview.net/pdf?id=Bkg9ZeBB37
“To leverage crowd-sourced data to train multi-speaker text-
to-speech (TTS) models that can synthesize clean speech
for all speakers, it is essential to learn disentangled
representations which can independently control the
speaker identity and background noise in generated signals.
However, learning such representations can be challenging,
duetothe lackoflabelsdescribingtherecordingconditionsof
each training example, and the fact that speakers and
recording conditions are often correlated, e.g. since users
oftenmakemanyrecordingsusingthesameequipment.
This paper proposes three components to address this
problem by: (1) formulating a conditional generative model
with factorized latent variables, (2) using data augmentation
to add noise that is not correlated with speaker identity and
whose label is known during training, and (3) using
adversarial factorization to improve disentanglement.
Experimental results demonstrate that the proposed method
can disentangle speaker and noise attributes even if
they are correlated in the training data, and can be used to
consistentlysynthesizecleanspeechforallspeakers.”

Decompose HighandLow frequencies
Drop anOctave:ReducingSpatialRedundancy in
Convolutional Neural Networks withOctave
Convolution
YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis,
MarcusRohrbach, ShuichengYan, JiashiFeng
(Submitted on 10 Apr 2019)
https://export.arxiv.org/abs/1904.05049
In this work, we propose to factorize the mixed feature maps by
their frequencies and design a novel Octave Convolution
(OctConv) operation to store and process feature maps that vary
spatially "slower" at a lower spatial resolution reducing both memory
and computation cost. Unlike existing multi-scale meth-ods,
OctConv is formulated as a single, generic, plug-and-play
convolutional unit that can be used as a direct
replacement of (vanilla) convolutions without any
adjustments in the network architecture. It is also orthogonal and
complementary to methods that suggest better topologies or
reduce channel-wise redundancy like group or depth-wise
convolutions. We experimentally show that by simply replacing
con-volutions with OctConv, we can consistently boost
accuracy for both image and video recognition tasks, while reducing
memoryandcomputationalcost.

Decompose Signalandthe Noise
Deeplearningofdynamicsandsignal-noise
decompositionwithtime-steppingconstraints
Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton
Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle,
last revised 22 Aug2018
https://github.com/snagcliffs/RKNN
“We propose a novel paradigm for data-driven modeling that
simultaneously learns the dynamics and estimates the
measurement noise at each observation. By constraining our
learning algorithm, our method explicitly accounts for measurement
error in the map between observations, treating both the
measurement error and the dynamics as unknowns to be
identified,ratherthan assumingidealizednoiselesstrajectories.
We also discuss issues with the generalizability of neural network
models for dynamicalsystemsand provide open-source code for
allexamples.”
The combination of neural networks and numerical time-stepping
schemes suggests a number of high-priority research
directions in system identification and data-driven forecasting.
Future extensions of this work include considering systems with
process noise, a more rigorous analysis of the specific method for
interpolating f, including time delay coordinates to accommodate
latent variables, and generalizing the method to identify
partial differential equations. Rapid advances in hardware and
the ease of writing software for deep learning will enable these
innovations through fast turnover in developing and testing
methods.

Signal Restoration LiteratureReview

Super-resolutions Insightsfromaudio
Time-frequencynetworks foraudiosuper-
resolution
TeckYian Lim etal. (2018)
http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf
http://tlim11.web.engr.illinois.edu/
“Audiosuper-resolution (a.k.a. bandwidthextension)is
thechallengingtaskofincreasingthetemporalresolutionof
audiosignals. Recentdeepnetworksapproachesachieved
promisingresultsby modelingthetaskas aregression
problem ineithertimeorfrequencydomain. Inthispaper,
weintroducedTime-FrequencyNetwork(TFNet),a
deepnetworkthat utilizessupervision inboth thetimeand
frequencydomain.Weproposedanovelmodelarchitecture
whichallowsthetwodomainstobe jointlyoptimized.”
Spectrogram correspondingto
the LR input(frequenciesabove
4kHz missing), HR
reconstruction, and the HR
ground truth. Our approach
successfullyrecoversthehigh
frequencycomponentsfrom the
LRaudiosignal.

GANs Alsofortime-seriesdenoising #1a
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
“In this paper, we explicitly learn to remove
noise from time series data without
assuming a prior distribution of noise.
We propose an online, fully automated, end-
to-endsystemfordenoisingtimeseriesdata.
Our model for denoising time series is trained
using unpaired training corpora and does
not need information about the source of the
noiseorhowitismanifestedin thetimeseries.
We propose a new architecture called
AsymmetricGAN that uses a generative
adversarial network for denoising time series
data.”
Consider, for example, a widely used method for time series featurization called Symbolic Aggregate
approXimation (SAX) that assumes time series are generated from a single normal distribution. As
shown in this assumption does not hold in several real life time series datasets. Other techniques
assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This
assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse
characteristics and originate from different sources. Hence, in this work, we focus on learning the
characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high
computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications.
For training of our network, we only need a set of clean signals and set of noisy signals. We do not need
paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for
applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.

GANs Alsofortime-seriesdenoising #1b
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
Pre-processing
The DC component in EEG data is different for each
recording. We normalize every window of clean and
noisy data to remove the DC offset from the data. We
remove the DC offset by subtracting the median of the
datain the window.
Evaluation of EEG data is challenging as the
ground truth noiseless signals are not
known. Multiple approaches to evaluation
have been proposed in recent years,
however, authors do not agree on a single
mechanismforevaluatingartifactremoval.

GANs Alsoforspeechdenoising
Segan:Speechenhancementgenerative
adversarialnetwork.
SantiagoPascual, AntonioBonafonte, and Joan Serra (2017)
https://github.com/santi-pdp/segan
“For the purpose of speech enhancement
and denoising, the SEGAN was developed,
employing a neural network with an encoder and
decoder pathway that successively halves and
doubles the resolution of feature maps in each
layer, respectively, and features skip connections
betweenencoderanddecoderlayersa.
The model works as an encoder-decoder fully-
convolutional structure, which makes it fast to
operate for denoising waveform chunks. The
results show that, not only the method is viable, but it
can also represent an effective alternative to current
approaches.
Possible future work involves the exploration of
better convolutional structures and the inclusion of
perceptual weightings in the adversarial training,
so that we reduce possible high frequency artifacts
that might be introduced by the current model.
Further experiments need to be done to compare
SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers
fromtheVoiceBankcorpus

Deep Learning for Biomedical Unstructured Time Series

Deep Learning for Biomedical Unstructured Time Series

Recommended

Recommended

More Related Content

What's hot

What's hot (15)

Similar to Deep Learning for Biomedical Unstructured Time Series

Similar to Deep Learning for Biomedical Unstructured Time Series (20)

More from PetteriTeikariPhD

More from PetteriTeikariPhD (16)

Recently uploaded

Recently uploaded (20)

Deep Learning for Biomedical Unstructured Time Series