SlideShare a Scribd company logo
1 of 160
Download to read offline
Deep Learning
for Biomedical
Unstructured
Time-series
1D Convolutional neural
networks (CNNs) for time
series analysis, and
inspiration from beyond
biomedical field
Petteri Teikari, PhD
Singapore Eye Research Institute (SERI)
Visual Neurosciences group
http://petteri-teikari.com/
Version “Wed 17 April 2019“
Time SeriesAnalysis VeryShortIntro
TimeSeries Basics
Regular time seriesvs. irregular timeseries
https://mediatum.ub.tum.de/doc/1444158/78684.pdf
UnstructuredBiomedical1DTimeSeries
Time-Frequencyvisualization
https://doi.org/10.3389/fnhum.2016.00605
Timeserieswithdiscrete“states”
Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,),
multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries
Manytypesof groundtruths possiblealsofor1Dtime
series Segmentation,classification,regression
https://arxiv.org/abs/1801.05394
TimeSeries Stationarity
Non-stationaritiessignificantly
distort short-term spectral,
symbolicand entropyheartrate
variabilityindicesNovember
2011PhysiologicalMeasurement
32(11):1775-86
DOI: 10.1088/0967-3334/32/11/S05
Testsof Stationarity
https://stats.stackexchange.com/questions/182764/stationarity-test
s-in-r-checking-mean-variance-and-covariance
Stationarity of order 2 For everyday use we often consider time series that have (instead of
strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html
●
aconstantmean
●
aconstantvariance
●
anautocovariancethatdoesnotdependontime.
Suchtimeseriesareknownas second-orderstationary or stationaryoforder2.
Examples of non-stationary processes are random walk with or without a
drift (a slow steady change) and deterministic trends (trends that are
constant, positive or negative, independent of time for the whole life of the
series).https://www.investopedia.com/articles/trading/07/stationary.asp
Time SeriesAnalysis LiteratureOverview
Representation vsSimilarity
https://arxiv.org/abs/1704.00794: “Time series
analysis approaches can be broadly categorized
into two families: (i) representation methods,
which provide high-level features for representing
properties of the time series at hand, and (ii)
similarity measures, which yield a meaningful
similarity between different time series for further
analysis.“
Classic representation methods are for instance
Fourier transforms, wavelets, singular value
decomposition, symbolic aggregate approximation,
andpiecewiseaggregateapproximation.
Time series may also be represented through the
parameters of model-based methods such as
Gaussian mixture models (GMM), Markov models and
hidden Markov models (HMMs), time series bitmaps
andvariantsofARIMA.
An advantage with parametric models is that they
can be naturally extended to the multivariate
case. For detailed overviews on representation
methods, we refer the interested reader to e.g.
Wangetal.(2013).
https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities
between pairs of time series may be utilized in a wide range of applications, such as
classification, clustering, and anomaly detection. Time series similarity measures include for
example dynamic time warping (DTW, the longest common subsequence (LCSS), the
extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and
representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction.
Attempts have been made to design kernels from non-metric distances such as DTW, of
which the global alignment kernel (GAK) is an example. There are also promising works on
deriving kernels from parametric models, such as the probability product kernel, Fisher kernel,
andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence
onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting.
Moreover, many of these methods cannot naturally be extended to deal with multivariate time
series (MTS), as they only capture the similarities between individual attributes and do not
modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not
designed to handle missing data, an important limitation in many existing scenarios, such
as clinical data where MTS originating from Electronic Health Records (EHRs) often contain
missingdata
In this work, we propose a surgical site infection detection framework for
patients undergoing colorectal cancer surgery that is completely
unsupervised, hence alleviating the problem of getting access to labelled
training data. The framework is based on powerful kernels for multivariate
time series that account for missing data when computing similarities.
https://arxiv.org/abs/1803.07879
Analysis withSimilarityMeasures
TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData
KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017)
https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab)
Similarity-based approaches represent a
promising direction for time series analysis.
However, many such methods rely on
parameter tuning, and some have
shortcomings if the time series are
multivariate (MTS), due to dependencies
between attributes, or the time series
containmissingdata.
In this paper, we address these challenges
within the powerful context of kernel
methods by proposing the robust time
series cluster kernel (TCK). The approach
taken leverages the missing data
handling properties of Gaussian
mixture models (GMM) augmented with
informative prior distributions. An ensemble
learning approach is exploited to ensure
robustness to parameters by combining the
clustering results of many GMM to
formthefinalkernel.
The experimental results demonstrated that the TCK
(1) is robust to hyperparameter settings, (2) is
competitive to established methods on prediction
tasks without missing data and (3) is better than
established methods on prediction tasks with missing
data.
In future works we plan to investigate whether the
use of more general covariance structures in the
GMM, or the use of HMMs as base probabilistic
models, could improve TCK.
Wavelets Shapelets→ Shapelets  ”1DGabors”#1
Fast classification of univariate and multivariate time
seriesthrough shapelet discovery
https://doi.org/10.1007/s10115-015-0905-9
Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme
A Shapelet Selection Algorithm forTime Series Classification: New Directions
https://doi.org/10.1016/j.procs.2018.03.025
The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession.
Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply
reducesthe time consumption ofshapeletselection.
https://slideplayer.com/slide/8370683/
Forexample,aclassof
abnormalECG
measurementmaybe
characterised by an
unusualpatternthat
onlyoccurs
occasionallyatany
point during the
measurement.Shapelets
aresubseriesthatcapture
thistypeofcharacteristic.
Theyallowforthe
detection ofphase-
independentlocalised
similaritybetween series
within thesameclass.
Thegreattimeseriesclassificationbakeoff:areviewandexperimental
evaluationof recentalgorithmicadvances
Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017)
https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification
Wavelets Shapelets→ Shapelets  ”1DGabors”#2
Afastshapelet selectionalgorithmfortime
series classification
https://doi.org/10.1016/j.comnet.2018.11.031
Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis
computed off-line, and the authorsaim tomake it moreefficient
Shapelet transformation algorithms have attracted a great deal of attention in the last
decade. However, the timecomplexity of the shapelet selectionprocess in shapelet
transformation algorithms is too high. To accelerate the shapelet selection process with
noreductioninaccuracy,wepresentedFSSforST.
The experimental results demonstrate that our proposed FSS was thousands of
timesfasterthantheoriginalshapelettransformation methodwithnoreduction
in accuracy. Our results also demonstrate that our method was the fastest method
among shapeletmethodsthathavetheleadinglevelofaccuracy.
RepresentationLearning with deeplearning #1
TowardsaUniversalNeuralNetworkEncoderforTime
Series
Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on
10May 2018)https://arxiv.org/abs/1805.03908
We have studied the use of a universal encoder for time
series in the specific case of classifying an out-of-sample data
set of an unseen data type. We have considered the cases of
no-adaptation,mappingadaptation,andfulladaptation.
In all cases we achieve performances that are competitive with
the state-of-the-art that, in addition, involve a compact reusable
representation and few training iterations. We have also studied
the effect of the representation dimensionality, showing that
small representations have an impact to no-adaptation and
mapping adaptation approaches,butnotmuch tofulladaptation
ones.
In the future, we plan to refine the encoder architecture, as well
as optimizing some of the parameters we empirically use in our
experiments. A very interesting direction for future research is
the adoption of one-shot learning schemas (Snelletal.2017;
Sutskeveretal.2014), which we find very suitable for the
current setting in time series classification problems.
A further option to enhance the performance of a universal
encoder is data augmentation, specially considering recent
linear instance/class interpolation approaches (
Zhangetal.2018).
In order to have sufficient knowledge to accomplish any task, and in order to be
applicable in the absence of labeled data or even without adaptation/re-training,
researchers have been increasingly adopting the generic concept of universal
encoders, specially within the text processing domain (note that related concepts also
existinother domains).
The basic idea is to train a model (the encoder) that learns a common representation
which is useful for a variety of tasks and that, at the same time, can be reused for
novel tasks with minimal or no adaptation. While it would seem that classical
autoencoders and other unsupervised models should perfectly fit this purpose, recent
research in sentence encoding shows that, with current means, encoders learnt with a
sufficiently large set of supervised tasks, or mixing supervised and
unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.
RepresentationLearning with deeplearning #2
OneDeepMusicRepresentationtoRuleThem All?
Acomparativeanalysisofdifferentrepresentationlearning
strategies
JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic
(Submittedon13Feb2018)
https://arxiv.org/abs/1802.04051
Ourworkwilladdressthefollowing researchquestions:
–RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain
anetwork,whatistheinfluenceofthenumberandtypeofthetaskson
theeffectivenessofthelearneddeeprepresentation?
–RQ2:Howdovariousdegreesofinformationsharinginthedeep
architectureaffecttheultimatesuccessofalearneddeep
representation?
–RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep
representation?
Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single
learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a
learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is
likely to be informative for the unseen task. At the same time, this representation may not be that informative to another
unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more
learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.
RepresentationLearning with deeplearning #3
LearningFiner-classNetworksforUniversal
Representations
https://arxiv.org/abs/1810.02126
https://arxiv.org/abs/1712.09708
JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline
Hudelot(Submittedon4 Oct2018)
Many real-world visual recognition use-cases can not directly benefit from
state-of-the-art CNN-based approaches because of the lack of many
annotated data. The usual approach to deal with this is to transfer a
representation pre-learned on a large annotated source-task onto a target-
task of interest. This raises the question of how well the original
representation is "universal", that is to say directly adapted to many
different target-tasks. To improve such universality, the state-of-the-art
consists in training networks on a diversified source problem, that is
modified either by adding generic or specific categories to the initial set of
categories.
We propose two methods to improve universality, but pay special attention
to limit the need of annotated data. We also propose a unified
framework of the methods based on the diversifying of the training
problem. Finally, to better match Atkinson's cognitive study about
universal human representations, we proposed to rely on the
transfer-learningschemeas wellasa new metric toevaluateuniversality.
We show thatourmethod learnsmore universal representationsthan state-
of-the-art, leading to significantly better results on 10 target-tasks from
multiple domains, using several network architectures, either alone or
combinedwithnetworkslearnedat acoarsersemantic level.
RepresentationLearning with deeplearning #4
ImprovingClinicalPredictionsthroughUnsupervised
TimeSeriesRepresentationLearning
https://arxiv.org/abs/1812.00490
XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas,
Gunnar Rätsch(Submittedon2Dec2018)
MachineLearningforHealth(ML4H)Workshop atNeurIPS2018.
We empirically showed that in scenarios
where labeled medical time series data is
scarce, training classifiers on unsupervised
representations provides performance gains
over end-to-end supervised learning using
raw input signals, thus making effective use
of information available in a separate,
unlabeled training set.
The proposed model, explored for the first
time in the context of unsupervised patient
representation learning, produces
representations with the highest
performance in future signal prediction
and clinical outcome prediction,
exceeding several baselines.
The idea behind applying attention mechanisms to time series forecasting is to enable the
decoder to preferentially “attend” to specific parts of the input sequence
during decoding. This allows for particularly relevant events (e.g. drastic changes in heart
rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.
RepresentationLearning with deeplearning #5
UnsupervisedScalableRepresentationLearningforMultivariate
TimeSeries
https://arxiv.org/abs/1901.10738
https://github.com/White-Link/UnsupervisedScalableRepresentationLearni
ngTimeSeries
(PyTorch)
Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi
(Submittedon30Jan2019)
Hence, we propose in the following an unsupervised
method to learn general-purpose representations for
multivariate time series that comply with the issues of
varying and potentially high lengths of the studied time
series. To this end, we adaptrecognized deep learningtools
and introduce a novel unsupervised loss. Our
representations are computed by a deep convolutional
neuralnetworkwithdilatedconvolutions(i.e.TCNs).
This network is then trained unsupervised, using the first
specifically designed triplet loss in the literature of
time series, taking advantage of the encoder resilience to
time seriesofunequallengths.
We leave as future work the applicability of our method to
other tasks like forecasting, and the study of its impact if it
weretobeaddedinpowerful ensemblemethods.
RepresentationLearning with deeplearning #6
Unsupervised speech representation learning
using WaveNet autoencoder
https://arxiv.org/abs/1812.00490
Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den
Oord(Submitted on 25 Jan 2019)
We consider the task of unsupervised extraction of
meaningful latent representations of speech by applying
autoencoding neural networks to speech waveforms. The
goal is to learn a representation able to capture high level
semantic content from the signal, e.g. phoneme identities,
while being invariant to confounding low level details in the
signal such as the underlying pitch contour or background
noise. The behavior of autoencoder models depends on the
kind of constraintthatis applied tothelatentrepresentation.
Our best models used MFCCs (mel-frequency cepstral
coefficient) as the encoder input, but reconstructed raw
waveforms at the decoder output. We used standard 13
MFCC features extracted every 10ms (i.e., at a rate of 100 Hz)
and augmented with their temporal first and second
derivatives. Such features were originally designed for
speech recognition and are mostly invariant to pitch and
similarconfoundingdetail in theaudiosignal. T
RepresentationLearning with deeplearning #7
ATaleof Two Time Series Methods:Representation
Learningfor Improved Distance and RiskMetrics
https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076
345253-MIT.pdf
DivyaShanmugam (June2018)
Architecture of the proposed model. A single convolutional layer
extracts local features from the input, which a strided maxpool
layer reduces to a fixed-size vector. A fully connected layer
with ReLU activation carries out further, nonlinear dimensionality
reduction to yield the embedding. A softmax layer is added at
training time.
We introduce the multiple instance learning paradigm to risk
stratification. Risk stratification models aim to identify patients
at high risk for a given outcome so that doctors may intervene, with
the attempt of avoiding that outcome. Machine learning has led to
improved risk stratification models for a number of outcomes,
including stroke, cancer and treatment resistance [55]. To the best of
our knowledge, this is the first application of multiple instance learning
to risk stratification.
The extension of Jiffy to multi-label classification and unsupervised
learning poses a challenging but necessary task. The availability of
unlabeled time series data eclipses the availability of its annotated
counterpart. Thus, a simple network-based method for representation
learning on multivariate timeseries inthe absence oflabels isan important
line of work. There is also potential to further increase Jiffy’s speed by
replacing the fully connected layer with a structured [Bojarskietal.2016]
or
binarized[Rastegariet al.2016]
matrix.
The proposed risk stratification model extends naturally to a range of adverse
outcomes. The model is not limited to operating on ECG signals - it is
worth exploring whether the multiple instance learning approach may be
successful in other modalities of medical data, including voice. On a
theoretical level, strong generalization guarantees for distinguishing bags with
relative witnessratesdonotexistand are worth exploring asthese modelsare
appliedintherealworld.
Intro tomethods#1a
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Intro tomethods#1b
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Structure inalibrary of8651time-seriesanalysisoperations. (a) A
summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by
a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series
analysisliterature.(b)A network representation of theoperationsinour library
thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7],
which wereretrieved fromourlibraryautomatically.Each nodein thenetwork
representsanoperationand linksencodedistancesbetweenthem(computed
using a normalized mutual information-based distancemetric, cf.electronic
supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof
ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded
community (indicated bya heavily outlined node, vertical axis). Similar pictures
can beproduced by targeting anygivenoperationin our library, thereby
connecting differenttime-seriesanalysismethodsthatneverthelessdisplay
similar behaviour acrossempiricaltimeseries.
Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis
methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of
versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure.
The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as
empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and
operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each
panel.
(a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof
scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures
3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d).
(d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in
our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f)
Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the
differencesbetween classesof labelled time-series datasets(cf. figure 5e).
Intro tomethods#1c
Highlycomparative time-series analysis: theempirical
structure of time series and their methods
http://doi.org/10.1098/rsif.2013.0048
Ben D. Fulcher, Max A. Little, Nick S. Jones
Highlycomparativetechniquesfortime-
seriesanalysistasks.Wedrawonourfull
library oftime-seriesanalysismethodsto:
(a) structure datasetsinmeaningfulways,
andretrieveandorganizeusefuloperations
for (b,e) classificationand(c,d) regression
tasks.(a)Fiveclassesof EEG signalsare
structuredmeaningfullyinatwo-
dimensional principalcomponentsspaceof
our libraryof operations.(b)Pairwise linear
correlationcoefficientsmeasuredbetween
the60mostsuccessful operationsfor
classifyingcongestiveheartfailureand
normalsinusrhythmRR intervalseries.
Clusteringrevealsthatmostoperationsare
organizedintooneof threegroups
(indicatedbydashedboxes). 
Most of the time when people talk about time series and deep
learning, most likely they talking of Sequences (e.g. language)
instead of unstructuredtime series (e.g. voice waveform)
“Sequences” vs“TimeSeries”
“DenseTimeSeries”at videoframerate
Icehockeyas
gamecan be
simplifiedto
discreteevents
(sequences)
https://arxiv.org/abs/1808.04063
Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates
Time Series RNNsforsequences
The Unreasonable Effectivenessof
RecurrentNeuralNetworks
May21,2015|AndrejKarpathy
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
DanQ:ahybridconvolutionaland
recurrentdeepneuralnetworkfor
quantifyingthefunctionofDNA
sequences 
Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44,
Issue11,20June2016,Pagese107, 
https://doi.org/10.1093/nar/gkw226
DeepLearningforUnderstandingConsumerHistories
byTobiasLang- 25Oct2016
https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1
Sequences. Depending on your background you mightbewondering: 
WhatmakesRecurrentNetworkssospecial?
Time Series LSTM,upgradedRNNs
TimeSeries LSTMsApplied
DeepAir|UCBerkeleySchoolofInformation
https://www.ischool.berkeley.edu/projects/2017/deep-air
This project investigates the use of the LSTM recurrent neural network (RNN) as a
framework for forecasting in the future, based on time series data of pollution and
meteorological information in Beijing. Our results show that the LSTM framework
produces equivalent accuracy when predicting future time stamps compared to the
baseline support vector regression for a single time stamp. Using our LSTM framework,
we can now extend the prediction from a single time stamp out to 5 to 10 hours in the
future.
Overview of our self-supervised approach for posture and sequence representation learning
using CNNLSTM. After the initial training with motion-based detections we retrain our model for
enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399
PianoGenie:An IntelligentMusicalInterface
Oct15,2018 |https://magenta.tensorflow.org/pianogenie
Chris Donahue (  chrisdonahue ,  chrisdonahuey ) ;Ian Simon (  iansimon ,  iansimon ) ;Sander Dieleman (  benanne ,  sedielem )
A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller
buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM
decoder then decodes these controller sequences back into piano performances. After
training, the encoder isdiscarded and controller sequencesareprovided byuser input.
Time Series RNN/LSTMsareoutdated?#1
ThefallofRNN/ LSTM
EugenioCulurciello
https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0
Combining multiple neural attention modules, comes the “hierarchical
neural attention encoder”… Notice there is a hierarchy of attention
modules here, very similar to the hierarchy of neural networks. This is also
similar toTemporalconvolutionalnetwork(TCN)
→ Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional
NeuralNetworksforSequence-to-
SequencePrediction
MahaElbayad,LaurentBesacier,JakobVerbeek
(Submittedon11Aug 2018)
https://arxiv.org/abs/1808.03867|
https://github.com/elbayadm/attn2d
Time Series RNN/LSTMsareoutdated?#2
AnEmpiricalEvaluationof GenericConvolutional and
RecurrentNetworksforSequence Modeling
ShaojieBai,J.ZicoKolter,VladlenKoltun
(Revised19Apr2018)
https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN
For most deep learning practitioners, sequence modeling is
synonymous with recurrent networks. Yet recent results
indicate that convolutional architectures can outperform recurrent
networks on tasks such as audio synthesis and machine translation.
Given a new sequence modeling task or dataset, which architecture
should one use?
We conduct a systematic evaluation of generic convolutional and
recurrent architectures for sequence modeling. The models are
evaluated across a broad range of standard tasks that are commonly
used to benchmark recurrent networks. Our results indicate that a
simple convolutional architecture outperforms canonical
recurrent networks such as LSTMs across a diverse range of
tasks and datasets, while demonstrating longer effective memory. We
conclude that the common association between sequence modeling
and recurrent networks should be reconsidered, and convolutional
networks should be regarded as a natural starting point for sequence
modelingtasks.
The preeminence enjoyed by recurrent networks in sequence modeling
may be largely a vestige of history. Until recently, before the introduction of
architectural elements such as dilated convolutions and residual
connections, convolutional architectures were indeed weaker. Our
results indicate that with these elements, a simple convolutional
architecture is more effective across diverse sequence modeling tasks
than recurrent architectures such as LSTMs. Due to the comparable
clarity and simplicity of TCNs, we conclude that convolutional
networks should be regarded as a natural starting point and a
powerfultoolkit for sequence modeling
Time Series RNN/LSTMsareoutdated?#3
Dilated Temporal Fully-Convolutional Networkfor
Semantic Segmentation ofMotion CaptureData
NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik
Herrmann,Han Du, Klaus Fischer, PhilippSlusallek
(Submittedon 24Jun 2018)
https://arxiv.org/abs/1806.09174
Semantic segmentation of motion capture sequences
plays a key part in many data-driven motion synthesis
frameworks. It is a preprocessing step in which long
recordings of motion capture sequences are partitioned
into smaller segments. Afterwards, additional methods like
statistical modeling can be applied to each group of
structurally-similar segments to learn an abstract motion
manifold. The segmentation task however often
remains a manual task, which increases the effort and
costofgeneratinglarge-scalemotiondatabases.
We therefore propose an automatic framework for
semantic segmentation of motion capture data using a
dilated temporal fully-convolutional network. Our
model outperforms a state-of-the-art model in action
segmentation, as well as three networks for sequence
modeling.
Time Series RNN/LSTMsareoutdated?#4
TemporalConvolutionalNetworksandDynamicTimeWarping
canDrasticallyImprovetheEarlyPredictionofSepsis
MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten
Borgwardt(Submittedon7Feb2019)
https://arxiv.org/abs/1902.01659
https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318
For future work, we aim to extend our analysis to more types of data
sources arising from the ICU. Futoma et al. (2017b) already
employed a subset of baseline covariates, medication effects, and
missingness indicator variables. However, a multitude of feature
classes still remain to be explored and properly integrated. For
instance, the combination of sequential and non-sequential
features has previously been handled by feeding non-sequential
data into the sequential model (Futoma et al.,2017a).
We hypothesize that this could be handled more efficiently by
using a more modular architecture that incorporates both
sequential and non-sequential parts. Furthermore, we aim to obtain
a better understanding of the time series features utilized by the
model. Specifically, we are interested in assessing the
interpretability of the learned filters of the MGPTCN framework
and evaluate how much the activity of an individual filter contributes
to a prediction. This endeavor is somewhat facilitated by our use of a
convolutional architecture. The extraction of short per-channel
signals could prove very relevant for supporting diagnoses made by
clinical practitioners.
Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process
(MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly
spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward
pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including
both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during
training.
We evaluate all methods using Area under the Precision–Recall Curve
(AUPRC) and additionally display the (less informative) Area under the
Receiver Operator Characteristic (AUC). The current state-of-the-art
method, MGP-RNN, is shown in blue. The two approaches for early
detection of sepsis that were introduced in this paper, i.e. MGP-TCN and
DTW-KNN ensemble, are shown in pink and red, respectively. By using three
random splits for all measures and methods, we depict the mean (line) and
standard deviation error bars (shaded area).
Clinicalnotes and textreportunderstanding
Wordsas thesequences
StructuringClinicalText
Comparativeeffectiveness of convolutional neural
network(CNN)and recurrent neural network(RNN)
architectures for radiologytext reportclassification (2018)
https://doi.org/10.1016/j.artmed.2018.11.004
DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof
Medicine,Stanford,CA,USA
This paper explores cutting-edge deep learning methods for
information extraction from medical imaging free text
reports at a multi-institutional scale and compares them to the
state-of-the-art domain-specific rule-based system – PEFinder
andtraditionalmachinelearning methods– SVMandAdaboost.
Visualization methods have been developed to identify the
impact of input words on the output decision for both
deeplearning models.
DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA-
HNN)architecture.
ClinicalText +Images
Unsupervised MultimodalRepresentation Learning across
Medical Images and Reportsn
(MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.)
https://arxiv.org/abs/1811.08615 MITCSAIL
Joint embeddings between medical imaging modalities and
associated radiology reports have the potential to offer
significant benefits to the clinical community, ranging from cross-
domain retrieval to conditional generation of reports to the
broader goals of multimodal representation learning. In this work,
we establish baseline joint embedding results measured via both
local and global retrieval methods on the soon to be released
MIMIC-CXR dataset consisting of both chest X-ray images and
the associatedradiologyreports..
We establish baseline results using supervised and unsupervised joint embedding
methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval
evaluation metrics. Results show a possibility of incorporating more unsupervised data
into training for minimal-effort performance increase. A further study of joint
embeddings between these modalities may enable significant applications, such as
text/imagegenerationor theincorporationofotherEMRmodalities.
ElectronicHealthRecords
Visitsassequences,
eachsequencecancontain1Dbiosignals
EHRMining Risk Predictionmodel
Risk Prediction on Electronic Health Records with Prior
Medical Knowledge (2018)
https://doi.org/10.1145/3219819.3220020
We propose a novel and general framework called PRIME for
risk prediction task, which can successfully incorporate
discrete prior medical knowledge into all of the state-of-the-
art predictive models using posterior regularization technique.
Different from traditional posterior regularization, we do not need
to manually set a bound for each piece of prior medical
knowledge when modeling desired distribution of the target
disease on patients. Moreover, the proposed PRIME can
automatically learn the importance of different prior knowledge
with alog-linearmodel.
The limitation of this work is that the proposed PRIME is only
effective for common diseases. For rare and emerging
diseases, since there is little medical knowledge about them, it
is hard to incorporate any prior knowledge into deep learning
predictive models. Thus, the proposed PRIME may achieve
similar performance to the state-of-the-art baselines. In our
future work, we will focus on how to improve predictive
performanceofrisk predictionforrare diseases.
Preprocessing Cleaning
Intro to cleaning
Inthepreprocessing component,themainpurposeistocleanthe
data,filter theunusualpointsandmakeitsuitableastheinputtothe
CNN.Besidesthenormalstepsincludingtimestampalignment,
normalizationandmissingdataimputationfortimeseriesdatawith
trend,
themostimportantoperationtoimprovethedataqualityisthe
outlierdetection,interpolation andfiltering,inparticularfor
clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there
aremanymissingor outlier datapointsduetoerrorsincalibration,
measurements,and/or mistakesintheprocessofdatacollectionand
transmission.Here,severalmethodsareintroducedtohandlethese
scenarios[36].
●
DimensionReductionModel: thetimeseriescan beprojectedinto
lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent
analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas
outliers.
●
Proximity-basedModel: thedataaredeterminedbynearest
neighbouranalysis,clusterordensity.Thus thedatainstancesthat are
isolatedfromthemajorityareconsidered asoutliers.
●
Probabilistic Stochastic Filters:differentfiltersforthesignals, such
asgaussian mixturemodelsoptimized usingexpectation-maximization.
In ourcasethefiltercan beimplementedbeforetheCNN, duetothe
continuouscharacteristic oftheinputglycaemic timeseriesdata.
AconvolutionalneuralnetworkforECGannotationasthebasisfor
classificationofcardiacrhythms
PhilippSodmann etal2018Physiol.Meas.inpress
https://doi.org/10.1088/1361-6579/aae304
Signalcleaning:
Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We
resampledallECGsto300HzusingthefastFourier transforminorder topassECG
segmentsofequallengthontotheCNN.
Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects,
or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks
asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4).
Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor
accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof
33sampleswasappliedtoremovethepersistentbaseline.
https://doi.org/10.3389/fnins.2013.00267
MEGandEEGdataanalysis withMNE-Python
Preprocessing Transformations
TimeSeries Invariances
Acomplexity-invariantdistancemeasurefortimeseries
https://doi.org/10.1137/1.9781611972818.60
GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh.
In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM),
pages699–710.SIAM,2011.Citedby216 
TimeSeries DTWthe classicalmethod
https://doi.org/10.1145/2888451.2888
456
StockPricePredictionwithFluctuationPatternsUsing
IndexingDynamic TimeWarpingand k∗
-Nearest
NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018)
https://doi.org/10.1007/978-3-319-93794-6_7
Learning invariances#1a
LearningtoExploit InvariancesinClinical
Time-SeriesDatausingSequence
TransformerNetworks
JeehehOh, JiaxuanWang, JennaWiens
(Submittedon 21 Aug2018)
https://arxiv.org/abs/1808.06725
Recently, researchers have started applying convolutional neural
networks (CNNs) with 1D convolutions to clinical tasks
involving time-series data. This is due, in part, to their
computational efficiency, relative to recurrent neural networks
and their ability to efficiently exploit certain temporal invariances,
(e.g.,phaseinvariance).
However, it is well-established that clinical data may exhibit many
other types of invariances (e.g., scaling). While preprocessing
techniques, (e.g., dynamic time warping) may successfully
transform and align inputs, their use often requires one to identify
thetypesofinvariancesinadvance.
In contrast, we propose the use of Sequence Transformer
Networks, an end-to-end trainable architecture that learns to
identify and account for invariances in clinical time-series data.
Applied to the task of predicting in-hospital mortality, our
proposedapproachachievesanimprovementintheAUROC.
Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor
learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom
thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and
task-dependenttransformations.Incontrasttodataaugmentationapproaches,our
proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.
Learning invariances#1b
LearningtoExploitInvariancesinClinicalTime-
Series DatausingSequenceTransformerNetworks
Jeeheh Oh, Jiaxuan Wang, JennaWiens
(Submitted on 21 Aug 2018)
https://arxiv.org/abs/1808.06725
Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe
SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample,
insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned
transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare
encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on
whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances
andfeature-specificinvariancescouldleadtofurther improvementsinperformance.
Learning invariances#2
Autowarp:LearningaWarpingDistancefromUnlabeledTime
Series UsingSequenceAutoencoders
Abubakar Abid, JamesZou StanfordUniversity
(Submitted on 23Oct2018)
https://arxiv.org/abs/1810.10107
Domain experts typically hand-craft or manually select a specific metric, such as dynamic time
warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end
algorithm that optimizesand learnsagood metric givenunlabeled trajectories.
We define a flexible and differentiable family of warping metrics, which encompasses common
metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation
power of sequence autoencoders to optimize for a member of this warping distance
family. The output is a metric which is easy to interpret and can be robustly learned from relatively
few trajectories.
Future work will extend these results to more challenge time series data, such as those with higher
dimensionality or heterogeneousdata.
Learning invariances#3
NeuralWarp:Time-Series SimilaritywithWarpingNetworks
Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018)
https://arxiv.org/abs/1812.08306 | Relatedarticles
In this paper we propose to learn a warping function for
aligning the indices of time series in a deep latent
representation. We compared the suggested architecture
with two types of encoders (CNN, or RNN) and a deep
forward network as a warping function. Experimental
comparisons to non-parametric and un-warped Siames
networks demonstrated that the proposed elastic deep
similaritymeasureismoreaccuratethanpriormodels.
Preprocessing ClassImbalances
SMOTE forimbalancedclasses
SMOTE-GPU:BigData preprocessingon
commodityhardwareforimbalancedclassification
ProgressinArtificialIntelligenceDecember2017,Volume6,
Issue4,pp347–354
https://doi.org/10.1007/s13748-017-0128-2
Consideringabinaryproblemwithamajorityclassanda
minorityclass,itislikelythatalearning algorithmignoresthe
later andstillachievesahighaccuracy.Thereare threemain
waysof dealingwiththesesituations [16]:
●
Algorithmicmodification Modifyinglearning algorithmsin
order totackletheproblembydesign.
●
Cost-sensitivelearningIntroducingcostsfor
misclassificationoftheminorityclassatdataor algorithmic
level.
●
DatasamplingPreprocessingthedatainorder toreduce
thebreachbetweenthenumberofinstancesofeachclass.
TheSMOTEtechniqueisbasedontheideaof
neighborhoodofthek-nearestneighbor (kNN)rule.
The area under the ROC curve results show that the use of
oversampling methods improves the detection of the minority
class in Big Data datasets. We have also shown how our design can
successfully work on a wide range of devices, including a laptop,
while requiring reasonable times, around 25 min on high-end devices,
and less than 2 h on the laptop, for the most time-demanding
experiment.
SMOTEforLearningfromImbalancedData:Progress and
Challenges,Markingthe15-yearAnniversary(2018)
https://doi.org/10.1613/jair.1.11192
●
GS4(Moutafis & Kakadiaris, 2014)
,SEG-SSC (Triguero et al.,2015)
and OCHS-SSC
(Dong et al.,2016)
generate synthetic examplestodiminish the
drawbacksproducedby the absence of labeled examples.
Several learning techniques were checked andsomeproperties
such asthecommonhiddenspacebetweenlabeledsamplesand
thesyntheticsamplewereexploited.
●
The technique proposed by Park et al. (2014) is a semi-
supervised active learning method in which labels are
incrementally obtained and applied using a clusteringalgorithm.
Inthe contextofcurrentchallengesoutlined,we highlightedtheneed
forenhancingthetreatmentof smalldisjuncts,noise, lack of data,
overlapping,datasetshiftandthecurseof dimensionality. To doso,the
theoreticalpropertiesof SMOTE re-garding these data
characteristics, and its relationship with the new synthetic
instances,mustbefurtheranalyzedindepth. Finally,wealsoposited
thatitisimportanttofocusondatasampling andpre-processing
approaches(such asSMOTE anditsextension)withintheframework
ofBig Dataandreal-timeprocessing.
Outlierdetection Whatto impute?
TypesofAnomalies
globalanomalies(x1, x2),
localanomaly x3 
micro-cluster c3. 
Asimpletwo-dimensionalexample
“Thissimpleexamplealready
illustratesthatanomaliesarenot
alwaysobviousandascoreis
muchmoreusefulthanabinary
labelassignment.”
AComparative EvaluationofUnsupervised
AnomalyDetectionAlgorithmsforMultivariate
Data(2016)
Markus Goldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
Threetypesofanomaly
schemes:
●
 pointanomalydetection
●
collectiveanomaly
●
contextualanomalies
State-of-the-art 2 yearsoldcuttingedge#1
AComparativeEvaluationofUnsupervisedAnomaly
DetectionAlgorithms forMultivariateData (2016)
MarkusGoldstein,Seiichi Uchida
https://doi.org/10.1371/journal.pone.0152173
Dozens of algorithms have been proposed in this area, but unfortunately
the research community still lacks a comparative universal evaluation as
wellascommonpubliclyavailabledatasets.
These shortcomings are addressed in this study, where 19 different
unsupervised anomaly detection algorithms are evaluated on 10
different datasetsfrommultipleapplicationdomains.
By publishing the source code and the datasets, this paper aims to
be a new well-funded basis for unsupervised anomaly detection
research. Additionally, this evaluation reveals the strengths and
weaknessesofthedifferent approachesforthefirst time.
As a general summary for algorithmselection, werecommend to use
nearest-neighbor based methods, in particular k-NN for global tasks
and LOF for local tasks instead of clustering-based methods. If
computation time is essential, HBOS is a good candidate, especially for
larger datasets. A special attention should be paid to the nature of the
dataset when applying local algorithms, and if local anomalies are of
interest at allin thiscase. 
Different anomaly detection modes
dependingon the availability of labels
in the dataset.
(a) Supervised anomaly detection uses a
fully labeled dataset for training. (b) Semi-
supervised anomaly detection uses an
anomaly-free training dataset. Afterwards,
deviations in the test data from that normal
model are used to detect anomalies. (c)
Unsupervised anomaly detection
algorithms use only intrinsic information of
the data in order to detect instances
deviatingfrom the majority of thedata.
State-of-the-art 2 yearsoldcuttingedge#2
A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor
Multivariate Data (2016)MarkusGoldstein, SeiichiUchida
https://doi.org/10.1371/journal.pone.0152173
A visualization of the results of the k-NN global
anomaly detection algorithm. The anomaly score is
represented by the bubble size whereas the color shows the
labelsoftheartificiallygenerateddataset.
Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor
(LOF) showstheusefulnessofthe reverseneighborhoodset.
For the red instance, LOF takes only the neighbors in the gray
area into account resulting in a high anomaly score. INFLO
additionally takes the blue instances into account (reverse
neighbors)andthusscorestheredinstancemorenormal.
Anomalydetection Cyber-physicalsystems
Anomaly DetectionwithGenerativeAdversarialNetworks for
MultivariateTimeSeries (2018)
Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg
InstituteofDataScience, National UniversityofSingapore,
https://arxiv.org/abs/1809.04758
Unsupervised machinelearningtechniquescanbeusedtomodelthe
systembehaviour andclassifydeviantbehavioursaspossibleattacks.
Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based
AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs.
WeusedLSTM-RNNinourGANtocapturethedistributionofthe
multivariatetimeseriesofthesensorsandactuatorsundernormal
workingconditionsofaCPS.
Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model
thetimeseriesofmultiplesensorsandactuatorsintheCPS
concurrently totakeintoaccountofpotentiallatentinteractions betweenthem.
ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe
GAN-traineddiscriminator together withtheresidualsbetweengenerator-
reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe
complexCPS.
We will also conduct further
research on feature
selection formultivariate
anomalydetection,and
investigate principled
methodsfor choosing the
latent dimension andPC
dimension withtheoretical
guarantees.
Anomalydetection Financialtime-series
Modelingapproachesfortimeseries forecastingand
anomaly detection (2018)
Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun
http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf
This project focuses on prediction of time series data for Wikipedia
page accesses for a period of over twenty-four months. The methods
explored here are K-nearest neighbors (KNN), Long short-term memory
network (LSTM), and Sequence to Sequence with Convolution Neural
Network (CNN) and we will compare predicted values to actual web traffic.
Thepredictionscan helpusinanomalydetectionintheseries.
Pre-processing : “The are many series in which values are zero. This
could be a missing value, or actual lack of web page access. In addition,
there are significant spikes in the data, where values have a broad range
from 1 to hundreds/thousandsfor several web pages. We normalize this
data by adding 1 to all entries, taking the log of the values, and setting
the mean to zero and variance to one. We have the results of fourier
analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.”
Our approaches to time series prediction depends on features extracted
from the the time series data itself. Our models learn periodicity, ramp and
other regular trends quite well. However, none of our models are able to
capture spikes or outliers that arise from external sources. Enhancing
the performance of the models will require augmenting our feature set from
othersourcessuchasnewseventsandweather.
“SpecialOutliers” Disguisedmissingvalues
FAHES:ARobustDisguised Missing
ValuesDetector
QatarComputingResearch Institute,HBKU, Doha,Qatar
https://doi.org/10.1145/3219819.3220109
Missing values are common in real-world data and may
seriously affect data analytics such as simple statistics
and hypothesis testing. Generally speaking, there are
two types of missing values: explicitly missing
values (i.e. NULL values), and implicitly missing values
(a.k.a. disguised missing values (DMVs)) such as
"11111111" for a phone number and "Some college" for
education. While detecting explicitly missing values is
trivial, detecting DMVs is not; the essential challenge is
the lack of standardization about how DMVs are
generated.
Onefutureworkweareplanning
toperformistoimproveFAHESto
detecttheDMVsthataregenerated
randomlywithintherangeofthe
data.For example,whenachildtries
tocreateanaccountonadomain
thathasaminimumagerestriction,
thechildfakesher agewitharandom
valuethatallowshimtocreatethe
account.Suchrandomfakevalues
arehard,ifnotimpossible,todetect.
Moreover,althoughDMVsarethe
focusofthispaper,therearemore
typesoferrorsarefoundinthewild.
Manyoftheprinciplesand
techniqueswehaveusedtodetect
DMVscanbeleveragedtodetect
other typesoferrors,soanatural
nextstepistoextendthe
infrastructurewehavebuiltto
detectthose.Thisopensnew
challengesrelatedtotherobust
identificationoferrorsthatcouldbe
interpreteddifferentlybydifferent
modules.
DeepLearning Outlier Detection overview
UncertaintyandNoveltydetection #1a
Does YourModel KnowtheDigit6Is NotaCat?ALessBiased
Evaluationof“Outlier” Detectors (2018)
AlirezaShafaei,MarkSchmidt,andJamesJ.Little
https://arxiv.org/abs/1809.04729
What makes this problem differentfrom a typical supervisedlearning setting
isthatwecannotmodelthediversityofout-of-distributionsamplesin
practice. The distribution of outliers used in training may not be the same as
the distribution of outliers encountered in the application. Therefore,
classical approaches that learn inliers vs. outliers with only two datasets
can yield optimistic results. We introduce OD-test, a three-dataset
evaluation scheme as a practical and more reliable strategy to assess
progress on this problem. The OD-test benchmark provides a
straightforward means of comparison for methods that address the out-of-
distributionsampledetectionproblem.
In real life deployment of products that use complex machinery such as
deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe
input. In the absence ofextrapolation guarantees, when the independently
and identically distributed (IID) assumption is violated, the behaviour of the
pipeline may be be unpredictable. From a quality assurance
perspective, it is desirable to detect and prevent these scenarios
automatically.
A reliable pipeline would first determine whether it can process a
given sample, then it would use the prediction of the target neural
network. The unfortunate incident that
mislabeledpeople asnon-human , for instance, is a clear example of
OOD extrapolation that could have been prevented by such a
decision scheme: the model simply did not know that it did
not know. While incidentsof similar nature have fueled researchon
de-biasing the datasets and the deep learning machinery, we still
wouldneed to identify the limitationsof ourmodels.
The application is not limited to fortifying large-scale user-
facing products. Successful detection of such violations could
also be used in active learning, unsupervised learning, learning with
noisy data, or simply be a condition to invoking transfer learning
strategies. In this work, we are interested in evaluating mechanisms
that detect OOD samples.
UncertaintyandNoveltydetection #1b
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
The Uncertainty View. A commonly invoked strategy in addressing
similarproblemsistocharacterizeanotionofuncertainty.
The literature distinguishes aleatoric uncertainty, the uncertainty inherent
to the process (the known unknowns, like flipping a coin), from epistemic
uncertainty, the uncertainty that can be eliminated with more information
(the unknown unknowns). The Bayesian approach to epistemic
uncertainty estimation is to measure the degree of disagreement among
thepotentiallyviablemodels(theposterior).
The MC-Dropout approach is often advertised as a feasible method to
estimateuncertainty for a variety of applications. Similarly, we can adopt a
non-Bayesian approach by training independent models and then
measuringthedisagreement.Lakshminarayananetal.showanensembleof
five neural networks (DeepEnsemble) that are trained with an
adversarialsample-augmented strategy is sufficient to provide a non-
Bayesian alternative to capturing predictive uncertainty. We evaluate
DeepEnsemble and MC-Dropout.
* The Abstention View
* The Anomaly View AEThreshold PixelCNN++ K-NNSVM
* The Novelty View OpenMax
We train these architectures with a cross-entropy loss (CE), and a k-way logistic
regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces
mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks –
it does not enforce mutual exclusivity of the predictions.
We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect
on the ability to predict OOD samples. CE loss cannot make a None prediction without an
explicitly defined None class, but KL loss can make None predictions through low activations of
all the classes.
UncertaintyandNoveltydetection #1c
VGG-backedandResnet-backedmethods
significantlydifferinaccuracy.Thegap
indicatesthesensitivityofthemethodstothe
underlyingnetworks.
Thismeansthattheimageclassificationaccuracy
maynotbetheonlyrelevantfactor inperformance
ofthesemethods.ODINislesssensitivetothe
underlyingnetwork.
Despitenotenforcingmutualexclusivity,training
thenetworkswithKLlossinsteadofCEloss
consistentlyreducestheaccuracyofOOD
detectionmethodsonaverage.
UncertaintyandNoveltydetection #1d
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch]
Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions:
(i) in-distribution techniques, and (ii) out-of-distribution techniques.
Guoetal. (2017) observed that
modern neural networks tend to
be overconfident in their
predictions. They show that
temperature scaling in the
softmax operator, also known as
Platt scaling, can be used to
calibrate the output probabilities of
a neural network to empirically
align the accuracy of a prediction
with its probability. Their efforts fall
under the uncertainty estimation
approaches.
Geifman and El-Yaniv (2017)
present a framework for selective
classification with deep neural
networks that follows the
abstention view. A selection
function decides whether to
make a prediction or not. For
the choice of selection function,
they experiment with MC-Dropout
and the softmax output. They
provide an analytical trade-off
between risk and coverage within
their formulation.
input perturbation serves as a way to assess how the network would behave nearby the given
input. When the temperature is 1 and the perturbation step is 0 we simply recover the
PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to
outperform the previous work [8] by a significant margin. We also assess the performance of ODIN
inourwork.
These methods provide an abstract idea which depends on the successful training of GANs. To
the best of our knowledge, training GANs is itself an active area of research, and it is not apparent
what design decisions would be appropriate to implement these ideas in practice. Furthermore,
someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.
UncertaintyandNoveltydetection #1e
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
Datasets.
We extend the previous work by evaluating over a broader set
of datasets with varying levels of complexity. The
variation in complexity allows for a fine-grained evaluation of
the techniques. Since OOD detection is closely related to the
problem of density estimation, the dimensionality of the
input image will be of vital importance in practical
assessments. As the input dimensionality increases, we
expect the task to become much more difficult.
Therefore, to provide a more accurate picture of performance,
itiscrucialtoevaluatethemethodsonhighdimensionaldata.
MC-Dropout
Inlow-dimensional
datasets,K-
NNSVMperforms
similarlyorbetter
than theother
methods
Thetop-performingmethod,ODIN,isinfluencedbythe
numberofclassesin thedataset.Similarto PbThreshold,ODIN
dependson themaximum signalin theclasspredictions,
thereforetheincreasednumberof classeswould directly affect
bothofthemethods.Furthermore,neitherofthemconsistently
prefersVGGoverResnetwithinalldatasets. Overall,ODIN
consistentlyoutperformsothersinhigh-dimensional
settings, but allthemethodshavea relativelylow average
accuracyinthe60%-78%range.
UncertaintyandNoveltydetection #1f
DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of
“Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little
https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
UncertaintyandNoveltydetection #2
To TrustOr NotTo Trust A Classifier
HeinrichJiang, Been Kim, Maya Gupta (2018)
Google Research;Google Brain
https://arxiv.org/abs/1805.11783
We propose a new score, called the trust
score, which measures the agreement
between the classifier and a modified
nearest-neighbor classifier on the testing
example. We show empirically that high
(low) trust scores produce surprisingly high
precision at identifying correctly (incorrectly)
classified examples, consistently
outperforming the classifier’s confidence
scoreas well as many other baselines.
Two example datasets and models. Predicting correctness (top row) and
incorrectness (bottom). The vertical dotted black line indicates accuracy level of the
classifier. The trust score consistently attains a higher precision for each given percentile
of classifier decision-rejection. Furthermore, the trust score generally shows increasing
precision as the percentile level increases, but surprisingly, many of the comparison
baselinesdo not.
UncertaintyandNoveltydetection #3
Interpreting Neural NetworksWith Nearest
Neighbors
Eric Wallace, Shi Feng, Jordan Boyd-Graber
https://arxiv.org/abs/1809.02847
Local model interpretation methodsexplain individual
predictionsbyassigning animportance value to each
inputfeature. Thisvalue isoften determined by
measuringthe change in confidence when a feature is
removed. However, the confidence of neural networksis
nota robust measure of model uncertainty.
Thisissue makesreliably judgingthe importance of the
input featuresdifficult.We addressthisby changing
the test-time behaviorofneural networks using
Deep k-Nearest Neighbors. Without harmingtext
classification accuracy, thisalgorithm providesa more
robustuncertainty metric whichwe use to generate
feature importance values.
The resultinginterpretationsbetteralign withhuman
perception than baseline methods. Finally, we use our
interpretation methodto analyze model predictionson
dataset annotation artifacts.
Deepk-nearest neighbors: Towards confident,
interpretable and RobustDeep Learning
NicolasPapernot and Patrick D. McDaniel (2018)
https://arxiv.org/abs/1803.04765
Debugging ResNet model biases—This illustrates how the
DkNN algorithm helps to understand a bias identified by Stock and
Cisse [105] in the ResNet model for ImageNet. The image at the
bottom of each column is the test input presented to the DkNN.
Each test input is cropped slightly differently to include (left) or
exclude (right) the football. Images shown at the top are nearest
neighbors in the predicted class according to the representation
output by the last hidden layer. This comparison suggests that the
“basketball” prediction may have been a consequence of the ball
being in the picture. Also note how the white apparel color and
general arm positions of players often match the test image of
BarackObama.
UncertaintyandNoveltydetection #4
AND:AutoregressiveNoveltyDetectors
Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara
(Submitted on4 Jul 2018)
https://arxiv.org/abs/1807.01653
We propose an unsupervised model for novelty
detection. The subject is treated as a density estimation
problem, in which a deep neural network is employed to learn a
parametric function that maximizes probabilities of training
samples. This is achieved by equipping an autoencoder with a
novel module, responsible for the maximization of
compressed codes' likelihood by means of autoregression. We
illustrate design choices and proper layers to perform
autoregressive density estimation when dealing with both
image and video inputs. Despite a very general formulation, our
model shows promising results in diverse one-class novelty
detectionandvideoanomalydetectionbenchmarks.
Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction
network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1
Anomalydetection withGANs#1
AnomalydetectionwithWassersteinGAN
IlyassHaloui, Jayant SenGupta, and Vincent Feuillard
(Submitted on11Dec2018)
https://arxiv.org/pdf/1812.02463
Inthispaper,we investigateGAN toperformanomalydetectionon
time series dataset. In order to achieve this goal, a bibliography is
made focusing on theoretical properties of GAN and GAN used for
anomaly detection. A Wasserstein GAN hasbeen chosen to learn the
representation of normal data distribution and a stacked encoder with
the generator performsthe anomaly detection. W-GAN with encoder
seems to produce state of the art anomaly detection scores on MNIST
datasetandweinvestigateitsusageon multi-variatetimeseries.
Based on this literature review, we chose to perform anomaly detection
using a Wasserstein Generative Adversarial Network. The main
reason is that Wasserstein GAN does not collapse contrarily to the
classical GAN which needs to be heavily tuned in order to avoid this
problem. Mode collapse can be blocking if we need to perform
anomaly detection: ifasubset ofour datadistributionisnotlearned bythe
generator, then all samples that are similar to this subset might end up
classified as abnormal. Another added value of the wasserstein GAN
version compared to a standard GAN is the possibility of using the loss
function of the discriminator to evaluate convergence since it is an
approximationoftheWassersteindistancebetween Pr
andPθ
.
A future improvement consists in considering CNN for both
the generator and discriminator in order to detect anomalies from
raw time series data. 1-D convolutions are needed and will be
investigated to produce good visual representations of time
series samples.A more thorough study of the impact of the
architecture should also be done.
Anomalydetection withGANs#2
MAD-GAN:MultivariateAnomalyDetectionforTimeSeries
DatawithGenerativeAdversarialNetworks
DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg
(Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore
https://arxiv.org/abs/1901.04997
In this work, we propose a novel Multivariate Anomaly Detection
strategywith GAN (MAD-GAN) to model the complex multivariate
correlations among the multiple data streams to detect
anomalies using both the GANtrained generator and discriminator.
Unlike traditional classification methods, the GAN-trained discriminator
learns to detect fake data from real data in an unsupervised fashion,
making it an attractive unsupervised machine learning technique for
anomalydetection
Given that this is an early attempt on multivariate anomaly detection on
timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther
investigations.Forexample,wehavenotedtheissuesofdeterminingthe
optimal subsequence length as well as the potential model instability of
theGANapproaches.
For future work, we plan to conduct further research on feature
selection for multivariate anomaly detection, and investigate principled
methods for choosing the latent dimension and PC dimension
with theoretical guarantees.Wealsohope toperformadetailedstudyon
the stability of the detection model. In terms of applications, we plan to
explore the use of MAD-GAN for other anomaly detection applications
such as predictive maintenance and fault diagnosis for smart buildings
andmachineries.
Uncertainty InsightsfromNLP uncertainty
QuantifyingUncertaintiesinNaturalLanguage
ProcessingTasks
YijunXiaoand William YangWang(Submitted on 18 May2018)
https://arxiv.org/abs/1811.07253
In this paper, we propose novel methods to study the
benefits of characterizing model and data
uncertainties for natural language processing (NLP)
tasks. With empirical experiments on sentiment analysis,
named entity recognition, and language modeling using
convolutional and recurrent neural network models, we
show that explicitly modeling uncertainties is not only
necessary to measure output confidence levels, but also
useful at enhancing model performances in various
NLPtasks.
1. We mathematically define model and data
uncertaintiesviathelawof totalvariance;
2. Our empirical experiments show that by accounting
for model and data uncertainties, we observe
significantimprovementsinthree importantNLPtasks;
3. We show that our model outputs higher data
uncertainties for more difficult predictions in sentiment
analysis andnamedentity recognitiontasks.
Uncertainty CNNs+GaussianProcesses
CalibratingDeepConvolutionalGaussianProcesses
Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio
Filippone. (Submitted on 26May 2018)
https://arxiv.org/abs/1805.10522
Despite the considerable interest in combining CNNs
with GPs, little attention has been devoted to
understand the implications in terms of the ability of
these models to accurately quantify the level of
uncertainty inpredictions.
This is the first work that highlights the issues of
calibration of these models, showing that GPs cannot
cure the issues of miscalibration in CNNs. We
have proposed a novel combination of CNNs and GPs
where the resulting model becomes a particular form of
a Bayesian CNN for which inference using variational
inference isstraightforward.
However, our results also indicate that combining CNNs
and GPs does not significantly improve the
performance of standard CNNs. This can serve as
a motivation for investigating new approximation
methods for scalable inference in GP models and
combinationswithCNNs.
CalibrationofConvolutionalNetworks:
The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of
support vector machines for probabilistic classification. Calibration techniques aim to learn a
transformation of the output using a validation set in order for the transformed output to give a reliable
account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless
of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques
include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs)
have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy
loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the
correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a
particular case of DNNs, however, show that depth has a negative impact on calibration, despite
the use of a cross-entropy loss, and that regularization improves the calibration properties of
classifiers[Guoetal.2017].
Combinationsof ConvNetsandGaussianProcesses:
Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian
CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian
DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these
approaches have not been investigated. In this work, we propose an alternative way to combine CNNs
and GPs, where GPs are approximated using random features expansions. The random feature
expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation,
turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of
Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian
CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation
of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a
practicalwayofcombiningCNNsand GPs.
Uncertainty in timestamps,modelingfor clinicaluse#1
Time-DiscountingConvolutionforEventSequences
withAmbiguousTimestamps
(Submitted on 6Dec2018)
https://arxiv.org/abs/1812.02395
This paper proposes a method for modeling event
sequences with ambiguous timestamps, a time-
discounting convolution. Unlike in ordinary time series,
time intervals are not constant, small time-shifts
have no significant effect, and inputting timestamps or
time durations into a model is not effective. The criteria
that we require for the modeling are providing
robustness against time-shifts or timestamps
uncertainty as well as maintaining the essential
capabilities of time-series models, i.e., forgetting
meaningless past information and handling infinite
sequences.
The proposed method handles them with a
convolutional mechanism across time with specific
parameterizations, which efficiently represents the event
dependencies in a time-shift invariant manner while
discounting the effect of past events, and a dynamic
pooling mechanism, which provides robustness
against the uncertainty in timestamps and enhances the
time-discounting capability by dynamically changing the
poolingwindowsize.
Imputation LiteratureReview
Typesof Missing Values
Feldmanetal.(2018): “Rubin (1976) discusses three possible
mechanisms for the formation of missing values, each reflecting a
different form of missing-data probabilities and relationships between the
measured variables, and each may lead to different imputation methods
(Luengoetal.,2012)”
Missing Completely at Random (MCAR): a missing value that cannot be
related to the value itself or to other variable values in that record. This is a
completely unsystematic missing pattern and therefore the observed data
canbethoughtofasarandomunbiasedsampleofacompletedataset.
Missing at Random (MAR): cases in which a missing value is related to
other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with
a "marital status" value "single", has a missing value in the "spouse name"
attribute). In other words, in MAR scenarios, incomplete data can be partially
explained and the actual value can be possibly predicted by other variable
values.
Missing Not at Random (MNAR): the missing value is not random and
depends on the actual value itself; hence, cannot be explained by other values
(e.g., an overweight person is reluctant to provide the "weight" value in a
survey). NMAR scenarios are the most difficult to analyze and handle, as the
missing data cannot be associated with other data items that are available in
thedataset.
https://statistical-programming.com/missing-data/
Missinginaction:the dangersofignoringmissingdata
https://doi.org/10.1016/j.tree.2008.06.014
Intro toimputationmethods
ComparisonofEstimatingMissingValues inIoTTime
Series DataUsingDifferentInterpolationAlgorithms
August2018
https://doi.org/10.1007/s10766-018-0595-5
“When collecting the Internet of Things data using various sensors or
other devices, it may be possible to miss several kinds of values of
interest.In thispaper,we focusonestimating the missing valuesin IoT
time series data using three interpolation algorithms, including
(1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3)
AdaptiveInverseDistanceWeighted.“
Onthechoiceofthebestimputationmethods formissingvalues
consideringthreegroups ofclassificationmethods
June2011
https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM
“In thiswork, wefocuson aclassification task with twenty-three classification methods
and fourteen different imputation approaches to missing values treatment that
are presented and analyzed. The analysis involves a group-based approach, in which
we distinguish between three different categories of classification methods.
Each category behaves differently, and the evidence obtained shows that the use of
determined missing values imputation methods could improve the accuracy obtained
for these methods. In this study, the convenience of using imputation methods
for preprocessing data sets with missing values is stated. The analysis suggests
that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“
We have discovered that the
Combined Multivariate Collapsing
(CMC) and Event Covering (EC)
methods show good behavior for
these two measures, and they are
two methods that provide good
results for an important range of
learning methods, as we have
previously analyzed. In short, these
two approaches introduce less
noise and maintain the mutual
information better.
Class centerbasedapproachformissingvalue
imputation2018
https://doi.org/10.1016/j.knosys.2018.03.026
A novel missing value imputation isintroduced, which iscomposedof
two modules. Each class center and its distances from the other
observed data are measured to identify a threshold. Then, the
identified threshold is used for missing value imputation. The
proposed approach outperforms the other approaches for both
numerical and mixed datasets. It requires much less imputation
timethanthemachinelearning basedmethods.
Imputation withDeepLearning#1
BRITS:BidirectionalRecurrentImputationforTime
Series
WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi
(Submittedon27May2018) https://arxiv.org/abs/1805.10572
https://github.com/NIPS-BRITS/BRITS
Existing imputation methods often impose strong
assumptions of the underlying data generating process,
such as linear dynamics in the state space. In this paper, we
propose BRITS, a novel method based on recurrent neural
networksformissingvalueimputationintimeseriesdata.
Our proposed method directly learns the missing
values in abidirectional recurrentdynamicalsystem,without
any specific assumption. The imputed values are treated as
variablesofRNNgraphandcan beeffectivelyupdatedduring
the backpropagation. We simultaneously perform missing
value imputation and classification/regression of applications
jointlyinoneneuralgraph.
BRITS has three advantages: (a) it can handle multiple
correlated missing values in time series; (b) it generalizes
to time series with nonlinear dynamics underlying; (c) it
provides a data-driven imputation procedure and
appliestogeneralsettingswithmissing data.
We evaluate the imputation performance in terms of
mean absolute error (MAE) and mean relative error
(MRE).
Imputation withDeepLearning#2
End-to-EndTimeSeriesImputationviaResidualShortPaths
Lifeng Shen,Qianli Ma,SenLi (2018)
http://proceedings.mlr.press/v95/shen18a.html
We propose an end-to-end imputation network with residual
short paths, called Residual IMPutation LSTM (RIMP-LSTM), a
flexible combination of residual short paths with graph-based
temporal dependencies. We construct a residual sum unit (RSU),
which enables RIMP-LSTM to make full use of previous revealed
information to model incomplete time series and reduce the
negative impact of missing values. Moreover, a switch unit is
designed to detect the missing values and a new loss function is
then developed to train our model with time series in the presence of
missing values in an end-to-end way, which also allows
simultaneous imputationand prediction.
RIMP-LSTM combines the merits of graph-based models with
explicitly modeled temporal dependencies via weighted
residual connection between nodes, with the ones of LSTM that can
accumulate historical residual information and learn the underlying
patternsof incomplete time seriesautomatically.
On the other hand, compared with IMP-LSTM, RIMP-LSTM has
better performance as it is good at modeling temporal
dependencies with weighted residual short paths, which
demonstrates that the reasonability of using these weighted residual
pathsto model graphlike temporal dependenciesforimputation.
Imputation withDeepLearning#3
Acontextencoderforaudioinpainting
AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018)
https://arxiv.org/abs/1810.12138
http://www.github.com/andimarafioti/audioContextEncoder
(Python,Matlab)
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Here, the STFT features, meant as a reasonable first choice,
provided a decent performance. In the future, we expect more
hearing-related features to provide even better reconstructions. In
particular, an investigation of Audlet frames, i.e., invertible time-
frequency systems adapted to perceptual frequency scales, as
featuresforaudioinpaintingpresentintriguingopportunities.
Here, preferred architectures are those not relying on a
predetermined target and input feature length, e.g., a recurrent
network. Recent advances in generative networks will provide
other interesting alternatives for analyzing and processing audio
dataaswell.Theseapproachesareyettobefully explored.
Finally, music data can be highly complex and it is unreasonable to
expect a single trained model to accurately inpaint a large number
of musical styles and instruments at once. Thus, instead of training
on a very general dataset, we expect significantly improved
performance for more specialized networks that could be
trained by restricting the training data to specific genres or
instrumentation. Applied to a complex mixture and potentially
preceded by a source-separation algorithm, the resulting
modelscouldbeusedjointlyinamixture-of-experts.approach.
Imputation withDeepLearning#4: GANs
NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation
Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019)
https://arxiv.org/abs/1901.10946
We studied the ability of deep neural networks (DNNs) to restore missing audio
content based on its context, a process usually referred to as audio inpainting.
We focused on gaps in the range of tens of milliseconds, a condition which has
not received much attention yet. The proposed DNN structure was trained on
audio signals containing music and musical instruments, separately, with 64-ms
long gaps
Leveraging multiresolution modeling and adversarial training, NAOMI is able to
learn the conditional distribution given very few known observations and
achieves superior performances in variousexperiments of both deterministic and
stochastic dynamics. Future work will investigate how to infer the
underlyingdistribution when complete training dataisunavailable.The trade-
off between partial observations and external constraints is another direction for
deepgenerativeimputationmodels.
Effect of missingvalues toclassificationperformance
Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in
classificationproblems
Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017,
https://doi.org/10.1080/03610926.2016.1277752
“This study suggests that the negative impact of poor data quality (DQ) on decision making is often
mediated by biased model estimation. To highlight this perspective, we develop an analytical framework
that links three quality levels – data, model, and decision. The general framework is first developed at a
high-level”
Evolutionary MachineLearningfor
ClassificationwithIncompleteData
Tran, CaoTruong(2018, PhDThesis)
http://hdl.handle.net/10063/7639
“The thesis develops approaches for
improving imputation for
classification with incomplete data by
integrating clustering and feature
selection with imputation. The approaches
improve both the effectiveness and the
efficiency of using imputation for
classificationwith incompletedata.
The thesis develops interval genetic
programming to directly evolve classifiers
for incomplete data. The results show that
classifiers generated by interval genetic
programming can be more effective and
efficient than classifiers generated the
combination of imputation and traditional
genetic programming. Interval genetic
programming is also more effective than
common classification algorithms able to
workdirectlywith incompletedata.”
Imputation and Classification
MissingData ImputationforSupervisedLearning
August 2018
https://doi.org/10.1080/08839514.2018.1448143
“Thispapercomparesmethodsforimputingmissing
categoricaldataforsupervisedclassificationtasks. “
The results of the present study show that perturbation can help increase predictive accuracy
for imputed models, but not one-hot encoded models. Future work can identify the conditions
under which missing-data perturbation can improve prediction accuracy. Interesting
extensions of this paper include evaluating the benefits of using missing-data
perturbation over more popularregularization techniquessuchas dropout training.
ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis).
TheAdult datasetcontainsN= 48,842examples
and 14 features(6 continuousand 8 categorical).The
predictiontask isto determinewhether aperson
makesover $50,000a year.
Decomposition LiteratureReview
CEEMD EmpiricalModeDecomposition
Empirical mode decomposition for
seismic time-frequency analysis
Jiajun Han and Mirko van der Baan
Geophysics (2013) 78 (2):O9-O19.
https://doi.org/10.1190/geo2012-0199.1
Complete ensemble empirical mode
decomposition decomposes a
seismic signal into a sum of
oscillatory components, with
guaranteed positive and smoothly
varying instantaneous frequencies.
Analysis on synthetic and real data
demonstrates that this method
promises higher spectral-spatial
resolution than the short-time
Fourier transform or wavelet
transform. Application on field data
thus offers the potential of
highlighting subtle geologic
structures that might otherwise
escape unnoticed.
CEEMD is a robust extension of EMD methods. It
solves not only the mode mixing problem, but also leads to
complete signal reconstructions. After CEEMD,
instantaneous frequency spectra manifest visibly higher
time-frequency resolution than short-time Fourier and
wavelet transforms on synthetic and field data examples.
These characteristics render the technique highly
promisingforseismic processingand interpretation.
Introducinglibeemd:Aprogrampackageforperformingthe
ensembleempiricalmodedecomposition(July2015)
ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E.
Räsänen C, R and Python
http://doi.org/10.1007/s00180-015-0603-9
https://bitbucket.org/luukko/libeemd
SourceSeparation ”signaldecomposition”#1
Wave-U-Net:AMulti-ScaleNeuralNetworkfor
End-to-EndAudioSourceSeparation
Daniel Stoller, Sebastian Ewert, Simon Dixon
Queen Mary Universityof London, Spotify
(Submitted on8 Jun2018)
https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net
“Models for audio source separation usually operate on the
magnitude spectrum, which ignores phase information and
makes separation performance dependant on hyper-parameters
for the spectral front-end. Therefore, we investigate end-to-end
source separation in the time-domain, which allows
modelling phase information and avoids fixed spectral
transformations. Due to high sampling rates for audio, employing a
long temporal input context on the sample level is difficult, but
required for high quality separation results because of long-range
temporalcorrelations.
In thiscontext, weproposethe Wave-U-Net,an adaptation of the
U-Net to the one-dimensional time domain, which repeatedly
resamples feature maps to compute and combine features at
different time scales. We introduce further architectural
improvements, including an output layer that enforces source
additivity, an upsampling technique and a context-aware
predictionframework toreduceoutput artifacts.
Experiments for singing voice separation indicate that our
architecture yields a performance comparable to a state-of-the-
artspectrogram-basedU-Netarchitecture,given thesamedata.
75 tracks from the training partition of the MUSDB
multi-track database are randomly assigned to
our training set. For singing voice separation, we
also add the whole CCMixter database to the
training set. No further data preprocessing is performed, only a
conversion to mono (except for stereo models) and downsampling to
22050 Hz.
For future work, we could investigate to
which extent our model performs a
spectral analysis, and how to incorporate
computations similar to those in a multi-
scale filterbank, or to explicitly compute
a decomposition of the input signal into a
hierarchical set of basis signals and
weightings on which to perform the
separation, similarto the TasNet [12].
Furthermore, better loss functions for
raw audio prediction should be investigated
such as the ones provided by generative
adversarial networks [3, 21], since the MSE
might not reflect the perceived loss of
quality well.
SourceSeparation ”signaldecomposition”#2
TasNet:SurpassingIdealTime-Frequency
MaskingforSpeechSeparation
YiLuo, NimaMesgarani
(Submitted on21 Sep 2018)
https://arxiv.org/abs/1809.07454
“TasNet uses a convolutional encoder to create a representation
of the signal that is optimized for extracting individual speakers.
Speaker extraction is achieved by applying a weighting
function (mask) to the encoder output. The modified encoder
representation is then inverted to the sound waveform using a
linear decoder. A linear deconvolution layer serves as a decoder
by invertin gthe encoder output back to the sound waveform. This
encoder-decoder framework is similar to the ICA method when
a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and
to the semi-nonnegative matrix factorization method (semi-NMF)
[Dingetal.2008], where the basis signals are the parameters of
thedecoder.
The masks are found using a temporal convolutional network
(TCN) consisting of dilated convolutions, which allow the
network to model the long-term dependencies of the speech
signal. This end-to-end speech separation algorithm significantly
outperforms previous time-frequency methods in terms
of separating speakers in mixed audio, even when compared to
the separation accuracy achieved with the ideal time-frequency
mask of the speakers. In addition, TasNet has a smaller model size
and a shorter minimum latency, making it a suitable solution for
bothofflineandreal-time speechseparation applications.“
SourceSeparation ”signaldecomposition”#3
DisentanglingCorrelatedSpeakerandNoisefor
SpeechSynthesis viaDataAugmentationand
AdversarialFactorization
Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang,
YonghuiWu, JamesGlass.
32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada.
https://openreview.net/pdf?id=Bkg9ZeBB37
“To leverage crowd-sourced data to train multi-speaker text-
to-speech (TTS) models that can synthesize clean speech
for all speakers, it is essential to learn disentangled
representations which can independently control the
speaker identity and background noise in generated signals.
However, learning such representations can be challenging,
duetothe lackoflabelsdescribingtherecordingconditionsof
each training example, and the fact that speakers and
recording conditions are often correlated, e.g. since users
oftenmakemanyrecordingsusingthesameequipment.
This paper proposes three components to address this
problem by: (1) formulating a conditional generative model
with factorized latent variables, (2) using data augmentation
to add noise that is not correlated with speaker identity and
whose label is known during training, and (3) using
adversarial factorization to improve disentanglement.
Experimental results demonstrate that the proposed method
can disentangle speaker and noise attributes even if
they are correlated in the training data, and can be used to
consistentlysynthesizecleanspeechforallspeakers.”
Decompose HighandLow frequencies
Drop anOctave:ReducingSpatialRedundancy in
Convolutional Neural Networks withOctave
Convolution
YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis,
MarcusRohrbach, ShuichengYan, JiashiFeng
(Submitted on 10 Apr 2019)
https://export.arxiv.org/abs/1904.05049
In this work, we propose to factorize the mixed feature maps by
their frequencies and design a novel Octave Convolution
(OctConv) operation to store and process feature maps that vary
spatially "slower" at a lower spatial resolution reducing both memory
and computation cost. Unlike existing multi-scale meth-ods,
OctConv is formulated as a single, generic, plug-and-play
convolutional unit that can be used as a direct
replacement of (vanilla) convolutions without any
adjustments in the network architecture. It is also orthogonal and
complementary to methods that suggest better topologies or
reduce channel-wise redundancy like group or depth-wise
convolutions. We experimentally show that by simply replacing
con-volutions with OctConv, we can consistently boost
accuracy for both image and video recognition tasks, while reducing
memoryandcomputationalcost.
Decompose Signalandthe Noise
Deeplearningofdynamicsandsignal-noise
decompositionwithtime-steppingconstraints
Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton
Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle,
last revised 22 Aug2018
https://arxiv.org/abs/1808.02578
https://github.com/snagcliffs/RKNN
“We propose a novel paradigm for data-driven modeling that
simultaneously learns the dynamics and estimates the
measurement noise at each observation. By constraining our
learning algorithm, our method explicitly accounts for measurement
error in the map between observations, treating both the
measurement error and the dynamics as unknowns to be
identified,ratherthan assumingidealizednoiselesstrajectories.
We also discuss issues with the generalizability of neural network
models for dynamicalsystemsand provide open-source code for
allexamples.”
The combination of neural networks and numerical time-stepping
schemes suggests a number of high-priority research
directions in system identification and data-driven forecasting.
Future extensions of this work include considering systems with
process noise, a more rigorous analysis of the specific method for
interpolating f, including time delay coordinates to accommodate
latent variables, and generalizing the method to identify
partial differential equations. Rapid advances in hardware and
the ease of writing software for deep learning will enable these
innovations through fast turnover in developing and testing
methods.
Signal Restoration LiteratureReview
Super-resolutions Insightsfromaudio
Time-frequencynetworks foraudiosuper-
resolution
TeckYian Lim etal. (2018)
http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf
http://tlim11.web.engr.illinois.edu/
“Audiosuper-resolution (a.k.a. bandwidthextension)is
thechallengingtaskofincreasingthetemporalresolutionof
audiosignals. Recentdeepnetworksapproachesachieved
promisingresultsby modelingthetaskas aregression
problem ineithertimeorfrequencydomain. Inthispaper,
weintroducedTime-FrequencyNetwork(TFNet),a
deepnetworkthat utilizessupervision inboth thetimeand
frequencydomain.Weproposedanovelmodelarchitecture
whichallowsthetwodomainstobe jointlyoptimized.”
Spectrogram correspondingto
the LR input(frequenciesabove
4kHz missing), HR
reconstruction, and the HR
ground truth. Our approach
successfullyrecoversthehigh
frequencycomponentsfrom the
LRaudiosignal.
GANs Alsofortime-seriesdenoising #1a
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
“In this paper, we explicitly learn to remove
noise from time series data without
assuming a prior distribution of noise.
We propose an online, fully automated, end-
to-endsystemfordenoisingtimeseriesdata.
Our model for denoising time series is trained
using unpaired training corpora and does
not need information about the source of the
noiseorhowitismanifestedin thetimeseries.
We propose a new architecture called
AsymmetricGAN that uses a generative
adversarial network for denoising time series
data.”
Consider, for example, a widely used method for time series featurization called Symbolic Aggregate
approXimation (SAX) that assumes time series are generated from a single normal distribution. As
shown in this assumption does not hold in several real life time series datasets. Other techniques
assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This
assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse
characteristics and originate from different sources. Hence, in this work, we focus on learning the
characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high
computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications.
For training of our network, we only need a set of clean signals and set of noisy signals. We do not need
paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for
applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.
GANs Alsofortime-seriesdenoising #1b
DenoisingTimeSeriesData Using
AsymmetricGenerativeAdversarial
Networks
Sunil Gandhi;Tim Oates;TinooshMohsenin and David
Hairston (2018)
https://doi.org/10.1007/978-3-319-93040-4_23
Pre-processing
The DC component in EEG data is different for each
recording. We normalize every window of clean and
noisy data to remove the DC offset from the data. We
remove the DC offset by subtracting the median of the
datain the window.
Evaluation of EEG data is challenging as the
ground truth noiseless signals are not
known. Multiple approaches to evaluation
have been proposed in recent years,
however, authors do not agree on a single
mechanismforevaluatingartifactremoval.
GANs Alsoforspeechdenoising
Segan:Speechenhancementgenerative
adversarialnetwork.
SantiagoPascual, AntonioBonafonte, and Joan Serra (2017)
https://arxiv.org/abs/1703.09452
https://github.com/santi-pdp/segan
“For the purpose of speech enhancement
and denoising, the SEGAN was developed,
employing a neural network with an encoder and
decoder pathway that successively halves and
doubles the resolution of feature maps in each
layer, respectively, and features skip connections
betweenencoderanddecoderlayersa.
The model works as an encoder-decoder fully-
convolutional structure, which makes it fast to
operate for denoising waveform chunks. The
results show that, not only the method is viable, but it
can also represent an effective alternative to current
approaches.
Possible future work involves the exploration of
better convolutional structures and the inclusion of
perceptual weightings in the adversarial training,
so that we reduce possible high frequency artifacts
that might be introduced by the current model.
Further experiments need to be done to compare
SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers
fromtheVoiceBankcorpus
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series
Deep Learning for Biomedical  Unstructured Time Series

More Related Content

What's hot

What's hot (15)

Purkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurementPurkinje imaging for crystalline lens density measurement
Purkinje imaging for crystalline lens density measurement
 
Image Restoration for 3D Computer Vision
Image Restoration for 3D Computer VisionImage Restoration for 3D Computer Vision
Image Restoration for 3D Computer Vision
 
Instrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopyInstrumentation for in vivo intravital microscopy
Instrumentation for in vivo intravital microscopy
 
Practical Considerations in the design of Embedded Ophthalmic Devices
Practical Considerations in the design of Embedded Ophthalmic DevicesPractical Considerations in the design of Embedded Ophthalmic Devices
Practical Considerations in the design of Embedded Ophthalmic Devices
 
Portable Multispectral Fundus Camera
Portable Multispectral Fundus CameraPortable Multispectral Fundus Camera
Portable Multispectral Fundus Camera
 
Time-resolved biomedical sensing through scattering medium
Time-resolved biomedical sensing through scattering mediumTime-resolved biomedical sensing through scattering medium
Time-resolved biomedical sensing through scattering medium
 
Design of lighting systems for animal experiments
Design of lighting systems for animal experimentsDesign of lighting systems for animal experiments
Design of lighting systems for animal experiments
 
Hyperspectral Retinal Imaging
Hyperspectral Retinal ImagingHyperspectral Retinal Imaging
Hyperspectral Retinal Imaging
 
Labeling fundus images for classification models
Labeling fundus images for classification modelsLabeling fundus images for classification models
Labeling fundus images for classification models
 
Pupillometry Through the Eyelids
Pupillometry Through the EyelidsPupillometry Through the Eyelids
Pupillometry Through the Eyelids
 
Data-driven Ophthalmology
Data-driven OphthalmologyData-driven Ophthalmology
Data-driven Ophthalmology
 
Lighting design for Startup Offices
Lighting design for Startup OfficesLighting design for Startup Offices
Lighting design for Startup Offices
 
Short intro for retinal biomarkers of Alzheimer’s Disease
Short intro for retinal biomarkers of Alzheimer’s DiseaseShort intro for retinal biomarkers of Alzheimer’s Disease
Short intro for retinal biomarkers of Alzheimer’s Disease
 
Future of Retinal Diagnostics
Future of Retinal DiagnosticsFuture of Retinal Diagnostics
Future of Retinal Diagnostics
 
Smartphone-powered Ophthalmic Diagnostics
Smartphone-powered Ophthalmic DiagnosticsSmartphone-powered Ophthalmic Diagnostics
Smartphone-powered Ophthalmic Diagnostics
 

Similar to Deep Learning for Biomedical Unstructured Time Series

TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
ijdkp
 
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdfBäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
TadiyosHailemichael
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
IJDKP
 
Text documents clustering using modified multi-verse optimizer
Text documents clustering using modified multi-verse optimizerText documents clustering using modified multi-verse optimizer
Text documents clustering using modified multi-verse optimizer
IJECEIAES
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
IJERA Editor
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
ijtsrd
 

Similar to Deep Learning for Biomedical Unstructured Time Series (20)

TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPINGTOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
TOWARDS MORE ACCURATE CLUSTERING METHOD BY USING DYNAMIC TIME WARPING
 
Journal of Computer Science Research | Vol.5, Iss.2 January 2023
Journal of Computer Science Research | Vol.5, Iss.2 January 2023Journal of Computer Science Research | Vol.5, Iss.2 January 2023
Journal of Computer Science Research | Vol.5, Iss.2 January 2023
 
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdfBäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
Bäßler2022_Article_UnsupervisedAnomalyDetectionIn.pdf
 
rnn_review.10.pdf
rnn_review.10.pdfrnn_review.10.pdf
rnn_review.10.pdf
 
50620130101006
5062013010100650620130101006
50620130101006
 
Drsp dimension reduction for similarity matching and pruning of time series ...
Drsp  dimension reduction for similarity matching and pruning of time series ...Drsp  dimension reduction for similarity matching and pruning of time series ...
Drsp dimension reduction for similarity matching and pruning of time series ...
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSERWITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
WITH SEMANTICS AND HIDDEN MARKOV MODELS TO AN ADAPTIVE LOG FILE PARSER
 
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discoveryJAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
JAVA 2013 IEEE DATAMINING PROJECT Region based foldings in process discovery
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Novel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data StreamsNovel Ensemble Tree for Fast Prediction on Data Streams
Novel Ensemble Tree for Fast Prediction on Data Streams
 
1104.0355
1104.03551104.0355
1104.0355
 
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
Time Series Forecasting Using Novel Feature Extraction Algorithm and Multilay...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
Effective Multi-Stage Training Model for Edge Computing Devices in Intrusion ...
 
Text documents clustering using modified multi-verse optimizer
Text documents clustering using modified multi-verse optimizerText documents clustering using modified multi-verse optimizer
Text documents clustering using modified multi-verse optimizer
 
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String MatchingA Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
A Novel Framework for Short Tandem Repeats (STRs) Using Parallel String Matching
 
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data StreamsNovel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
Novel Class Detection Using RBF SVM Kernel from Feature Evolving Data Streams
 
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
Principle Component Analysis Based on Optimal Centroid Selection Model for Su...
 
Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...Parallel multivariate deep learning models for time-series prediction: A comp...
Parallel multivariate deep learning models for time-series prediction: A comp...
 

More from PetteriTeikariPhD

More from PetteriTeikariPhD (16)

ML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung SoundsML and Signal Processing for Lung Sounds
ML and Signal Processing for Lung Sounds
 
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and OculomicsNext Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
Next Gen Ophthalmic Imaging for Neurodegenerative Diseases and Oculomics
 
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
Next Gen Computational Ophthalmic Imaging for Neurodegenerative Diseases and ...
 
Wearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung SensingWearable Continuous Acoustic Lung Sensing
Wearable Continuous Acoustic Lung Sensing
 
Precision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthmaPrecision Medicine for personalized treatment of asthma
Precision Medicine for personalized treatment of asthma
 
Two-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature SegmentationTwo-Photon Microscopy Vasculature Segmentation
Two-Photon Microscopy Vasculature Segmentation
 
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phaseSkin temperature as a proxy for core body temperature (CBT) and circadian phase
Skin temperature as a proxy for core body temperature (CBT) and circadian phase
 
Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...Summary of "Precision strength training: The future of strength training with...
Summary of "Precision strength training: The future of strength training with...
 
Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...Precision strength training: The future of strength training with data-driven...
Precision strength training: The future of strength training with data-driven...
 
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging featuresIntracerebral Hemorrhage (ICH): Understanding the CT imaging features
Intracerebral Hemorrhage (ICH): Understanding the CT imaging features
 
Hand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical ApplicationsHand Pose Tracking for Clinical Applications
Hand Pose Tracking for Clinical Applications
 
Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1Precision Physiotherapy & Sports Training: Part 1
Precision Physiotherapy & Sports Training: Part 1
 
Creativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technologyCreativity as Science: What designers can learn from science and technology
Creativity as Science: What designers can learn from science and technology
 
Light Treatment Glasses
Light Treatment GlassesLight Treatment Glasses
Light Treatment Glasses
 
Efficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular ImagingEfficient Data Labelling for Ocular Imaging
Efficient Data Labelling for Ocular Imaging
 
Dashboards for Business Intelligence
Dashboards for Business IntelligenceDashboards for Business Intelligence
Dashboards for Business Intelligence
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Deep Learning for Biomedical Unstructured Time Series

  • 1. Deep Learning for Biomedical Unstructured Time-series 1D Convolutional neural networks (CNNs) for time series analysis, and inspiration from beyond biomedical field Petteri Teikari, PhD Singapore Eye Research Institute (SERI) Visual Neurosciences group http://petteri-teikari.com/ Version “Wed 17 April 2019“
  • 3. TimeSeries Basics Regular time seriesvs. irregular timeseries https://mediatum.ub.tum.de/doc/1444158/78684.pdf UnstructuredBiomedical1DTimeSeries Time-Frequencyvisualization https://doi.org/10.3389/fnhum.2016.00605 Timeserieswithdiscrete“states” Sleepstagesinferredfromunivariateormultivariate(multipleEEGelectrodelocations,), multimodal(EEGwithECG/EMG,etc.)dense1Dtimeseries Manytypesof groundtruths possiblealsofor1Dtime series Segmentation,classification,regression https://arxiv.org/abs/1801.05394
  • 4. TimeSeries Stationarity Non-stationaritiessignificantly distort short-term spectral, symbolicand entropyheartrate variabilityindicesNovember 2011PhysiologicalMeasurement 32(11):1775-86 DOI: 10.1088/0967-3334/32/11/S05 Testsof Stationarity https://stats.stackexchange.com/questions/182764/stationarity-test s-in-r-checking-mean-variance-and-covariance Stationarity of order 2 For everyday use we often consider time series that have (instead of strictstationarity):https://people.maths.bris.ac.uk/~magpn/Research/LSTS/TOS.html ● aconstantmean ● aconstantvariance ● anautocovariancethatdoesnotdependontime. Suchtimeseriesareknownas second-orderstationary or stationaryoforder2. Examples of non-stationary processes are random walk with or without a drift (a slow steady change) and deterministic trends (trends that are constant, positive or negative, independent of time for the whole life of the series).https://www.investopedia.com/articles/trading/07/stationary.asp
  • 6. Representation vsSimilarity https://arxiv.org/abs/1704.00794: “Time series analysis approaches can be broadly categorized into two families: (i) representation methods, which provide high-level features for representing properties of the time series at hand, and (ii) similarity measures, which yield a meaningful similarity between different time series for further analysis.“ Classic representation methods are for instance Fourier transforms, wavelets, singular value decomposition, symbolic aggregate approximation, andpiecewiseaggregateapproximation. Time series may also be represented through the parameters of model-based methods such as Gaussian mixture models (GMM), Markov models and hidden Markov models (HMMs), time series bitmaps andvariantsofARIMA. An advantage with parametric models is that they can be naturally extended to the multivariate case. For detailed overviews on representation methods, we refer the interested reader to e.g. Wangetal.(2013). https://arxiv.org/abs/1704.00794: “Similarity-based approaches, once defined, such similarities between pairs of time series may be utilized in a wide range of applications, such as classification, clustering, and anomaly detection. Time series similarity measures include for example dynamic time warping (DTW, the longest common subsequence (LCSS), the extended Frobenius norm (Eros), and the Edit Distance with Real sequences (EDR), and representstate-of-the-artperformanceinunivariatetimeseries(UTS)prediction. Attempts have been made to design kernels from non-metric distances such as DTW, of which the global alignment kernel (GAK) is an example. There are also promising works on deriving kernels from parametric models, such as the probability product kernel, Fisher kernel, andreservoir basedkernels.Commontoallthese methodsishowever a strongdependence onacorrecthyperparametertuning,whichisdifficulttoobtaininanunsupervisedsetting. Moreover, many of these methods cannot naturally be extended to deal with multivariate time series (MTS), as they only capture the similarities between individual attributes and do not modelthe dependenciesbetweenmultiple attributes.Equallyimportant,thesemethodsare not designed to handle missing data, an important limitation in many existing scenarios, such as clinical data where MTS originating from Electronic Health Records (EHRs) often contain missingdata In this work, we propose a surgical site infection detection framework for patients undergoing colorectal cancer surgery that is completely unsupervised, hence alleviating the problem of getting access to labelled training data. The framework is based on powerful kernels for multivariate time series that account for missing data when computing similarities. https://arxiv.org/abs/1803.07879
  • 7. Analysis withSimilarityMeasures TimeSeriesClusterKernelforLearningSimilaritiesbetweenMultivariateTimeSerieswithMissingData KarlØyvindMikalsen,FilippoMariaBianchi,CristinaSoguero-Ruiz,RobertJenssen(lastrevised29Jun2017) https://arxiv.org/abs/1704.00794|https://github.com/kmi010/Time-series-cluster-kernel-TCK-(TheTCKwasimplementedinRandMatlab) Similarity-based approaches represent a promising direction for time series analysis. However, many such methods rely on parameter tuning, and some have shortcomings if the time series are multivariate (MTS), due to dependencies between attributes, or the time series containmissingdata. In this paper, we address these challenges within the powerful context of kernel methods by proposing the robust time series cluster kernel (TCK). The approach taken leverages the missing data handling properties of Gaussian mixture models (GMM) augmented with informative prior distributions. An ensemble learning approach is exploited to ensure robustness to parameters by combining the clustering results of many GMM to formthefinalkernel. The experimental results demonstrated that the TCK (1) is robust to hyperparameter settings, (2) is competitive to established methods on prediction tasks without missing data and (3) is better than established methods on prediction tasks with missing data. In future works we plan to investigate whether the use of more general covariance structures in the GMM, or the use of HMMs as base probabilistic models, could improve TCK.
  • 8. Wavelets Shapelets→ Shapelets ”1DGabors”#1 Fast classification of univariate and multivariate time seriesthrough shapelet discovery https://doi.org/10.1007/s10115-015-0905-9 Josif Grabocka, MartinWistuba, Lars Schmidt-Thieme A Shapelet Selection Algorithm forTime Series Classification: New Directions https://doi.org/10.1016/j.procs.2018.03.025 The high timecomplexityof shapelet selection processhindersitsapplication in real timedataprocession. Toovercome this, inthispaper we proposeafast shapelet selection algorithm (FSS), which sharply reducesthe time consumption ofshapeletselection. https://slideplayer.com/slide/8370683/ Forexample,aclassof abnormalECG measurementmaybe characterised by an unusualpatternthat onlyoccurs occasionallyatany point during the measurement.Shapelets aresubseriesthatcapture thistypeofcharacteristic. Theyallowforthe detection ofphase- independentlocalised similaritybetween series within thesameclass. Thegreattimeseriesclassificationbakeoff:areviewandexperimental evaluationof recentalgorithmicadvances Anthony Bagnall, Jason Lines, Aaron Bostrom,James Large, Eamonn Keoghs (May2017) https://doi.org/10.1007/s10618-016-0483-9 | https://bitbucket.org/TonyBagnall/time-series-classification
  • 9. Wavelets Shapelets→ Shapelets ”1DGabors”#2 Afastshapelet selectionalgorithmfortime series classification https://doi.org/10.1016/j.comnet.2018.11.031 Thetrainingtime ofshapelet based algorithmsishigh, eventhough itis computed off-line, and the authorsaim tomake it moreefficient Shapelet transformation algorithms have attracted a great deal of attention in the last decade. However, the timecomplexity of the shapelet selectionprocess in shapelet transformation algorithms is too high. To accelerate the shapelet selection process with noreductioninaccuracy,wepresentedFSSforST. The experimental results demonstrate that our proposed FSS was thousands of timesfasterthantheoriginalshapelettransformation methodwithnoreduction in accuracy. Our results also demonstrate that our method was the fastest method among shapeletmethodsthathavetheleadinglevelofaccuracy.
  • 10. RepresentationLearning with deeplearning #1 TowardsaUniversalNeuralNetworkEncoderforTime Series Joan Serrà,SantiagoPascual,AlexandrosKaratzoglou(Submitted on 10May 2018)https://arxiv.org/abs/1805.03908 We have studied the use of a universal encoder for time series in the specific case of classifying an out-of-sample data set of an unseen data type. We have considered the cases of no-adaptation,mappingadaptation,andfulladaptation. In all cases we achieve performances that are competitive with the state-of-the-art that, in addition, involve a compact reusable representation and few training iterations. We have also studied the effect of the representation dimensionality, showing that small representations have an impact to no-adaptation and mapping adaptation approaches,butnotmuch tofulladaptation ones. In the future, we plan to refine the encoder architecture, as well as optimizing some of the parameters we empirically use in our experiments. A very interesting direction for future research is the adoption of one-shot learning schemas (Snelletal.2017; Sutskeveretal.2014), which we find very suitable for the current setting in time series classification problems. A further option to enhance the performance of a universal encoder is data augmentation, specially considering recent linear instance/class interpolation approaches ( Zhangetal.2018). In order to have sufficient knowledge to accomplish any task, and in order to be applicable in the absence of labeled data or even without adaptation/re-training, researchers have been increasingly adopting the generic concept of universal encoders, specially within the text processing domain (note that related concepts also existinother domains). The basic idea is to train a model (the encoder) that learns a common representation which is useful for a variety of tasks and that, at the same time, can be reused for novel tasks with minimal or no adaptation. While it would seem that classical autoencoders and other unsupervised models should perfectly fit this purpose, recent research in sentence encoding shows that, with current means, encoders learnt with a sufficiently large set of supervised tasks, or mixing supervised and unsupervised data, consistentlyoutperformtheirpurelyunsupervisedcounterparts.
  • 11. RepresentationLearning with deeplearning #2 OneDeepMusicRepresentationtoRuleThem All? Acomparativeanalysisofdifferentrepresentationlearning strategies JaehunKim,JulianUrbano,CynthiaC. S.Liem,AlanHanjalic (Submittedon13Feb2018) https://arxiv.org/abs/1802.04051 Ourworkwilladdressthefollowing researchquestions: –RQ1:Givenasetofcommonlearningtasksthatcanbeusedtotrain anetwork,whatistheinfluenceofthenumberandtypeofthetaskson theeffectivenessofthelearneddeeprepresentation? –RQ2:Howdovariousdegreesofinformationsharinginthedeep architectureaffecttheultimatesuccessofalearneddeep representation? –RQ3:Whatisthebestwaytoassesstheeffectivenessofadeep representation? Simplified illustration of the conceptual difference between traditional deep transfer learning (DTL) based on a single learning task (above) and multi-task based deep transfer learning (MTDTL) (below). The same color used for a learning and an unseen task indicates that the tasks have commonalities, which implies that the learned representation is likely to be informative for the unseen task. At the same time, this representation may not be that informative to another unseen task, leading to a low transfer learning performance. The hypothesis behind MTDTL is that relying on more learning tasksincreasesrobustness of thelearned representationand itsusabilityfor abroadersetof unseen tasks.
  • 12. RepresentationLearning with deeplearning #3 LearningFiner-classNetworksforUniversal Representations https://arxiv.org/abs/1810.02126 https://arxiv.org/abs/1712.09708 JulienGirard,YoussefTamaazousti,HervéLeBorgne,Céline Hudelot(Submittedon4 Oct2018) Many real-world visual recognition use-cases can not directly benefit from state-of-the-art CNN-based approaches because of the lack of many annotated data. The usual approach to deal with this is to transfer a representation pre-learned on a large annotated source-task onto a target- task of interest. This raises the question of how well the original representation is "universal", that is to say directly adapted to many different target-tasks. To improve such universality, the state-of-the-art consists in training networks on a diversified source problem, that is modified either by adding generic or specific categories to the initial set of categories. We propose two methods to improve universality, but pay special attention to limit the need of annotated data. We also propose a unified framework of the methods based on the diversifying of the training problem. Finally, to better match Atkinson's cognitive study about universal human representations, we proposed to rely on the transfer-learningschemeas wellasa new metric toevaluateuniversality. We show thatourmethod learnsmore universal representationsthan state- of-the-art, leading to significantly better results on 10 target-tasks from multiple domains, using several network architectures, either alone or combinedwithnetworkslearnedat acoarsersemantic level.
  • 13. RepresentationLearning with deeplearning #4 ImprovingClinicalPredictionsthroughUnsupervised TimeSeriesRepresentationLearning https://arxiv.org/abs/1812.00490 XinruiLyu,MatthiasHüser,StephanieL.Hyland,GeorgeZerveas, Gunnar Rätsch(Submittedon2Dec2018) MachineLearningforHealth(ML4H)Workshop atNeurIPS2018. We empirically showed that in scenarios where labeled medical time series data is scarce, training classifiers on unsupervised representations provides performance gains over end-to-end supervised learning using raw input signals, thus making effective use of information available in a separate, unlabeled training set. The proposed model, explored for the first time in the context of unsupervised patient representation learning, produces representations with the highest performance in future signal prediction and clinical outcome prediction, exceeding several baselines. The idea behind applying attention mechanisms to time series forecasting is to enable the decoder to preferentially “attend” to specific parts of the input sequence during decoding. This allows for particularly relevant events (e.g. drastic changes in heart rate),tocontributemoretothegenerationofdifferentpointsintheoutputsequence.
  • 14. RepresentationLearning with deeplearning #5 UnsupervisedScalableRepresentationLearningforMultivariate TimeSeries https://arxiv.org/abs/1901.10738 https://github.com/White-Link/UnsupervisedScalableRepresentationLearni ngTimeSeries (PyTorch) Jean-YvesFranceschi,AymericDieuleveut,MartinJaggi (Submittedon30Jan2019) Hence, we propose in the following an unsupervised method to learn general-purpose representations for multivariate time series that comply with the issues of varying and potentially high lengths of the studied time series. To this end, we adaptrecognized deep learningtools and introduce a novel unsupervised loss. Our representations are computed by a deep convolutional neuralnetworkwithdilatedconvolutions(i.e.TCNs). This network is then trained unsupervised, using the first specifically designed triplet loss in the literature of time series, taking advantage of the encoder resilience to time seriesofunequallengths. We leave as future work the applicability of our method to other tasks like forecasting, and the study of its impact if it weretobeaddedinpowerful ensemblemethods.
  • 15. RepresentationLearning with deeplearning #6 Unsupervised speech representation learning using WaveNet autoencoder https://arxiv.org/abs/1812.00490 Jan Chorowski, Ron J. Weiss,Samy Bengio, Aaron van den Oord(Submitted on 25 Jan 2019) We consider the task of unsupervised extraction of meaningful latent representations of speech by applying autoencoding neural networks to speech waveforms. The goal is to learn a representation able to capture high level semantic content from the signal, e.g. phoneme identities, while being invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. The behavior of autoencoder models depends on the kind of constraintthatis applied tothelatentrepresentation. Our best models used MFCCs (mel-frequency cepstral coefficient) as the encoder input, but reconstructed raw waveforms at the decoder output. We used standard 13 MFCC features extracted every 10ms (i.e., at a rate of 100 Hz) and augmented with their temporal first and second derivatives. Such features were originally designed for speech recognition and are mostly invariant to pitch and similarconfoundingdetail in theaudiosignal. T
  • 16. RepresentationLearning with deeplearning #7 ATaleof Two Time Series Methods:Representation Learningfor Improved Distance and RiskMetrics https://dspace.mit.edu/bitstream/handle/1721.1/119575/1076 345253-MIT.pdf DivyaShanmugam (June2018) Architecture of the proposed model. A single convolutional layer extracts local features from the input, which a strided maxpool layer reduces to a fixed-size vector. A fully connected layer with ReLU activation carries out further, nonlinear dimensionality reduction to yield the embedding. A softmax layer is added at training time. We introduce the multiple instance learning paradigm to risk stratification. Risk stratification models aim to identify patients at high risk for a given outcome so that doctors may intervene, with the attempt of avoiding that outcome. Machine learning has led to improved risk stratification models for a number of outcomes, including stroke, cancer and treatment resistance [55]. To the best of our knowledge, this is the first application of multiple instance learning to risk stratification. The extension of Jiffy to multi-label classification and unsupervised learning poses a challenging but necessary task. The availability of unlabeled time series data eclipses the availability of its annotated counterpart. Thus, a simple network-based method for representation learning on multivariate timeseries inthe absence oflabels isan important line of work. There is also potential to further increase Jiffy’s speed by replacing the fully connected layer with a structured [Bojarskietal.2016] or binarized[Rastegariet al.2016] matrix. The proposed risk stratification model extends naturally to a range of adverse outcomes. The model is not limited to operating on ECG signals - it is worth exploring whether the multiple instance learning approach may be successful in other modalities of medical data, including voice. On a theoretical level, strong generalization guarantees for distinguishing bags with relative witnessratesdonotexistand are worth exploring asthese modelsare appliedintherealworld.
  • 17. Intro tomethods#1a Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones
  • 18. Intro tomethods#1b Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones Structure inalibrary of8651time-seriesanalysisoperations. (a) A summaryof thefourmainclassesof operationsin ourlibrary,asdetermined by a k-medoidsclustering,reflectsacrudebutintuitiveoverviewof thetime-series analysisliterature.(b)A network representation of theoperationsinour library thataremostsimilarto theapproximateentropy algorithm, ApEn(2,0.2)[7], which wereretrieved fromourlibraryautomatically.Each nodein thenetwork representsanoperationand linksencodedistancesbetweenthem(computed using a normalized mutual information-based distancemetric, cf.electronic supplementary material,§S1.3.1).Annotated scatterplotsshowtheoutputsof ApEn(2,0.2)(horizontal axis)againsta representativememberof each shaded community (indicated bya heavily outlined node, vertical axis). Similar pictures can beproduced by targeting anygivenoperationin our library, thereby connecting differenttime-seriesanalysismethodsthatneverthelessdisplay similar behaviour acrossempiricaltimeseries. Key scientific questions that can be addressed by representing time series by their properties (measured by many types of analysis methods) and operations by their behaviour (across many types of time-series data). We show that this representation facilitates a range of versatile techniquesfor addressingscientific time-seriesanalysisproblems, which are illustrated schematicallyin thisfigure. The representations of time series (rows of the data matrix, figure 1a) and operations (columns of the data matrix, figure 1b) serve as empirical fingerprints, and are shown in the top panel. Coloured borders are used to label different classes of time series and operations, and other figures in this paper that explicitly demonstrate each technique are given in the bottom right-hand corner of each panel. (a) Time-seriesdatasetscan be organized automatically, revealingthe structure in agiven dataset (cf. figures4a,b and 5a). (b)Collectionsof scientific methods can be organized automatically, highlighting relationships between methods developed in different fields (cf. figures 3a and 5b). (c) Real-world and model-generated datawith similar propertiesto aspecific time-seriestarget can be identified (cf. figure 4c,d). (d)Given aspecific operation, alternativesfrom acrossscience can be retrieved (cf. figure 3b). (e)Regression:the behaviour of operations in our library can be compared to find operations that vary with a target characteristic assigned to time series in a dataset (cf. figure 5d). (f) Classification: operations can be selected based on their classification performance to build useful classifiers and gain insights into the differencesbetween classesof labelled time-series datasets(cf. figure 5e).
  • 19. Intro tomethods#1c Highlycomparative time-series analysis: theempirical structure of time series and their methods http://doi.org/10.1098/rsif.2013.0048 Ben D. Fulcher, Max A. Little, Nick S. Jones Highlycomparativetechniquesfortime- seriesanalysistasks.Wedrawonourfull library oftime-seriesanalysismethodsto: (a) structure datasetsinmeaningfulways, andretrieveandorganizeusefuloperations for (b,e) classificationand(c,d) regression tasks.(a)Fiveclassesof EEG signalsare structuredmeaningfullyinatwo- dimensional principalcomponentsspaceof our libraryof operations.(b)Pairwise linear correlationcoefficientsmeasuredbetween the60mostsuccessful operationsfor classifyingcongestiveheartfailureand normalsinusrhythmRR intervalseries. Clusteringrevealsthatmostoperationsare organizedintooneof threegroups (indicatedbydashedboxes). 
  • 20. Most of the time when people talk about time series and deep learning, most likely they talking of Sequences (e.g. language) instead of unstructuredtime series (e.g. voice waveform)
  • 21. “Sequences” vs“TimeSeries” “DenseTimeSeries”at videoframerate Icehockeyas gamecan be simplifiedto discreteevents (sequences) https://arxiv.org/abs/1808.04063 Notalwayssoblack-white,butinourcasetime-seriesaremainlydense1DBiosignalswithambiguousormissingdiscretestates
  • 22. Time Series RNNsforsequences The Unreasonable Effectivenessof RecurrentNeuralNetworks May21,2015|AndrejKarpathy http://karpathy.github.io/2015/05/21/rnn-effectiveness/ DanQ:ahybridconvolutionaland recurrentdeepneuralnetworkfor quantifyingthefunctionofDNA sequences  Daniel Quang XiaohuiXieNucleic AcidsResearch,Volume44, Issue11,20June2016,Pagese107,  https://doi.org/10.1093/nar/gkw226 DeepLearningforUnderstandingConsumerHistories byTobiasLang- 25Oct2016 https://jobs.zalando.com/tech/blog/deep-learning-for-understanding-consumer-histories/?gh_src=4n3gxh1 Sequences. Depending on your background you mightbewondering:  WhatmakesRecurrentNetworkssospecial?
  • 24. TimeSeries LSTMsApplied DeepAir|UCBerkeleySchoolofInformation https://www.ischool.berkeley.edu/projects/2017/deep-air This project investigates the use of the LSTM recurrent neural network (RNN) as a framework for forecasting in the future, based on time series data of pollution and meteorological information in Beijing. Our results show that the LSTM framework produces equivalent accuracy when predicting future time stamps compared to the baseline support vector regression for a single time stamp. Using our LSTM framework, we can now extend the prediction from a single time stamp out to 5 to 10 hours in the future. Overview of our self-supervised approach for posture and sequence representation learning using CNNLSTM. After the initial training with motion-based detections we retrain our model for enhancingthe learningof therepresentations. https://doi.org/10.1109/CVPR.2017.399 PianoGenie:An IntelligentMusicalInterface Oct15,2018 |https://magenta.tensorflow.org/pianogenie Chris Donahue (  chrisdonahue ,  chrisdonahuey ) ;Ian Simon (  iansimon ,  iansimon ) ;Sander Dieleman (  benanne ,  sedielem ) A bidirectional LSTM encoder maps asequence of piano notestoasequence of controller buttons (shown as 4 in the above figure, 8 in the actual system). A unidirectional LSTM decoder then decodes these controller sequences back into piano performances. After training, the encoder isdiscarded and controller sequencesareprovided byuser input.
  • 25. Time Series RNN/LSTMsareoutdated?#1 ThefallofRNN/ LSTM EugenioCulurciello https://towardsdatascience.com/the-fall-of-rnn-lstm-2d1594c74ce0 Combining multiple neural attention modules, comes the “hierarchical neural attention encoder”… Notice there is a hierarchy of attention modules here, very similar to the hierarchy of neural networks. This is also similar toTemporalconvolutionalnetwork(TCN) → Shapelets AttentionModels,e.g. Pervasive Attention: 2D Convolutional NeuralNetworksforSequence-to- SequencePrediction MahaElbayad,LaurentBesacier,JakobVerbeek (Submittedon11Aug 2018) https://arxiv.org/abs/1808.03867| https://github.com/elbayadm/attn2d
  • 26. Time Series RNN/LSTMsareoutdated?#2 AnEmpiricalEvaluationof GenericConvolutional and RecurrentNetworksforSequence Modeling ShaojieBai,J.ZicoKolter,VladlenKoltun (Revised19Apr2018) https://arxiv.org/abs/1803.01271 |http://github.com/locuslab/TCN For most deep learning practitioners, sequence modeling is synonymous with recurrent networks. Yet recent results indicate that convolutional architectures can outperform recurrent networks on tasks such as audio synthesis and machine translation. Given a new sequence modeling task or dataset, which architecture should one use? We conduct a systematic evaluation of generic convolutional and recurrent architectures for sequence modeling. The models are evaluated across a broad range of standard tasks that are commonly used to benchmark recurrent networks. Our results indicate that a simple convolutional architecture outperforms canonical recurrent networks such as LSTMs across a diverse range of tasks and datasets, while demonstrating longer effective memory. We conclude that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutional networks should be regarded as a natural starting point for sequence modelingtasks. The preeminence enjoyed by recurrent networks in sequence modeling may be largely a vestige of history. Until recently, before the introduction of architectural elements such as dilated convolutions and residual connections, convolutional architectures were indeed weaker. Our results indicate that with these elements, a simple convolutional architecture is more effective across diverse sequence modeling tasks than recurrent architectures such as LSTMs. Due to the comparable clarity and simplicity of TCNs, we conclude that convolutional networks should be regarded as a natural starting point and a powerfultoolkit for sequence modeling
  • 27. Time Series RNN/LSTMsareoutdated?#3 Dilated Temporal Fully-Convolutional Networkfor Semantic Segmentation ofMotion CaptureData NoshabaCheema,Somayeh Hosseini, Janis Sprenger, Erik Herrmann,Han Du, Klaus Fischer, PhilippSlusallek (Submittedon 24Jun 2018) https://arxiv.org/abs/1806.09174 Semantic segmentation of motion capture sequences plays a key part in many data-driven motion synthesis frameworks. It is a preprocessing step in which long recordings of motion capture sequences are partitioned into smaller segments. Afterwards, additional methods like statistical modeling can be applied to each group of structurally-similar segments to learn an abstract motion manifold. The segmentation task however often remains a manual task, which increases the effort and costofgeneratinglarge-scalemotiondatabases. We therefore propose an automatic framework for semantic segmentation of motion capture data using a dilated temporal fully-convolutional network. Our model outperforms a state-of-the-art model in action segmentation, as well as three networks for sequence modeling.
  • 28. Time Series RNN/LSTMsareoutdated?#4 TemporalConvolutionalNetworksandDynamicTimeWarping canDrasticallyImprovetheEarlyPredictionofSepsis MichaelMoor,MaxHorn,BastianRieck,DamianRoqueiroandKarsten Borgwardt(Submittedon7Feb2019) https://arxiv.org/abs/1902.01659 https://osf.io/av5yx/?view_only=a6e3442634b34d53ba6e59c4a956b318 For future work, we aim to extend our analysis to more types of data sources arising from the ICU. Futoma et al. (2017b) already employed a subset of baseline covariates, medication effects, and missingness indicator variables. However, a multitude of feature classes still remain to be explored and properly integrated. For instance, the combination of sequential and non-sequential features has previously been handled by feeding non-sequential data into the sequential model (Futoma et al.,2017a). We hypothesize that this could be handled more efficiently by using a more modular architecture that incorporates both sequential and non-sequential parts. Furthermore, we aim to obtain a better understanding of the time series features utilized by the model. Specifically, we are interested in assessing the interpretability of the learned filters of the MGPTCN framework and evaluate how much the activity of an individual filter contributes to a prediction. This endeavor is somewhat facilitated by our use of a convolutional architecture. The extraction of short per-channel signals could prove very relevant for supporting diagnoses made by clinical practitioners. Overview of our model. The raw, irregularly spaced time series are provided to the Multi-task Gaussian Process (MGP) patient by patient. The MGP then draws from a posterior distribution (given the observed data) at evenly spaced grid times (each hour). This grid is then fed into a temporal convolutional network (TCN) which after aforward pass returns a loss. Its gradient is then computed by backpropagating through the computational graph including both the TCN and the MGP (green arrows). Both the MGP and TCN parameters are learned end-to-end during training. We evaluate all methods using Area under the Precision–Recall Curve (AUPRC) and additionally display the (less informative) Area under the Receiver Operator Characteristic (AUC). The current state-of-the-art method, MGP-RNN, is shown in blue. The two approaches for early detection of sepsis that were introduced in this paper, i.e. MGP-TCN and DTW-KNN ensemble, are shown in pink and red, respectively. By using three random splits for all measures and methods, we depict the mean (line) and standard deviation error bars (shaded area).
  • 30. StructuringClinicalText Comparativeeffectiveness of convolutional neural network(CNN)and recurrent neural network(RNN) architectures for radiologytext reportclassification (2018) https://doi.org/10.1016/j.artmed.2018.11.004 DepartmentofBiomedicalDataScience,StanfordUniversitySchoolof Medicine,Stanford,CA,USA This paper explores cutting-edge deep learning methods for information extraction from medical imaging free text reports at a multi-institutional scale and compares them to the state-of-the-art domain-specific rule-based system – PEFinder andtraditionalmachinelearning methods– SVMandAdaboost. Visualization methods have been developed to identify the impact of input words on the output decision for both deeplearning models. DomainPhraseAttention-basedHierarchicalNeuralNetwork(DPA- HNN)architecture.
  • 31. ClinicalText +Images Unsupervised MultimodalRepresentation Learning across Medical Images and Reportsn (MachineLearning for Health (ML4H)Workshop atNeurIPS 2018.) https://arxiv.org/abs/1811.08615 MITCSAIL Joint embeddings between medical imaging modalities and associated radiology reports have the potential to offer significant benefits to the clinical community, ranging from cross- domain retrieval to conditional generation of reports to the broader goals of multimodal representation learning. In this work, we establish baseline joint embedding results measured via both local and global retrieval methods on the soon to be released MIMIC-CXR dataset consisting of both chest X-ray images and the associatedradiologyreports.. We establish baseline results using supervised and unsupervised joint embedding methods along with local (direct pairs) and global (ICD-9 code groupings) retrieval evaluation metrics. Results show a possibility of incorporating more unsupervised data into training for minimal-effort performance increase. A further study of joint embeddings between these modalities may enable significant applications, such as text/imagegenerationor theincorporationofotherEMRmodalities.
  • 33. EHRMining Risk Predictionmodel Risk Prediction on Electronic Health Records with Prior Medical Knowledge (2018) https://doi.org/10.1145/3219819.3220020 We propose a novel and general framework called PRIME for risk prediction task, which can successfully incorporate discrete prior medical knowledge into all of the state-of-the- art predictive models using posterior regularization technique. Different from traditional posterior regularization, we do not need to manually set a bound for each piece of prior medical knowledge when modeling desired distribution of the target disease on patients. Moreover, the proposed PRIME can automatically learn the importance of different prior knowledge with alog-linearmodel. The limitation of this work is that the proposed PRIME is only effective for common diseases. For rare and emerging diseases, since there is little medical knowledge about them, it is hard to incorporate any prior knowledge into deep learning predictive models. Thus, the proposed PRIME may achieve similar performance to the state-of-the-art baselines. In our future work, we will focus on how to improve predictive performanceofrisk predictionforrare diseases.
  • 35. Intro to cleaning Inthepreprocessing component,themainpurposeistocleanthe data,filter theunusualpointsandmakeitsuitableastheinputtothe CNN.Besidesthenormalstepsincludingtimestampalignment, normalizationandmissingdataimputationfortimeseriesdatawith trend, themostimportantoperationtoimprovethedataqualityisthe outlierdetection,interpolation andfiltering,inparticularfor clinicaldata.Becauseintheclinicaldataofglucosetimeseries,there aremanymissingor outlier datapointsduetoerrorsincalibration, measurements,and/or mistakesintheprocessofdatacollectionand transmission.Here,severalmethodsareintroducedtohandlethese scenarios[36]. ● DimensionReductionModel: thetimeseriescan beprojectedinto lowerdimensionsusinglinearcorrelationssuchasprinciplecomponent analysis(PCA),and datawithlargeresidualerrorscanbeconsideredas outliers. ● Proximity-basedModel: thedataaredeterminedbynearest neighbouranalysis,clusterordensity.Thus thedatainstancesthat are isolatedfromthemajorityareconsidered asoutliers. ● Probabilistic Stochastic Filters:differentfiltersforthesignals, such asgaussian mixturemodelsoptimized usingexpectation-maximization. In ourcasethefiltercan beimplementedbeforetheCNN, duetothe continuouscharacteristic oftheinputglycaemic timeseriesdata. AconvolutionalneuralnetworkforECGannotationasthebasisfor classificationofcardiacrhythms PhilippSodmann etal2018Physiol.Meas.inpress https://doi.org/10.1088/1361-6579/aae304 Signalcleaning: Inthedatapreprocessing,weperformedresamplingandsignaldenoising.We resampledallECGsto300HzusingthefastFourier transforminorder topassECG segmentsofequallengthontotheCNN. Tofilternoisycomponentsinthesignalsuchasbaselinewandering,respirationeffects, or powerlineinterference,weappliedadiscretewavelettransform(DWT)whichworks asaband-passfilter.For this,weusedDaubechieswavelettransform(Db4). Beforere-composition,eachcoefficientofthetransformwasmultipliedbyafactor accordingtotabulatedvalues.Afterwards,a15%-trimmedmeanwithawindowsizeof 33sampleswasappliedtoremovethepersistentbaseline. https://doi.org/10.3389/fnins.2013.00267 MEGandEEGdataanalysis withMNE-Python
  • 37. TimeSeries Invariances Acomplexity-invariantdistancemeasurefortimeseries https://doi.org/10.1137/1.9781611972818.60 GustavoEAPA Batista, Xiaoyue Wang, and Eamonn J Keogh. In Proceedingsofthe2011SIAM InternationalConferenceon DataMining(SDM), pages699–710.SIAM,2011.Citedby216 
  • 38. TimeSeries DTWthe classicalmethod https://doi.org/10.1145/2888451.2888 456 StockPricePredictionwithFluctuationPatternsUsing IndexingDynamic TimeWarpingand k∗ -Nearest NeighborsKei Nakagawa, MitsuyoshiImamura,Kenichi Yoshida(2018) https://doi.org/10.1007/978-3-319-93794-6_7
  • 39. Learning invariances#1a LearningtoExploit InvariancesinClinical Time-SeriesDatausingSequence TransformerNetworks JeehehOh, JiaxuanWang, JennaWiens (Submittedon 21 Aug2018) https://arxiv.org/abs/1808.06725 Recently, researchers have started applying convolutional neural networks (CNNs) with 1D convolutions to clinical tasks involving time-series data. This is due, in part, to their computational efficiency, relative to recurrent neural networks and their ability to efficiently exploit certain temporal invariances, (e.g.,phaseinvariance). However, it is well-established that clinical data may exhibit many other types of invariances (e.g., scaling). While preprocessing techniques, (e.g., dynamic time warping) may successfully transform and align inputs, their use often requires one to identify thetypesofinvariancesinadvance. In contrast, we propose the use of Sequence Transformer Networks, an end-to-end trainable architecture that learns to identify and account for invariances in clinical time-series data. Applied to the task of predicting in-hospital mortality, our proposedapproachachievesanimprovementintheAUROC. Toaddressesthesechallenges,weproposeSequenceTransformer Networks,anapproachfor learningtask-specificinvariancesrelatedtoamplitude,offset,andscaleinvariancesdirectlyfrom thedata.Appliedtoclinicaltime-seriesdata,SequenceTransformerNetworkslearn input-and task-dependenttransformations.Incontrasttodataaugmentationapproaches,our proposedapproachmakeslimitedassumptionsaboutthepresenceofinvariancesinthedata.
  • 40. Learning invariances#1b LearningtoExploitInvariancesinClinicalTime- Series DatausingSequenceTransformerNetworks Jeeheh Oh, Jiaxuan Wang, JennaWiens (Submitted on 21 Aug 2018) https://arxiv.org/abs/1808.06725 Theproposedapproachisnotwithoutlimitation.Morespecifically,initscurrentformthe SequenceTransformer appliesthesametransformationacrossallfeatureswithinanexample, insteadoflearningfeature-specifictransformations.Despitethislimitation,thelearned transformationsstillleadtoanincreaseinintra-classsimilarity.Inconclusion,weare encouragedbythesepreliminaryresults.Overall,thiswork representsastartingpoint on whichotherscanbuild.Inparticular,wehypothesizethattheabilitytocapturelocalinvariances andfeature-specificinvariancescouldleadtofurther improvementsinperformance.
  • 41. Learning invariances#2 Autowarp:LearningaWarpingDistancefromUnlabeledTime Series UsingSequenceAutoencoders Abubakar Abid, JamesZou StanfordUniversity (Submitted on 23Oct2018) https://arxiv.org/abs/1810.10107 Domain experts typically hand-craft or manually select a specific metric, such as dynamic time warping (DTW), to apply on their data. In this paper, we propose Autowarp, an end-to-end algorithm that optimizesand learnsagood metric givenunlabeled trajectories. We define a flexible and differentiable family of warping metrics, which encompasses common metrics such as DTW, Euclidean, and edit distance. Autowarp then leverages the representation power of sequence autoencoders to optimize for a member of this warping distance family. The output is a metric which is easy to interpret and can be robustly learned from relatively few trajectories. Future work will extend these results to more challenge time series data, such as those with higher dimensionality or heterogeneousdata.
  • 42. Learning invariances#3 NeuralWarp:Time-Series SimilaritywithWarpingNetworks Josif Grabocka, LarsSchmidt-Thieme (Submitted on20 Dec2018) https://arxiv.org/abs/1812.08306 | Relatedarticles In this paper we propose to learn a warping function for aligning the indices of time series in a deep latent representation. We compared the suggested architecture with two types of encoders (CNN, or RNN) and a deep forward network as a warping function. Experimental comparisons to non-parametric and un-warped Siames networks demonstrated that the proposed elastic deep similaritymeasureismoreaccuratethanpriormodels.
  • 44. SMOTE forimbalancedclasses SMOTE-GPU:BigData preprocessingon commodityhardwareforimbalancedclassification ProgressinArtificialIntelligenceDecember2017,Volume6, Issue4,pp347–354 https://doi.org/10.1007/s13748-017-0128-2 Consideringabinaryproblemwithamajorityclassanda minorityclass,itislikelythatalearning algorithmignoresthe later andstillachievesahighaccuracy.Thereare threemain waysof dealingwiththesesituations [16]: ● Algorithmicmodification Modifyinglearning algorithmsin order totackletheproblembydesign. ● Cost-sensitivelearningIntroducingcostsfor misclassificationoftheminorityclassatdataor algorithmic level. ● DatasamplingPreprocessingthedatainorder toreduce thebreachbetweenthenumberofinstancesofeachclass. TheSMOTEtechniqueisbasedontheideaof neighborhoodofthek-nearestneighbor (kNN)rule. The area under the ROC curve results show that the use of oversampling methods improves the detection of the minority class in Big Data datasets. We have also shown how our design can successfully work on a wide range of devices, including a laptop, while requiring reasonable times, around 25 min on high-end devices, and less than 2 h on the laptop, for the most time-demanding experiment. SMOTEforLearningfromImbalancedData:Progress and Challenges,Markingthe15-yearAnniversary(2018) https://doi.org/10.1613/jair.1.11192 ● GS4(Moutafis & Kakadiaris, 2014) ,SEG-SSC (Triguero et al.,2015) and OCHS-SSC (Dong et al.,2016) generate synthetic examplestodiminish the drawbacksproducedby the absence of labeled examples. Several learning techniques were checked andsomeproperties such asthecommonhiddenspacebetweenlabeledsamplesand thesyntheticsamplewereexploited. ● The technique proposed by Park et al. (2014) is a semi- supervised active learning method in which labels are incrementally obtained and applied using a clusteringalgorithm. Inthe contextofcurrentchallengesoutlined,we highlightedtheneed forenhancingthetreatmentof smalldisjuncts,noise, lack of data, overlapping,datasetshiftandthecurseof dimensionality. To doso,the theoreticalpropertiesof SMOTE re-garding these data characteristics, and its relationship with the new synthetic instances,mustbefurtheranalyzedindepth. Finally,wealsoposited thatitisimportanttofocusondatasampling andpre-processing approaches(such asSMOTE anditsextension)withintheframework ofBig Dataandreal-timeprocessing.
  • 47. State-of-the-art 2 yearsoldcuttingedge#1 AComparativeEvaluationofUnsupervisedAnomaly DetectionAlgorithms forMultivariateData (2016) MarkusGoldstein,Seiichi Uchida https://doi.org/10.1371/journal.pone.0152173 Dozens of algorithms have been proposed in this area, but unfortunately the research community still lacks a comparative universal evaluation as wellascommonpubliclyavailabledatasets. These shortcomings are addressed in this study, where 19 different unsupervised anomaly detection algorithms are evaluated on 10 different datasetsfrommultipleapplicationdomains. By publishing the source code and the datasets, this paper aims to be a new well-funded basis for unsupervised anomaly detection research. Additionally, this evaluation reveals the strengths and weaknessesofthedifferent approachesforthefirst time. As a general summary for algorithmselection, werecommend to use nearest-neighbor based methods, in particular k-NN for global tasks and LOF for local tasks instead of clustering-based methods. If computation time is essential, HBOS is a good candidate, especially for larger datasets. A special attention should be paid to the nature of the dataset when applying local algorithms, and if local anomalies are of interest at allin thiscase.  Different anomaly detection modes dependingon the availability of labels in the dataset. (a) Supervised anomaly detection uses a fully labeled dataset for training. (b) Semi- supervised anomaly detection uses an anomaly-free training dataset. Afterwards, deviations in the test data from that normal model are used to detect anomalies. (c) Unsupervised anomaly detection algorithms use only intrinsic information of the data in order to detect instances deviatingfrom the majority of thedata.
  • 48. State-of-the-art 2 yearsoldcuttingedge#2 A ComparativeEvaluation of Unsupervised Anomaly Detection Algorithmsfor Multivariate Data (2016)MarkusGoldstein, SeiichiUchida https://doi.org/10.1371/journal.pone.0152173 A visualization of the results of the k-NN global anomaly detection algorithm. The anomaly score is represented by the bubble size whereas the color shows the labelsoftheartificiallygenerateddataset. Comparing Influenced Outlierness (INFLO) withLocal Outlier Factor (LOF) showstheusefulnessofthe reverseneighborhoodset. For the red instance, LOF takes only the neighbors in the gray area into account resulting in a high anomaly score. INFLO additionally takes the blue instances into account (reverse neighbors)andthusscorestheredinstancemorenormal.
  • 49. Anomalydetection Cyber-physicalsystems Anomaly DetectionwithGenerativeAdversarialNetworks for MultivariateTimeSeries (2018) Dan Li, DachengChen, Jonathan Goh,andSee-KiongNg InstituteofDataScience, National UniversityofSingapore, https://arxiv.org/abs/1809.04758 Unsupervised machinelearningtechniquescanbeusedtomodelthe systembehaviour andclassifydeviantbehavioursaspossibleattacks. Inthiswork,weproposedanovelGenerativeAdversarialNetworks-based AnomalyDetection(GAN-AD)methodfor suchcomplexnetworkedCPSs. WeusedLSTM-RNNinourGANtocapturethedistributionofthe multivariatetimeseriesofthesensorsandactuatorsundernormal workingconditionsofaCPS. Insteadoftreatingeachsensor’sandactuator’stimeseriesindependently,we model thetimeseriesofmultiplesensorsandactuatorsintheCPS concurrently totakeintoaccountofpotentiallatentinteractions betweenthem. ToexploitboththegeneratorandthediscriminatorofourGAN,wedeployedthe GAN-traineddiscriminator together withtheresidualsbetweengenerator- reconstructeddataandtheactualsamplestodetectpossibleanomaliesinthe complexCPS. We will also conduct further research on feature selection formultivariate anomalydetection,and investigate principled methodsfor choosing the latent dimension andPC dimension withtheoretical guarantees.
  • 50. Anomalydetection Financialtime-series Modelingapproachesfortimeseries forecastingand anomaly detection (2018) Du,Shuyang; Pandey, Madhulima; Xing,Cuiqun http://cs229.stanford.edu/proj2017/final-reports/5244275.pdf This project focuses on prediction of time series data for Wikipedia page accesses for a period of over twenty-four months. The methods explored here are K-nearest neighbors (KNN), Long short-term memory network (LSTM), and Sequence to Sequence with Convolution Neural Network (CNN) and we will compare predicted values to actual web traffic. Thepredictionscan helpusinanomalydetectionintheseries. Pre-processing : “The are many series in which values are zero. This could be a missing value, or actual lack of web page access. In addition, there are significant spikes in the data, where values have a broad range from 1 to hundreds/thousandsfor several web pages. We normalize this data by adding 1 to all entries, taking the log of the values, and setting the mean to zero and variance to one. We have the results of fourier analysisforexploringperiodictyonaweekly/monthly/quarterlybasis.” Our approaches to time series prediction depends on features extracted from the the time series data itself. Our models learn periodicity, ramp and other regular trends quite well. However, none of our models are able to capture spikes or outliers that arise from external sources. Enhancing the performance of the models will require augmenting our feature set from othersourcessuchasnewseventsandweather.
  • 51. “SpecialOutliers” Disguisedmissingvalues FAHES:ARobustDisguised Missing ValuesDetector QatarComputingResearch Institute,HBKU, Doha,Qatar https://doi.org/10.1145/3219819.3220109 Missing values are common in real-world data and may seriously affect data analytics such as simple statistics and hypothesis testing. Generally speaking, there are two types of missing values: explicitly missing values (i.e. NULL values), and implicitly missing values (a.k.a. disguised missing values (DMVs)) such as "11111111" for a phone number and "Some college" for education. While detecting explicitly missing values is trivial, detecting DMVs is not; the essential challenge is the lack of standardization about how DMVs are generated. Onefutureworkweareplanning toperformistoimproveFAHESto detecttheDMVsthataregenerated randomlywithintherangeofthe data.For example,whenachildtries tocreateanaccountonadomain thathasaminimumagerestriction, thechildfakesher agewitharandom valuethatallowshimtocreatethe account.Suchrandomfakevalues arehard,ifnotimpossible,todetect. Moreover,althoughDMVsarethe focusofthispaper,therearemore typesoferrorsarefoundinthewild. Manyoftheprinciplesand techniqueswehaveusedtodetect DMVscanbeleveragedtodetect other typesoferrors,soanatural nextstepistoextendthe infrastructurewehavebuiltto detectthose.Thisopensnew challengesrelatedtotherobust identificationoferrorsthatcouldbe interpreteddifferentlybydifferent modules.
  • 53. UncertaintyandNoveltydetection #1a Does YourModel KnowtheDigit6Is NotaCat?ALessBiased Evaluationof“Outlier” Detectors (2018) AlirezaShafaei,MarkSchmidt,andJamesJ.Little https://arxiv.org/abs/1809.04729 What makes this problem differentfrom a typical supervisedlearning setting isthatwecannotmodelthediversityofout-of-distributionsamplesin practice. The distribution of outliers used in training may not be the same as the distribution of outliers encountered in the application. Therefore, classical approaches that learn inliers vs. outliers with only two datasets can yield optimistic results. We introduce OD-test, a three-dataset evaluation scheme as a practical and more reliable strategy to assess progress on this problem. The OD-test benchmark provides a straightforward means of comparison for methods that address the out-of- distributionsampledetectionproblem. In real life deployment of products that use complex machinery such as deepneuralnetworks(DNNs),we wouldhavevery littlecontroloverthe input. In the absence ofextrapolation guarantees, when the independently and identically distributed (IID) assumption is violated, the behaviour of the pipeline may be be unpredictable. From a quality assurance perspective, it is desirable to detect and prevent these scenarios automatically. A reliable pipeline would first determine whether it can process a given sample, then it would use the prediction of the target neural network. The unfortunate incident that mislabeledpeople asnon-human , for instance, is a clear example of OOD extrapolation that could have been prevented by such a decision scheme: the model simply did not know that it did not know. While incidentsof similar nature have fueled researchon de-biasing the datasets and the deep learning machinery, we still wouldneed to identify the limitationsof ourmodels. The application is not limited to fortifying large-scale user- facing products. Successful detection of such violations could also be used in active learning, unsupervised learning, learning with noisy data, or simply be a condition to invoking transfer learning strategies. In this work, we are interested in evaluating mechanisms that detect OOD samples.
  • 54. UncertaintyandNoveltydetection #1b DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test The Uncertainty View. A commonly invoked strategy in addressing similarproblemsistocharacterizeanotionofuncertainty. The literature distinguishes aleatoric uncertainty, the uncertainty inherent to the process (the known unknowns, like flipping a coin), from epistemic uncertainty, the uncertainty that can be eliminated with more information (the unknown unknowns). The Bayesian approach to epistemic uncertainty estimation is to measure the degree of disagreement among thepotentiallyviablemodels(theposterior). The MC-Dropout approach is often advertised as a feasible method to estimateuncertainty for a variety of applications. Similarly, we can adopt a non-Bayesian approach by training independent models and then measuringthedisagreement.Lakshminarayananetal.showanensembleof five neural networks (DeepEnsemble) that are trained with an adversarialsample-augmented strategy is sufficient to provide a non- Bayesian alternative to capturing predictive uncertainty. We evaluate DeepEnsemble and MC-Dropout. * The Abstention View * The Anomaly View AEThreshold PixelCNN++ K-NNSVM * The Novelty View OpenMax We train these architectures with a cross-entropy loss (CE), and a k-way logistic regression loss (KL). CE loss is the typical choice for k-way classification tasks – it enforces mutual exclusion in the predictions. KL loss is the typical choice for attribute prediction tasks – it does not enforce mutual exclusivity of the predictions. We test these two loss functions to see if the exclusivity assumption of CE has an adverse effect on the ability to predict OOD samples. CE loss cannot make a None prediction without an explicitly defined None class, but KL loss can make None predictions through low activations of all the classes.
  • 56. UncertaintyandNoveltydetection #1d DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test [PyTorch] Related work indeep learning can be categorized into two broadgroupsbased on the underlyingassumptions: (i) in-distribution techniques, and (ii) out-of-distribution techniques. Guoetal. (2017) observed that modern neural networks tend to be overconfident in their predictions. They show that temperature scaling in the softmax operator, also known as Platt scaling, can be used to calibrate the output probabilities of a neural network to empirically align the accuracy of a prediction with its probability. Their efforts fall under the uncertainty estimation approaches. Geifman and El-Yaniv (2017) present a framework for selective classification with deep neural networks that follows the abstention view. A selection function decides whether to make a prediction or not. For the choice of selection function, they experiment with MC-Dropout and the softmax output. They provide an analytical trade-off between risk and coverage within their formulation. input perturbation serves as a way to assess how the network would behave nearby the given input. When the temperature is 1 and the perturbation step is 0 we simply recover the PbThreshold method. ODIN, the state-of-the-art at the time of this writing, is reported to outperform the previous work [8] by a significant margin. We also assess the performance of ODIN inourwork. These methods provide an abstract idea which depends on the successful training of GANs. To the best of our knowledge, training GANs is itself an active area of research, and it is not apparent what design decisions would be appropriate to implement these ideas in practice. Furthermore, someoftheseideasareprohibitivelyexpensivetoexecuteatthetimeofthiswriting.
  • 57. UncertaintyandNoveltydetection #1e DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test Datasets. We extend the previous work by evaluating over a broader set of datasets with varying levels of complexity. The variation in complexity allows for a fine-grained evaluation of the techniques. Since OOD detection is closely related to the problem of density estimation, the dimensionality of the input image will be of vital importance in practical assessments. As the input dimensionality increases, we expect the task to become much more difficult. Therefore, to provide a more accurate picture of performance, itiscrucialtoevaluatethemethodsonhighdimensionaldata. MC-Dropout Inlow-dimensional datasets,K- NNSVMperforms similarlyorbetter than theother methods Thetop-performingmethod,ODIN,isinfluencedbythe numberofclassesin thedataset.Similarto PbThreshold,ODIN dependson themaximum signalin theclasspredictions, thereforetheincreasednumberof classeswould directly affect bothofthemethods.Furthermore,neitherofthemconsistently prefersVGGoverResnetwithinalldatasets. Overall,ODIN consistentlyoutperformsothersinhigh-dimensional settings, but allthemethodshavea relativelylow average accuracyinthe60%-78%range.
  • 58. UncertaintyandNoveltydetection #1f DoesYour Model Know the Digit 6 Is Not a Cat?A Less Biased Evaluation of “Outlier” Detectors (2018) AlirezaShafaei, MarkSchmidt, and JamesJ. Little https://arxiv.org/abs/1809.04729 |https://github.com/ashafaei/OD-test
  • 59. UncertaintyandNoveltydetection #2 To TrustOr NotTo Trust A Classifier HeinrichJiang, Been Kim, Maya Gupta (2018) Google Research;Google Brain https://arxiv.org/abs/1805.11783 We propose a new score, called the trust score, which measures the agreement between the classifier and a modified nearest-neighbor classifier on the testing example. We show empirically that high (low) trust scores produce surprisingly high precision at identifying correctly (incorrectly) classified examples, consistently outperforming the classifier’s confidence scoreas well as many other baselines. Two example datasets and models. Predicting correctness (top row) and incorrectness (bottom). The vertical dotted black line indicates accuracy level of the classifier. The trust score consistently attains a higher precision for each given percentile of classifier decision-rejection. Furthermore, the trust score generally shows increasing precision as the percentile level increases, but surprisingly, many of the comparison baselinesdo not.
  • 60. UncertaintyandNoveltydetection #3 Interpreting Neural NetworksWith Nearest Neighbors Eric Wallace, Shi Feng, Jordan Boyd-Graber https://arxiv.org/abs/1809.02847 Local model interpretation methodsexplain individual predictionsbyassigning animportance value to each inputfeature. Thisvalue isoften determined by measuringthe change in confidence when a feature is removed. However, the confidence of neural networksis nota robust measure of model uncertainty. Thisissue makesreliably judgingthe importance of the input featuresdifficult.We addressthisby changing the test-time behaviorofneural networks using Deep k-Nearest Neighbors. Without harmingtext classification accuracy, thisalgorithm providesa more robustuncertainty metric whichwe use to generate feature importance values. The resultinginterpretationsbetteralign withhuman perception than baseline methods. Finally, we use our interpretation methodto analyze model predictionson dataset annotation artifacts. Deepk-nearest neighbors: Towards confident, interpretable and RobustDeep Learning NicolasPapernot and Patrick D. McDaniel (2018) https://arxiv.org/abs/1803.04765 Debugging ResNet model biases—This illustrates how the DkNN algorithm helps to understand a bias identified by Stock and Cisse [105] in the ResNet model for ImageNet. The image at the bottom of each column is the test input presented to the DkNN. Each test input is cropped slightly differently to include (left) or exclude (right) the football. Images shown at the top are nearest neighbors in the predicted class according to the representation output by the last hidden layer. This comparison suggests that the “basketball” prediction may have been a consequence of the ball being in the picture. Also note how the white apparel color and general arm positions of players often match the test image of BarackObama.
  • 61. UncertaintyandNoveltydetection #4 AND:AutoregressiveNoveltyDetectors Davide Abati, AngeloPorrello, Simone Calderara, RitaCucchiara (Submitted on4 Jul 2018) https://arxiv.org/abs/1807.01653 We propose an unsupervised model for novelty detection. The subject is treated as a density estimation problem, in which a deep neural network is employed to learn a parametric function that maximizes probabilities of training samples. This is achieved by equipping an autoencoder with a novel module, responsible for the maximization of compressed codes' likelihood by means of autoregression. We illustrate design choices and proper layers to perform autoregressive density estimation when dealing with both image and video inputs. Despite a very general formulation, our model shows promising results in diverse one-class novelty detectionandvideoanomalydetectionbenchmarks. Thestructureoftheproposedautoencoder.Pairedwithastandardcompression-reconstruction network,adensityestimationmodulelearnsthedistributionoflatentcodes,viaautoregression.1
  • 62. Anomalydetection withGANs#1 AnomalydetectionwithWassersteinGAN IlyassHaloui, Jayant SenGupta, and Vincent Feuillard (Submitted on11Dec2018) https://arxiv.org/pdf/1812.02463 Inthispaper,we investigateGAN toperformanomalydetectionon time series dataset. In order to achieve this goal, a bibliography is made focusing on theoretical properties of GAN and GAN used for anomaly detection. A Wasserstein GAN hasbeen chosen to learn the representation of normal data distribution and a stacked encoder with the generator performsthe anomaly detection. W-GAN with encoder seems to produce state of the art anomaly detection scores on MNIST datasetandweinvestigateitsusageon multi-variatetimeseries. Based on this literature review, we chose to perform anomaly detection using a Wasserstein Generative Adversarial Network. The main reason is that Wasserstein GAN does not collapse contrarily to the classical GAN which needs to be heavily tuned in order to avoid this problem. Mode collapse can be blocking if we need to perform anomaly detection: ifasubset ofour datadistributionisnotlearned bythe generator, then all samples that are similar to this subset might end up classified as abnormal. Another added value of the wasserstein GAN version compared to a standard GAN is the possibility of using the loss function of the discriminator to evaluate convergence since it is an approximationoftheWassersteindistancebetween Pr andPθ . A future improvement consists in considering CNN for both the generator and discriminator in order to detect anomalies from raw time series data. 1-D convolutions are needed and will be investigated to produce good visual representations of time series samples.A more thorough study of the impact of the architecture should also be done.
  • 63. Anomalydetection withGANs#2 MAD-GAN:MultivariateAnomalyDetectionforTimeSeries DatawithGenerativeAdversarialNetworks DanLi, DachengChen, LeiShi, BaihongJin, Jonathan Goh, and See-KiongNg (Submitted on15Jan 2019) Institute ofData Science, National UniversityofSingapore https://arxiv.org/abs/1901.04997 In this work, we propose a novel Multivariate Anomaly Detection strategywith GAN (MAD-GAN) to model the complex multivariate correlations among the multiple data streams to detect anomalies using both the GANtrained generator and discriminator. Unlike traditional classification methods, the GAN-trained discriminator learns to detect fake data from real data in an unsupervised fashion, making it an attractive unsupervised machine learning technique for anomalydetection Given that this is an early attempt on multivariate anomaly detection on timeseriesdatausingGAN,thereareinteresting issuesthatawaitfurther investigations.Forexample,wehavenotedtheissuesofdeterminingthe optimal subsequence length as well as the potential model instability of theGANapproaches. For future work, we plan to conduct further research on feature selection for multivariate anomaly detection, and investigate principled methods for choosing the latent dimension and PC dimension with theoretical guarantees.Wealsohope toperformadetailedstudyon the stability of the detection model. In terms of applications, we plan to explore the use of MAD-GAN for other anomaly detection applications such as predictive maintenance and fault diagnosis for smart buildings andmachineries.
  • 64. Uncertainty InsightsfromNLP uncertainty QuantifyingUncertaintiesinNaturalLanguage ProcessingTasks YijunXiaoand William YangWang(Submitted on 18 May2018) https://arxiv.org/abs/1811.07253 In this paper, we propose novel methods to study the benefits of characterizing model and data uncertainties for natural language processing (NLP) tasks. With empirical experiments on sentiment analysis, named entity recognition, and language modeling using convolutional and recurrent neural network models, we show that explicitly modeling uncertainties is not only necessary to measure output confidence levels, but also useful at enhancing model performances in various NLPtasks. 1. We mathematically define model and data uncertaintiesviathelawof totalvariance; 2. Our empirical experiments show that by accounting for model and data uncertainties, we observe significantimprovementsinthree importantNLPtasks; 3. We show that our model outputs higher data uncertainties for more difficult predictions in sentiment analysis andnamedentity recognitiontasks.
  • 65. Uncertainty CNNs+GaussianProcesses CalibratingDeepConvolutionalGaussianProcesses Gia-Lac Tran, Edwin V. Bonilla, John P. Cunningham, PietroMichiardi, Maurizio Filippone. (Submitted on 26May 2018) https://arxiv.org/abs/1805.10522 Despite the considerable interest in combining CNNs with GPs, little attention has been devoted to understand the implications in terms of the ability of these models to accurately quantify the level of uncertainty inpredictions. This is the first work that highlights the issues of calibration of these models, showing that GPs cannot cure the issues of miscalibration in CNNs. We have proposed a novel combination of CNNs and GPs where the resulting model becomes a particular form of a Bayesian CNN for which inference using variational inference isstraightforward. However, our results also indicate that combining CNNs and GPs does not significantly improve the performance of standard CNNs. This can serve as a motivation for investigating new approximation methods for scalable inference in GP models and combinationswithCNNs. CalibrationofConvolutionalNetworks: The issue of calibration of classifiers in machine learning was popularized in the 90’s with the use of support vector machines for probabilistic classification. Calibration techniques aim to learn a transformation of the output using a validation set in order for the transformed output to give a reliable account ofthe actual probability ofclasslabels; interestingly,calibration can be appliedregardless of the probabilistic nature of the untransformed output of the classifier. Popular calibration techniques include Plattscaling and isotonicregression. Classifiers based on Deep Neural Networks (DNNs) have been shown to be well-calibrated]. The reason is that the optimization of the cross-entropy loss promotes calibrated output. The same loss is used in Platt scaling and it corresponds to the correct multinomial likelihood for class labels. Recent studies on the calibration of CNNs, which are a particular case of DNNs, however, show that depth has a negative impact on calibration, despite the use of a cross-entropy loss, and that regularization improves the calibration properties of classifiers[Guoetal.2017]. Combinationsof ConvNetsandGaussianProcesses: Thinking of Bayesian priors as a form of regularization, it is natural to assume that Bayesian CNNs can “cure” the miscalibration of modern CNNs. Despite the abundant literature on Bayesian DNNs, far less attention has been devoted to Bayesian CNNs, and the calibration properties of these approaches have not been investigated. In this work, we propose an alternative way to combine CNNs and GPs, where GPs are approximated using random features expansions. The random feature expansion approximation amounts in replacing the orginal kernel matrix with a low-rank approximation, turning GPs into Bayesian linear models. Combining this with CNNs leads to a particular form of Bayesian CNNs, much like GPs and DGPs are particular forms of Bayesian DNNs. Inference in Bayesian CNNs is intractable and requires some form of approximation. In this work, we draw on the interpretation of dropout as variational inference, employing the so-called Monte Carlo Dropout (MCD) to obtain a practicalwayofcombiningCNNsand GPs.
  • 66. Uncertainty in timestamps,modelingfor clinicaluse#1 Time-DiscountingConvolutionforEventSequences withAmbiguousTimestamps (Submitted on 6Dec2018) https://arxiv.org/abs/1812.02395 This paper proposes a method for modeling event sequences with ambiguous timestamps, a time- discounting convolution. Unlike in ordinary time series, time intervals are not constant, small time-shifts have no significant effect, and inputting timestamps or time durations into a model is not effective. The criteria that we require for the modeling are providing robustness against time-shifts or timestamps uncertainty as well as maintaining the essential capabilities of time-series models, i.e., forgetting meaningless past information and handling infinite sequences. The proposed method handles them with a convolutional mechanism across time with specific parameterizations, which efficiently represents the event dependencies in a time-shift invariant manner while discounting the effect of past events, and a dynamic pooling mechanism, which provides robustness against the uncertainty in timestamps and enhances the time-discounting capability by dynamically changing the poolingwindowsize.
  • 68. Typesof Missing Values Feldmanetal.(2018): “Rubin (1976) discusses three possible mechanisms for the formation of missing values, each reflecting a different form of missing-data probabilities and relationships between the measured variables, and each may lead to different imputation methods (Luengoetal.,2012)” Missing Completely at Random (MCAR): a missing value that cannot be related to the value itself or to other variable values in that record. This is a completely unsystematic missing pattern and therefore the observed data canbethoughtofasarandomunbiasedsampleofacompletedataset. Missing at Random (MAR): cases in which a missing value is related to other variable valuesin thatrecord,but nottothevalue itself(e.g., aperson with a "marital status" value "single", has a missing value in the "spouse name" attribute). In other words, in MAR scenarios, incomplete data can be partially explained and the actual value can be possibly predicted by other variable values. Missing Not at Random (MNAR): the missing value is not random and depends on the actual value itself; hence, cannot be explained by other values (e.g., an overweight person is reluctant to provide the "weight" value in a survey). NMAR scenarios are the most difficult to analyze and handle, as the missing data cannot be associated with other data items that are available in thedataset. https://statistical-programming.com/missing-data/ Missinginaction:the dangersofignoringmissingdata https://doi.org/10.1016/j.tree.2008.06.014
  • 69. Intro toimputationmethods ComparisonofEstimatingMissingValues inIoTTime Series DataUsingDifferentInterpolationAlgorithms August2018 https://doi.org/10.1007/s10766-018-0595-5 “When collecting the Internet of Things data using various sensors or other devices, it may be possible to miss several kinds of values of interest.In thispaper,we focusonestimating the missing valuesin IoT time series data using three interpolation algorithms, including (1) Radial Basis Functions, (2) Moving Least Squares (MLS), and (3) AdaptiveInverseDistanceWeighted.“ Onthechoiceofthebestimputationmethods formissingvalues consideringthreegroups ofclassificationmethods June2011 https://doi.org/10.1007/s10115-011-0424-2|https://sci2s.ugr.es/MVDM “In thiswork, wefocuson aclassification task with twenty-three classification methods and fourteen different imputation approaches to missing values treatment that are presented and analyzed. The analysis involves a group-based approach, in which we distinguish between three different categories of classification methods. Each category behaves differently, and the evidence obtained shows that the use of determined missing values imputation methods could improve the accuracy obtained for these methods. In this study, the convenience of using imputation methods for preprocessing data sets with missing values is stated. The analysis suggests that theuseofparticularimputation methodsconditionedtothegroupsisrequired.“ We have discovered that the Combined Multivariate Collapsing (CMC) and Event Covering (EC) methods show good behavior for these two measures, and they are two methods that provide good results for an important range of learning methods, as we have previously analyzed. In short, these two approaches introduce less noise and maintain the mutual information better. Class centerbasedapproachformissingvalue imputation2018 https://doi.org/10.1016/j.knosys.2018.03.026 A novel missing value imputation isintroduced, which iscomposedof two modules. Each class center and its distances from the other observed data are measured to identify a threshold. Then, the identified threshold is used for missing value imputation. The proposed approach outperforms the other approaches for both numerical and mixed datasets. It requires much less imputation timethanthemachinelearning basedmethods.
  • 70. Imputation withDeepLearning#1 BRITS:BidirectionalRecurrentImputationforTime Series WeiCao,DongWang,JianLi,HaoZhou,LeiLi,YitanLi (Submittedon27May2018) https://arxiv.org/abs/1805.10572 https://github.com/NIPS-BRITS/BRITS Existing imputation methods often impose strong assumptions of the underlying data generating process, such as linear dynamics in the state space. In this paper, we propose BRITS, a novel method based on recurrent neural networksformissingvalueimputationintimeseriesdata. Our proposed method directly learns the missing values in abidirectional recurrentdynamicalsystem,without any specific assumption. The imputed values are treated as variablesofRNNgraphandcan beeffectivelyupdatedduring the backpropagation. We simultaneously perform missing value imputation and classification/regression of applications jointlyinoneneuralgraph. BRITS has three advantages: (a) it can handle multiple correlated missing values in time series; (b) it generalizes to time series with nonlinear dynamics underlying; (c) it provides a data-driven imputation procedure and appliestogeneralsettingswithmissing data. We evaluate the imputation performance in terms of mean absolute error (MAE) and mean relative error (MRE).
  • 71. Imputation withDeepLearning#2 End-to-EndTimeSeriesImputationviaResidualShortPaths Lifeng Shen,Qianli Ma,SenLi (2018) http://proceedings.mlr.press/v95/shen18a.html We propose an end-to-end imputation network with residual short paths, called Residual IMPutation LSTM (RIMP-LSTM), a flexible combination of residual short paths with graph-based temporal dependencies. We construct a residual sum unit (RSU), which enables RIMP-LSTM to make full use of previous revealed information to model incomplete time series and reduce the negative impact of missing values. Moreover, a switch unit is designed to detect the missing values and a new loss function is then developed to train our model with time series in the presence of missing values in an end-to-end way, which also allows simultaneous imputationand prediction. RIMP-LSTM combines the merits of graph-based models with explicitly modeled temporal dependencies via weighted residual connection between nodes, with the ones of LSTM that can accumulate historical residual information and learn the underlying patternsof incomplete time seriesautomatically. On the other hand, compared with IMP-LSTM, RIMP-LSTM has better performance as it is good at modeling temporal dependencies with weighted residual short paths, which demonstrates that the reasonability of using these weighted residual pathsto model graphlike temporal dependenciesforimputation.
  • 72. Imputation withDeepLearning#3 Acontextencoderforaudioinpainting AndresMarafioti,NathanaelPerraudin,Nicki Holighaus,andPiotr Majdak (Submittedon29Oct2018) https://arxiv.org/abs/1810.12138 http://www.github.com/andimarafioti/audioContextEncoder (Python,Matlab) We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds, a condition which has not received much attention yet. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps Here, the STFT features, meant as a reasonable first choice, provided a decent performance. In the future, we expect more hearing-related features to provide even better reconstructions. In particular, an investigation of Audlet frames, i.e., invertible time- frequency systems adapted to perceptual frequency scales, as featuresforaudioinpaintingpresentintriguingopportunities. Here, preferred architectures are those not relying on a predetermined target and input feature length, e.g., a recurrent network. Recent advances in generative networks will provide other interesting alternatives for analyzing and processing audio dataaswell.Theseapproachesareyettobefully explored. Finally, music data can be highly complex and it is unreasonable to expect a single trained model to accurately inpaint a large number of musical styles and instruments at once. Thus, instead of training on a very general dataset, we expect significantly improved performance for more specialized networks that could be trained by restricting the training data to specific genres or instrumentation. Applied to a complex mixture and potentially preceded by a source-separation algorithm, the resulting modelscouldbeusedjointlyinamixture-of-experts.approach.
  • 73. Imputation withDeepLearning#4: GANs NAOMI:Non-AutoregressiveMultiresolutionSequenceImputation Yukai Liu,RoseYu,StephanZheng,EricZhan,Yisong Yue (Submittedon30Jan2019) https://arxiv.org/abs/1901.10946 We studied the ability of deep neural networks (DNNs) to restore missing audio content based on its context, a process usually referred to as audio inpainting. We focused on gaps in the range of tens of milliseconds, a condition which has not received much attention yet. The proposed DNN structure was trained on audio signals containing music and musical instruments, separately, with 64-ms long gaps Leveraging multiresolution modeling and adversarial training, NAOMI is able to learn the conditional distribution given very few known observations and achieves superior performances in variousexperiments of both deterministic and stochastic dynamics. Future work will investigate how to infer the underlyingdistribution when complete training dataisunavailable.The trade- off between partial observations and external constraints is another direction for deepgenerativeimputationmodels.
  • 74. Effect of missingvalues toclassificationperformance Amethodologyforquantifyingtheeffectofmissingdata ondecisionquality in classificationproblems Received 09Mar 2016, Accepted 22 Dec 2016, Accepted author version posted online: 13Jan 2017, https://doi.org/10.1080/03610926.2016.1277752 “This study suggests that the negative impact of poor data quality (DQ) on decision making is often mediated by biased model estimation. To highlight this perspective, we develop an analytical framework that links three quality levels – data, model, and decision. The general framework is first developed at a high-level” Evolutionary MachineLearningfor ClassificationwithIncompleteData Tran, CaoTruong(2018, PhDThesis) http://hdl.handle.net/10063/7639 “The thesis develops approaches for improving imputation for classification with incomplete data by integrating clustering and feature selection with imputation. The approaches improve both the effectiveness and the efficiency of using imputation for classificationwith incompletedata. The thesis develops interval genetic programming to directly evolve classifiers for incomplete data. The results show that classifiers generated by interval genetic programming can be more effective and efficient than classifiers generated the combination of imputation and traditional genetic programming. Interval genetic programming is also more effective than common classification algorithms able to workdirectlywith incompletedata.”
  • 75. Imputation and Classification MissingData ImputationforSupervisedLearning August 2018 https://doi.org/10.1080/08839514.2018.1448143 “Thispapercomparesmethodsforimputingmissing categoricaldataforsupervisedclassificationtasks. “ The results of the present study show that perturbation can help increase predictive accuracy for imputed models, but not one-hot encoded models. Future work can identify the conditions under which missing-data perturbation can improve prediction accuracy. Interesting extensions of this paper include evaluating the benefits of using missing-data perturbation over more popularregularization techniquessuchas dropout training. ErrorratesontheAdulttestsetwith(bottom)andwithout(top)missing dataimputation,for variouslevelsofMCAR-perturbedcategoricaltrainingfeatures(x-axis). TheAdult datasetcontainsN= 48,842examples and 14 features(6 continuousand 8 categorical).The predictiontask isto determinewhether aperson makesover $50,000a year.
  • 77. CEEMD EmpiricalModeDecomposition Empirical mode decomposition for seismic time-frequency analysis Jiajun Han and Mirko van der Baan Geophysics (2013) 78 (2):O9-O19. https://doi.org/10.1190/geo2012-0199.1 Complete ensemble empirical mode decomposition decomposes a seismic signal into a sum of oscillatory components, with guaranteed positive and smoothly varying instantaneous frequencies. Analysis on synthetic and real data demonstrates that this method promises higher spectral-spatial resolution than the short-time Fourier transform or wavelet transform. Application on field data thus offers the potential of highlighting subtle geologic structures that might otherwise escape unnoticed. CEEMD is a robust extension of EMD methods. It solves not only the mode mixing problem, but also leads to complete signal reconstructions. After CEEMD, instantaneous frequency spectra manifest visibly higher time-frequency resolution than short-time Fourier and wavelet transforms on synthetic and field data examples. These characteristics render the technique highly promisingforseismic processingand interpretation. Introducinglibeemd:Aprogrampackageforperformingthe ensembleempiricalmodedecomposition(July2015) ComputationalStatistics 31(2):1-13P.J.J.Luukko,JouniHelske,E. Räsänen C, R and Python http://doi.org/10.1007/s00180-015-0603-9 https://bitbucket.org/luukko/libeemd
  • 78. SourceSeparation ”signaldecomposition”#1 Wave-U-Net:AMulti-ScaleNeuralNetworkfor End-to-EndAudioSourceSeparation Daniel Stoller, Sebastian Ewert, Simon Dixon Queen Mary Universityof London, Spotify (Submitted on8 Jun2018) https://arxiv.org/abs/1806.03185 |https://github.com/f90/Wave-U-Net “Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporalcorrelations. In thiscontext, weproposethe Wave-U-Net,an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware predictionframework toreduceoutput artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the- artspectrogram-basedU-Netarchitecture,given thesamedata. 75 tracks from the training partition of the MUSDB multi-track database are randomly assigned to our training set. For singing voice separation, we also add the whole CCMixter database to the training set. No further data preprocessing is performed, only a conversion to mono (except for stereo models) and downsampling to 22050 Hz. For future work, we could investigate to which extent our model performs a spectral analysis, and how to incorporate computations similar to those in a multi- scale filterbank, or to explicitly compute a decomposition of the input signal into a hierarchical set of basis signals and weightings on which to perform the separation, similarto the TasNet [12]. Furthermore, better loss functions for raw audio prediction should be investigated such as the ones provided by generative adversarial networks [3, 21], since the MSE might not reflect the perceived loss of quality well.
  • 79. SourceSeparation ”signaldecomposition”#2 TasNet:SurpassingIdealTime-Frequency MaskingforSpeechSeparation YiLuo, NimaMesgarani (Submitted on21 Sep 2018) https://arxiv.org/abs/1809.07454 “TasNet uses a convolutional encoder to create a representation of the signal that is optimized for extracting individual speakers. Speaker extraction is achieved by applying a weighting function (mask) to the encoder output. The modified encoder representation is then inverted to the sound waveform using a linear decoder. A linear deconvolution layer serves as a decoder by invertin gthe encoder output back to the sound waveform. This encoder-decoder framework is similar to the ICA method when a nonnegativemixing matrix (NMF) is used [Wangetal.2009] and to the semi-nonnegative matrix factorization method (semi-NMF) [Dingetal.2008], where the basis signals are the parameters of thedecoder. The masks are found using a temporal convolutional network (TCN) consisting of dilated convolutions, which allow the network to model the long-term dependencies of the speech signal. This end-to-end speech separation algorithm significantly outperforms previous time-frequency methods in terms of separating speakers in mixed audio, even when compared to the separation accuracy achieved with the ideal time-frequency mask of the speakers. In addition, TasNet has a smaller model size and a shorter minimum latency, making it a suitable solution for bothofflineandreal-time speechseparation applications.“
  • 80. SourceSeparation ”signaldecomposition”#3 DisentanglingCorrelatedSpeakerandNoisefor SpeechSynthesis viaDataAugmentationand AdversarialFactorization Wei-NingHsu, Yu Zhang, Ron J. Weiss, Yu-An Chung, Yuxuan Wang, YonghuiWu, JamesGlass. 32nd ConferenceonNeural InformationProcessing Systems (NIPS 2018), Montréal, Canada. https://openreview.net/pdf?id=Bkg9ZeBB37 “To leverage crowd-sourced data to train multi-speaker text- to-speech (TTS) models that can synthesize clean speech for all speakers, it is essential to learn disentangled representations which can independently control the speaker identity and background noise in generated signals. However, learning such representations can be challenging, duetothe lackoflabelsdescribingtherecordingconditionsof each training example, and the fact that speakers and recording conditions are often correlated, e.g. since users oftenmakemanyrecordingsusingthesameequipment. This paper proposes three components to address this problem by: (1) formulating a conditional generative model with factorized latent variables, (2) using data augmentation to add noise that is not correlated with speaker identity and whose label is known during training, and (3) using adversarial factorization to improve disentanglement. Experimental results demonstrate that the proposed method can disentangle speaker and noise attributes even if they are correlated in the training data, and can be used to consistentlysynthesizecleanspeechforallspeakers.”
  • 81. Decompose HighandLow frequencies Drop anOctave:ReducingSpatialRedundancy in Convolutional Neural Networks withOctave Convolution YunpengChen, HaoqiFang, BingXu, ZhichengYan, YannisKalantidis, MarcusRohrbach, ShuichengYan, JiashiFeng (Submitted on 10 Apr 2019) https://export.arxiv.org/abs/1904.05049 In this work, we propose to factorize the mixed feature maps by their frequencies and design a novel Octave Convolution (OctConv) operation to store and process feature maps that vary spatially "slower" at a lower spatial resolution reducing both memory and computation cost. Unlike existing multi-scale meth-ods, OctConv is formulated as a single, generic, plug-and-play convolutional unit that can be used as a direct replacement of (vanilla) convolutions without any adjustments in the network architecture. It is also orthogonal and complementary to methods that suggest better topologies or reduce channel-wise redundancy like group or depth-wise convolutions. We experimentally show that by simply replacing con-volutions with OctConv, we can consistently boost accuracy for both image and video recognition tasks, while reducing memoryandcomputationalcost.
  • 82. Decompose Signalandthe Noise Deeplearningofdynamicsandsignal-noise decompositionwithtime-steppingconstraints Samuel H. Rudy, J. Nathan Kutz, Steven L. Brunton Department of Applied Mathematics/ Mechanical Engineering, Universityof Washington, Seattle, last revised 22 Aug2018 https://arxiv.org/abs/1808.02578 https://github.com/snagcliffs/RKNN “We propose a novel paradigm for data-driven modeling that simultaneously learns the dynamics and estimates the measurement noise at each observation. By constraining our learning algorithm, our method explicitly accounts for measurement error in the map between observations, treating both the measurement error and the dynamics as unknowns to be identified,ratherthan assumingidealizednoiselesstrajectories. We also discuss issues with the generalizability of neural network models for dynamicalsystemsand provide open-source code for allexamples.” The combination of neural networks and numerical time-stepping schemes suggests a number of high-priority research directions in system identification and data-driven forecasting. Future extensions of this work include considering systems with process noise, a more rigorous analysis of the specific method for interpolating f, including time delay coordinates to accommodate latent variables, and generalizing the method to identify partial differential equations. Rapid advances in hardware and the ease of writing software for deep learning will enable these innovations through fast turnover in developing and testing methods.
  • 84. Super-resolutions Insightsfromaudio Time-frequencynetworks foraudiosuper- resolution TeckYian Lim etal. (2018) http://isle.illinois.edu/sst/pubs/2018/lim18icassp.pdf http://tlim11.web.engr.illinois.edu/ “Audiosuper-resolution (a.k.a. bandwidthextension)is thechallengingtaskofincreasingthetemporalresolutionof audiosignals. Recentdeepnetworksapproachesachieved promisingresultsby modelingthetaskas aregression problem ineithertimeorfrequencydomain. Inthispaper, weintroducedTime-FrequencyNetwork(TFNet),a deepnetworkthat utilizessupervision inboth thetimeand frequencydomain.Weproposedanovelmodelarchitecture whichallowsthetwodomainstobe jointlyoptimized.” Spectrogram correspondingto the LR input(frequenciesabove 4kHz missing), HR reconstruction, and the HR ground truth. Our approach successfullyrecoversthehigh frequencycomponentsfrom the LRaudiosignal.
  • 85. GANs Alsofortime-seriesdenoising #1a DenoisingTimeSeriesData Using AsymmetricGenerativeAdversarial Networks Sunil Gandhi;Tim Oates;TinooshMohsenin and David Hairston (2018) https://doi.org/10.1007/978-3-319-93040-4_23 “In this paper, we explicitly learn to remove noise from time series data without assuming a prior distribution of noise. We propose an online, fully automated, end- to-endsystemfordenoisingtimeseriesdata. Our model for denoising time series is trained using unpaired training corpora and does not need information about the source of the noiseorhowitismanifestedin thetimeseries. We propose a new architecture called AsymmetricGAN that uses a generative adversarial network for denoising time series data.” Consider, for example, a widely used method for time series featurization called Symbolic Aggregate approXimation (SAX) that assumes time series are generated from a single normal distribution. As shown in this assumption does not hold in several real life time series datasets. Other techniques assume noise comes from a Gaussian distribution and estimate the parameters of that distribution. This assumption doesnot hold for datasourceslikeElectroencephalography (EEG), wherenoisecan have diverse characteristics and originate from different sources. Hence, in this work, we focus on learning the characteristics of noise in EEG data and removing it as a preprocessing step. ICA has high computationalcomplexityandlargememoryrequirements,makingitunsuitableforreal-timeapplications. For training of our network, we only need a set of clean signals and set of noisy signals. We do not need paired training data, i.e., we do not need clean versions of the noisy data. This is particularly useful for applicationslikeartifact removalinEEGdataaswecannot recordclean versionsofnoisyEEG.
  • 86. GANs Alsofortime-seriesdenoising #1b DenoisingTimeSeriesData Using AsymmetricGenerativeAdversarial Networks Sunil Gandhi;Tim Oates;TinooshMohsenin and David Hairston (2018) https://doi.org/10.1007/978-3-319-93040-4_23 Pre-processing The DC component in EEG data is different for each recording. We normalize every window of clean and noisy data to remove the DC offset from the data. We remove the DC offset by subtracting the median of the datain the window. Evaluation of EEG data is challenging as the ground truth noiseless signals are not known. Multiple approaches to evaluation have been proposed in recent years, however, authors do not agree on a single mechanismforevaluatingartifactremoval.
  • 87. GANs Alsoforspeechdenoising Segan:Speechenhancementgenerative adversarialnetwork. SantiagoPascual, AntonioBonafonte, and Joan Serra (2017) https://arxiv.org/abs/1703.09452 https://github.com/santi-pdp/segan “For the purpose of speech enhancement and denoising, the SEGAN was developed, employing a neural network with an encoder and decoder pathway that successively halves and doubles the resolution of feature maps in each layer, respectively, and features skip connections betweenencoderanddecoderlayersa. The model works as an encoder-decoder fully- convolutional structure, which makes it fast to operate for denoising waveform chunks. The results show that, not only the method is viable, but it can also represent an effective alternative to current approaches. Possible future work involves the exploration of better convolutional structures and the inclusion of perceptual weightings in the adversarial training, so that we reduce possible high frequency artifacts that might be introduced by the current model. Further experiments need to be done to compare SEGANwithothercompetitiveapproaches.” Thedatasetisaselectionof30speakers fromtheVoiceBankcorpus