# learnable_group_transform_for_timeseries__1bf96eea.pdf

Learnable Group Transform For Time-Series

Romain Cosentino 1 Behnaam Aazhang 1

We propose a novel approach to ﬁlter bank learning for time-series by considering spectral decompositions of signals deﬁned as a Group Transform. This framework allows us to generalize classical time-frequency transformations such as the Wavelet Transform, and to efﬁciently learn the representation of signals. While the creation of the wavelet transform ﬁlter-bank relies on afﬁne transformations of a mother ﬁlter, our approach allows for non-linear transformations. The transformations induced by such maps enable us to span a larger class of signal representations, from wavelet to chirplet-like ﬁlters. We propose a parameterization of such a non-linear map such that its sampling can be optimized for a speciﬁc task and signal. The Learnable Group Transform can be cast into a Deep Neural Network. The experiments on diverse time-series datasets demonstrate the expressivity of this framework, which competes with state-of-the-art performances.

1. Introduction

To this day, the front-end processing of time-series remains a keystone toward the improvement of a wealth of applications such as health-care (Saritha et al., 2008), environmental sound (Balestriero et al., 2018; Lelandais & Glotin, 2008), and seismic data analysis (Seydoux et al., 2016). The common denominator of the recorded signals in these ﬁelds is their undulatory behavior. While these signals share this common behavior, two signiﬁcant factors imply the need of learning the representation: 1) time-series are intrinsically different because of their physical nature, 2) the machine learning task can be different even within the same type of data. Therefore, the representation should be induced by both the signal and the task at hand.

1Department of Electrical and Computer Engineering, Rice University, USA. Correspondence to: Romain Cosentino <rom.cosentino@gmail.com>.

Proceedings of the 37 th International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

A common approach to performing inference on time-series consists of building a Deep Neural Network (DNN) that operates on a spectral decomposition of the time-series such as wavelet transform (WT) or Mel Frequency Spectral Coefﬁcients (MFSC). These decompositions represent the signal. While the use of these decompositions is extensive, we show in Section 2 their inherent biases and motivate the development of a generalized framework. The selection of the judicious transform is either performed by an expert on the signal at hand, or by considering ﬁlter selection methods (Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Gribonval & Bacry, 2003). However, an inherent drawback is that the selection of the ﬁlters decomposing the signals is often achieved with criteria that do not align with the task. For instance, a selection based on the sparsity of the representation while the task is the classiﬁcation of the signals. Besides, these selection methods and transformations require substantial cross-validations of a large number of hyperparameters such as mother ﬁlter family, number of octaves, number of wavelets per octave, size of the window (Le & Argoul, 2004; Cosentino et al., 2017).

In this work, we alleviate these drawbacks by proposing a simple and efﬁcient approach by considering the generalization of these spectral decompositions. They consist of taking the inner product between ﬁlters and the signals. From one decomposition to the other, only the ﬁlter bank differs. The ﬁlters of well-known spectral decompositions, such as the short-time Fourier transform (STFT) and the continuous wavelet transform (CWT) are built following a particular scheme. Each ﬁlter is the result of the action of a transformation map on a selected mother ﬁlter, e.g., a Gabor ﬁlter. If the transformation map is induced by a Group, the representation is called a Group Transform (GT), and both the group with the mother ﬁlter characterize the decomposition.

We propose to enable the learnability of such a scheme. More precisely, our contributions are: 1) we generalize common Group Transforms by proposing the utilization of strictly increasing and nonlinear transformations, 2) draw the connection between ﬁlters that can be learned by our framework and commonly observed ﬁlters in biological time-series 3) we show how the equivariance properties of the representation differs from traditional afﬁne transforma-

Learnable Group Transform For Time-Series

tions, Section 3.1, 4) we propose an efﬁcient way of optimizing the sampling of such functional space, Section 3.2, and 5) apply our method to three datasets containing complementary challenges a) artiﬁcial data showing the limitation and drawbacks of well-known GTs, b) a large bird detection dataset ( 20 hours of audio recording, 20 larger than CIFAR10 in term of number of scalar values in the dataset) where optimal spectral decomposition are known and developed by expert, and c) a haptic dataset that does not beneﬁt from expert knowledge regarding important features, Section 4.

We can summarize our approach to

given a ﬁlter ψ with its analytical formula

generate increasing and continuous maps using 1-Layer Relu Network (the number of increasing and continuous map will be the number of ﬁlters in the ﬁlter bank)

compose the increasing and continuous maps with the ﬁlter ψ

convolve the ﬁlters obtained with the signal to acquire the representation

2. Related Work and Background

One approach to represent the data consists of building equivariant-invariant representations. For instance, in (Mallat, 2012; Bruna, 2013) they propose a translation-invariant representation, the Scattering Transform, which is stable under the action of small diffeomorphisms. In (Oyallon et al., 2018; Cohen & Welling, 2016), they focus on equivariantinvariant representations for images, which reduces the sample complexity and endow DNN s layers with interpretability.

The closest work to ours consist of learning the ﬁlter bank in an end-to-end fashion. (Cakir et al., 2016; Ravanelli & Bengio, 2018; Balestriero et al., 2018; Zeghidour et al., 2018) investigated the learnability of a mother ﬁlter such that it can be jointly optimized with the DNN. In order to build the ﬁlter bank, this learnable mother ﬁlter is transformed by deterministic afﬁne maps. The representation of the signal is obtained by convolving the ﬁlter bank elements with the signals. Recently, (Khan & Yener, 2018) investigated the learnability of the afﬁne transformations, that is, the sampling of the dilation parameter of the afﬁne group inducing the wavelet ﬁlter bank. Optimized jointly with the DNN, their method allows for an adaptive transformation of the mother ﬁlter. Our work generalizes this approach and provide its theoretical properties and building blocks.

One of the main drawbacks of these approaches using timefrequency representation is that the ﬁlter bank induces a bias that might not be adapted to the data. This bias can be

Figure 1. Time-Frequency Tilings at a given time τ: (left) shorttime Fourier transform, i.e., constant bandwidth, (middle) wavelet transform, i.e., proportional bandwidth, (right) Learnable Group Transform, i.e, adaptive bandwidth, the tiling is induced by the learned non-linear transformation underlying the ﬁlter bank decomposition.

understood by considering the time-frequency tiling of each GT. It is known that the spread of a ﬁlter and its Fourier transform are inversely proportional as per the Heisenberg uncertainty principle (Mallat, 1999).

Following this principle, we can observe that in the case of STFT (respectively WT with a Gabor wavelet), at a given time τ, the signal is transformed by a window of constant bandwidth (respectively proportional bandwidth) modulated by complex exponential resulting in a uniform tiling (respectively proportional) on the frequency axis, Figure 1. This implies that, for instance, in the case of WT, the precision in frequency degrades as the frequency increases while its precision in time increases (Mallat, 1999). Thus, WT is not adapted for fast-varying frequency signals (Xu et al., 2016). In the case of STFT, the uniform tiling implies that the precision is constant along the frequency axis. In our proposed framework, the LGT allows for an adaptive tiling, as illustrated in Figure 1 such that the trade-off between time and frequency precision depends on the task and data.

3. Learnable Group Transform

Common time-frequency ﬁlter banks are built by transforming a mother ﬁlter that we denote by ψ. We consider the transformations of this mother ﬁlter deﬁned as ψ g, g F, where F deﬁnes the functional space of the transformation and ψ g denotes the function composition. Note that in signal processing, such a transformation is called warping (Goldenstein & Gomes, 1999; Kerkyacharian et al., 2004). Given a space F, the ﬁlter bank with K ﬁlters is created by ﬁrst, sampling K transformation maps from F and then, by transforming the mother ﬁlter such as

{ψ g1, . . . , ψ g K|g1, . . . , g K F} .

Now, let s denote a signal by s L2(R), we will consider the representation of the signal as the result of its convolu-

Learnable Group Transform For Time-Series

Figure 2. Learnable Group Transform: (left) generating the strictly increasing continuous functions gθk with parameters θk, k {1, . . . , K}, where K denotes the number of ﬁlters in the ﬁlter bank. The x-axis is the time variable and the y-axis the amplitude. (middle) The mother ﬁlter, ψ (presently a Morlet wavelet), is composed with each warping function gθk , where the imaginary part is shown in red and the real part in blue. The x-axis represents the time and y-axis the amplitude of the ﬁlter. These transformations lead to the ﬁlter bank (only the kth element is displayed). Then, the convolutions between the ﬁlter bank elements and the signal si lead to the LGT of the signal. The black box on the LGT representation (right) corresponds to the convolution of the kth ﬁlter with the signal. In this ﬁgure, the horizontal axis corresponds to the time, each row corresponds to the convolution with a ﬁlter of the ﬁlter bank, and the color displays the amplitude of each inner product. Notice that a complex modulus has been applied to the LGT. The strictly increasing and continuous piecewise linear functions can be learned efﬁciently by back-propagating the error induced by the generated GT.

tion with the ﬁlter bank elements and denote it by

W[s, ψ](g, .) = [W[s, ψ](g1, .), . . . , W[s, ψ](g K, .)]T ,

W[s, ψ](g, .) = si (ψ g), g F,

with the convolution operator and (.) corresponds to the time axis.

Therefore, the properties of the representation are carried by the mother ﬁlter ψ, and space F. In this work, we focus on the warping that generalizes common time-frequency decompositions as well as the properties carried by the associated ﬁlter bank, in particular we consider nonlinear warping. We provide a parameterization of such a warping and show how one can efﬁciently learn these parameters. The decomposition of the signal by this learned ﬁlter bank deﬁnes a Group Transform. The overall building blocks of the LGT, and its application on a signal is depicted in Figure 2.

3.1. Time Warped Filters

We propose to transform the mother ﬁlter by means of a subset of invertible maps on R. Instead of the afﬁne warping used in WT, we propose the use of a more general transformation map space F. In particular, we will use the space of strictly increasing and continuous functions deﬁned as

C0 inc(R) = g C0(R)|g is strictly increasing ,

where C0(R) deﬁnes the space of continuous functions deﬁned on R. This set of functions is composed of invertible maps which is crucial in order to derive invariance properties as well as avoid artifacts in the transformed ﬁlters.

The transformation of a mother ﬁlter ψ is deﬁned by the linear operator ρinc(g) such as

ρinc(g)ψ = ψ g, g C0 inc(R),

By construction, this space allows for non-linear transformations of a mother ﬁlter. An example of such a warping can be visualized in Figure 3.

In the next paragraph, we introduce some ﬁlters that can be recovered using this transformation map. For some of these ﬁlters, the estimation of their parameter has been investigated (Gribonval, 2001; Wang & Jiang, 2008; Xu et al., 2016), however, our method provides two beneﬁts, ﬁrst, the generalization which alleviates the need of selecting a speciﬁc type of ﬁlter bank, second, the scalability of our method leading to a learnable ﬁlter bank.

Recovering Standard Filter Banks: The space C0 inc(R) allows us to span well known transformations. In particular, a ﬁlter can inherit a particular chirpyness1 from nonlinear transformations belonging to C0 inc(R).

This property is interesting for the decomposition of nonstationary and fast-varying signals. In fact, various signals include such an intricate feature, such as bird song, speech, sonar system (Flandrin, 2001). Among the possible transformations induced on a mother ﬁlter by the mapping g C0 inc(R), some of them correspond to well-known ﬁlters described in Table 1.

For instance, let s consider the case where F is the space of linear function with positive slope and deﬁned as g F, g(t) = t λ, where λ is positive. In this case, we recover the transformation leading to the dilation or contraction of a wavelet mother ﬁlter. The ﬁlter bank is then

1Chirpyness is deﬁned as the rate of change of the instantaneous frequency of the ﬁlter (Mann & Haykin, 1995).

Learnable Group Transform For Time-Series

Figure 3. Transformation of a Morlet Wavelet: For all the ﬁlters, the real part is shown in blue and the imaginary in red. (left) Morlet wavelet mother ﬁlter. (middle) Transformation of the mother ﬁlter with respect to an afﬁne transform: the dilation parameter 0 < a < 1, i.e., contraction, and translation b = 0, i.e., no translation. (right) Increasing and continuous transformation of the mother ﬁlter for some randomly generated function g C0 inc(R) leading to chirplet-like ﬁlter.

Table 1. Recovering well-known ﬁlters g C0 inc(R) ψ g Afﬁne Wavelet Quadratic Convex Increasing Quadratic Chirplet Quadratic Concave Decreasing Quadratic Chirplet Logarithmic Logarithmic Chirplet Exponential Exponential Chirplet

generated by sampling a few elements of the group. In the case of the dyadic wavelet transform, the dilation parameters follow a geometric progression of common ratio equals to 2, such as λk = 2(k 1)/Q, k = 1, ..., K, where K = J Q, with J and Q are the number of octaves and wavelets per octave, respectively. The ﬁlter bank obtained is n ψ( t

λ1 ), . . . , ψ( t

λK ) o , and the representation of signal is obtained by convolutions between the ﬁlter bank elements and the signal. Equivalently, the space F can be deﬁned as afﬁne, and the WT is achieved by inner products between the ﬁlters and the signal.

While the WT ﬁlter bank can easily be recovered, our modelization of the ﬁlter bank does not allow for elements with a number of oscillations that differ from the mother ﬁlter. To enable such a transformation, another function h with a number of oscillations that differs from the mother ﬁlter could be multiplied with the mother ﬁlter, such that h ψ g provides the elements of the ﬁlter bank. Therefore, STFT is not part of the representations that such a framework encompasses.

In this work, we also consider the case where the representation of the signal is performed by convolutions. This representation has equivariance properties that are induced by the convolutional operator as well as the space C0 inc(R).

Equivariance Properties of The Filter Bank: The equivariance-invariance properties of signal representations play a crucial role in the efﬁciency of the algorithm at hand as they deﬁne how some variations in the signal may or may not be captured (Mallat, 2016). These properties can

be intuitively explained and analyzed by considering the representation of the signal as a function of group elements. Details regarding the background of group theory and its link with wavelet analysis are provided in Appendix A. Considering the mapping ρinc = ψ g, g C0 inc(R), as a group action on the space of the mother ﬁlter, i.e., L2(R), or more precisely, a representation of a group on L2(R), we can develop the equivariance properties of the LGT. The proof that ρint is in fact a representation is given in Appendix D.1. We can consider the set C0 inc(R) with the operation consisting of the composition of functions to form the group of strictly increasing and continuous maps denoted by Ginc. This formulation eases the derivation of the equivariance properties of group transforms which can be deﬁned for a group G for all g, g G by

W[ρ(g )si, ψ](g, .) = W[si, ψ]((g ) 1 g, .).

Transforming the signal with respect to the group G and computing its representation is equal to computing the representation of the signal and then transforming the representation. If G corresponds to the afﬁne group, the associated group transform is the WT, which is equivariant to scalings and translations. One can already notice that since W(., .) employs convolution to decompose the signal, for any group G, the LGT is translation equivariant. We now focus on more speciﬁc equivariance properties of the LGT by deﬁning the local equivariance for all g, g G by

τ R, W[ρ(g )si, ψ](g, τ) = W[si, ψ]((g ) 1 g, τ).

That is, the representation of a local transformation of a signal in a window centered at τ is equal to the transformation of the representation at τ. The size of the window depends on the support of the ﬁlter. As a matter of fact, assuming that the representation of Ginc is unitary, we have the following proposition. Proposition 1. The LGT is locally equivariant with respect to the action of the group Ginc.

The proof is given in Appendix D.

As we mentioned, a ﬁlter bank of K ﬁlters is created by sampling the space C0 inc(R). We now show how this sampling

Learnable Group Transform For Time-Series

can be achieved efﬁciently by proposing a parametrization of functions belonging to such a space.

3.2. Learning the Time Warping

In this work, we are speciﬁcally interested in the learnability of such an increasing and continuous map. We provide a way to sample this space via its parameterization. We use piecewise afﬁne functions constrained such that they belong to the class of strictly increasing and continuous functions, which can be efﬁciently performed by sorting the output of a 1-layer Re LU NN.

Adaptive Knot Implementation: To implement the nonlinear mapping induced by the representation of the piecewise afﬁne group, we use the fact that a piecewise continuous function can be re-written as a 1-layer Re LU Neural Network (Arora et al., 2016; Yarotsky, 2017).

Besides the computational advantages of such relationships and the differentiable property of the weights of the NN, this model is a knot-free piecewise afﬁne mapping, providing more ﬂexibility regarding the warping function. The knotfree mapping implies that instead of having each afﬁne piece of the function with uniform support, it can vary. As such, this ﬂexibility induces better approximation property (Jupp, 1978). Then, the increasing constraint on the mapping is implemented by sorting the output of the NN. This operation has a O(n log n) complexity and is applied on the warped time, which is usually of size 29.

Objective Function and Learning: Let θk be the parameters of each increasing piecewise afﬁne map computed by the NN and we denote by gθk the sorted outputs of the NN. The LGT ﬁlter bank has the following form

{ψ gθ1, . . . , ψ gθK} .

Given a set of signals {si L2(R)}N i=1 and given a task speciﬁc loss function L, we aim at solving the following optimization problem

i=1 L F(W[si, ψ](gΘ, .)) ,

where Θ = (θ1, . . . , θK), N denotes the number of signals, K the number of ﬁlters, F represents a DNN, and we recall that

W[si, ψ](gΘ, .) = [W[si, ψ](gθ1, .), . . . , W[si, ψ](gθK, .)]T .

Since, the gθk are computed by sorting the output of the NN and the parameters can be learned by a gradient descent optimization jointly with the parameters of F.

Model Constraints to Reduce Aliasing: The nonlinearity of the transformation might reduce the localization of the ﬁlter in the frequency domain, and produce aliasing. For some applications, the localization of each ﬁlter in the frequency domain is crucial, e.g., the bird detection task in Section 4.2.

In order to limit the possible aliasing induced by the piecewise increasing mappings applied to the mother ﬁlter, we propose different settings. Besides, these constraints also impact the type of ﬁlter bank our method can reach.

First, we propose a normalization of the frequency of the transform ﬁlter (denoted in the result tables by n LGT). This normalization helps to reduce the aliasing induced by the ﬁlters. We propose to use ˆf, the normalized frequency f with respect to the maximum slope of the piecewise afﬁne mapping. For instance, in the case of a Morlet wavelet, the normalization is as follows

(ψ gθ)(t) = π 1

4 exp 2πj ˆfgθ(t) exp 1

2(gθ(t)/σ)2 ,

where ˆf = f/ maxl {1,...,n} al, where n denotes the number of pieces of the piecewise map, and al the slope of each piece, j is the imaginary unit, and σ is the width parameter deﬁning the localization of the wavelet in time and frequency. This normalization will be performed for each sample of the group, and thus for each generated ﬁlter k {1, . . . , K} of the ﬁlter bank.

Second, we constrain the domain of the piecewise afﬁne map (denoted in the result tables by c LGT). In the following experiments, we propose a dyadic constraint of the domain as in the WT. The support of the ﬁlter is close to the support of a wavelet ﬁlter bank. However, the envelope of the ﬁlter and the instantaneous frequency still has a learned chirpyness.

4. Experiments

For all the experiments and all the settings, i.e., LGT, n LGT, c LGT, cn LGT, the increasing and continuous piecewise afﬁne map is initialized randomly, and the optimization is performed with Adam Optimizer, and the number of knots of each piecewise afﬁne map is 256. The mother ﬁlter used for our setting is a Morlet wavelet ﬁlter. The code of the LGT framework is provided in the following repository https://github.com/Koldh/Learnable Group Transform Time Series.

4.1. Artiﬁcial Data: Classiﬁcation of Chirp Signals

We present an artiﬁcial dataset that demonstrates how a speciﬁc time-frequency tiling might not be adapted or would require cross-validations for a given task and data. To build the dataset, we generate one high frequency ascending chirp and

Learnable Group Transform For Time-Series

Figure 4. Learnable Group Transform Filters for the Artiﬁcial Data - Each row displays two selected ﬁlters (left and right sub-ﬁgure) for different settings: (from top to bottom) n LGT, c LGT, cn LGT. For each subﬁgure, the left part corresponds to the ﬁlter before training and the right part to the ﬁlter after training. The blue and red denote the real and imaginary parts of the ﬁlters, respectively.

Table 2. Testing Accuracy for the Chirp Signals Classiﬁcation Task

Representation + Non-Linearity + Linear Classiﬁer Accuracy Wavelet Transform (64 Filters) 53.01 5.1 Short-Time Fourier Transform (64 Filters) 65.1 11.9 Short-Time Fourier Transform (128 Filters) 86.6 9.8 Short-Time Fourier Transform (512 Filters) 100 0.0 LGT (64 Filters) 92.9 4.0 n LGT (64 Filters) 95.7 3.3 c LGT (64 Filters) 56.8 1.6 cn LGT (64 Filters) 100.0 0.0

one descending high-frequency chirp of size 8192 following the chirplet formula provided in (Baraniuk & Jones, 1996). Then for both chirp signals, we add Gaussian noise samples (100 times for each class), see Figures in Appendix C.1. The task aims at being able to detect whether the chirp is ascending or descending. Both the training and test sets are composed of 50 instances of each class. For all models, set the batch size to 10, the number of epochs to 50. Each experiment was repeated 5 times with randomly sampled train and test set, and the accuracy was the result of the average over these 5 runs. Each GT is composed with a complex modulus, and the inference is performed by a linear classiﬁer. For the case of WT and LGT, the size of the ﬁlters is 512.

As we can observe in Table 2, the WT, as well as the STFT with few numbers of ﬁlters, perform poorly on this dataset. The chirp signals to be analyzed are localized close to the Nyquist frequency, and in the case of WT, as illustrated in Figure 1, the wavelet ﬁlter bank has a poor frequency resolution in high frequency while beneﬁting from a high time resolution. In this experiment, we can see that this charac-

teristic the WT time-frequency tiling implies that through time, the small frequency variations of the chirp are not efﬁciently captured. In the case of STFT, as the number of ﬁlters decreases, the frequency resolution is altered. Thus, this frequency variation is not captured. Using a large window for the STFT increases the frequency resolution of the tiling and thus enables to capture the difference between the two classes. In the LGT setting, the tiling has adapted to the task and produces good performances except for the c LGT model. In fact, the domain of the piecewise linear map is constrained to be dyadic, and thus the adaptivity of the ﬁlter bank is reduced, which is not suitable for this speciﬁc task.

Some of the ﬁlters can be visualized in Figure 4, as well as the representations of the signals in Appendix C.1.2. This experiment shows an example of signals that are not easily classiﬁed by neither the proportional-bandwidth nor the constant-bandwidth without considering cross-validation of hyperparameters.

4.2. Supervised Bird Detection

Table 3. Testing AUC for the Bird Detection Task Representation + Non-Linearity + Deep Network AUC MFSC (80 Filters) 77.83 1.34 Conv. Filter init. random (80 Filters) 66.77 1.04 Conv. Filter init. Gabor (80 Filters) 67.67 0.98 Spline Conv. init. random (80 Filters) (Balestriero et al., 2018) 78.17 1.48 Spline Conv. init. Gabor (80 Filters) (Balestriero et al., 2018) 79.32 1.52 LGT (80 Filters) 78.41 1.38 n LGT (80 Filters) 75.50 1.39 c LGT (80 Filters) 79.14 0.83 cn LGT (80 Filters) 79.68 1.35

We now propose a large scale dataset to validate the suitabil-

Learnable Group Transform For Time-Series

Figure 5. Learnable Group Transform - Visualization of a sample containing a bird song (c LGT), where (left) at the initialization and (right) after learning. For each subﬁgure, the x-axis corresponds to time and the y-axis to the different ﬁlters. Notice that the y-axis usually corresponds to the scale or the center-frequency of the ﬁlters. Other representations are displayed in Appendix C.2.2. We can observe that compared to the initialization, the learned representation is sparser and the SNR is increased. Besides, the representation is less redundant in the frequency axis.

Figure 6. Learnable Group Transform Filters for the Bird Detection Data - Each row displays two selected ﬁlters (left and right sub-ﬁgure) for different settings: (from top to bottom) LGT, n LGT, c LGT. For each subﬁgure, the left part corresponds to the ﬁlter before training and the right part to the ﬁlter after training. The blue and red denote the real and imaginary parts of the ﬁlters, respectively.

ity of our model in a noisy and realistic setting. The dataset is extracted from the Freesound audio archive (Stowell & Plumbley, 2013). This dataset contains about 7, 000 ﬁeld recording signals of 10 seconds sampled at 44 k Hz, representing slightly less than 20 hours of audio signals. The content of these recordings varies from water sounds to city noises. Among these signals, some contain bird songs that are mixed with different background sounds having more energy than the bird song. A visualisation of a sample is shown in Appendix C.2.1. The given task is a binary classiﬁcation where one should predict the presence or absence of a bird song. As the dataset is unbalanced, we use the Area Under Curve (AUC) metric. The results we propose for both the benchmarks and our models are evaluated on a test set consisting of 33% of the total dataset.

In order to compare with previously used methods, we use the same seeds to sample the train and test set, the batch size, i.e., 10, and the learning rate cross-validation grid as in (Balestriero et al., 2018). For each model, the best hyperparameters are selected, and we train and evaluated randomly

10-times the models with early stopping, the results are shown in Table 3. While the ﬁrst layer of the architecture has a model-dependent representation (i.e., MFSC, LGT, Conv. ﬁlters,...), we use the state-of-the-art architecture (Grill & Schl uter, 2017) for the DNN architecture, described in Appendix B.2. Notice that this speciﬁc DNN architecture has been designed and optimized for MFSC representation.

As we can see in Table 3, the case without constraints (LGT) reaches better accuracy than the domain expert benchmark (MFSC). Besides, including more constraints on the model (cn LGT) reduces overﬁtting and further improve results to outperform the other benchmarks. One can also remark that both the LGT framework and learnable mother wavelet reach almost the same accuracy, while they both outperform the hand-crafted feature as well as the unconstrained convolutional ﬁlters. One can notice that all the learned ﬁlters in Figure 6 contain either an increasing chirp or a decreasing chirp, corresponding respectively to the convexity or concavity of the instantaneous phase of the ﬁlter and thus of the piecewise linear map. Such a feature is being used and is

Learnable Group Transform For Time-Series

Figure 7. Learnable Group Transform Filters for the Haptics Data - Each row displays two selected ﬁlters (left and right sub-ﬁgure) for different settings: (from top to bottom) n LGT, c LGT, cn LGT. For each subﬁgure, the left part corresponds to the ﬁlter before training and the right part to the ﬁlter after training. The blue and red denote the real and imaginary parts of the ﬁlters, respectively.

crucial in the detection and analysis of bird song (Stowell & Plumbley, 2012).

4.3. Haptics Dataset Classiﬁcation Table 4. Testing Accuracy for the Haptics Classiﬁcation Task

Model Accuracy DTW (Al-Naymat et al., 2009) 37.7 BOSS (Sch afer, 2015) 46.4 Residual NN (Wang et al., 2017) 50.5 COTE ((Bagnall et al., 2015) 51.2 Fully Convolutional NN (Wang et al., 2017) 55.1 WD + Convolutional NN (Khan & Yener, 2018) 57.5 LGT (96 Filters)+ Non-Linearity + Linear Classiﬁer 53.5 n LGT (96 Filters)+ Non-Linearity + Linear Classiﬁer 50.4 c LGT (96 Filters)+ Non-Linearity + Linear Classiﬁer 58.2 cn LGT (96 Filters)+ Non-Linearity + Linear Classiﬁer 54.3

The Haptics dataset is a classiﬁcation problem with ﬁve classes and 155 training and 308 testing samples from the UCR Time Series Repository (Chen et al., 2015), where each time-series has 1092 time samples. As opposed to the bird dataset where features of interests are known, and competitive methods have been established, there is no expert knowledge regarding the speciﬁc signal features (see Table 4). One can see that our method outperforms other approaches in the c LGT setting while performing the classiﬁcation with a linear classiﬁer as opposed to other methods using DNN algorithms. This demonstrates the capability of our method to transform the data efﬁciently while not requiring a further change of basis as well as knowledge on the features of interests. Besides, even in a small dataset regime, our approach is capable of learning an efﬁcient transformation of the data.

We provide in Figure 7 the visualization of some sampled ﬁlters before and after learning as well as representations in

Appendix C.3.2. As opposed to the supervised bird dataset, we can see that the ﬁlters do not coincide with well-known ﬁlters that are commonly used in signal processing. This is an example of an application where the features of interest in the signals are unknown, and one requires a learnable representation.

5. Conclusion In this work, we enable the learnability of Group Transform and generalize the wavelet transform by introducing non-linear transformations of a mother ﬁlter as well as an efﬁcient way to sample this mapping. We establish the connections with well-known time-frequency ﬁlters that are common in diverse biological signals as well as the derivation of the equivariance properties of the LGT. Also, we have shown a tractable way to learn to sample these transformations using a 1-layer NN enabling an end-to-end approach. Our approach competes with state-of-the-art methods without a priori knowledge on the signal power spectrum and outperforms classical hand-crafted time-frequency representations. Interestingly, in the bird detection experiment, we recover chirplet ﬁlters that are known to be crucial to their detection, while in the case of the haptic dataset where important features to be captured to perform the classiﬁcation of the signals are unknown, the ﬁlters learned are very dissimilar to classical time-frequency ﬁlters and allow to outperform state-of-the-art methods with a linear classiﬁer. Acknowledgment

The authors would like to thank Randall Balestriero, Yanis Bahroun, and Anirvan M. Sengupta for their qualitative comments and reviews. The authors were supported by NSF grant SCH-1838873 and NIH grant R01HL144683-CFDA.

Learnable Group Transform For Time-Series

Al-Naymat, G., Chawla, S., and Taheri, J. Sparsedtw: A novel approach to speed up dynamic time warping. In Proceedings of the Eighth Australasian Data Mining Conference-Volume 101, pp. 117 127. Australian Computer Society, Inc., 2009.

Arora, R., Basu, A., Mianjy, P., and Mukherjee, A. Understanding deep neural networks with rectiﬁed linear units. ar Xiv preprint ar Xiv:1611.01491, 2016.

Bagnall, A., Lines, J., Hills, J., and Bostrom, A. Time-series classiﬁcation with cote: the collective of transformationbased ensembles. IEEE Transactions on Knowledge and Data Engineering, 27(9):2522 2535, 2015.

Balestriero, R., Cosentino, R., Glotin, H., and Baraniuk, R. Spline ﬁlters for end-to-end deep learning. In International Conference on Machine Learning, pp. 373 382, 2018.

Baraniuk, R. G. Shear madness: signal-dependent and metaplectic time-frequency representations. 1993.

Baraniuk, R. G. and Jones, D. L. Wigner-based formulation of the chirplet transform. IEEE Transactions on signal processing, 44(12):3129 3135, 1996.

Bruna, J. Scattering representations for recognition. Ph D thesis, 2013.

Cakir, E., Ozan, E. C., and Virtanen, T. Filterbank learning for deep neural network based polyphonic sound event detection. In Neural Networks (IJCNN), 2016 International Joint Conference on, pp. 3399 3406. IEEE, 2016.

Chen, Y., Keogh, E., Hu, B., Begum, N., Bagnall, A., Mueen, A., and Batista, G. The ucr time series classiﬁcation archive, July 2015. www.cs.ucr.edu/ eamonn/time_series_data/.

Cohen, T. and Welling, M. Group equivariant convolutional networks. In International Conference on Machine Learning, pp. 2990 2999, 2016.

Coifman, R. R. and Wickerhauser, M. V. Entropy-based algorithms for best basis selection. Information Theory, IEEE Transactions on, 38(2):713 718, 1992.

Cosentino, R., Balestriero, R., Baraniuk, R. G., and Patel, A. Overcomplete frame thresholding for acoustic scene analysis. ar Xiv preprint ar Xiv:1712.09117, 2017.

Daubechies, I. Ten Lectures on Wavelets, volume 61. Siam, 1992.

Feichtinger, H. G., Kozek, W., and Luef, F. Gabor analysis over ﬁnite abelian groups. Applied and Computational Harmonic Analysis, 26(2):230 248, 2009.

Flandrin, P. Time frequency and chirps. In Wavelet Applications VIII, volume 4391, pp. 161 175. International Society for Optics and Photonics, 2001.

Goldenstein, S. and Gomes, J. Time warping of audio signals. In cgi, pp. 52. IEEE, 1999.

Gribonval, R. Fast matching pursuit with a multiscale dictionary of gaussian chirps. IEEE Transactions on signal Processing, 49(5):994 1001, 2001.

Gribonval, R. and Bacry, E. Harmonic decomposition of audio signals with matching pursuit. IEEE Transactions on Signal Processing, 51(1):101 111, 2003.

Grill, T. and Schl uter, J. Two convolutional neural networks for bird detection in audio signals. In Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Kos Island, Greece, August 2017. URL http://ofai.at/ jan.schlueter/ pubs/2017_eusipco.pdf.

Jupp, D. L. B. Approximation to data by splines with free knots. SIAM Journal on Numerical Analysis, 15(2):328 343, 1978.

Kerkyacharian, G., Picard, D., et al. Regression in random design and warped wavelets. Bernoulli, 10(6):1053 1105, 2004.

Khan, H. and Yener, B. Learning ﬁlter widths of spectral decompositions with wavelets. In Advances in Neural Information Processing Systems, pp. 4601 4612, 2018.

Korda, M. and Mezi c, I. Linear predictors for nonlinear dynamical systems: Koopman operator meets model predictive control. Automatica, 93:149 160, 2018.

Le, T.-P. and Argoul, P. Continuous wavelet transform for modal identiﬁcation using free decay response. Journal of sound and vibration, 277(1-2):73 100, 2004.

Lelandais, F. and Glotin, H. Mallat s matching pursuit of sperm whale clicks in real-time using daubechies 15 wavelets. In New Trends for Environmental Monitoring Using Passive Systems, 2008, pp. 1 5. IEEE, 2008.

Mallat, S. A Wavelet Tour of Signal Processing. Elsevier, 1999.

Mallat, S. Group invariant scattering. Communications on Pure and Applied Mathematics, 65(10):1331 1398, 2012.

Mallat, S. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065): 20150203, 2016.

Learnable Group Transform For Time-Series

Mallat, S. and Zhang, Z. Matching pursuits with timefrequency dictionaries. Signal Processing, IEEE Transactions on, 41(12):3397 3415, 1993.

Mann, S. and Haykin, S. The chirplet transform: Physical considerations. IEEE Transactions on Signal Processing, 43(11):2745 2761, 1995.

Oyallon, E., Zagoruyko, S., Huang, G., Komodakis, N., Lacoste-Julien, S., Blaschko, M. B., and Belilovsky, E. Scattering networks for hybrid representation learning. IEEE transactions on pattern analysis and machine intelligence, 2018.

Ravanelli, M. and Bengio, Y. Interpretable convolutional ﬁlters with sincnet. ar Xiv preprint ar Xiv:1811.09725, 2018.

Saritha, C., Sukanya, V., and Murthy, Y. N. Ecg signal analysis using wavelet transforms. Bulg. J. Phys, 35(1): 68 77, 2008.

Sch afer, P. The boss is concerned with time series classiﬁcation in the presence of noise. Data Mining and Knowledge Discovery, 29(6):1505 1530, 2015.

Seydoux, L., Shapiro, N. M., de Rosny, J., Brenguier, F., and Land es, M. Detecting seismic activity with a covariance matrix analysis of data recorded on seismic arrays. Geophysical Journal International, 204(3):1430 1442, 2016.

Stowell, D. and Plumbley, M. D. Framewise heterodyne chirp analysis of birdsong. In 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO), pp. 2694 2698. IEEE, 2012.

Stowell, D. and Plumbley, M. D. An open dataset for research on audio ﬁeld recording archives: freeﬁeld1010. Co RR, abs/1309.5275, 2013. URL http://arxiv. org/abs/1309.5275.

Torr esani, B. Wavelets associated with representations of the afﬁne weyl heisenberg group. Journal of Mathematical Physics, 32(5):1273 1279, 1991.

Vilenkin, N. Y. Special Functions and the Theory of Group Representations, volume 22. American Mathematical Soc., 1978.

Wang, Y. and Jiang, Y.-C. Modiﬁed adaptive chirplet decomposition with application in isar imaging of maneuvering targets. EURASIP Journal on Advances in Signal Processing, 2008(1):456598, 2008.

Wang, Z., Yan, W., and Oates, T. Time series classiﬁcation from scratch with deep neural networks: A strong baseline. In 2017 international joint conference on neural networks (IJCNN), pp. 1578 1585. IEEE, 2017.

Xu, C., Wang, C., and Gao, J. Instantaneous frequency identiﬁcation using adaptive linear chirplet transform and matching pursuit. Shock and Vibration, 2016, 2016.

Yarotsky, D. Error bounds for approximations with deep relu networks. Neural Networks, 94:103 114, 2017.

Zeghidour, N., Usunier, N., Kokkinos, I., Schaiz, T., Synnaeve, G., and Dupoux, E. Learning ﬁlterbanks from raw speech for phone recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5509 5513. IEEE, 2018.