David Barber and Taylan Cemgil
Michael Lewicki (invited talk)
Richard Turner and Maneesh Sahani
The computational principles underpinning auditory processing are not well understood. This fact stands in stark contrast to early visual processing for which computational theories, and especially those built on statistical models, have recently enjoyed great success. We believe one of the reasons for this disparity is the paucity of rich, learnable generative models for natural scenes with an explicit temporal dimension. To that end we introduce a new generative model for the dynamic Fourier components of sounds. This comprises a cascade of modulatory processes which evolve over a wide range of time-scales. We show the model is capable of capturing both the sparse marginal distribution and the prevelance of amplitude modulation in natural sounds, to which the auditory system appears to listen so attentively. Moreover, we demonstrate that it is relatively easy to learn and to do inference in the Gaussian Modulation Cascade Process, due to the structure of its non-linearity. We hope that this provides a first step toward furthering our understanding of auditory computations.
09:05 Poster Session
Score-Guided Music Audio Source Separation
Abstract : Audio source separation seeks to decompose an audio recording into several different recordings corresponding to independent sources, such as speakers, foreground and foreground, or, in our case, musical parts. Source separation is a formidable task; while the problem has received considerable attention in recent years, it remains completely open. The majority of approaches we know of are deemed blind source separation, meaning that the audio is decomposed without explicit knowledge of its contents. In particular, much recent work has focused on Independent Component Analysis (ICA), as the methodological backbone of various approaches. Work on blind separation also contains work specifically devoted to music audio. While blind separation is, no doubt, broadly useful and deeply interesting, many of the techniques rely on restrictive assumptions about the recording process or audio, often not satisfied in practice. Moreover, blind approaches seem simply wrong-headed for our purposes, since they fail to capitalize on our explicit and detailed knowledge of the audio. The focus of our effort here is in fully incorporating this knowledge in a principled approach to musical source separation.
Analysis of polyphonic audio using source-filter model and NMF poster
T. Virtanen and A. Klapuri
Abstract: This paper proposes a method for analysing polyphonic audio signals based on a signal model where an input spectrogram is modelled as a linear sum of basis functions with time-varying gains. Each basis is further represented as a product of a source spectrum and the magnitude response of a filter. This formulation reduces the number of free parameters needed to represent realistic audiosignals and leads to a more reliable parameter estimation. Two novel estimation algorithms are proposed, one extended from non-negative matrix factorization and the other from non-negative
matrix deconvolution. In preliminary experiments with singing signals, both algorithms have been found to converge towards meaningful analysis results.
Multivariate Analysis and Kernel Methods for Music Data Analysis poster
J. A. Garcia, A. Meng, K. Petersen, L. Hansen
Abstract : There is an increasing interest in customizable methods for organizing music collections. Relevant music characterization can be obtained from short-time features, but it is not obvious how to combine them to get useful information. First, the relevant information might not be evident at the short-time level, and these features have to be combined at a larger temporal scale. Second, we need to learn a model for the new features that generalizes well to new data. In this contribution, we will study how multivariate analysis (MVA) and kernel methods can be of great help in this task. More precisely, we will present two modified versions of a MVA method known as Orthonormalized Partial Least Squares (OPLS), one of them being a kernel extension, that are well-suited for discovering relevant dynamics in large music collections. The performance of both schemes will be illustrated in a music genre classification task.
Building a Binaural Source Separator poster
I. Mandel, Daniel P. W. Ellis and Tony Jebara
Abstract: We propose a number of cues and a strategy for combining them that could be used by a binaural machine to perform source separation. Our previous work has used the single cue of interaural phase difference (IPD) to segment the timefrequency plane using an EM algorithm. We see this as a first step towards a larger and more complete system that takes advantage of more of the cues available to a listener from the stereo mixture such as interaural level difference (ILD), monaural cues, and reliability cues. Additionally, these cues could be integrated with one another by extending the existing probabilistic framework.
Modelling Sound Dynamics Using Deformable Spectrograms
Manuel Reyes-Gomez, Nebojsa Jojic and Dan Ellis
Abstract: Speech and other natural sounds show high temporal correlation and smooth spectral evolution punctuated by a few, irregular and abrupt changes. We model successive spectra as transformations of their immediate predecessors, capturing the evolution of the signal energy through time. The speech production model is used to decomposed the logspectrogrum into two additive layers, which are able to separately explain and model the evolution of the harmonic excitation, and formant filtering of speech and similar sounds. We present results on a speech recognition task, that suggest that the model discovers a global structure on the dynamics of the signals energy that helps to alleviate the problems generated by noise interferences. The model is also used to segment mixtures of speech into dominant speaker regions on a unsupervised source separation task.
9:50 Signal separation by efficient combinatorial optimization Results Samples
Manuel Reyes-Gomez and Nebojsa Jojic
Abstract : We present a formulation of the source separation problem as the minimization of a symmetric function defined on fragments of the observed signal. We prove the function to be posimodular and propose the use of tractable combinatorial optimization techniques, in particular Queyranne’s algorithm, suited to optimization of symmetric submodular and posimodular functions. While these ideas can be applied to any signal segmentation problem (e.g., image or video segmentation), we focus here on unsupervised separation of sources in mixed speech signals recorded by a single microphone. The optimization criterion is the likelihood under a generative model which assumes that each time-frequency bin is assigned to one of the two speakers, and that each speaker’s utterance has been generated from the same generic speech model. This assumption is made given that the time-frequency representation of speech signals is very sparse. The optimization can then be performed over all possible assignments of the time-frequency bins to the two speakers. Even though the algorithm requires polynomial time, it is still too slow for large signals. Therefore, we first oversegment the spectrogram into a large number of segments which do not violate the deformable spectrogram model. Queyranne’s algorithm is then constrained to search only over unions of these segments, rather than all possible signal fragments. We show that this technique leads to blind separation of mixed signals where both speakers are of the same gender and very similar spectral characteristics.
Yuanqing Lin and Daniel D. Lee
Abstract : Estimating acoustic room impulse responses is central to many acoustic signal processing tasks such as time delay estimation, echo cancellation, beamforming, blind deconvolution, and so on. We propose to exploit nonnegativity and sparsity priors for estimating FIR filters that model acoustic room impulse responses. The filter estimation problems, with or without the knowledge of source (or input), can then be formulated as nonnegative least-mean-square (LMS) problems penalized by a weighted L1-norm of the filter coefficients. We show how the optimal L1-norm weighting (or regularization parameters) can be inferred in a Bayesian framework and the optimally sparse solution can be derived in a maximum-likelihood (ML) sense. The resulting algorithm is named as Bayesian regularization and nonnegative deconvolution (BRAND) algorithm. Simulation results demonstrate that the BRAND algorithm is able to accurately resolve nonnegative sparse filters from noisy convolutive microphone signals, and the results of real recordings experimentally validate the effectiveness of the nonnegativity assumption on the filters.
Dan Ellis (invited talk)
Abstract : When extracting information from simultaneous sound sources, listeners successfully exploit many different factors spanning spatial location and source characteristics. I will argue that detailed constraints on the form of particular source signals are being employed, and that this therefore is an important direction for research into automatic sound organization systems, in applications ranging from speech separation to environmental sound classification to music understanding.
Paris Smaragdis, Bhiksha Raj and Madhusudana Shashanka
Abstract : We describe a model developed for the analysis of acoustic spectra. Unlike basis decompositions techniques that can result in difficult to interpret results this model explicitly models spectra as distributions and extracts sets of additive and semantically useful components that facilitate a variety of applications ranging from source separation, denoising, music transcription and sound recognition. This model is probabilistic in nature and is easily extended to produce sparse codes, and discover transform invariant components which can be optimized for particular applications.
17:05 Poster Session
Cognitive Components of Speech at Different Time Scales
Ling Feng and Lars Kai Hansen
Abstract : We discuss the cognitive components of speech at different time scales. We investigate cognitive features of speech including phoneme, gender, height, speaker identity. Integration by feature stacking based on short time MFCCs. Our hypothesis is basically ecological: we assume that features that essentially independent in a reasonable ensemble can be efficiently coded using a sparse independent component representation. This means that supervised and nsupervised learning should result in similar representations. We do indeed find that supervised and unsupervised learning of a model based on identical representations have closely corresponding abilities as classifiers. Poster Spectrogram Factorization Using Phase Information Mitchell Parry and Irfan Essa Spectrogram factorization methods have been proposed for single channel source separation including independent subspace analysis and non-negative matrix factorization. These methods assume that the mixture spectrogram is a linear combination of the source spectrograms. However, this is an incorrect assumption because the mixture spectrogram additionally depends on the (unknown) phase of the sources. This paper investigates the role of phase in estimating the source spectrograms from the mixture spectrogram and incorporates a probabilistic representation of phase to improve separation results.
Spectrogram Factorization Using Phase Information
Mitchell Parry and Irfan Essa
Abstract: Spectrogram factorization methods have been proposed for single channel source separation including independent subspace analysis and non-negative matrix factorization. These methods assume that the mixture spectrogram is a linear combination of the source spectrograms. However, this is an incorrect assumption because the mixture spectrogram additionally depends on the (unknown) phase of the sources. This paper investigates the role of phase in estimating the source spectrograms from the mixture spectrogram and incorporates a probabilistic representation of phase to improve separation results.
Modeling the Semantics of Sound
D. Turnbull, L. Barrington, D. Torres, G. Lanckriet
Abstract : While semantic image/video annotation has received considerable attention in recent years, relatively little work has been done on semantic audio annotation and retrieval. This paper is the first attempt to formulate a rigorous machine learning framework to model the semantics of sound. We combine a supervised multiclass naive Bayes model, which has shown good performance on image annotation, with advanced audio feature extraction techniques. The parameters for this model can be estimated efficiently using the mixture hierarchies algorithm. We consider two heterogeneous audio and text data sets; sound effects with captions, and music with associated reviews. We quantitatively show that this first proposed framework can both annotate a novel audio track with semantically meaningful words and retrieve relevant audio tracks given a text-based query.
Temporal Constraints for Sound Source Formation using the Normalized Cut poster
M. Lagrange, J. Murdoch, G. Tzanetakis
Abstract : In this paper, we explore the use of a graph algorithm called the normalized cut in order to organize prominent components of the auditory scene. We focus specifically on defining a time-constrained similarity metric. We show that such a metric can be successfully expressed in terms of the time and frequency masking phenomena and can be used to solve common problems in auditory scene
Nonnegative CCA for audiovisual source separation
Christian Sigg, Bernd Fischer and Volker Roth
Abstract : We present a method for finding correlated components in audio and video signals. The concept of canonical correlation analysis is reformulated such that it allows us to incorporate non-negativity constraints on the coefficients. This additional requirement ensures that projection directions obey the non-negativity requirements of energy signals. By finding multiple orthogonal directions we finally obtain a component-based decomposition of both data modalities. Experiments for simultaneous source separation in both video and audio streams effectively demonstrate the benefits of this approach.
Recurrent Timing Neural Networks for Joint F0-Localisation Estimation poster
Stuart N. Wrigley and Guy J. Brown
Abstract : A novel extension to recurrent timing neural networks (RTNNs) is proposed which allows such networks to exploit a joint interaural time difference-fundamental frequency (ITD-F0) auditory cue as opposed to F0 only. This extension involves coupling a second layer of coincidence detectors to a two-dimensional RTNN. The coincidence detectors are tuned to particular ITDs and each feeds excitation to a column in the RTNN. Thus, one axis of the RTNN represents F0 and the other ITD. The resulting behaviour allows sources to be segregated on the basis of their separation in ITD-F0 space. Furthermore, all grouping and segregation activity proceeds within individual frequency channels without recourse to across channel estimates of F0 or ITD that are commonly used in auditory scene analysis approaches. The system has been evaluated using a source separation task operating on spatialised speech signals.
Acoustic Representation and Processing: It is time!
Jean Rouat, Stephane Loiselle, Ramin Pichevar
Abstract : From physiology we learn that the auditory system extracts simultaneous features from the underlying signal, giving birth to simultaneous representations of audible signals. We also learn that pattern analysis and recognition are not separated processes (in opposition to the engineering approach of pattern recognition where analysis and recognition are usually separated processes). Furthermore, in the visual system, it has been observed that the sequence order of firing is crucial to perform fast visual recognition tasks (Rank Order Coding). The use of the Rank Order Coding has also been recently hypothesized in the mammalian auditory system. In a first application we compare a very simplistic speech recognition prototype that uses the Rank Order Coding with a conventional Hidden Markov Model speech recognizer. It is also shown that the type of neurons being used should be adapted to the type of phonemes (consonants/transients or vowels/stable) to be recognized. In a second application, we combine a simultaneous auditory images representation with a network of oscillatory spiking neurons to segregate and bind auditory objects for acoustical source separation. We also discuss on the importance of the time in acoustic processing.
Francis R. Bach and Michael I. Jordan
Abstract : Spectral clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. In this paper, we derive new cost functions for spectral clustering based on measures of error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem. Minimizing these cost functions with respect to the partition leads to new spectral clustering algorithms. Minimizing with respect to the similarity matrix leads to algorithms for learning the similarity matrix from fully labelled data sets. We apply our learning algorithm to the blind one-microphone speech separation problem, casting the problem as one of segmentation of the spectrogram.
Lucas C. Parra and Barak Pearlmutter
Abstract : Phenomena resembling tinnitus and Zwicker phantom tone are seen to result from an auditory gain adaptation mechanism that attempts to make full use of a fixed-capacity channel. In the case of tinnitus, the gain adaptation enhances internal noise of a frequency band otherwise silent due to damage. This generates a percept of a phantom sound as a consequence of hearing loss. In the case of Zwicker tone, a frequency band is temporarily silent during the presentation of a notched broadband sound, resulting in a percept of a tone at the notched frequency. The model suggests a link between tinnitus and the Zwicker tone percept, in that it predicts different results for normal and tinnitus subjects due to a loss of instantaneous nonlinear compression. Listening experiments on 44 subjects show that tinnitus subjects (11 of 44) are significantly more likely to hear the Zwicker tone. This psychoacoustic experiment establishes the first empirical link between the Zwicker tone percept and tinnitus. Together with the modeling results, this supports the hypothesis that the phantom percept is a consequence of a central adaptation mechanism confronted with a degraded sensory apparatus.