Print Slides - IfIS - Technische Universität Braunschweig
Transcrição
Print Slides - IfIS - Technische Universität Braunschweig
6/13/2016 Previous Lecture • Audio Retrieval - Basics of Audio Data - Audio Information in Databases - Audio Retrieval Multimedia Databases Wolf-Tilo Balke Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tu-bs.de Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7 Audio Retrieval 2 7.1 Low-level Audio Features • Typical Low Level Features 7 Audio Retrieval – Mean amplitude (loudness) – Frequency distribution, bandwidth – Energy distribution (brightness) – Harmonics – Pitch 7.1 Low Level Audio Features 7.2 Difference Limen 7.3 Pitch recognition • Measured – In the time domain: any given time is assigned to an amplitude – In the frequency domain: each signal frequency is assigned a strength Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 3 7.1 Fourier Analysis 4 7.1 Spectrogram • Fourier analysis: • Spectrograms: combined representation of time and frequency domain – Simple characterization by Fourier transform – Fourier coefficients are descriptive feature vector – Raster image – X-axis as time – Y-axis as the frequency components – Gray value of a point is the energy of that frequency at that time • Issues: – Time-domain does not show the frequency components of a signal – Frequency-domain does not show when frequencies occur • Allows for example, analysis of regularity of occurring frequencies • Solution: spectrograms Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 5 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 6 1 6/13/2016 7.1 Spectrogram 7.1 Classification • Spectrogram of the spoken word “durst” • Use of different low level features for automatic classification of audio files – Different audio classes have typical values for various properties – Thus, various typical feature vectors Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7.1 Example 7.1 Example • Distinguish speech and music • Characteristics are in each case difficult to predict, but there are general trends • Don’t just use a single feature, but evaluate combination of all features • Dependent and independent features Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig • Bandwidth – For speech rather low 100–7000 Hz 100-7000 Hz – In music it tends to be high, 16-20000 Hz • Brightness (central point of the bandwidth): – In language it is low (mainly due to the low bandwidth) – For music, it is high 9 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 7.1 Example 10 7.1 Example • Proportion of silence • Simple classification algorithm (Lu and Hankinson, 1998) – Frequent pauses in speech (between words and sentences) – Low percentage of silence for music (except for solo instruments) Audio high Brightness low • Variance of the zero crossing rate (over time) Music low Solo music Portion of silence – In speech there is a characteristic structure of syllables: short and long vowels, therefore fluctuating zero crossing rate – Music often has a consistent rhythm, so rather uniform zero pass rate Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 8 high Variance of zero crossing rate 11 low high Speech Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 12 2 6/13/2016 7.1 Example 7.1 Classification • Quantitative high / low estimates are highly dependent on the collection • Speech and music – Determine reference vector for each class by a set of training examples • Assignment of new audio files to classes is based on minimum distance of its feature vector to one of the reference vectors of the respective class Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 13 7.1 Static Coefficients • Four statistical characteristics (Wold and others, 1996) – For good feature vectors, the audio signal must be divided into time slots – Compute a vector for each window – Loudness (perceived volume) • Measured as the root mean square (RMS) of the amplitude values (in dB) • More sophisticated methods take into account differences in the perceptionallity of parts of the frequency spectrum (<50 Hz, >20Khz) • Calculate low level features in the time window • Build statistical characteristics about low level features – Perceptional comparison of audio files 15 7.1 Static Coefficients Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 16 7.1 Static Coefficients – Brightness (perceived brightness) – Pitch (perceived pitch) • Defined as the center-of-gravity of the Fourier spectrum • Logarithmic scale • Describes the amount of high frequencies in the signal – Bandwidth (frequency bandwidth) • Defined as a weighted average of the differences of the Fourier coefficients to the center-of-gravity of the spectrum • Amplitudes are used as weights Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 14 7.1 Static Coefficients • Low-level Features for Audio Retrieval Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 17 • Calculated from the frequencies and amplitudes of the peaks within each interval (pitch tracking) • Pitch tracking is a difficult problem, therefore, often in simpler systems approximated by the fundamental frequency Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 18 3 6/13/2016 7.1 Static Coefficients 7.1 Static Coefficients • Time-dependent function for each size in each time window • E.g., laughter • Aggregate statistical description of the four functions through: – Expected value (average value) – Variance (mean square deviation) – Autocorrelation (self-similarity of the signal) – Loudness – Brightness – Bandwidth – Pitch Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig • Either for each window or for the whole signal (results in 12-dimensional feature vector, such as IBM's QBIC) 19 7.1 Static Coefficients Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 20 7.1 Static Classification • Example: Laughter • Training set of sounds of a class leads to a perceptional model for each class – Compute the vector of the mean • Each sound has typical values • Thus we can classify audio files Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig – Calculate the covariance matrix 21 7.1 Static Classification Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 22 7.1 Application in Classification • For every new audio file, compute the Mahalanobis distance to each class: • Classification for laughter (Wold and others, 1996) • Order the data of a class (either on the threshold or on the minimum distance) • Determine the probability of correct classification as: Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 23 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 24 4 6/13/2016 7.1 Evaluation 7.1 Evaluation • Statistical properties for retrieval and classification work well with short audio data • Ok for differentiating between speech and music or laughter and music – Parameters statistically represent human perception – Easy to use, easy to index, query by example – The only expansion in commercial databases – But purely statistical values are rather unsuitable in order to classify and differentiate between musical pieces – Detection of notes from the audio signal (pitch determination) does not work very well – How does one define the term “melody”? (especially for queries, query by humming) • DB2 AIV Extenders (development discontinued) • Oracle Multimedia Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 25 7.1 Problem Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 26 7.1 Problem • Recognition of notes from signal • Definition of “melody” – Variety of instruments – Overlap of different instruments (and possibly voice) – Melody = sequence of musical notes – But querying for a melody has to be: • Invariant under pitch shift (soprano and bass) • Invariant under time shift • Invariant under slight variations • Simple, if we have data in MIDI format and the audio signal was synthesized from it • Often, not the sequence of notes themselves, but a sequence of their differences Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 27 7.2 Frequencies and Pitch Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 28 7.2 Frequencies and Pitch • Pitch has something to do with the frequency • Harmonics – Only useful for periodic frequencies, not for noise – Harmonic tones have one main oscillation and several harmonics Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 29 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 30 5 6/13/2016 7.2 Problem 7.2 Difference Limen • Interference make the automatic detection of the dominant pitch difficult • Human perception often differs from physical measurements – E.g., fundamental frequency ≠ pitch • However we need the pitch to extract the melody line Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 31 7.2 Difference Limen • Exactly how do people perceive frequency differences? – Difference Limen is the smallest change that is reliably perceived (“just noticeable difference”) – Accuracy varies with different pitch, duration and volume – Experimental determination of average values for sine waves (Jesteadt and others, 1977) Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 32 7.2 Difference Limen • Determined through psychological testing • (Jesteadt and others, 1977): Difference Limen by 0.2% – Two tones with 500 ms duration and small tone difference are played one after the other – Subjects determine whether the second tone was higher or lower – This results in a psychometric function between the difference in frequency and accuracy of the classification (50% -100%) – Above 75% perception is considered reliable Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 33 7.2 Difference Limen Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 34 7.2 Difference Limen • 0.2% Difference Limen means that most people can distinguish a 1000 Hz tone from a 1002 Hz tone reliably • Quality of separation is not uniform across the frequency band – Worse at high and low frequencies • Tone duration is important – Increasingly better for tones between 0 –100 ms, constant for longer • Volume is important – Increasingly better for tones between 5 - 40 dB, then constant Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 35 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 36 6 6/13/2016 7.3 Pitch Definition 7.3 Pitch Determination • ANSI standard (1994) • Experiments on frequency scale – „Pitch is that attribute of auditory sensation in terms of which sounds may be ordered on a scale extending from low to high. Pitch depends mainly on the frequency content of the sound stimulus, but it also depends on the sound pressure and the waveform of the stimulus“ • Typically, limit to the melody line, to distinguish pitch from timbre – E.g., “s” – and “sh“ sounds, rather have different timbre than different pitch Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 37 – Adaptation of a generated sine wave (with 40 dB) to the perceived pitch by using test subjects (Fletcher, 1934) – Histograms show the pitch (x Hz) and the compliance of all test subjects (x ± y Hz) – Multimodal distributions indicate several pitches • E.g., polyphonic music: some persons concentrate on one instrument while others on another instrument Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 38 7.3 Theoretical Models 7.3 Pitch Determination • Experiments: pitch • Location dependent pitch detection – Cochlea perceives different frequencies at different places – High frequencies on entrance of the cochlea – Low frequencies at the end of the cochlea – The brain recognizes which neurons were excited Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 39 7.3 Location-dependent Models 40 7.3 Location-dependent Models • The stimulation of different neurons along the approximately 35 mm long basilar membrane in the cochlea is a typical pattern for audio coding • The pitch can be detected from this patterns Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig • Formula in millimeters (z), of the excitation (Greenwood, 1990) 41 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 42 7 6/13/2016 7.3 Time-dependent Models 7.3 Theoretical Models • The two models address recognizing the pitches of individual sounds • Pitch detection is more difficult in the case of complex tones • The coding of the sound is not purely location-dependent, but rather by temporal synchronization of the neurons – All neurons fire spontaneously in random sequence depending on their refraction characteristic – When a sound with some frequency starts, it causes more neurons to fire synchronously – The brain determines the pitch based on an “autocorrelation function” of the pattern Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig – Groups of neurons are excited in several locations or with interfering synchronization patterns • Which neuron excitement is the pitch? 43 7.3 Fundamental Frequency Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 44 7.3 Auditory Organization • Lowest frequency generates harmonics … thus pitch as the fundamental frequency? • The hearing analyses complex sounds in different frequency bands • The hearing processes organizes and integrates the different impressions • Decide the pitch by matching against the harmonic templates (Goldstein, 1973) – Psychological experiments with and without the fundamental frequency in the same note, show, that the pitch of the note is still rated the same – Since synchrony remains the same with or without the fundamental frequency, then we should consider the time-dependent model • But how do we evaluate the synchrony? Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 45 7.3 Auditory Organization 46 7.3 Auditory Organization • Experiments favor centralized template-matching • The pitch can also be synthesized as nonoccurring frequency e.g., 296 Hz for the simultaneous play of the following non-harmonic tones: – We can feel pitches, even if we split the signal into disjoint units which are then heard with both ears – The pitch is synthesized (but it doesn’t work on all partitions; they are usually heard as more pitches) – The listeners can be mislead by ambient noise to perceive a false template Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Apparent 2/3 harmonic 47 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 48 8 6/13/2016 7.3 Pitch Tracking Algorithms 7.3 Pitch Tracking Algorithms • Pitch is a feature of the frequency at a particular time • Requirements – Frequency resolution in the range of a semitone with the correct octave – Detection of different instruments with well-defined harmonies (e.g., cello, flute) – (Recognition of pitches for conversion into symbolic notation in real time for interactive systems) – Pitch tracking in the frequency domain – Pitch tracking in the time domain • Length of time window for frequency spectrum – At least twice the length of the estimated period Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 49 7.3 Harmonic Product Spectrum 50 7.3 Harmonic Product Spectrum • HPS-pitch detection is one of the simplest and most robust method in the frequency domain (Schroeder, 1968), (Noll, 1969) • X(ω): strength of the frequency ω in the current time window • R: number of harmonics to be checked – A certain frequency range is analyzed for the fundamental frequency – E.g., R = 5 • E.g., 50-250 Hz for male voices • Pitch is then the maximum of the product spectrum Y over all frequencies ω1,ω 2, ... in the frequency range to be investigated – All frequencies in this area are analyzed for harmonic overtones Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 51 7.3 Harmonic Product Spectrum Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 52 7.3 Harmonic Product Spectrum • Problems occur due to noise at frequencies below 50 Hz • Other problems occur due to the frequent octave errors – Pitch is often recognized an octave too high • Rule-based solution: – If the next closest amplitude under the pitch candidate has approx. half the frequency of the pitch candidate, and its amplitude is above a threshold, then select the pitch one octave below the pitch candidate – In practice, usually sufficient Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 53 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 54 9 6/13/2016 7.3 Maximum Likelihood Estimator 7.3 Harmonic Product Spectrum • HPS, example • The ML algorithm (Noll, 1969) compares the possibly “ideal” specters with this spectrum and selects the most similar based on the shape • Ideal specters are chains of pulses of the dampening function of the signal window • For the dampening function – The signal section, represented by each signal window, (e.g., length 40 ms) is dampened (mainly at the edges) through a special function to remove artifacts (mostly false high frequencies) Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 55 7.3 Maximum Likelihood Estimator 56 7.3 Maximum Likelihood Estimator • Generation of an ideal spectrum • The error is now determined between the spectrum being studied and the ideal spectrum Ỹω (with pitch ω) Convolution • The shares ||Y||2 and ||Ỹω||2 remain relatively constant, so the error is given by the product YỸω Dampening function Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 57 7.3 Maximum Likelihood Estimator Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 58 7.3 Maximum Likelihood Estimator • Pitch is the frequency with minimal error: • In essence, this method is a “multiplication” of the vector of the input spectrum with a matrix of ideal specters Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 59 • Efficiency and error probability varies with the degree of resolution (number of ideal specters) • Good results when the pitch of the analyzed signal is close to the pitch of one of the ideal specters used Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 60 10 6/13/2016 7.3 Auto Correlation Functions 7.3 Auto Correlation Functions • The most popular procedures in the time domain is the peak-search in the autocorrelation functions (ACFs) – It is given by the equation • ACFs measure the correlation of a signal with the shifted version of the same signal, where τ represents the shift Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 61 7.3 Auto Correlation Functions • Since harmonic signals are strongly correlated with each other when they are shifted around the pitch, there is a peak in the ACF for good pitch candidates • Because of multiplication the procedure is more expensive Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 62 7.3 Auto Correlation Functions • Average Magnitude Difference Function (AMDF) – Its minimum is where the ACF has peaks – It is more efficient to compute – Therefore differences are often used rather than products Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 63 7.3 Auto Correlation Functions • AMDF Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 64 7.3 Auto Correlation Functions • ACF and AMDF are independent and can therefore be combined into a robust estimator (Kobayashi and Shinamura, 2000) Ψ(τ) • Significantly better fault tolerance against noise (the best results for k=1) Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 65 Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 66 11 6/13/2016 7.3 Melody Recognition 7.3 Melody Recognition • Pitch can be relatively robust identified for each time slot • But normally it is not enough for melody detection just to use sequence of pitches • HPS pitch detection for a cello – Windowing draws error in the individual recognition (continuous pitch changes) – Attack frequency vs. melody on sustain level Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 67 7.3 Melody Recognition Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 68 7.3 Melody Recognition • Filtering of transient faults and spontaneous octave jumps • Monitor size of the peaks for pitch allocation across multiple windows • Post processed pitch detection for a flute – Better resolution for specters in the case of uncertain assignments • Interference with the original signal Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 69 70 This Lecture 7.3 Melody Recognition • Problems of melody detection • Audio Retrieval – Strong polyphony – Tuning of the instruments, changes with temperature, humidity, or time (can be achieved in ML by adjusting of the ideal specters) – Even minor changes (just for a second) can lead to alternating detection of two notes, where only one is played (hysteresis) Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig - Low Level Audio Features - Difference Limen - Pitch: tracking algorithms 71 Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 72 12 6/13/2016 Next lecture • Query by Humming • Melody Representation and Matching – Parsons-Codes – Dynamic Time Warping • Hidden Markov Models Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig 73 13