Print Slides - IfIS - Technische Universität Braunschweig

Transcrição

Print Slides - IfIS - Technische Universität Braunschweig
6/13/2016
Previous Lecture
• Audio Retrieval
- Basics of Audio Data
- Audio Information in Databases
- Audio Retrieval
Multimedia Databases
Wolf-Tilo Balke
Institut für Informationssysteme
Technische Universität Braunschweig
http://www.ifis.cs.tu-bs.de
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
7 Audio Retrieval
2
7.1 Low-level Audio Features
• Typical Low Level Features
7 Audio Retrieval
– Mean amplitude (loudness)
– Frequency distribution,
bandwidth
– Energy distribution
(brightness)
– Harmonics
– Pitch
7.1 Low Level Audio Features
7.2 Difference Limen
7.3 Pitch recognition
• Measured
– In the time domain:
any given time is assigned to an amplitude
– In the frequency domain:
each signal frequency is assigned a strength
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
3
7.1 Fourier Analysis
4
7.1 Spectrogram
• Fourier analysis:
• Spectrograms: combined representation
of time and frequency domain
– Simple characterization by
Fourier transform
– Fourier coefficients are
descriptive feature vector
– Raster image
– X-axis as time
– Y-axis as the frequency components
– Gray value of a point is the energy of that frequency
at that time
• Issues:
– Time-domain does not show the
frequency components of a signal
– Frequency-domain does not show
when frequencies occur
• Allows for example, analysis of regularity
of occurring frequencies
• Solution: spectrograms
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
5
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
6
1
6/13/2016
7.1 Spectrogram
7.1 Classification
• Spectrogram of the spoken word “durst”
• Use of different low level features for automatic
classification of audio files
– Different audio classes have typical values for various
properties
– Thus, various typical
feature vectors
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
7
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
7.1 Example
7.1 Example
• Distinguish speech and music
• Characteristics are in each case
difficult to predict, but there are
general trends
• Don’t just use a single feature, but evaluate
combination of all features
• Dependent and independent features
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
• Bandwidth
– For speech rather low
100–7000 Hz 100-7000 Hz
– In music it tends to be high, 16-20000 Hz
• Brightness
(central point of the bandwidth):
– In language it is low
(mainly due to the low bandwidth)
– For music, it is high
9
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
7.1 Example
10
7.1 Example
• Proportion of silence
• Simple classification algorithm
(Lu and Hankinson, 1998)
– Frequent pauses in speech (between
words and sentences)
– Low percentage of silence for music
(except for solo instruments)
Audio
high
Brightness
low
• Variance of the zero crossing rate (over time)
Music
low
Solo music
Portion of silence
– In speech there is a characteristic structure of syllables:
short and long vowels, therefore fluctuating
zero crossing rate
– Music often has a consistent rhythm,
so rather uniform zero pass rate
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
8
high
Variance of
zero crossing rate
11
low
high
Speech
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
12
2
6/13/2016
7.1 Example
7.1 Classification
• Quantitative high / low estimates are highly
dependent on the collection
• Speech and music
– Determine reference vector for each class by a set of
training examples
• Assignment of new audio files to classes is based
on minimum distance of its feature vector to
one of the reference vectors of the respective
class
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
13
7.1 Static Coefficients
• Four statistical characteristics
(Wold and others, 1996)
– For good feature vectors, the audio signal must be
divided into time slots
– Compute a vector for each window
– Loudness (perceived volume)
• Measured as the root mean square (RMS) of the amplitude
values (in dB)
• More sophisticated methods
take into account differences
in the perceptionallity of
parts of the frequency
spectrum
(<50 Hz, >20Khz)
• Calculate low level features in the time window
• Build statistical characteristics about low level features
– Perceptional comparison of audio files
15
7.1 Static Coefficients
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
16
7.1 Static Coefficients
– Brightness (perceived brightness)
– Pitch (perceived pitch)
• Defined as the center-of-gravity of the Fourier spectrum
• Logarithmic scale
• Describes the amount of high frequencies in the signal
– Bandwidth (frequency bandwidth)
• Defined as a weighted average of the differences of the
Fourier coefficients to the center-of-gravity of the spectrum
• Amplitudes are used as weights
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
14
7.1 Static Coefficients
• Low-level Features for Audio Retrieval
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
17
• Calculated from the frequencies
and amplitudes of the peaks
within each interval
(pitch tracking)
• Pitch tracking is a difficult problem, therefore, often in
simpler systems approximated by the fundamental
frequency
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
18
3
6/13/2016
7.1 Static Coefficients
7.1 Static Coefficients
• Time-dependent function
for each size in each
time window
• E.g., laughter
• Aggregate statistical description of the four
functions through:
– Expected value (average value)
– Variance (mean square deviation)
– Autocorrelation
(self-similarity of the signal)
– Loudness
– Brightness
– Bandwidth
– Pitch
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
• Either for each window or for the whole signal
(results in 12-dimensional feature vector, such as
IBM's QBIC)
19
7.1 Static Coefficients
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
20
7.1 Static Classification
• Example: Laughter
• Training set of sounds of a class leads to a
perceptional model for each class
– Compute the vector of the mean
• Each sound has typical values
• Thus we can classify audio files
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
– Calculate the covariance matrix
21
7.1 Static Classification
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
22
7.1 Application in Classification
• For every new audio file, compute the
Mahalanobis distance to each class:
• Classification for laughter
(Wold and others, 1996)
• Order the data of a class (either on the threshold
or on the minimum distance)
• Determine the probability of correct classification
as:
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
23
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
24
4
6/13/2016
7.1 Evaluation
7.1 Evaluation
• Statistical properties for retrieval and
classification work well with short audio data
• Ok for differentiating between speech and music
or laughter and music
– Parameters statistically represent human perception
– Easy to use, easy to index, query by example
– The only expansion in commercial databases
– But purely statistical values are rather unsuitable in
order to classify and differentiate between musical
pieces
– Detection of notes from the audio signal (pitch
determination) does not work very well
– How does one define the term “melody”?
(especially for queries, query by humming)
• DB2 AIV Extenders (development discontinued)
• Oracle Multimedia
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
25
7.1 Problem
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
26
7.1 Problem
• Recognition of notes from signal
• Definition of “melody”
– Variety of instruments
– Overlap of different instruments (and possibly voice)
– Melody = sequence of musical notes
– But querying for a melody has to be:
• Invariant under pitch shift (soprano and bass)
• Invariant under time shift
• Invariant under slight variations
• Simple, if we have data in MIDI format and the
audio signal was synthesized from it
• Often, not the sequence of notes themselves, but
a sequence of their differences
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
27
7.2 Frequencies and Pitch
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
28
7.2 Frequencies and Pitch
• Pitch has something to do with the frequency
• Harmonics
– Only useful for periodic frequencies, not for noise
– Harmonic tones have one main oscillation and several
harmonics
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
29
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
30
5
6/13/2016
7.2 Problem
7.2 Difference Limen
• Interference make the automatic detection of the
dominant pitch difficult
• Human perception often differs from physical
measurements
– E.g., fundamental frequency ≠ pitch
• However we need the pitch to extract the
melody line
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
31
7.2 Difference Limen
• Exactly how do people perceive frequency
differences?
– Difference Limen is the smallest change that is
reliably perceived (“just noticeable difference”)
– Accuracy varies with different pitch, duration and
volume
– Experimental determination of average values for sine
waves (Jesteadt and others, 1977)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
32
7.2 Difference Limen
• Determined through
psychological testing
• (Jesteadt and others, 1977): Difference Limen by
0.2%
– Two tones with 500 ms duration and small tone
difference are played one after the other
– Subjects determine whether the second tone was
higher or lower
– This results in a psychometric function between the
difference in frequency and accuracy of the
classification (50% -100%)
– Above 75% perception is considered reliable
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
33
7.2 Difference Limen
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
34
7.2 Difference Limen
• 0.2% Difference Limen means that most people
can distinguish a 1000 Hz tone from a 1002 Hz
tone reliably
• Quality of separation is not uniform across the
frequency band
– Worse at high and low frequencies
• Tone duration is important
– Increasingly better for tones between
0 –100 ms, constant for longer
• Volume is important
– Increasingly better for tones
between 5 - 40 dB, then constant
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
35
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
36
6
6/13/2016
7.3 Pitch Definition
7.3 Pitch Determination
• ANSI standard (1994)
• Experiments on frequency scale
– „Pitch is that attribute of auditory sensation in terms
of which sounds may be ordered on a scale extending
from low to high. Pitch depends mainly on the
frequency content of the sound stimulus, but it also
depends on the sound pressure and the waveform of
the stimulus“
• Typically, limit to the melody line, to distinguish
pitch from timbre
– E.g., “s” – and “sh“ sounds, rather have different timbre
than different pitch
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
37
– Adaptation of a generated sine wave (with 40 dB) to
the perceived pitch by using test subjects (Fletcher,
1934)
– Histograms show the pitch (x Hz) and the compliance
of all test subjects (x ± y Hz)
– Multimodal distributions indicate several pitches
• E.g., polyphonic music: some persons concentrate on one
instrument while others on another instrument
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
38
7.3 Theoretical Models
7.3 Pitch Determination
• Experiments: pitch
• Location dependent pitch detection
– Cochlea perceives different frequencies at different
places
– High frequencies on
entrance of the cochlea
– Low frequencies
at the end of the cochlea
– The brain recognizes
which neurons were
excited
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
39
7.3 Location-dependent Models
40
7.3 Location-dependent Models
• The stimulation of different neurons along the
approximately 35 mm long basilar membrane in
the cochlea is a typical pattern for audio coding
• The pitch can be
detected from
this patterns
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
• Formula in millimeters (z), of the excitation
(Greenwood, 1990)
41
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
42
7
6/13/2016
7.3 Time-dependent Models
7.3 Theoretical Models
• The two models address recognizing the pitches
of individual sounds
• Pitch detection is more difficult in the case of
complex tones
• The coding of the sound is not purely
location-dependent, but rather by
temporal synchronization of the neurons
– All neurons fire spontaneously in random sequence
depending on their refraction characteristic
– When a sound with some frequency starts, it causes
more neurons to fire synchronously
– The brain determines the pitch based on an “autocorrelation function” of the pattern
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
– Groups of neurons are excited in several locations or
with interfering synchronization patterns
• Which neuron excitement is the pitch?
43
7.3 Fundamental Frequency
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
44
7.3 Auditory Organization
• Lowest frequency generates harmonics …
thus pitch as the fundamental frequency?
• The hearing analyses complex sounds in different
frequency bands
• The hearing processes organizes and integrates
the different impressions
• Decide the pitch by matching against the
harmonic templates (Goldstein, 1973)
– Psychological experiments with and without the
fundamental frequency in the same note, show,
that the pitch of the note is still rated the same
– Since synchrony remains the same with or without
the fundamental frequency, then we should consider
the time-dependent model
• But how do we evaluate the synchrony?
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
45
7.3 Auditory Organization
46
7.3 Auditory Organization
• Experiments favor centralized
template-matching
• The pitch can also be synthesized as nonoccurring frequency e.g., 296 Hz for the
simultaneous play of the following non-harmonic
tones:
– We can feel pitches, even if we split the
signal into disjoint units which are
then heard with both ears
– The pitch is synthesized (but it doesn’t work on all
partitions; they are usually heard as more pitches)
– The listeners can be mislead by ambient noise to
perceive a false template
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Apparent 2/3 harmonic
47
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
48
8
6/13/2016
7.3 Pitch Tracking Algorithms
7.3 Pitch Tracking Algorithms
• Pitch is a feature of the frequency at a particular
time
• Requirements
– Frequency resolution in the range of a semitone with
the correct octave
– Detection of different instruments with well-defined
harmonies (e.g., cello, flute)
– (Recognition of pitches for conversion into symbolic
notation in real time for interactive systems)
– Pitch tracking in the frequency domain
– Pitch tracking in the time domain
• Length of time window for frequency
spectrum
– At least twice the length of the estimated period
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
49
7.3 Harmonic Product Spectrum
50
7.3 Harmonic Product Spectrum
• HPS-pitch detection is one of the simplest and
most robust method in the frequency domain
(Schroeder, 1968), (Noll, 1969)
• X(ω): strength of the frequency ω in the current
time window
• R: number of harmonics to be checked
– A certain frequency range is analyzed for the
fundamental frequency
– E.g., R = 5
• E.g., 50-250 Hz for male voices
• Pitch is then the maximum of the product
spectrum Y over all frequencies ω1,ω 2, ... in the
frequency range to be investigated
– All frequencies in this area are analyzed for harmonic
overtones
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
51
7.3 Harmonic Product Spectrum
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
52
7.3 Harmonic Product Spectrum
• Problems occur due to noise at
frequencies below 50 Hz
• Other problems occur due to the
frequent octave errors
– Pitch is often recognized an octave too high
• Rule-based solution:
– If the next closest amplitude under the pitch candidate
has approx. half the frequency of the pitch candidate,
and its amplitude is above a threshold, then select the
pitch one octave below the pitch candidate
– In practice, usually sufficient
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
53
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
54
9
6/13/2016
7.3 Maximum Likelihood Estimator
7.3 Harmonic Product Spectrum
• HPS, example
• The ML algorithm (Noll, 1969) compares the
possibly “ideal” specters with this spectrum and
selects the most similar based on the shape
• Ideal specters are chains of pulses of the
dampening function of the signal window
• For the dampening function
– The signal section, represented by each signal window,
(e.g., length 40 ms) is dampened (mainly at the edges)
through a special function to remove artifacts (mostly
false high frequencies)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
55
7.3 Maximum Likelihood Estimator
56
7.3 Maximum Likelihood Estimator
• Generation of an ideal spectrum
• The error is now determined between
the spectrum being studied and the ideal
spectrum Ỹω (with pitch ω)
Convolution
• The shares ||Y||2 and ||Ỹω||2 remain relatively
constant, so the error is given by
the product YỸω
Dampening function
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
57
7.3 Maximum Likelihood Estimator
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
58
7.3 Maximum Likelihood Estimator
• Pitch is the frequency with minimal error:
• In essence, this method is a “multiplication” of
the vector of the input spectrum with a matrix of
ideal specters
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
59
• Efficiency and error probability varies with the
degree of resolution (number of ideal specters)
• Good results when the pitch of the analyzed
signal is close to the pitch of one of the ideal
specters used
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
60
10
6/13/2016
7.3 Auto Correlation Functions
7.3 Auto Correlation Functions
• The most popular procedures in the time domain
is the peak-search in the autocorrelation
functions (ACFs)
– It is given by the equation
• ACFs measure the correlation of a signal with the
shifted version of the same signal, where τ
represents the shift
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
61
7.3 Auto Correlation Functions
• Since harmonic signals are strongly
correlated with each other when they are shifted
around the pitch, there is a peak in the ACF for
good pitch candidates
• Because of multiplication the procedure is more
expensive
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
62
7.3 Auto Correlation Functions
• Average Magnitude Difference Function (AMDF)
– Its minimum is where the ACF has peaks
– It is more efficient to compute
– Therefore differences are often used rather than
products
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
63
7.3 Auto Correlation Functions
• AMDF
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
64
7.3 Auto Correlation Functions
• ACF and AMDF are independent and can
therefore be combined into a robust estimator
(Kobayashi and Shinamura, 2000)
Ψ(τ)
• Significantly better fault tolerance against noise
(the best results for k=1)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
65
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
66
11
6/13/2016
7.3 Melody Recognition
7.3 Melody Recognition
• Pitch can be relatively robust identified for each
time slot
• But normally it is not enough for melody
detection just to use sequence of pitches
• HPS pitch detection for a cello
– Windowing draws error in the individual recognition
(continuous pitch changes)
– Attack frequency vs. melody
on sustain level
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
67
7.3 Melody Recognition
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
68
7.3 Melody Recognition
• Filtering of transient faults and spontaneous
octave jumps
• Monitor size of the peaks for pitch allocation
across multiple windows
• Post processed pitch detection
for a flute
– Better resolution for specters in the case of uncertain
assignments
• Interference with the original signal
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
69
70
This Lecture
7.3 Melody Recognition
• Problems of melody detection
• Audio Retrieval
– Strong polyphony
– Tuning of the instruments, changes with temperature,
humidity, or time (can be achieved in ML by adjusting
of the ideal specters)
– Even minor changes (just for a second) can lead to
alternating detection
of two notes, where
only one is played
(hysteresis)
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
- Low Level Audio Features
- Difference Limen
- Pitch: tracking algorithms
71
Multimedia Databases– Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
72
12
6/13/2016
Next lecture
• Query by Humming
• Melody Representation
and Matching
– Parsons-Codes
– Dynamic Time Warping
• Hidden Markov
Models
Multimedia Databases – Wolf-Tilo Balke – Institut für Informationssysteme – TU Braunschweig
73
13