Frequently Asked Questions about Similarity Matrices and the Beat Spectrum
Q. In the paper entitled "Automatic Music Summarization via Similarity
Analysis," you mention that you parameterized the audio using a Hamming
window and MFCC's. In the paper entitled "Retrieving Music Rhythmic
Similarity," you describe the process as simply dividing the audio into
frames. I was wondering if the samples in the latter were obtained from
a parameterization of simply framing the audio? (ie. Did you use a
Hamming window?)
A. Sorry to be unclear: the processing in the rhythmic similarity paper is
exactly the same (windowed MFCCs) as in the other papers. It's not
likely to work in the time domain! Another trick that we
sometime neglect to mention is that we typically
zero-mean the MFCC features before we compute the cosine distance. This
makes a much better looking matrix.
Q: I am a bit confused about how you describe autocorrelation on the
similarity matrix, particularly the part about summing over one
variable. I have tried to program the formula into Matlab using a brute
force method, and the calculations simply take too long. I think I am
overlooking some simplifications that I can make. I was wondering if
you could explain this autocorrelation process to me.
A. I won't explain autocorrelation because it's in all the textbooks,
but I will note that Matlab's built-in autocorrelation function is
painfully slow as it was written to work in the time domain. There's a
handy trick for doing autocorrelation in the frequency domain using
FFTs:
X = fft2(x);
Xc = real(ifft2(X .* conj(X)));
This is amazingly faster, and I have no idea why the propellerheads at
the Mathworks
don't do it this way. Note: for this to work you need to zero-pad the original
signal x. This gives you a 2-d result, of which you just take the center
column. (Also note that (i)fft2 may only be in the image processing toolkit).
In general, avoid loops like the plague in Matlab if you are working on large
data sets like similarity matrices. You can do a huge amount with clever indexing: look at the code for specgram.m, which has essentially no loops at all!
Q: I don't understand the connection of the similarity matrix and the beat
spectrum. I also don't understand how to interpret
B(l) = Sum(S(k,k+l))
and its connection to figure 3 in chapter 3.4 of your Beat
Spectrum paper (The Beat Spectrum: A New Approach to Rhythm Analysis). You write that "B(0) is simply the sum along the main
diagonal over some continuous range R [...]". My questions here: What are l and R (meaning, range,...)? Why do you sum along the diagonal at all? What is the idea behind this?
A. Let me try to explain another way.
Imagine a signal that repeats itself exactly at 1 second intervals. The
similarity matrix for this signal will look like diagonal stripes, which are
spaced at 1 second (in both dimensions) from the main diagonal.
The idea behind the beat spectrum is that when you sum diagonally, a particular
diagonal stripes will sum down to a large value, while diagonals between the
stripes will sum to small values.
(Remember that both the similarity matrix and the beat spectrum are functions
of time; and even in though in practice they are discrete it may help to think
of them as continuous functions.)
Pick one axis of the similarity matrix. If we start at t = zero and sum along
the diagonal, we will add up R values on the main diagonal. This is B(0), and
should be a large number, as values on the main diagonal are maximum. Now
continue along the axis you chose, say to t = 1/2 second. If we sum along the
diagonal line that intersects the axis at t = 1/2, we will get a much smaller
number B(1/2), because the similarity values at any point in the matrix along
that off-diagonal will be much less. Now continue on to t = 1 second. Here we
hit the first of those diagonal stripes. If we sum along that diagonal, we will
get a large value again for B(1), because the similarity values along that
diagonal will be large.
So the beat spectrum is a function of time B(t), along the axis that we chose,
whose values are the sums along diagonal lines a distance t from the main
diagonal. So the beat spectrum for this signal will have a peak at t = 0 (the
main diagonal), a valley at t=1/2, another peak at t = 1 , and so forth for
each stripe. B(0) is a scalar -- just the sum along the main diagonal, so we
plot B(t) as a function of t.
The actual range of the sum is not important for the artificial example, as
long as we sum over the same number of values. In fact, even if we take the sum
over just one value we will still see a peak at B(0), a valley at B(1/2) and
more peaks at B(1), B(2), and so forth. In a real signal that changes its
structure over time, you need to pick R to be small enough that it doesn't
average out the changes, and large enough that it does average out the noise.
Q: How you can conclude that "the first 10 seconds of the jazz
composition Take 5" in Figure 3 of your paper (The Beat Spectrum: A New Approach to Rhythm Analysis) are "in the unusual 5/4 time signature", I mean that you may have a approach to get time signature, but it is not very clear to me, you are very appreciated to give me a detailed explanation.
A: I did not mean to indicate that I can extract the time signature. I
know that "Take 5" is in 5/4 by listening, and the annotations in the
figure were added by hand. I don't know of any methods to automatically
determine the time signature. I apologize for not being more clear.
Q: In practice, is the algorithm sufficiently fast to run in real-time
on a stock PC that is occupied with other activity (such as generating
3D graphics) such that only 10-20% of CPU might be available for beat
detection? Have you explored real-time application of the technique?
A: We have not investigated any real-time processing. Real-time beat tracking
involves a different set of problems than what we are looking at, for example
to do it well you need a predictor to anticipate when the next downbeat will
happen, as there will always be a lag due to whatever analysis you are using.
There is no reason why our techniques wouldn't work in real time, but we
haven't implemented anything. I don't think computation is an issue, but
optimizing the code is something that we also have not worked on. Also, the
beat spectrum may not give you exactly what you want for visualization; in
particular it will tell you the repetition times but not the actual instant of
the downbeat. (Though the latter can be detected using pretty simple criteria).
Q Have you tried extracting any other parameters from music apart from
the time signature and beat? For example, have you tried to guess the
musical style, or (getting more fanciful) whether it's happy or sad.
Could one distinguish between un upbeat rock piece and a sad ballad
(there would obviously be a tempo difference, but there are also other
clues).
A: Yes, but I've stayed away from genre detection because it's so messy:
there's little "ground truth" to evaluate how well it actually works.
The techniques we use for music
retrieval would certainly work well for this, if you
can decide on a set of styles that you want to distinguish between. This
could be as simple as getting a collection of music in each style: we
could then measure the similarity of any work to each style.
Q: I know that you can standardize a set of vectors by subtracting the
mean and then dividing by the standard deviation for each data point.
For example, in this case, each row would represent a song's beat
spectrum vector, and each column would represent the set of all data
points for all the songs, so for each column you would subtract the mean
and divide by the standard deviation.
I was wondering if the above method of standardizing a set of vectors applied
in the beat spectrum case, and if it does, have you tried it to good results?
Would I normalize each beat spectrum by scaling it between 0 and 1 before
"standardizing" it? In case you are wondering, I am using a cosine
similarity metric to compare the beat spectra.
A: Commonly, the autocorrelation is normalized so the maximum is 1. This
always occurs at a lag of 0. The beat spectrum is just an
autocorrelation, and like the autocorrelation it can have negative
values. I think it's a mistake to make those positive, so typically we
just divide by the maximum (lag 0) value so everything is scaled
(including the negative regions). Subtracting the mean would change the
zero crossings, which I think are are good information (correlation
vs. anti-correlation) so we don't do that. I don't see a good argument for
normalizing to unity variance.
|
|