BLOG    
 

Frequently Asked Questions about Similarity Matrices and the Beat Spectrum

Q. In the paper entitled "Automatic Music Summarization via Similarity Analysis," you mention that you parameterized the audio using a Hamming window and MFCC's. In the paper entitled "Retrieving Music Rhythmic Similarity," you describe the process as simply dividing the audio into frames. I was wondering if the samples in the latter were obtained from a parameterization of simply framing the audio? (ie. Did you use a Hamming window?)

A. Sorry to be unclear: the processing in the rhythmic similarity paper is exactly the same (windowed MFCCs) as in the other papers. It's not likely to work in the time domain! Another trick that we sometime neglect to mention is that we typically zero-mean the MFCC features before we compute the cosine distance. This makes a much better looking matrix.

Q: I am a bit confused about how you describe autocorrelation on the similarity matrix, particularly the part about summing over one variable. I have tried to program the formula into Matlab using a brute force method, and the calculations simply take too long. I think I am overlooking some simplifications that I can make. I was wondering if you could explain this autocorrelation process to me.

A. I won't explain autocorrelation because it's in all the textbooks, but I will note that Matlab's built-in autocorrelation function is painfully slow as it was written to work in the time domain. There's a handy trick for doing autocorrelation in the frequency domain using FFTs:

X = fft2(x);

Xc = real(ifft2(X .* conj(X)));

This is amazingly faster, and I have no idea why the propellerheads at the Mathworks don't do it this way. Note: for this to work you need to zero-pad the original signal x. This gives you a 2-d result, of which you just take the center column. (Also note that (i)fft2 may only be in the image processing toolkit). In general, avoid loops like the plague in Matlab if you are working on large data sets like similarity matrices. You can do a huge amount with clever indexing: look at the code for specgram.m, which has essentially no loops at all!

Q: I don't understand the connection of the similarity matrix and the beat spectrum. I also don't understand how to interpret

 B(l) =  Sum(S(k,k+l)) 
and its connection to figure 3 in chapter 3.4 of your Beat Spectrum paper (The Beat Spectrum: A New Approach to Rhythm Analysis). You write that "B(0) is simply the sum along the main diagonal over some continuous range R [...]". My questions here: What are l and R (meaning, range,...)? Why do you sum along the diagonal at all? What is the idea behind this?

A. Let me try to explain another way. Imagine a signal that repeats itself exactly at 1 second intervals. The similarity matrix for this signal will look like diagonal stripes, which are spaced at 1 second (in both dimensions) from the main diagonal. The idea behind the beat spectrum is that when you sum diagonally, a particular diagonal stripes will sum down to a large value, while diagonals between the stripes will sum to small values. (Remember that both the similarity matrix and the beat spectrum are functions of time; and even in though in practice they are discrete it may help to think of them as continuous functions.)

Pick one axis of the similarity matrix. If we start at t = zero and sum along the diagonal, we will add up R values on the main diagonal. This is B(0), and should be a large number, as values on the main diagonal are maximum. Now continue along the axis you chose, say to t = 1/2 second. If we sum along the diagonal line that intersects the axis at t = 1/2, we will get a much smaller number B(1/2), because the similarity values at any point in the matrix along that off-diagonal will be much less. Now continue on to t = 1 second. Here we hit the first of those diagonal stripes. If we sum along that diagonal, we will get a large value again for B(1), because the similarity values along that diagonal will be large.

So the beat spectrum is a function of time B(t), along the axis that we chose, whose values are the sums along diagonal lines a distance t from the main diagonal. So the beat spectrum for this signal will have a peak at t = 0 (the main diagonal), a valley at t=1/2, another peak at t = 1 , and so forth for each stripe. B(0) is a scalar -- just the sum along the main diagonal, so we plot B(t) as a function of t.

The actual range of the sum is not important for the artificial example, as long as we sum over the same number of values. In fact, even if we take the sum over just one value we will still see a peak at B(0), a valley at B(1/2) and more peaks at B(1), B(2), and so forth. In a real signal that changes its structure over time, you need to pick R to be small enough that it doesn't average out the changes, and large enough that it does average out the noise.


Q: How you can conclude that "the first 10 seconds of the jazz composition Take 5" in Figure 3 of your paper (The Beat Spectrum: A New Approach to Rhythm Analysis) are "in the unusual 5/4 time signature", I mean that you may have a approach to get time signature, but it is not very clear to me, you are very appreciated to give me a detailed explanation.

A: I did not mean to indicate that I can extract the time signature. I know that "Take 5" is in 5/4 by listening, and the annotations in the figure were added by hand. I don't know of any methods to automatically determine the time signature. I apologize for not being more clear.

Q: In practice, is the algorithm sufficiently fast to run in real-time on a stock PC that is occupied with other activity (such as generating 3D graphics) such that only 10-20% of CPU might be available for beat detection? Have you explored real-time application of the technique?

A: We have not investigated any real-time processing. Real-time beat tracking involves a different set of problems than what we are looking at, for example to do it well you need a predictor to anticipate when the next downbeat will happen, as there will always be a lag due to whatever analysis you are using.

There is no reason why our techniques wouldn't work in real time, but we haven't implemented anything. I don't think computation is an issue, but optimizing the code is something that we also have not worked on. Also, the beat spectrum may not give you exactly what you want for visualization; in particular it will tell you the repetition times but not the actual instant of the downbeat. (Though the latter can be detected using pretty simple criteria).


Q Have you tried extracting any other parameters from music apart from the time signature and beat? For example, have you tried to guess the musical style, or (getting more fanciful) whether it's happy or sad. Could one distinguish between un upbeat rock piece and a sad ballad (there would obviously be a tempo difference, but there are also other clues).

A: Yes, but I've stayed away from genre detection because it's so messy: there's little "ground truth" to evaluate how well it actually works. The techniques we use for music retrieval would certainly work well for this, if you can decide on a set of styles that you want to distinguish between. This could be as simple as getting a collection of music in each style: we could then measure the similarity of any work to each style.

Q: I know that you can standardize a set of vectors by subtracting the mean and then dividing by the standard deviation for each data point. For example, in this case, each row would represent a song's beat spectrum vector, and each column would represent the set of all data points for all the songs, so for each column you would subtract the mean and divide by the standard deviation.

I was wondering if the above method of standardizing a set of vectors applied in the beat spectrum case, and if it does, have you tried it to good results? Would I normalize each beat spectrum by scaling it between 0 and 1 before "standardizing" it? In case you are wondering, I am using a cosine similarity metric to compare the beat spectra.

A: Commonly, the autocorrelation is normalized so the maximum is 1. This always occurs at a lag of 0. The beat spectrum is just an autocorrelation, and like the autocorrelation it can have negative values. I think it's a mistake to make those positive, so typically we just divide by the maximum (lag 0) value so everything is scaled (including the negative regions). Subtracting the mean would change the zero crossings, which I think are are good information (correlation vs. anti-correlation) so we don't do that. I don't see a good argument for normalizing to unity variance.

Copyright ©1999-2010 FX Palo Alto Laboratory | Send feedback to the webmaster