Quick public service announcement for my fellow R nerds:
R has two commonly-used random-Normal generators: rnorm
and MASS::mvrnorm
. I was foolish and assumed that their parameterizations were equivalent when you’re generating univariate data. But nope:
- Base R can generate univariate draws with
rnorm(n, mean, sd)
, which uses the standard deviation for the spread. - The MASS package has a multivariate equivalent,
mvrnorm(n, mu, Sigma)
, which uses the variance-covariance matrix for the spread. In the univariate case,Sigma
is the variance.
I was using mvrnorm
to generate a univariate random variable, but giving it the standard deviation instead of the variance. It took me two weeks of debugging to find this problem.
Dear reader, I hope this cautionary tale reminds you to check R function arguments carefully!
The documentation for mvrnorm says that Sigma must be a positive-definite symmetric matrix. Usually a scalar (sd in rnorm) is not considered to be a degenerate form of a positive-definite symmetric matrix, is it? So I would also remind the reader to check the documentation. Though a lot of it is poorly written so I can sympathize with those who get out of the habit of checking it.
Well, a 1×1 matrix whose single entry is positive seems to fit the definition of positive-definite symmetric matrix.
But I certainly agree that it’s worthwhile to check the documentation regularly!
Looked up your link. Talks about an n x n matrix but that usually assumes n>1. Doesn’t seem to explicitly include n=1, but I didn’t read it that closely. Most sources make a distinction between scalars and arrays for a host of reasons.
Fair enough. And even if n=1 is OK, you are right that a 1×1 matrix is not a scalar—neither in math nor in computing.
But then, R doesn’t even have scalars, only vectors of length 1, which are not 1×1 matrices either!
The bug in my code was definitely my own fault, and I didn’t mean to blame anyone else, just remind others how easy it is to jumble all these subtle differences.
Put yourself in the position of the person writing the software to generate multivariate normal samples. How do you parametrize it? The mean vector is obviously needed for location, but what about the spread? The variance matrix, normally denoted by Sigma, is uniquely defined and will usually be the matrix that the user of the software will have available. The standard deviation, usually denoted by sigma (lower case) has NO unique generalization to the multivariate case. A triangular factor of the variance matrix, as well as a symmetric square root, or any number of other possibilities would serve. More importantly, most users of the software would need to compute one of them as a separate computation before calling the main function. Furthermore if two users of the software chose different generalizations, even setting the same random number seed would not ensure they got the same results.
So you need to be careful, as usual, with using someone else’s software. However there is a very clear explanation of why the authors of mvrnorm needed to choose the parametrization that they (i.e. we…) did.
Absolutely! I didn’t mean in any way to criticize the authors of MASS (…which I use often, so I really should have known better than to make this mistake.) Thank you for creating such a useful package!
If I had to write mvrnorm, I would have used the covariance matrix too. And I can also imagine why the original rnorm used sd—it’s sensible to keep consistent units for center and spread.
I just intended this post as a reminder, as you said, to be careful with using someone else’s code. No offence meant!
No offence taken at all. I just wanted to be clear why we did it as we did.
I’m rather pleased to hear some people are still using MASS, too. You need to be careful there , too. We defined a generic function select() which is a name also used in the popular package dplyr, so there is a tendency for them to become entangled.
Just a little thing to notice…