Mixed Distributions

Suppose we wish to simulate the mix of two or more distributions. I.e., we take a sample of the distribution X[k] with probability p[k]. In a programming language we would

  • define a vector of k functions or objects to hold the k generators,
  • produce an integer random number n between 1 and k according to the given probability vector p,
  • call the n-th generator.

To get the integer random number we can use a uniform distribution in [0,1] and get some t. Most languages have that built-in. Then we add up the p[k] until the sum exceeds t. I.e., we take n such that

\(p_1+\ldots+p_n \le t < p_1+\ldots+p_{n+1}\)

How to do this in the matrix language Euler Math Toolbox?

Of course, it is possible to program in that language. To speed up we could call routines in C or Python. That would certainly be fast enough. But I like to show you how to do it with the matrix language.

For an example, we mix two normal distributions. It is easy to generate a matrix with one line per distribution.

>mu=[-5,5]; sigma=[1,2]; ...
>n=1000000; X=mu'+sigma'*normal(n);

That’s all. The code works, since mu‘ and sigma‘ are column vectors, and normal(n) is a row vector. E.g., both are combined such that the first row of X contains a normal distribution with mean mu[1] and standard deviation sigma[1]. We took one million samples.

Now we can do the selection with

>function pselect (p,X) ...
$  n=cols(X);
$  j=randpint(1,n,p);
$  x=zeros(1,n);
$  loop 1 to n; x[#]=X[j[#],#]; end;
$  return x
$endfunction
>p=[1/3,2/3]; x=pselect(p,X);
>plot2d(x,>distribution):

test-001

The function randpint(n,m,p) is interesting. It makes a nxm matrix of integers according to the probabilities in p. Here is the definition in the utility file statistics.e.

function randpint (n: index, m: index, p: vector)
    return find(0|cumsum(p),random(n,m));
endfunction

It relies on an internal function find() which does an integer bisection to find numbers in the intervals of a sorted vector. It does map to the elements of its second argument, i.e., it applies to all random numbers generated by random(n,m).

By the way, the result has the correct mean and standard deviation.

>mean(x), p.mu'
 1.66577954404
 1.66666666667
>dev(x), sqrt(sum(p*(mu^2+sigma^2))-sum(p*mu)^2)
 5.02147777688
 5.02217305777

The function pselect() takes about 3 seconds on my computer for the one million numbers. If we wrote it in C (with pre-computed random integers), it would be much faster. But there is a way to do this in EMT.

The trick is to use the new function mget().

>function pselect (p,X) ...
$  n=cols(X);
$  j=randpint(1,n,p);
$  return mget(X,j'|(1:n)');
$endfunction

This function selects indices from a matrix. The indices are in the rows of a nx2 matrix. Now the selection takes 0.06 seconds. The factor 100 is indeed typical for matrix stuff which is rewritten in C.

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht.

Diese Website verwendet Akismet, um Spam zu reduzieren. Erfahre mehr darüber, wie deine Kommentardaten verarbeitet werden.