Searching through Wikipedia, I wanted to find an algorithm for the median of n real values. Clearly, sorting the values and taking the value in the middle (or the mean value of the middle two values) is a good algorithm.
>seed(1); >n=1000001; >x=random(n); >xs=sort(x); xs[n/2+1] 0.500499771695 >median(x) 0.500499771695
The complexity of sorting is O(log(n)*n), however. So the question is if there is an algorithm of complexity O(n).
In fact, this is easy if we use the Golden Section search. This is not the best idea as we will see below. But it works. First we observe that the median solves the minimization problem
\(\sum\limits_{i=1}^n |m-x_i| \to \text{Min.}\)
This is a not so easy exercise for the reader.
So we can just use the Golden Section search for the minimization between the maximal and the minimal value. The Golden Section search is like a trisection algorithm, which computes the minimum to any fixed accuracy with a fixed number of steps. For 16 digits these are about 75 steps if the numbers are between 0 and 1. Each steps involves taking the sum of n numbers. So we get O(n) operations.
>function f(m) := sum(abs(m-x)) >fmin("f",min(x),max(x)) 0.500499771748
As you see, the minimum loses accuracy. This is typical for a minimum, since the function f(m) looks like a parabola closed to the minimum. To determine a minimum to single precision then requires double precision. Nevertheless, the complexity is O(n). Moreover, we can improve both the speed and the accuracy by the algorithm below.
By the way, here is a plot of a typical f(). Note that f() consists of line segments.
>x=0:0.2:1; plot2d("f",0,1):
In fact, the slopes are integer values between the points. These values can be computed easily by counting the number of elements larger and smaller than m. To get the median, we seek a point, where both counts are equal or defer by at most 1. The following algorithm implements this idea.
To check for the median m, we count the number of elements <=m, and the number of elements >=m. Moreover, we determine the largest element of the vector <=m, and the smallest element >=m. If the number of elements <=m and >=m defer by at most 1 we have found the median. Otherwise, we use the test value m for a new boundary of our bisection method.
In the following, we implement the algorithm in TinyC. This small C compiler is included in EMT. The C code is compiled into a DLL which is loaded to EMT. The user can then call the function cmedian(). There are some macros to make the communication with EMT easier and to handle type checking. The macro ARG_DOUBLE_MATRIX generates a double vector x[] in C and stores the size of the matrix in integer variables r and c. Moreover, the result must be pushed to the stack of EMT with new_real().
function tinyc cmedian (x) ... $ ARG_DOUBLE_MATRIX(x,r,c); $ CHECK(r==1,"Need a row vector!"); $ double min=x[0],max=x[0]; $ for (int i=1; i<c; i++) { $ if (x[i]<min) min=x[i]; $ if (x[i]>max) max=x[i]; $ } $ double totmin=min,totmax=max; $ int nless,ngreater; $ double med,xless,xgreater,res; $ while (1) { $ med=(min+max)/2; $ // print("%g %g %g\n",min,max,med); $ if (med==min || med==max) { res=med; break; } $ nless=0; ngreater=0; $ xless=totmin-1; xgreater=totmax+1; $ for (int i=0; i<c; i++) { $ if (x[i]<=med) { $ nless++; $ if (x[i]>xless) xless=x[i]; $ } $ if (x[i]>=med) { $ ngreater++; $ if (x[i]<xgreater) xgreater=x[i]; $ } $ } $ // print("%d %d %g %g\n",nless,ngreater,xless,xgreater); $ if (nless==ngreater) { res=(xless+xgreater)/2.0; break; } $ if (ngreater==nless+1) { res=xgreater; break; } $ if (nless==ngreater+1) { res=xless; break; } $ if (nless<ngreater) min=med; $ if (nless>ngreater) max=med; $ } $ new_real(res); $ endfunction >cmedian(x) 0.500499771695
We can use calls to print() for debugging. Note, that TinyC function which are defined like this must not contain a return statement. For more complex C programs with lots of functions or data types, compilation of external files is available in EMT.
Of course, such algorithms can be tested in EMT before they are written in C. The implementation should use the matrix language whenever possible. This makes code in EMT or Matlab or similar systems hard to read, however.
>function emtmedian (x : vector) ... $ min=min(x); max=max(x); $ repeat $ med=(min+max)/2; $ if med==min or med==max then return med; endif; $ i=nonzeros(x<=med); nless=cols(i); xless=max(x[i]); $ i=nonzeros(x>=med); ngreater=cols(i); xgreater=min(x[i]); $ if nless==ngreater then return med; $ elseif nless==ngreater+1 then return xless; $ elseif ngreater==nless+1 then return xgreater; $ endif; $ if nless<ngreater then min=med; $ else max=med; $ endif; $ end; $ endfunction >emtmedian(x) 0.500499771695
The counting of the numbers less and greater then the median is done with a conditional vector and nonzeros(). This is typical for matrix languages. It is the most efficient way to get the job done in EMT besides using C or Python. The algorithm is not much slower than the scripted versions.
There are other algorithms. E.g., variations of the Quicksort algorithm can be used to get the median. But these algorithms are fare more difficult to understand.