## Penney’s Game

I recently stumbled across some YouTube videos about Penney’s game. Here is one of them. The problem was actually new to me, like almost all problems in mathematics. There simply are too many. It is easier to come up with a problem than a a solution.

According to Wikipedia, Walter Penney published the problem in the Journal of Recreational Mathematics in 1969. The article seems to be hard to get, so simply let us start an analysis ourselves. After all, that sounds like a lot of fun, and I have written about a similar problem in this blog before.

Assume we toss coins with outcome 0 and 1 at equal probability. So we get a sequence of tosses like

\(0,1,0,0,1,1,1,0,1,0,\ldots\)

We are studying the following game between two players:

*Player A selects a triplet of tosses (like 0,0,1), and player B another triplet (like 1,0,0).**The player whose triplet appears first in the sequence of tosses wins.*

Given the sequence above and the selected examples of triplets above, player B would win. His triplet 1,0,0 appeared immediately after four tosses.

If both players select their triplet secretly this is obviously a fair game.

There is a counterintuitive fact here: It is an illusion that the choice of the triplet matters. The average waiting time for any triplet is equal, and selecting (1,0,1) is no better than (1,1,1).

To understand this, we need to see the tossing as a sequence of triplets. In the example above the sequence is the following:

\((0,1,0) \to (1,0,0) \to (0,0,1) \to (0,1,1) \to (1,1,1) \to \ldots\)

All triplets have the same chance (namely 1/8) to appear at the start of the sequence. And any triplet T has exactly two other triplets S1, S2 which can lead to this triplet. Since S1 and S2 have the same chance (namely 1/8) to appear in the n-th position of the sequence and T is selected with probability 1/2 from each, the chance for T to appear in the (n+1)-th position is also 1/8.

After this first excursion into the mathematics of random walks, we need to come back to the third rule of Penney’s Game.

- The player B selects his triplet after seeing the selected triplet of player A.

Now, there is no reason why player B shouldn’t have an advantage. After all he can simply take the same triplet and get a draw. If he was forced to take another triplet he might even be at a disadvantage. But, indeed, he has a big advantage of at least 2:1. A wrong selection of player A can make the his chances even worse.

Let us try to understand this. This is a case of non-transitive ordering.

- Let us write S>T if the triplet S is more likely to appear before triplet T in the toss sequence when the tosses are started with a random triplet.

Then it is not true that

\(T \ge S \,{\rm and}\, S \ge U \Longrightarrow T \ge U\)

If it were true, the ordering would be complete and player A could select the best triplet, or at least a triplet that cannot be beaten by the selection of player B.

It turns out that for each selected triplet T of player A, there is a selection S which is better than T (S>T). This makes the game a win for player B.

This is counterintuitive again. A consequence is that there is a sequence of triplets with the property

\(T_1 < T_2, \quad T_2 < T_3, \ldots \quad , T_{n-1}<T_n, \quad T_n < T_1\)

In this game, the simple rule to find a better triplet is the following:

- If player A selects T=(X,Y,Z), player B selects S=(YN,X,Y), where YN is the opposite of Y, i.e., YN=0 if Y=1 and YN=1 if Y=0.

E.g., if player A selects (0,1,0) or (0,1,1), player B selects (0,0,1). For another example, if player A selects (1,1,0) or (1,1,1) player B selects (0,1,1).

How do we understand this? We use a general mathematical model.

A way to understand this is by considering random walks on a directed graph. The graph in this game has 8 vertices, the 8 possible triplets. It has 16 edges, i.e., pairs of vertices, with associated probabilities 1/2. Given any triplet, there are two possible tosses which lead to two possible edges. (Note that some edges go from the triplet to itself!)

E.g., there are two edges leading away from (1,1,0),

\((1,1,0) \to (1,0,0), \quad (1,1,0) \to (1,0,1)\)

or, from (1,1,1)

\((1,1,1) \to (1,1,0), \quad (1,1,1) \to (1,1,1)\)

One mathematical model for uses an adjacency matrix M, containing transition probabilities from vertex to vertex.

- We set M(i,j) to the probability to go from triplet j to triplet i (i.e., to reach triplet i from triplet j) in one toss. So we have a matrix M of transition probabilities.

Now imagine starting a lot of experiments by walking from a start vertex to subsequent vertices with the given probabilities. The experiments select the start vertex according to a probability vector P, i.e., vertex i is selected with probability P(i).

- If the distribution of vertices in the experiments is P at a step. Then the matrix multiplication MP will compute the distribution of probabilities at the next step.

Let us do a test in Euler Math Toolbox. First, we generate the adjacency matrix M for our game. Then we start at one specific node and compute the distribution after three steps.

>function makeM () ... $M = zeros(8,8); $for i1=0 to 1; $ for i2=0 to 1; $ for i3=0 to 1; $ for j1=0 to 1; $ for j2=0 to 1; $ for j3=0 to 1; $ i=4*i1+2*i2+i3; $ j=4*j1+2*j2+j3; $ if i2==j1 and i3==j2 then $ M[j+1,i+1]=1/2; $ endif; $ end; $ end; $ end; $ end; $ end; $end; $return M; $endfunction >M=makeM; >shortest M 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 >p=zeros(8)'; p[1]=1; p' [1, 0, 0, 0, 0, 0, 0, 0] >p=M.p; p' [0.5, 0.5, 0, 0, 0, 0, 0, 0] >p=M.p; p' [0.25, 0.25, 0.25, 0.25, 0, 0, 0, 0] >p=M.p; p' [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]

The function to generate the matrix M is not the most elegant one. It might be better to use a recursion than a loop. After we generate the matrix, we start with a „distribution“ which puts everything into the first triplet (actually that is (0,0,0)). After only three steps, we see that the probability to reach each triplet is equal.

Let us now model Penney’s game.

We have two triplets T and S (vertex numbers i and j) which lead nowhere. I.e., if those states are met we have found a case where T or S are hit.

We modify the adjacency matrix accordingly and delete all edges leading away from T and S and connect T and S to themselves with probability 1. This is a modified adjacency matrix M(i,j). We now have to look at the following limit.

\(P = \lim_{n \to \infty} M_{i,j}^n P_0\)

with

\(P_0 = (1/8,\ldots,1/8)^T\)

which is our start distribution.

Let us simulate that in Euler Math Toolbox. We select the triplets T=(0,1,0) (third element of p) and S=(0,0,1) (second element of p) which according to our rule should be superior.

>M=makeM; >M[,2]=0; M[2,2]=1; >M[,3]=0; M[3,3]=1; >shortest M 0.5 0 0 0 0.5 0 0 0 0.5 1 0 0 0.5 0 0 0 0 0 1 0 0 0.5 0 0 0 0 0 0 0 0.5 0 0 0 0 0 0 0 0 0.5 0 0 0 0 0 0 0 0.5 0 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 0 0 0 0.5 >p=ones(8)'/8; p' [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125] >p=M.p; p' [0.125, 0.25, 0.1875, 0.0625, 0.0625, 0.0625, 0.125, 0.125] >p=M.p; p' [0.09375, 0.34375, 0.21875, 0.03125, 0.0625, 0.0625, 0.09375, 0.09375] >loop 1 to 100; p=M.p; end; >p' [0, 0.666667, 0.333333, 0, 0, 0, 0, 0]

Indeed, the two selected triplets catch all the distribution, and S=(0,0,1) is indeed twice as good as T=(0,1,0). I.e., player B will win twice as often.

Things get worse for T=(0,0,0) and S=(1,0,0). S is seven times as good as T! There are other combinations which are three times as good.

Due to the structure of the problem, the limit P will have only non-zero entries for T and S. The probability to be in any other triplet will tend to 0. T and S will catch everything.

But can we compute the winning chance for S without a simulation? For that we define

- w(i) as the chance to reach S before T when starting at triplet i.

We are using the following properties of w.

- w(i) = 1 for the vertex S with number i.
- w(j) = 0 for the vertex T with number j.
- w satisfies

\(w = M_{i,j}^T \, w\)

We are searching for a specific eigenvector. The following function of EMT computes w given and adjacency matrix M and i,j, and returns the probability to reach vertex i before vertex j.

>function getw (M,i,j) ... $if i==j then return 0; endif; $Mt=M; $Mt[,i]=0; Mt[i,i]=1; $Mt[,j]=0; Mt[j,j]=1; $V = eigenspace(Mt',1); $c = V[[i,j]]\[1;0]; $return sum((V.c)')/cols(Mt); $endfunction >M = makeM(); % Let us try our simulated examples. >fraction getw(M,2,3), fraction getw(M,5,1) 2/3 7/8

The simulated example from above works, as well as another example which takes T=(0,0,0) and S=(1,0,0).

For the next step, we compute all probabilities that the triplet number i appears before the triplet number j when starting from a random triplet.

>function solvePG () ... $M = makeM(); $W=zeros(8,8); $for i=1 to 8; $ for j=1 to 8; $ W[i,j]=getw(M,i,j); $ end; $end; $return W; $endfunction >W=solvePG(); >fracformat(7); W, defformat; 0 1/2 2/5 2/5 1/8 5/12 3/10 1/2 1/2 0 2/3 2/3 1/4 5/8 1/2 7/10 3/5 1/3 0 1/2 1/2 1/2 3/8 7/12 3/5 1/3 1/2 0 1/2 1/2 3/4 7/8 7/8 3/4 1/2 1/2 0 1/2 1/3 3/5 7/12 3/8 1/2 1/2 1/2 0 1/3 3/5 7/10 1/2 5/8 1/4 2/3 2/3 0 1/2 1/2 3/10 5/12 1/8 2/5 2/5 1/2 0

The non-transitivity of the problem means that in each column there is a value >1/2. We can select this value and print the corresponding triplet in human readable form.

>function printtriplet (i) ... $s = ")"; i=i-1; $if mod(i,2) then s=",1"+s; else s=",0"+s; endif; $i = floor(i/2); $if mod(i,2) then s=",1"+s; else s=",0"+s; endif; $i = floor(i/2); $if mod(i,2) then s="(1"+s; else s="(0"+s; endif; $return s; $endfunction >function printPG (W) ... $for j=1 to 8; $ w=W[,j]'; $ e=extrema(w); $ printtriplet(j) + " -> " + printtriplet(e[4]), $end; $endfunction % We get the well known solution of the game. >printPG(W) (0,0,0) -> (1,0,0) (0,0,1) -> (1,0,0) (0,1,0) -> (0,0,1) (0,1,1) -> (0,0,1) (1,0,0) -> (1,1,0) (1,0,1) -> (1,1,0) (1,1,0) -> (0,1,1) (1,1,1) -> (0,1,1)

This is exactly the well-known rule that you find also in the Wikipedia article.

## Another Bayesian Problem

I just stumbled over another problem of probability theory on the channel MindYourDecisions. The video is rather old already, but mathematical problems never age.

*„A nursery has 3 boys and a number of girls at one day, plus an additional birth at night. The next day a child from the day before is picked at random, and it is a boy. What is the probability that the child born at night is a boy?“*

I hate the form that this question for a probability is posed. For the casual audience this does not make much sense. They will just answer 50%, and I would not be angry about this answer. It is sort of correct. You need a mathematical education to understand the problem to start with. Let me explain.

Everyone who has not met this kind of problems before will argue that the probability of the night child being a boy is 50%. That is true. The fact that the survey picks a boy the next day does not change this at all. I mean, the randomly picked boy was a boy. So what?

The problem is that the question is asking something more involved. It asks you to consider all possible random experiments. I.e., you assume a random child at night and a random pick. Then you discard those experiments that do not fit to the story.

An extreme case would be that there are no boys at daytime. Then a girl at night would not fit to the story, and all experiments that fit feature a boy at night. The Bayesian mathematics calls this a 100% probability for a boy.

For the actual case of three boys at daytime, let us assume g girls. I now follow the explanation of the video, but with different wordings and a completely different concept of thinking.

There are two cases now.

Case 1: We have a boy at night. We expect this case to happen in half of the experiments. The chance to pick a boy the next day is then 4/(4+g). We throw away all experiments where a girl is picked, i.e., g/(4+g) of them on average.

Case 2: We have a girl at night. Again this happens in half of the experiments on average. The chance to pick a boy next day is 3/(4+g). We throw away all experiments where a girl is picked.

Now imagine both cases together. In the experiments that we did not throw away we have an overhead of 4 to 3 for case 1. Thus, the Bayesian mathematics says that the chance for case one is 4/7.

There are other ways of writing this up. But they are only shortcuts of the above thought experiment. And for me, defining a precise experiment is the clearest way to think about such problems.

## Depth of Field and Bokeh of an Ideal Lens – Mathematics

I have never written about the optical computations that apply to an „ideal lens“ in the blog, although there is a notebook for Euler Math Toolbox and an Euler file („dof.e“) that can be loaded into the program. The topic, however, is quite interesting from a mathematical point of view. It involves modelling of the lens in a though experiment („Gedankenexperiment“), and determining formulas from the geometry of the model. So it makes a good application of simple Algebra and Geometry for higher school classes.

An ideal lens is able to focus light rays that emerge from a point of the scene (the object) into another point behind the lens (the image of the object). As you see in the figure above, the lens should ideally focus all objects on a plane with distance d (object distance) onto a plane with distance f(d) (focussed image distance) behind the lens. We denote it by f(d) because it depends on d. The line through the object and its image goes through the midpoint of the lens. The figure shows the situation (green and blue) for two objects.

If the object is at infinity, its emitted rays of light are parallel. Then the image is on a plane with distance f from the lens. This distance is called the focal length of the lens. Again, the figure shows two situations (green and blue). Actually, we could write

\(f = f(\infty)\)

In the first figure, the object is not at infinity, but closer. The lens will not be able to focus the object at distance f, but at the larger distance d_f. In a camera, the lens will have to be moved further away from the sensor or film to get a sharp image.

Our first aim is to prove the relation

\(\dfrac{1}{f} = \dfrac{1}{f(d)} + \dfrac{1}{d}\)

For this, we only want to use geometric properties of our „ideal lens“.

Our first observation is that we can find the focal length f in the figure above if we select a ray that goes from an object perpendicular to the lens at distance d and is then focussed to an image at distance f(d). The intersection with the central axis of the lens will have distance f in that case. This is so, because we assumed that parallel rays focus at distance f as in the second figure above.

At that point, we draw the situation in another way, emphasizing the things we need for our geometric argument.

Looking at the blue and the tray rectangle, we easily see

\(\dfrac{f(d)}{f} = \dfrac{d+f(d)}{d} = 1 + \dfrac{f(d)}{d}\)

Dividing by f(d) we get our simple and easy to remember formula.

There are some interesting special cases. One is the case when the object is at infinity. Then the image distance is just f as in the figure above.

\(d = \infty \quad \to \quad f(d)=f\)

Another one is

\(f(d) = d = \dfrac{f}{2}\)

That is a macro lens with 1:1 magnification. The size of an object is then identical to the size of the image of the object.

We can also see, that it is impossible to focus on objects that are closer than the focal length f.

Now, we can start to proceed towards the concept of „depth of field“ (DOF). The first step is to observe that an object will not be focussed correctly if the lens has the wrong distance from the sensor (or film).

So assume the object distance is d_o, but we have focussed on objects at a distance d. The objects at distance d_o will not be in focus. Instead, the rays emitted by the center object will form a small circle on the sensor.

The diameter c of this circle is, by another geometrical observation,

\(c = l \, \dfrac{d-f(d)}{f(d)}\)

Here, l is the diameter of the lens. To see this, consider the stretching factor between the two green triangles behind the lens (i.e., left of the lens in the figure).

For a realistic measure, we use the „aperture“ of a lens. That is usually measured by F-stops which we denote by A_F to minimize the confusion to the focal length f.

\(A_F = \dfrac{f}{l}\)

The aperture measure the amount of light that reaches the sensor per area. For different lenses, this amount will be the same for the same aperture, since f and l both affect the amount of light proportional to f^2 and l^2. Thus, the aperture can be used to get the correct exposure independent of the lens. Unfortunately, it is inverted in an unintuitive way. So larger a will yield less light on the sensor.

Now we are going to use EMT and Maxima to compute a formula for c which depends on the object distance, the focussed distance, the focal length and the aperture.

>equ1 &= 1/F = 1/DF+1/D 1 1 1 - = -- + - F DF D >function f(F,D) &= DF with solve(equ1,DF) D F - ----- F - D >function c(F,D0,D,AF) &= factor(F / AF * (f(F,D)-f(F,DO))/f(F,DO)) 2 (D - DO) F ------------- AF DO (F - D)

Thus

\(c = \dfrac{|d-d_o| \, f^2}{A_F d_o (d-f)}\)

That sound terribly complicated. So, let us study special cases and find some simple relations which we can use for photography.

Assume, we focus on an object at distance d_o. We are interested in the Bokeh of the photographic image. This is the blurriness of objects at infinity, or at a far distance. We need to study the behavior of c for large d. Mathematically, this is the limit as d goes to infinity. We get

\(c_\infty = \dfrac{f^2}{A_F \, d_o}\)

That’s much easier. But for practical purposes we have to consider that we need to double our focal length if we move the object away from us to double the distance. Thus, a more practical rule of thumb is

\(c_\infty \sim \dfrac{f}{A_F}\)

It applies to photographing the same object at the same size on the picture. This is a very natural formula if you think of the focal length as a magnification. The circles of diffusion are simply magnified with a higher focal length.

We also learn that for the same object at the same size on the image, the Bokeh gets twice better with twice the focal length and with half an F-stop. Or, you get the same Bokeh for a 50mm lens at F2 as for a 100mm lens at F4. For the Bokeh effect, it is usually cheaper to use a longer lens than an expensive and fast medium lens.

For depth of field (DOF), we want to know what objects are at a distance where the circle of confusion does not exceed a maximal value. In the old film days, this value was

\(c_\text{max} = 0.03 \text{mm}\)

It is possible to solve this exactly, but we are only interested in rules of thumb.

First, let us assume we take an image of the same object with different focal length such that the object has the same size on the picture. We need to change our distance accordingly. In fact f/d is almost constant. Moreover, we can assume that d is much larger than f, and thus d-f approximately equal to d_o. Then we get simply for object distances d_o which yield the maximal circle of confusion

\(|d-d_o| \sim A_F\)

I.e., the DOF does not change for the same subject at the same size with different lenses unless we change the aperture.

There is also the topic of the „hyperfocal“ distance. We want to determine the shortest distance d_o to focus on such that infinity is reasonably sharp. By our formula above for the radius of diffusion, we get

\(d_o = \dfrac{f^2}{c_\text{max} \, A_F}\)

Can we derive a practical rule of thumb from this? Not really. We can provide some examples, like focussing 3m for F8 and 24mm. The usual advice to „focus 1/3 into the scene“ is obviously only vague advice. Maybe, it yields the most pleasing unsharp regions. But I tend to check the image on the camera.

It is interesting how a „crop factor“ (aka. form factor) changes the mathematics. For an example, let us assume we take a micro four thirds (MFT) instead of a „full frame“ camera. Then we have a crop factor f_c of 2. That means, we need to use a 25mm instead of a 50mm to get the same image. The formula yields 1/4 of the circle of diffusion. But since our sensor is also 1/2 the size, we get a full frame equivalent

\( c_\text{equiv} = \dfrac{|d-d_o| \, f^2}{f_c A_F d_o (d-f)} \)

Effectively, the DOF multiplies by f_c, and the blurriness in the Bokeh at infinity divides by f_c. For the Bokeh, we now get under the condition that we take the same object at the same size in the picture,

\(c_{\infty,\text{equiv}} \sim \dfrac{f}{f_c \, A_F}.\)

To get the same Bokeh for MFT, we need twice the aperture. So you need to replace a 85mm lens on a full frame camera at F2.8 with a 42,5mm lens on MFT at 1.4. The best cheap lens is at 1.7, which is equivalent to F3.5 on the full frame camera.

But there is also an interesting effect which puts MFT into a better light. Assume, you need enough DOF to capture a bird in flight and get it safely into focus. In this case, you might use F5 on a 70mm lens (140mm equivalent) on MFT at 1/250 seconds exposure. On the full frame equivalent, you will have to use F10 (F11 actually) and thus crank up ISO two stops to keep 1/250 still. This loses the advantage of the larger sensor in terms of noise. Note that the Bokeh at infinity remains the same if you do so, since you doubled the F-stop in change of a factor f_c=2. So, MFT is working just as well here. You just use more compact and cheaper equipment.

## Hemmes Mathematische Rätsel

Ich verweise einmal auf die von Heinrich Hemme gesammelten mathematischen Rätsel in Spektrum der Wissenschaft. Manche sind leicht, manche sind schwer, alle sind interessant.

Hier ist ein älteres Beispiel: Wie lange ist die Kurve, die die Spitze des Dreiecks zurücklegt, wenn das Dreieck an der Seite abrollt.

## Constructive Art with EMT?

>n=40; X=random(n); Y=random(n); >function map dist(x,y) := log(prod((x-X)^2+(y-Y)^2)) >x=linspace(0,1,500); y=x'; >z=dist(x,y); levels=linspace(totalmin(z),totalmax(z),100); >fullwindow(); plot2d(z,levels=levels,a=0,b=1,c=0,d=1,<grid):

## The Ethics of AI

I recently have to read a lot about the consequences of AI with respect to ethics and moral. Let me quickly sketch my personal view on the subject.

First, let us talk about science fiction and AI robots that act like humans. In my view, it is rather simple:

*Machines are not human beings, **even when they are equipped with AI or look like humans. **Robots are not part of our society.*

In fact, it is probably better for us if the robots do not look like humans at all. That helps us keep the distinction. Machines, with AI or not, should be marked as machines.

The reason I am so sure and definitive about that is not a religious one. We humans are programmed genetically to be part of a human community. Our very self and the purpose of our life depends on other humans. Being lonely, an outlaw, or even just being pushed around make us sick. Children, friends, and admiration make us happy. We are proud to take responsibility for others. It is a misconception to think that people are basically selfish.

Now you can argue that the social system we live in could be enhanced and improved by AI robots which are on our level or even above, both in moral acting and in intellectual capabilities. These robots could live among us just like our friends. Wouldn’t that possibly be a better society? Aren’t humans inherently unreliable and the machines are not?

But any AI that acts in predictable and reliable ways is not an AI at all. It would act mechanical and can easily be determined to be a machine. The very characteristic of intelligence is that it explores new paths occasionally. This makes it unreliable and not predictable by definition. Soon, most AI will even act in super-human ways that we are incapable of understanding due to our limited capacities.

Thus, my only conclusion is that we need to distinguish ourselves from the AI we have. We need to treat AI as machines that we use. Robots are not a product of nature that can live with us uncontrolled with its own rights. We need to protect mankind, our society, and ourselves.

After all that science fiction nightmares, let us talk about the current way we use AI to improve technical systems. Even that limited form of AI already poses questions of ethics and moral.

Some see a lot of problems arising with AI or fuzzy logic that are built into cars, airplanes or surveillance systems. It is indeed true that these systems have the potential of a collision between a human decision and an AI decision. But that is already true without AI technology! We all experienced a system that seems to act on its own, a computer or even a mechanical device.

Think of an airplane with an AI as co-pilot. As a first example, assume the AI commands a go-around in an unstable approach. The chances are quite high that the AI is right and the pilot has made an error. I also tend to think that the AI makes much fewer mistakes throughout the complete flight as a co-pilot would. I would definitely prefer an AI to a rookie on the right seat. The same applies to nearly all applications of AI in technology. Usually, the AI is better. It can even be used to train the pilot in a simulator, much better and versatile than a human instructor.

It must be clear, however, that the responsibility for the proper working of the AI is at the developer of the plane, car or technical system that uses it. The AI is not a human that we can make responsible or even punish for mistakes. The developer has to test the system thoroughly to make sure it works as intended. But if you think this is too difficult note that it is much more difficult to test a human, and the verdict is much less reliable.

It must also be clear at all times that the AI that makes these decisions is a machine. When we let cars drive on their own the car does not become a being on our level, however intellectually superior it may be. If it does not work it will be trashed or repaired.

The problem with AI and robots should not be that they will be superior to us in many ways. The problem is that we need to treat them as our enslaved machines, and not as part of our society.

By the way, I am not afraid of super-human AI. Indeed, humanity deserves a lesson in humiliation. But let us use AI to our benefit!

## Teaching Python or Java

Whenever I have the pleasure to teach the beginner course in programming I start with the same question: Python or Java? Since colleagues have good success with Python and since Python is a useful language I am tempted ever again. However, I decide to go with Java every time. Let me explain the reasons.

This year, I even started a discussion on a Python forum to push me towards Python and tell me why I should prefer it. The discussion was an interesting read, but nothing that convinced me.

So I did my own research on Python sites designed for beginners and see how they handle the teaching. As expected, they dive into Python and use its data collections almost from the start, at least shortly after using the Python command line for interactive computations. Some even go quickly into real world usage with Python packages for graphics, artificial intelligence or numerical mathematics.

If Python is that useful and if there are so many libraries available and also quite nice IDEs (my favorite is Spyder) why do I shy back to use it as first language? Here are some reasons.

- I am teaching mathematicians. Mathematicians are meant to program, to test and to improve basic algorithms. They may also use high level libraries for research, but they should at least be able to start on the roots of the code, and often that is the only way to go. Python is not designed for this use. First, it is a lot slower than Java (which is on the level of C code, by the way). The difference can be a factor of 10 to 50. Then, it hides the basic data types (such byte, short, int, long, float, double) and their problems from the user and encourages to use its infinite arithmetic. I feel that mathematicians should understand the bits and bytes, and get full, speedy control of their code.
- Python extends its language to compactify the handling of data sets at the cost of extending the language with almost cryptic syntax. E.g., Python can automatically and in one expression create a list of integers satisfying a certain condition or select from another list under a condition. I am convinced that this kind of behind-the-scene operation is not suited to teach beginners how things work and how to write good and clean code. The situation is a bit like Matlab where everything is vectored and loops are costly. You learn how to use the high-level constructs of your language, but not the basics. If you only have a hammer everything looks like a nail.
- Python is as multi-platform or open-source as Java. Both need an interpreter and cannot be translated to native code to ease implementation of bigger programs on the user systems. Java is even a bit better in this respect. Both languages cannot easily use the native GUI library or other more hardware bound libraries. Python is a bit better here. But this is nothing to worry about as a first language anyway.
- I also need to answer the question: Is it easier to go from Java to Python or the other way? And, for me, the answer is clearly that Java provides the more basic knowledge on programming. Python is on a higher level. It is more a scripting language than a basic programming language. One may argue that this is good for beginners since they get a mighty system right at the start. After all, you learn to drive a car without knowing how the motor works. But I have seen enough programs in Python to know that it spoils the programmer and drives the thoughts away from a fresh, more efficient solution.

This said, I would very much welcome a second course building on the basics that uses Python and its advanced libraries. Many students are taught R and Matlab in the syllabus. I think Python would be a better choice than Matlab for numerical programming and data visualization. And, in contrast to Matlab, it is open and free.

## The Riemann Sphere

Since I am currently teaching complex function theory I am interested in visualizations of Möbius transformations as well as other analytic functions. The Riemann Sphere together with the Stereographic Projection is a good tool for this.

Above you see the basic idea. The points on the sphere are projected along a line through the north pole to the x-y-plane which represents the complex plane. The north pole associates to the point infinity. The formulas for the projection and its inverse are easy to compute with a bit of geometry. I have done so on this page using EMT and Maxima:

It is quite interesting to see what a Möbius transformation does to the Riemann sphere.

In this image, the north and south poles are mapped to the two black points. The other lines are the images of the circles of latitude and longitude. The image can be created by projecting the usual circles (with the poles in place) to the complex plane, applying the Möbius transformation, and projecting back to the sphere.

In fact, the projection and the Möbius transformation map circles or lines to circles or lines. They preserve angles. This is why the Stereographic Projection is often used for maps that should show the correct angles between paths. But, note that great circles that do not path to the pole are not mapped to lines, but to circles. So the shortest path on the surface of the Earth between two points is a circle on the projected map. Locally, this is only approximated by a line segment.

## A Simple Combinatorial Approximation Problem

This is certainly a well-known problem. But I pose it here because there is a geometric solution, and geometric solutions are always nice.

Assume you have two sets of n points

\(x_1,\ldots,x_n, \quad y_1,\ldots,y_n\)

The problem is to find a permutation p such that

\(\sum_{k=1}^n (x_k-y_{p(k)})^2\)

is minimized. Of course, we can assume that the x’s are sorted. And it is easy to guess that the minimum is achieved if the permutation sorts the y’s. But how to prove that?

Here is a simple idea. We claim that swapping two of the y’s into the right order makes the sum of squares smaller. I.e.,

\(x_1 < x_2, \, y_1<y_2 \Longleftrightarrow (x_1-y_1)^2+(x_2-y_2)^2 < (x_1-y_2)^2+(x_2-y_1)^2\)

If you try that algebraically the right hand side is equivalent to

\(-x_1y_1-x_2y_2 < -x_1y_2-x_2y_1\)

Putting everything one side and factorizing, you end up with

\((x_2-x_1)(y_2-y_1) > 0\)

and the equivalence is proved.

But there is a geometric solution. We plot the points into the plane. Two points are on the same side of the symmetry line between the x- and y-axis if they have the same order. And the connection to the mirrored point will always be larger.

We can extend this result to more norms. E.g., the sorting of the y’s will also minimize

\(\sum_{k=1}^n |x_k-y_{p(k)}|, \, \max_k |x_k-y_{p(k)}|\)

As long as the underlying norm is symmetric with respect to the variables, i.e.,

\(\|(\ldots,t,\ldots,s,\ldots)\|=\|(\ldots,s,\ldots,t,\ldots)\|\)

The prove for this can be based on the same geometrical argument.

## C vs Java vs Python vs EMT

I recently tested Python to see if it is good as a first programming language. The answer is a limited „yes“. Python has certainly a lot to offer. It is a mainstream language and thus there is a lot of support in the net, and there are tons of libraries for all sorts of applications. A modern version like Anaconda even installs all these tools effortlessly.

However, I keep asking myself if it isn’t more important to learn the basics before flying high. Or to put it another way: Should we really start to teach the usage of something like a machine learning toolbox before the student has understood data types and simple loops? This question is important because most high-level libraries take away the tedious work. If you know the right commands you can start problem-solving without knowing much of the underlying programming language.

I cannot decide for all teachers. It depends on the type of students and the type of class. But I can give you an idea of how much a high-level language like Python is holding you back if you want to implement some algorithm yourself. You have to rely on a well-implemented solution being available, or you will be lost.

So here is some test code for the following simple problem. We want to count how many pairs (i,j) of numbers exist such that i^2+j^2 is prime. For the count, we restrict i and j to 1000.

#include#include int isprime (int n) { if (n == 1 || n == 2) return 1; if (n % 2 == 0) return 0; int i = 3; while (i * i <= n) { if (n % i == 0) return 0; i = i + 2; } return 1; } int count(int n) { int c = 0; for (int i = 1; i <= n; i++) for (int j = 1; j <= n; j++) if (isprime(i * i + j * j)) c = c + 1; return c; } int main() { clock_t t = clock(); printf("%d\n",count(1000)); printf("%g", (clock() - t) / 1000.0); } /* Result was: 98023 0.132 */

So this takes 0.132 seconds. That is okay for one million prime-number checks.

Below is a direct translation in Java. Surprisingly, this is even faster than C, taking 0.127 seconds. The reason is not clear. But I might not have switched on every optimization of C.

public class Test { static boolean isprime (int n) { if (n == 1 || n == 2) return true; if (n % 2 == 0) return false; int i = 3; while (i * i <= n) { if (n % i == 0) return false; i = i + 2; } return true; } static int count(int n) { int c = 0; for (int i = 1; i <= n; i++) for (int j = 1; j <= n; j++) if (isprime(i * i + j * j)) c = c + 1; return c; } static double clock () { return System.currentTimeMillis(); } public static void main (String args[]) { double t = clock(); System.out.println(count(1000)); System.out.println((clock() - t) / 1000.0); } /* Result was: 98023 0.127 */ }

Below is the same code in Python. It is more than 30 times slower. This is unexpected even to me. I have to admit that I have no idea if some compilation trick can speed that up. But in any case, it is the behavior that an innocent student will see. Python should be seen as an interactive interpreter language, not a basic language to implement fast algorithms.

def isprime (n): if n==2 or n==1: return True if n%2==0: return False i=3 while i*i<=n: if n%i==0: return False i=i+2 return True def count (n): c=0 for k in range(1,n+1): for l in range(1,n+1): if isprime(k*k+l*l): ## print(k*k+l*l) c=c+1 return c import time sec=time.time() print(count(1000)) print(time.time()-sec) ## Result was ## 98023 ## 4.791218519210815

I have a version in Euler Math Toolbox too. It uses matrix tricks and is thus very short, using built-in loops in C. But it still cannot compete with the versions above. It is about 3 times slower than Python and 100 times slower than C. The reason is that the isprime() function is not implemented in C but in the EMT language.

>n=1000; k=1:n; l=k'; tic; totalsum(isprime(k^2+l^2)), toc; 98023 Used 13.352 seconds

Below is a version where I use TinyC to check for primes. The function isprimc() can only handle real numbers, no vectors. Thus I vectorize it with another function isprimecv(). This takes a bit of performance.

>function tinyc isprimec (x) ... $ ARG_DOUBLE(x); $ int n = (int)x; $ int res = 1; $ if (n > 2) $ { $ if (n%2 == 0) res=0; $ int i=3; $ while (i*i<=n) $ { $ if (n%i==0) { res=0; break; } $ i = i+2; $ } $ } $ new_real(res); $ endfunction >function map isprimecv (x) := isprimec(x); >n=1000; k=1:n; l=k'; tic; totalsum(isprimecv(k^2+l^2)), toc; 98023 Used 0.849 seconds