What tool would you choose for the following task: Train a Bayesian spam filter with some hundreds of spam mails and non-spam mails („ham“ mails), then check the success with some more mails of each kind. You have to write a training function, which sorts words into a collection of some kind and counts the frequencies. The words have to be extracted from the training mails. Then you have to write a function, which looks words up in the data bank and classifies the mails.
This is all possible in Matlab (using regexp, ismember and the like). But Matlab is clearly the wrong choice for such a project. You can do it for some hundred mails in less than a minute. To achieve this, you need to apply Matlab in really clever ways, producing obscure code, and you need a pretty fast computer. With Matlab, I could not beat the code of the students that did the problem before. The running time would not drop below 60 seconds. Debugging was a horror, at least if I wanted to test on the complete range of mails.
So I tried in Java. And indeed, I produced a much cleaner and more flexible code in much less time, running much faster. While Matlab took me half a day to come up with a sub-optimal solution, Java took only about an hour. Okay, I am rather experienced in Java, and not much so in Matlab, but I did not use any deep Java tricks. I expected the code to be much faster. But I was surprised to see the running time shrinking to a fraction of a second.
The first conclusion is that it can become ridiculous to use the wrong tool for a job. And Matlab clearly is the wrong tool for this.
The second conclusion is that we must combine tools. E.g., it looks like a good idea to write a Java program, which produces data for Matlab. Then we are able to draw histograms of frequencies or do whatever Matlab can do with the data. Many of the tools Matlab provides are more difficult in Java, unless you have a very good library at your hands.
I added Euler to the title line of this blog entry, because I want to make a third conclusion. Euler should stick with its aim to provide nice, beautiful, and easy to use mathematical tools to the user. In contrast to Matlab, Euler never tried to be on the edge of numerical computing. For this, there are other tools. And I am not sure, if Matlab is the always the best answer.