Great Answers to
Questions About Everything


It is a very common problem that given a distance function $d(p_1,p_2)$ and a set of points pts, we need to construct a matrix mat so that mat[[i,j]] == d[ pts[[i]], pts[[j]] ].

What is the most efficient way to do this in Mathematica?

Let's assume that the points are in $\mathbb{R}^n$ for simplicity, and because that's the case I'm dealing with now, but theoretically the points could be any type of object, e.g. strings with $d$ being an edit distance.

For the specific problem I have right now I need to calculate the EuclideanDistance and ManhattanDistance of 2D points.

The simplest way to do this is

pts = RandomVariate[NormalDistribution[], {1000, 2}];

mat = Outer[ManhattanDistance, pts, pts, 1]; // AbsoluteTiming

(* ==> {0.595327, Null} *)

This obviously calculates all distances twice, which is wasteful. So I was hoping for an easy $2\times$ speedup, but it isn't as easy as one would hope. Doing the same operation the same number of times in a Do loop takes considerably longer (probably because of indexing):

Do[ManhattanDistance[pts[[10]], pts[[20]]], {Length[pts]^2}]; // AbsoluteTiming

(* ==> {1.902417, Null} *)

So what programming pattern do you typically use when calculating such a distance matrix and which one would you recommend for this specific problem?

{ asked by Szabolcs }


Using Outer is here one of the worst methods, and not just because it computes the distance twice, but because you can't leverage vectorization in this approach. This is actually a common issue and an important point to stress: Outer works pairwise and is unable to utilize the possible vectorized nature of the operation it is performing on an element-by-element basis.

Here is the code I will adopt from this answer:


(*  {0.046875,Null} *)

which is an order of magnitude faster. You can Compile it with a C target which may improve the performance further. Also, essentially the same approach I used in this recent answer, with good performance.

For Manhattan distance, use

distances = 
   With[{tr = Transpose[pts]}, 
      Function[point, Total@Abs[(point - tr)]] /@ pts];


As noted by Ray Koopman in comments, the function DistanceMatrix from the package HierarchicalClustering` may be faster for Euclidean distance, for small and medium data size (up to a couple of thousands):

<< HierarchicalClustering`
Sqrt[DistanceMatrix[pts]];// AbsoluteTiming

(* {0.019351, Null} *)

Note, however, that this is only true for the particular case of Euclidean distance, or perhaps other distances which don't require to set the DistanceFunction option explicitly on the top-level. In other cases (for example, for Manhattan distance), it will be quite slow, because when DistanceFunction is set explicitly, one can not leverage vectorization any more, once again.

{ answered by Leonid Shifrin }