FAQ
overflow

Great Answers to
Questions About Everything

QUESTION

It is a very common problem that given a distance function $d(p_1,p_2)$ and a set of points pts, we need to construct a matrix mat so that mat[[i,j]] == d[ pts[[i]], pts[[j]] ].

What is the most efficient way to do this in Mathematica?


Let's assume that the points are in $\mathbb{R}^n$ for simplicity, and because that's the case I'm dealing with now, but theoretically the points could be any type of object, e.g. strings with $d$ being an edit distance.

For the specific problem I have right now I need to calculate the EuclideanDistance and ManhattanDistance of 2D points.

The simplest way to do this is

pts = RandomVariate[NormalDistribution[], {1000, 2}];

mat = Outer[ManhattanDistance, pts, pts, 1]; // AbsoluteTiming

(* ==> {0.595327, Null} *)

This obviously calculates all distances twice, which is wasteful. So I was hoping for an easy $2\times$ speedup, but it isn't as easy as one would hope. Doing the same operation the same number of times in a Do loop takes considerably longer (probably because of indexing):

Do[ManhattanDistance[pts[[10]], pts[[20]]], {Length[pts]^2}]; // AbsoluteTiming

(* ==> {1.902417, Null} *)

So what programming pattern do you typically use when calculating such a distance matrix and which one would you recommend for this specific problem?

{ asked by Szabolcs }

ANSWER

Using Outer is here one of the worst methods, and not just because it computes the distance twice, but because you can't leverage vectorization in this approach. This is actually a common issue and an important point to stress: Outer works pairwise and is unable to utilize the possible vectorized nature of the operation it is performing on an element-by-element basis.

Here is the code I will adopt from this answer:

distances=
   With[{tr=Transpose[pts]},
     Function[point,Sqrt[Total[(point-tr)^2]]]/@pts
   ];//Timing

(*  {0.046875,Null} *)

which is an order of magnitude faster. You can Compile it with a C target which may improve the performance further. Also, essentially the same approach I used in this recent answer, with good performance.

For Manhattan distance, use

distances = 
   With[{tr = Transpose[pts]}, 
      Function[point, Total@Abs[(point - tr)]] /@ pts];

EDIT

As noted by Ray Koopman in comments, the function DistanceMatrix from the package HierarchicalClustering` may be faster for Euclidean distance, for small and medium data size (up to a couple of thousands):

<< HierarchicalClustering`
Sqrt[DistanceMatrix[pts]];// AbsoluteTiming

(* {0.019351, Null} *)

Note, however, that this is only true for the particular case of Euclidean distance, or perhaps other distances which don't require to set the DistanceFunction option explicitly on the top-level. In other cases (for example, for Manhattan distance), it will be quite slow, because when DistanceFunction is set explicitly, one can not leverage vectorization any more, once again.

{ answered by Leonid Shifrin }
Tweet