QUESTION
It is a very common problem that given a distance function $d(p_1,p_2)$ and a set of points pts
, we need to construct a matrix mat
so that mat[[i,j]] == d[ pts[[i]], pts[[j]] ]
.
What is the most efficient way to do this in Mathematica?
Let's assume that the points are in $\mathbb{R}^n$ for simplicity, and because that's the case I'm dealing with now, but theoretically the points could be any type of object, e.g. strings with $d$ being an edit distance.
For the specific problem I have right now I need to calculate the EuclideanDistance
and ManhattanDistance
of 2D points.
The simplest way to do this is
pts = RandomVariate[NormalDistribution[], {1000, 2}];
mat = Outer[ManhattanDistance, pts, pts, 1]; // AbsoluteTiming
(* ==> {0.595327, Null} *)
This obviously calculates all distances twice, which is wasteful. So I was hoping for an easy $2\times$ speedup, but it isn't as easy as one would hope. Doing the same operation the same number of times in a Do
loop takes considerably longer (probably because of indexing):
Do[ManhattanDistance[pts[[10]], pts[[20]]], {Length[pts]^2}]; // AbsoluteTiming
(* ==> {1.902417, Null} *)
So what programming pattern do you typically use when calculating such a distance matrix and which one would you recommend for this specific problem?
ANSWER
Using Outer
is here one of the worst methods, and not just because it computes the distance twice, but because you can't leverage vectorization in this approach. This is actually a common issue and an important point to stress: Outer
works pairwise and is unable to utilize the possible vectorized nature of the operation it is performing on an element-by-element basis.
Here is the code I will adopt from this answer:
distances=
With[{tr=Transpose[pts]},
Function[point,Sqrt[Total[(point-tr)^2]]]/@pts
];//Timing
(* {0.046875,Null} *)
which is an order of magnitude faster. You can Compile
it with a C target which may improve the performance further. Also, essentially the same approach I used in this recent answer, with good performance.
For Manhattan distance, use
distances =
With[{tr = Transpose[pts]},
Function[point, Total@Abs[(point - tr)]] /@ pts];
EDIT
As noted by Ray Koopman in comments, the function DistanceMatrix
from the package HierarchicalClustering`
may be faster for Euclidean distance, for small and medium data size (up to a couple of thousands):
<< HierarchicalClustering`
Sqrt[DistanceMatrix[pts]];// AbsoluteTiming
(* {0.019351, Null} *)
Note, however, that this is only true for the particular case of Euclidean distance, or perhaps other distances which don't require to set the DistanceFunction
option explicitly on the top-level. In other cases (for example, for Manhattan distance), it will be quite slow, because when DistanceFunction
is set explicitly, one can not leverage vectorization any more, once again.