I’ve been a professional software developer for about 7 years now, and I can count the number of times I’ve had to design an algorithm with the fingers of one hand. Even when I’m generous with what counts as “designing an algorithm”. But it has happened, and here is the story of one case I think was nice and successful.
The application I was working on was recording location traces. For display purposes, we could sometimes work with the full trace, but a full trace could contain thousands of points, so in most cases we needed a simplified version that would still retain the same “shape” as the original. After a bit of searching, I discovered this is called line simplification or path simplification.
My usual approach when encountering a problem like this is to look at existing algorithms for the problem. But in this case most of the existing algorithms required the whole trace, and we needed to produce this simplified version during recording for the trace up to each point near-constantly. In short, we needed an online algorithm.
Searching through the literature (I come from a research background and I find it useful enough to maintain access to the ACM Digital Library), I did find a couple of online simplification algorithms. But reading through the papers, they felt just way too complex to implement, always a danger with current algorithm literature.
The Data Structure
With the literature failing (or me not being good enough at searching), I needed to design the algorithm myself. Also unlike how this problem is often considered, our constraint wasn’t any sort of “closeness” to the actual path but the number of points in the simplified path.
This is what I decided on as the general structure: I would keep the points of the simplified path in a priority queue, with the priority being the desirableness of the point in the simplified path. I would always have the first and last points in the simplified path. And when the queue got over the maximum allowed size, I would evict the least needed point. Then, at any time, I could get the simplified path by just taking the points in the priority queue, and maintaining the path when new points were added should be as efficient as priority queue operations generally are.
This is in fact a data structure that I’ve found useful every now and then. Not just a priority queue, but a priority queue with adjustable priorities (see below why this is needed). It’s not generally available in libraries, but it’s not too hard to implement, and I’ve done it in a few programming languages. (I should try to find that code and publish it someday…)
What I do is, I have a binary heap for the priority queue with a separate map structure pointing to the heap. Then the queue can provide an operation to adjust the priority of an element, which is implemented by updating the priority in the heap of the element found in the map, and then bubbling the element up and/or down as appropriate. This requires a bit of bookkeeping, but nothing too complicated.
The Priority Function
The above is of course just the general structure of an algorithm for this problem. The key is in the definition of the priority function, as that is what determines which points are included in the simplified trace. And naturally that has the constraints that it cannot be too expensive to compute and it shouldn’t depend on too “global” information to avoid requiring too much recomputation when the simplified trace changes.
I went through a couple of failed attempts, which I luckily could reject with very little testing. To be honest, I don’t remember the details of these attempts. I vaguely remember one to be measuring the distances to the two neighboring points in the simplified trace (I don’t remember what it did with them), but that ended up starting to reject too many points as the trace grew.
What I settled on as the priority of a point was the area of the triangle formed by the point and its two neighboring points in the simplified trace, with larger areas having higher priority, and start and end points considered as having infinite area. Intuitively, this works because a big triangle corresponds to a large deviation from a straight path, so including such points will show the major curves of a path.
This priority function fulfills the requirements. It is simple to compute and it depends only on local information. And when a point is removed from the simplified trace, only its neighbors need their priorities recomputed, so maintaining the simplified trace also does not take too much time.
I did not do any sort of formal analysis of this algorithm. I have no idea how it performs according to any accepted metrics for path simplification algorithms. There are probably pathological cases where it performs poorly. What I did do was to take a number of real traces we had, recorded in very different situations, and compared the drawn path to the path drawn from the full trace. And we all found it to perform quite well. So that was that, algorithm designed and implemented, performs acceptably well in real situations. I was very pleased with getting the chance to do a bit of algorithm design and ending up with something pretty elegant that suited our needs.
Originally published at futurice.com.