Loess and Clark

Apologies for the awful pun in the title, but it seemed to befit an exploration of the history of loess local regression, particularly its name and codebase.

If you’re not familiar with loess, it’s basically a nonparametric algorithm that smooths the data to find the local mean of y at each x value. If you want to end up with a more traditional regression, loess can still be a useful starting point for visually finding trends in the data. Earl Glynn shows a worked example with R code that illustrates the loess fit for different values of the bandwidth.

Today was the first session of a Machine Learning study group with my colleagues. (We’re following along Andrew Ng‘s course notes for Stanford’s CS 229, also available on Coursera.) In the first chapter, Ng mentions loess regression, and two colleagues had interesting historical comments about it.

First, the name “loess,” for which I’ve heard several conflicting explanations or acronyms… My colleague Ben cited the original 1992 manual by Cleveland et al.:

The method we will use to fit local regression models is called loess, which is short for local regression, and was chosen as the name since a loess is a deposit of fine clay or silt along a river valley, and thus is a surface of sorts. The word comes from the German löss, and is pronounced löíss.

Not a bad name for a tool whose output often looks like a meandering river.

Second, Ben discovered that nearly all major implementations of loess (including in R) are direct ports of the original code base. (Ben found this while making a FORTRAN-to-C translation of the original code for his excellent open-source statistical library Apophenia.) This is quite unusual for such a commonly-used statistical method; many others are simply re-programmed from scratch in each new language. Ben supposed this was because, even though the core idea is simple, there are many possible edge-cases that need to be dealt with to keep the results clean.

At this point another colleague, Bill, explained he’s heard from Cleveland himself that the code was intentionally written to be obscure and poorly commented, so they could sell it as proprietary software if they wanted. Indeed, look at the code: many of the functions have awful uninformative names such as ehg127. Bill doubted that anyone else could really understand the code as it is without Herculean effort, but it does work… hence there haven’t been many attempts to rewrite it.

I’m constantly impressed by my colleagues’ institutional memory and broad historical knowledge. Hat tips to Ben Klemens and Bill Winkler.