This fall I’m enjoying auditing Jordan Rodu‘s mini course on Deep Learning. He’s had us read parts of the forthcoming Deep Learning book (free draft online), finished just this year and thus presumably up-to-date.
It’s fascinating to see how the core advice has changed from the literature we covered in Journal Club just a few years ago. Then, my team was assigned a 2010 paper by Erhan et al.: “Why Does Unsupervised Pre-training Help Deep Learning?” Unsupervised pre-training1 seems to have sparked the latest neural network / deep learning renaissance in 2006, underlying some dramatic performance improvements that got people interested in this methodology again after a decade-long “neural network winter.” So, we spent a lot of time reading this paper and writing simulations to help us understand how/why/when pre-training helps. (Here are our notes on the paper, simulations, and class discussion.)
But now in 2016, the Deep Learning book’s Chapter 15 says that “Today, unsupervised pretraining has been largely abandoned” (p.535). It seems to be used only in a few specific fields where there are good reasons for it to work, such as natural language processing. How quickly this field has changed!
Obviously, larger datasets and more raw computing power helped make deep neural networks feasible and interesting again in the 2000s. But algorithmic developments have helped too. Although unsupervised pre-training is what sparked renewed interest, the recent book claims (p.226) that the most important improvements have been: (1) using cross-entropy loss functions (optimize the negative log-likelihood) instead of always using mean squared error, and (2) using rectified linear activation functions in hidden units instead of sigmoid activation functions. Chapter 6 explains what these things mean and why they make a difference. But basically, these small tweaks (to the loss function you optimize, and to the non-linearities you work with) make large models much easier to fit, because it helps give you steeper gradients when your model fits poorly, so you don’t get stuck in regions of poor fit quite as often.
I look forward to learning more as Jordan’s class progresses. Meanwhile, if you want to try building a deep neural network from scratch yourself, I found the Stanford Deep Learning Tutorial helpful. Here are my solutions to some of the exercises. (This doesn’t teach you to use the well-designed, optimized, pre-made Deep Learning libraries that you’d want for a real application—just to practice building their core components from scratch so you understand how they work in principle. Your resulting code isn’t meant to be optimal and you wouldn’t use it to deploy something real.)
PS—here’s also a nice post on Deep Learning from Michael Jordan (the ML expert, not the athlete). Instead of claiming ML will take over Statistics, I was glad to hear him reinforcing the importance of traditionally statistical questions:
…while I do think of neural networks as one important tool in the toolbox, I find myself surprisingly rarely going to that tool when I’m consulting out in industry. I find that industry people are often looking to solve a range of other problems, often not involving “pattern recognition” problems of the kind I associate with neural networks. E.g.,
1. How can I build and serve models within a certain time budget so that I get answers with a desired level of accuracy, no matter how much data I have?
2. How can I get meaningful error bars or other measures of performance on all of the queries to my database?
3. How do I merge statistical thinking with database thinking (e.g., joins) so that I can clean data effectively and merge heterogeneous data sources?
4. How do I visualize data, and in general how do I reduce my data and present my inferences so that humans can understand what’s going on?
5. How can I do diagnostics so that I don’t roll out a system that’s flawed or figure out that an existing system is now broken?
6. How do I deal with non-stationarity?
7. How do I do some targeted experiments, merged with my huge existing datasets, so that I can assert that some variables have a causal effect?
Although I could possibly investigate such issues in the context of deep learning ideas, I generally find it a whole lot more transparent to investigate them in the context of simpler building blocks.
- Train one neural-network layer to reproduce the input, i.e. train an autoencoder. Then add another layer and repeat. After several such layers, (we hope) the network will learn several levels of features common to the data inputs, essentially automating the feature-engineering process. E.g. in images you might learn edge-detectors and so on from the data, instead of manually curating them. Finally, use the structure/weights learned this way to initialize a supervised network, classifying the inputs using supervised labels.