Neural nets can now summarize long texts, or generate them given a query

I finally got around to writing another post after an unfortunate long time break.
This one will briefly summarize the papers I found most interesting during the last month or so. What I find interesting about this paper review, is that in writing it, I came across two different works, both interesting, that actually solve opposite tasks, as mentioned in the title.

Deep Learning Markov Random Field for Semantic Segmentation – This one introduces DPN (deep parsing network) for semantic segmentation.  A single CNN is able to jointly infer and learn the unary (per pixel) and pairwise (inter-pixel) terms of MRF models. Easily parallelized and speeded up, since based on CNNs. Achieve state of the art in VOC12, Cityscapes and CamVid datasets
Deep Learning Markov Random Field for Semantic Segmentation

DropNeuron: Simplifying the structure of deep neural networks – A novel approach of optimizing a deep neural network through the regularization of network architecture. This is basically a mechanism for dropping neurons during training. It allows one to construct simpler deep nets with compatible performance, while a lot quicker and less heavy. Code provided, including examples.

Deep Learning Relevance: Creating Relevant Information (as Opposed to Retrieving it) –  This work is about training an RNN that given a query, would synthesize a relevant document containing information on that query. On a user experiment, the synthetic document created by this approach was rated most relevant.

Learning Fine-Scaled Depth Maps from Single RGB Images – This work presents a multi-scale ConvNet with skip fusion layers for inferring a depth map from single RGB images. The depth maps are competitive with state of the art and also lead to accurate and rich with detail 3D reconstructions, as you can see on the rightmost example below:
Learning Fine-Scaled Depth Maps from Single RGB Images

Exploring the Depths of Recurrent Neural Networks with Stochastic Residual Learning – This is a technical report, but it’s so filled with innovations that I assume it’s only a temporary status. They introduce what they call a Res-ENN, an equivalent of ResNets (a most successful network structure that facilitates the training of ultra deep CNNs) for RNNs. In addition, to allow ultra-deep RNNs, they make use of two regularization techniques – one that drops layers randomly (stochastic depth), and one that drops timesteps (i.e words in a sentence) – namely stochastic timesteps. This new type of network with its corresponding regularization techniques is used for the sentiment classification task in NLP.
Exploring the Depths of Recurrent Neural Networks with Stochastic Residual Learning

A Hierarchical Model for Text Autosummarization – A hierarchical LSTM encoder-decoder model is used for the task of text summarization  with good results. Here’s an example:

Original document: official says number ’ of emails copyright 2015 cable news network/turner broadcasting system , inc. all rights reserved . this material may not be published , broadcast , rewritten , or redistributed . an email chain between former secretary of state hillary clinton and of u.s. central command david petraeus from january and february 2009 is raising questions about whether some of the emails on clinton ’s private email server are mistakenly deemed personal and not included among the 55,000 pages of emails she turned over to the state department .
Original title: new hillary clinton email chain discovered G: hillary clinton email service discovered
Generated summary: hillary clinton email service discovered

Pretty cool, right?


Deep Depth Super-Resolution : Learning Depth Super-Resolution using Deep Convolutional Neural Network – An end-to-end CNN is trained for mapping low resolution depth images to high resolution ones. State-of-the art performance is achieved on various benchmarks. The depth statistical information is used as a prior in the regularization of their neural network.
Deep Depth Super-Resolution

Fast Robust Monocular Depth Estimation for Obstacle Detection with Fully Convolutional Networks  – The authors proposed a fully convolutional encoder-decoder based network to extract a depth map from an image + optical flow estimation computed by the Brox algorithm. In addition to real training data, simulated data using the Unreal engine is used. This approach is intended as  a basis for an obstacle detection system.
Fast Robust Monocular Depth Estimation


Geo-location and domain adaptation are the focus of the last two week’s paper review

This paper introduces Convolutional Channel Features (CCF) – an integrated approach that mixes CNNs with Filtered Channel Features – an older approach that uses boosting to train a random forest model. The idea is to take only a few “low level” layers from a pre-trained CNN, and feed the output to the random forest model. They claim that this approach can be better at generalizing to a new specific task than the fine-tuning strategy commonly used in deep learning. The proposed approach is less prone to overfitting, since the random forest model is small, consisting of only an order of tens of thousands of parameters. This property also enables real-time performance and memory efficiency.

Convolutional Channel Features - A mix between CNNs and boosting forest model
Convolutional Channel Features – A mix between CNNs and boosting forest model


This paper addresses the problem of domain adaptation: transferring both a classifier and its learned features from one problem domain to another. In addition they want to assume that while in the source domain, they have labeled examples, in the target domain the learning is unsupervised. They gain inspiration from Residual Neural Nets, a concept that was used winning submission of ILSVRC 2015. The residual net is used as a building block in the proposed Residual Transfer Network (RTN), a deep neural network used for domain adaptation. It is trained end-to-end using the source domain’s labeled data, and then a new classifier emerges, that can predict results on data from a new domain, where training labels were unavailable.


“Learning over long time lags” is a review paper centered on RNNs, LSTMs, and various learning methods and building blocks that are used extensively in problems that involve time-series data. Definitely should be on your reading list if you plan to enter this field.


PlaNet is a deep CNN trained by Google to tell where in the world a particular image was take, just from the pixel values. They used millions of geo-tagged images for the training. They treat the problem as a problem of classification, as they divide the earth into small discrete cells. The network is trained to output a probability distribution of a given image, as being taken within each geographic cell. Results are impressive, as you can see in the example below.

On the left is the input image, and on the right the probability of it being taken in each location. The Eiffel tower is placed in Paris with certainty, while the beach for instance, is more ambiguous. Though, all the possibilities make sense.
On the left is the input image, and on the right the probability of it being taken in each location. The Eiffel tower is placed in Paris with certainty, while the beach for instance, is more ambiguous. Though, all the possibilities make sense.


This paper presents an interesting application for deep learning: image quality assessment. Several design choices are being compared, and eventually, a good correlation with human image quality assessment is reached.


A decent deep learning PC in under $1000

In this post I will try to give you an idea on how to do deep learning on a shallow budget. I will take a shot at trying to come up with a decent enough computer build that should allow a beginner to start out in doing deep learning, and perhaps become serious about it. Not every deep learning enthusiast has access to a huge Tesla GPU cluster. Even those who do have access to some good hardware at work or at the university, might want to do some side-projects or play around with some ideas on their own time. To do this, you would need your own PC. Most people don’t spend more than $1000 on a PC. It is a very tight budget for deep learning, but I will try to make it sound possible.

The single most important component of your new deep learning machine is definitely the GPU. It’s also the most expensive one, and so naturally this post will focus on it quite a bit. Unless you’re an OpenCL guru, and want to write the next optimized deep learning framework, you probably want an NVIDIA GPU. I know AMD make great graphic cards for gamers, but they’re sort of missing the boat so far when it comes to deep learning. I see some work being done on releasing OpenCL versions of deep learning frameworks, but it’s nowhere near as developed as Cuda. My feeling is that 99% of researchers and developers in the field currently use NVIDIA GPUs, so until this changes, you should probably be no different.

Which GPU specs matter most?

The way I see it, memory size is what you should be most cautious about. If your GPU doesn’t have enough memory for the network that you want to train (or deploy), then you’re out of luck, and you should either:

  • Change something fundamental in the way you were going to solve your problem
  • Get your hands dirty and manually change the implementation of your deep learning framework to favor memory efficiency rather than speed (most people don’t want to get into this).

Unfortunately, until NVIDIA release their new Pascal architecture, we’re all stuck with pretty low memory GPUs, unless of course you’re willing to spend thousands of dollars on server grade cards, but that’s definitely not the focus of this post. Hopefully, in the future cards memory will be more abundant as NVIDIA realizes it’s huge importance for deep learning and AI. Meanwhile, 6GB is probably the absolute best that you can do within the highly restrictive budget target which I’ve set. 4GB is not bad also, I have done a lot of work on a GTX980 4GB card. I can tell you that using Caffe with this card, I managed to train a GoogLeNet with some minor modifications on small images. But, there was not much more room to grow.

How about processing power?

The GPU’s processing power, that eventually tells you how fast you will be able to train and run your models, is determined by it’s number of Floating Point Operations per Second (FLOPS). The theoretical single-precision FLOPS in modern GPU architectures is directly proportionate to the number of CUDA cores times the core clock speed. A very nice GPU like the GTX Titan X that alone costs above $1000 (and by the way has a great 12GB of memory) does 6144 GFLOPS (Giga-FLOPS). The GTX 980 that I told you I use often does 4612. It goes for about $500. This page may come very handy for comparing different specs of NVIDIA GPUs.

GTX titan X
GTX titan X – the wet dream of every deep learning enthusiast. []
Another important property is memory bandwidth. It defines how fast data can be transferred to and from the GPU memory. The Titan X has a bandwidth of 336 GB/sec, while the GTX 980 has 224 GB/sec.

What’s important besides the GPU?

One thing that’s perhaps not very obvious is an SSD hard drive. The speed in which you read and/or write data to your disk can be a significant bottleneck when dealing with large databases, so I would definitely recommend getting an SSD drive with at least 200 GB of storage space. Usually people install their operating system and software that they use most often on the SSD drive. This way they feel the significant speed benefit every time they turn their computer on. What I usually do when I have limited SSD storage, is actually keep it empty, install the OS on a regular HDD, and use the SSD only for reading and writing data for deep learning purposes. These days however, the price of SSD storage has dropped significantly, so you shouldn’t worry about it too much.

Besides SSD, you’d probably want at least 16GB of RAM, and a decent Intel core-i7 processor, though I would go for one from one or two previous generations. The performance improvement of processors compared to previous generations is usually marginal, and besides, you will do most of your heavy computations on the GPU anyway.

Let’s go shopping

Let’s see how much you would spend on everything besides the GPU, and then we know how much of our designated $1000 budget is left for GPU shopping:

Central processing unit (CPU)

A very good choice for a CPU that I recommend because I own it myself after doing some research, would be the 4-core/8-thread Intel core-i7 4790K that you can get on Amazon starting from $320.

Intel core-i7 CPU [Amazon]
Intel core-i7 CPU [Amazon]

I’ll recommend the Gigabyte GA-Z97X-SLI which you can get for around $120 on Amazon or Newegg. This is what I happily own myself. It has an ATX form factor, which is the most common size (large, but unusually huge). It is compatible with the core-i7 4790K CPU, and it has support for an extra GPU card, in case you decide to buy another one and boost your deep learning performance.

Gigabyte GA-Z97X-SLI. [Amazon]
Gigabyte GA-Z97X-SLI. [Amazon]
Hard drives

A Crucial BX200 240GB SSD drive can be bought for about $60 today. Add a 1TB HDD drive for $50 and you’re good to go.

Crucial BX200 240GB SSD drive [Newegg]
Crucial BX200 240GB SSD drive [Newegg]

A Crucial 16 GB DDR3 memory kit can be bought for about $60 on Amazon at this time.

Power supply unit (PSU)

Using this nice power supply calculator, I selected the recommended components so far, with 2 GTX 980 GPUs, just in case. The result is a recommended power supply of 659 Watts, so a PSU of 700W should be very well on the safe side. I’ll let you do your own chosing, since I really don’t have much to contribute in terms of chosing a specific power supply unit, but the cost should vary anywhere between $35 to $80. Let’s assume $60 off our budget as a fair intermediate value.


This one, I’m definitely leaving for you to chose. Costs vary between $20 to hundreds of dollars, because there’s no limit to how crazy you can go about chosing a case. I’ll assume that since we’re dealing with a budget PC, you shouldn’t want to spend more than $50 on a case.

Now for the fun part – GPU

Everything above adds up to $720, which means you’re left with only $280 for the GPU. This is not much at all. There is no way that this will buy you a 6GB GPU, so 4GB it is. The 4GB GPU with the most FLOPS that can be bought new for under $280 is the Geforce GTX 960. It does 2308 GFLOPS, about twice less than the GTX 980, which naturally costs about twice as much.

Under such strict budget constraints, it actually makes a lot of sense to try and find a used GPU. GPUs from older generations can actually outperform (at least theoretically, in terms of specs) some of the newer GPUs.

Have a look at the GTX 770 that is sold on Amazon from $278. It does 3213 GFLOPS and has memory bandwidth of 224 GB/s. Not bad at all. And there’s your brand new (well almost) $1000 deep learning PC!

If you’re willing to go above $1000 and spend $500 on a GPU, then my most recommended options are either:

  • A new GTX 980 – 4GB memory, 4612 GFLOPS, 224 GB/s memory bandwidth.
  • This used GTX 780 that is an older generation but has much better specs: a whopping 6GB memory, 3977 GFLOPS and 288 GB/s memory bandwidth. Very nice, and in my opinion, well worth buying a used one. A new one costs $200 more.

From improved dropout to a brand new training algorithm – notable papers from the last 30 days

A new algorithm for training RNNs has been proposed in this paper, titled “Training recurrent neural networks by diffusion”. Based on some preliminary results,  the proposed algorithm achieves similar performance compared to SGD, in up to 25% fewer training epochs. The basic claim is that if you take some complex objective function, an optimal way of gradually walking towards a minimum, is by diffusing this function according to the heat equation, whose solution is known analytically. Another nice property of the new approach, is that some common principles that are used separately in deep learning for better convergence, such as smart initialization, noise injection (dropout, SGD), learning rate annealing and others, arise naturally from the proposed method, without applying them explicitly.

PN-Net is a new approach for telling whether two image pairs are similar or not. Their CNN takes as input three image patches, two of which are similar images, of the same point in space, from different viewpoints, and the third one is a negative example – an image not similar to the previous two. Their proposed loss function SoftPN, is some function of the distances between each two of the three input patches:
PN-netThey use a CNN relatively moderate in depth, which results in a very efficient and fast approach, and still achieve state-of-the-art performance!

This paper presents an end-to-end learning algorithm based on CNNs and LSTMs which significantly outperforms traditional approaches in recognizing emotions from speech.

The Google DeepMind propose what they call Pixel RNNs – a state-of-the-art tool for image completion:
They train their model to correctly model the RGB values of occluded pixels, given the apparent ones. Learning is unsupervised, since all they need are images. There is virtually unlimited data, so there is room for improvement by using larger models, assuming the necessary computing power.

In this paper, a deep learning based saliency map detector is proposed. Saliency maps are basically binary images that contain only the “important” objects in an image, as a human would think. The idea in this paper is elegant – they take a CNN that can classify an input image to an object. Then, they use backpropagation to modify the input image until the CNN no longer outputs the correct classification. The difference between the original image and the new, generated image, can be thought of as a good candidate for a saliency map. I like this simple and sensible idea.

An improvement to the Dropout regularization method is proposed in this paper.  The proposed method achieved a 10% accuracy improvement, and 50% training time improvement on the CIFAR-100 dataset. The claim is that a constant probability for dropout of each neuron is sub-optimal, and the authors proposed a dynamic approach for deciding on the probability with which to drop neurons out in each training iteration.

Deep learning – a brief catch up

You may have heard that the term “deep learning” is mostly a buzzword made to describe re-branded artificial neural networks, with some changes. One of the more important changes is the introduction of a new type of neural-network layers – convolutional ones. Inspired by the Neocognitron in 1980, they were developed in this paper by T. Homma in 1988, and further improved by Yann Lecun in 1988, when he successfully applied the LeNet-5 architecture to hand-written digit recognition on the MNIST database. This concept is used until this day in practical applications.

In the 90’s and early 2000’s some work has been done on convolutional neural networks (CNNs), but somewhat awkwardly, it was moderate, and nowhere near the amount of successful work being done since 2012.

What happened in 2012 that has enabled the whole deep learning “revolution”, well, besides the continuous advancement of computing power, and availability of huge databases, is the work by Alex Krizhevski which won the Imagenet classification challenge by a significant margin, using a CNN later termed “AlexNet”. Here are the results for 2012:

Imagenet 2012 results
Imagenet 2012 results

In order to successfully train his deep CNN, Krizhevsky applied two simple techniques that proved to be powerful. One is the ReLU nonlinearity, and the other is Dropout regularization. This lecture by Andrej Karpathy nicely explains the benefits of ReLU, and this paper introduced the dropout technique.

The 2012 results clearly demonstrates the power of CNNs. It’s also interesting to see the top scorers of the following year:

Imagenet 2013 results
Imagenet 2013 results

It has become a much closer battle at the top, as everyone realized that CNNs are the new state-of-the-art for image classification.

In the latest Imagenet challenge of 2015, the top-5 classification error for the winning team is as low as 3.5%. This was achieved through training a VERY deep CNN with ~150 layers. This was made possible thanks to a nice trick proposed by Kaiming He in this paper.

In addition to image classification and object recognition and localization, deep learning has quickly enabled success in many other computer vision application including semantic segmentation, optical flow estimation, image captioning, image matching and measuring similarity between image patches, and even “lower level” tasks such as super-resolution and denoising.

Deep learning has also been very successfully applied to non computer vision tasks, such as natural language processing, speech understanding and translation, and others.

Here I’ve mentioned mostly convolutional neural networks. While it is probably the primary approach that gave rise to the whole “deep learning” buzz in recent years, it is not the only learning algorithm that is considered “deep”. Here is a good place to start reading about the various types of deep learning algorithms, both supervised and unsupervised/semi-supervised.

Deep learning is a rapidly evolving technology. This is the case when it comes to algorithmic progress, software development, and hardware as well. Searching the phrase “deep learning neural networks” in Google Scholar just now, gave around 130 publications in just the last 30 days, over 4 papers per day on average. As a computer vision algorithms developer, it is obvious why I became highly interested in deep learning. I decided to start this blog as a platform for sharing my thoughts, findings and tips surrounding this fascinating field. Hopefully writing it will also provide me with another reason to try and stay up-to-date with the rapidly evolving academic literature, and industrial progress.

Thanks for reading and I hope you will enjoy coming back!