self-driving cars - freeCodeCamp.org

Image Augmentation: Make it rain, make it snow. How to modify photos to train self-driving cars

freeCodeCamp — Mon, 09 Apr 2018 04:02:55 +0000

By Ujjwal Saxena

Image Augmentation is a technique for taking an image and using it to generating new ones. It’s useful for doing things like training a self-driving car.

Think of a person driving a car on a sunny day. If it starts raining, they may initially find it difficult to drive in rain. But slowly they get accustomed to it.

An artificial neural network too finds it confusing to drive in a new environment unless it has seen it earlier. Their are various augmentation techniques like flipping, translating, adding noise, or changing color channel.

In this article, I’ll explore the weather part of this. I used the OpenCV library for processing images. I found it pretty easy after a while, and was able to introduce various weather scenarios into an image.

I’ve pushed a fully implemented Jupyter Notebook you can play with on GitHub.

Lets’ have a look.

I’ll first show you an original test image and will then augment it.

Sunny and Shady

After adding random sunny and shady effect, the image’s brightness changes. This is an easy and quick transformation to perform.

def add_brightness(image):    image_HLS = cv2.cvtColor(image,cv2.COLOR_RGB2HLS) ## Conversion to HLS    image_HLS = np.array(image_HLS, dtype = np.float64)     random_brightness_coefficient = np.random.uniform()+0.5 ## generates value between 0.5 and 1.5    image_HLS[:,:,1] = image_HLS[:,:,1]*random_brightness_coefficient ## scale pixel values up or down for channel 1(Lightness)    image_HLS[:,:,1][image_HLS[:,:,1]>255]  = 255 ##Sets all values above 255 to 255    image_HLS = np.array(image_HLS, dtype = np.uint8)    image_RGB = cv2.cvtColor(image_HLS,cv2.COLOR_HLS2RGB) ## Conversion to RGB    return image_RGB

The brightness of an image can be changed by changing the pixel values of “Lightness”- channel 1 of image in HLS color space. Converting the image back to RGB gives the same image with enhanced or suppressed lighting.

Sunny

Shady

Shadows

To a car, a shadow is nothing but the dark portions of an image, which can also be bright at times. So a self-driving car should always learn to drive with or without shadows. Randomly changing brightness on the hills or in the woods often boggle a car’s perception if not trained properly. This is even more prevalent on sunny days and differently tall buildings in a city, allowing beams of light to peep through.

Brightness is good for perception but uneven, sudden or too much brightness create perception issues. Let’s generate some fake shadows.

def generate_shadow_coordinates(imshape, no_of_shadows=1):    vertices_list=[]    for index in range(no_of_shadows):        vertex=[]        for dimensions in range(np.random.randint(3,15)): ## Dimensionality of the shadow polygon            vertex.append(( imshape[1]*np.random.uniform(),imshape[0]//3+imshape[0]*np.random.uniform()))        vertices = np.array([vertex], dtype=np.int32) ## single shadow vertices         vertices_list.append(vertices)    return vertices_list ## List of shadow vertices

def add_shadow(image,no_of_shadows=1):    image_HLS = cv2.cvtColor(image,cv2.COLOR_RGB2HLS) ## Conversion to HLS    mask = np.zeros_like(image)     imshape = image.shape    vertices_list= generate_shadow_coordinates(imshape, no_of_shadows) #3 getting list of shadow vertices    for vertices in vertices_list:         cv2.fillPoly(mask, vertices, 255) ## adding all shadow polygons on empty mask, single 255 denotes only red channel        image_HLS[:,:,1][mask[:,:,0]==255] = image_HLS[:,:,1][mask[:,:,0]==255]*0.5   ## if red channel is hot, image's "Lightness" channel's brightness is lowered     image_RGB = cv2.cvtColor(image_HLS,cv2.COLOR_HLS2RGB) ## Conversion to RGB    return image_RGB

OpenCV’s fillPoly() function is really handy in this case. Let’s create some random vertices and impose the polygon on an empty mask using fillPoly(). Having done this, the only thing left to do is to check the mask for hot pixels and reduce the “Lightness” in the HLS image wherever these hot pixels are found.

Random shadow polygon on the road

Snow

Well this is something new. We often wonder how would our vehicle behave on snowy roads. One way to test that is to get pics of snow clad roads or do something on the images to get a similar effect. This effect is not a complete alternative to snowy roads, but it’s an approach worth trying.

def add_snow(image):    image_HLS = cv2.cvtColor(image,cv2.COLOR_RGB2HLS) ## Conversion to HLS    image_HLS = np.array(image_HLS, dtype = np.float64)     brightness_coefficient = 2.5     snow_point=140 ## increase this for more snow    image_HLS[:,:,1][image_HLS[:,:,1]1][image_HLS[:,:,1]for channel 1(Lightness)    image_HLS[:,:,1][image_HLS[:,:,1]>255]  = 255 ##Sets all values above 255 to 255    image_HLS = np.array(image_HLS, dtype = np.uint8)    image_RGB = cv2.cvtColor(image_HLS,cv2.COLOR_HLS2RGB) ## Conversion to RGB    return image_RGB

Yup! That’s it. This code generally whitens the darkest parts of the image, which are mostly roads, trees, mountains and other landscape features, using the same HLS “Lightness” increase method used in the other approaches above. This technique doesn’t work well for dark images, but you can modify it to do so. Here’s what you get:

winter is here

You can tweak some parameters in the code for more or less snow than this. I have tested this on other images too, and this technique gives me chills.

Rain

Yes, you heard that right. Why not rain? When humans experience difficulty driving in rain, why should vehicles be spared from that? In fact, this is one of the situations for which I want my self-driving car to be trained the most. Slippery roads and blurred visions are risky, and cars should know how to handle them.

def generate_random_lines(imshape,slant,drop_length):    drops=[]    for i in range(1500): ## If You want heavy rain, try increasing this        if slant<0:            x= np.random.randint(slant,imshape[1])        else:            x= np.random.randint(0,imshape[1]-slant)        y= np.random.randint(0,imshape[0]-drop_length)        drops.append((x,y))    return drops            def add_rain(image):        imshape = image.shape    slant_extreme=10    slant= np.random.randint(-slant_extreme,slant_extreme)     drop_length=20    drop_width=2    drop_color=(200,200,200) ## a shade of gray    rain_drops= generate_random_lines(imshape,slant,drop_length)        for rain_drop in rain_drops:        cv2.line(image,(rain_drop[0],rain_drop[1]),(rain_drop[0]+slant,rain_drop[1]+drop_length),drop_color,drop_width)    image= cv2.blur(image,(7,7)) ## rainy view are blurry        brightness_coefficient = 0.7 ## rainy days are usually shady     image_HLS = cv2.cvtColor(image,cv2.COLOR_RGB2HLS) ## Conversion to HLS    image_HLS[:,:,1] = image_HLS[:,:,1]*brightness_coefficient ## scale pixel values down for channel 1(Lightness)    image_RGB = cv2.cvtColor(image_HLS,cv2.COLOR_HLS2RGB) ## Conversion to RGB    return image_RGB

What I did here is that again I generated random points all over the image and then used the OpenCV’s line() function to generate small lines all over the image. I have also used a random slant in the rain drops to have a feel of actual rain. I have also reduced image’s brightness because rainy days are usually shady, and also blurry because of the rain. You can change the dimension of your blur filter and the number of rain drops for desired effect.

Here is the result:

Fake rain but not much blur

Fog

This is yet another scenario that hampers the vision of a self-driving car a lot. Blurry white fluff in the image makes it very difficult to see beyond a certain stretch and reduces the sharpness in the image.

Fog intensity is an important parameter to train a car for how much throttle it should give. For coding such a function, you can take random patches from all over the image, and increase the image’s lightness within those patches. With a simple blur, this gives a nice hazy effect.

def add_blur(image, x,y,hw):    image[y:y+hw, x:x+hw,1] = image[y:y+hw, x:x+hw,1]+1    image[:,:,1][image[:,:,1]>255]  = 255 ##Sets all values above 255 to 255    image[y:y+hw, x:x+hw,1] = cv2.blur(image[y:y+hw, x:x+hw,1] ,(10,10))    return image

def generate_random_blur_coordinates(imshape,hw):    blur_points=[]    midx= imshape[1]//2-hw-100    midy= imshape[0]//2-hw-100    index=1    while(midx>-100 or midy>-100): ## radially generating coordinates        for i in range(250*index):            x= np.random.randint(midx,imshape[1]-midx-hw)            y= np.random.randint(midy,imshape[0]-midy-hw)            blur_points.append((x,y))        midx-=250*imshape[1]//sum(imshape)        midy-=250*imshape[0]//sum(imshape)        index+=1    return blur_points    def add_fog(image):    image_HLS = cv2.cvtColor(image,cv2.COLOR_RGB2HLS) ## Conversion to HLS    mask = np.zeros_like(image)     imshape = image.shape    hw=100    image_HLS[:,:,1]=image_HLS[:,:,1]*0.8    haze_list= generate_random_blur_coordinates(imshape,hw)    for haze_points in haze_list:         image_HLS[:,:,1][image_HLS[:,:,1]>255]  = 255 ##Sets all values above 255 to 255        image_HLS= add_blur(image_HLS, haze_points[0],haze_points[1], hw) ## adding all shadow polygons on empty mask, single 255 denotes only red channel    image_RGB = cv2.cvtColor(image_HLS,cv2.COLOR_HLS2RGB) ## Conversion to RGB    return image_RGB

Coding this was the hardest of all the functions above. I have tried a radial approach to generate patches here. Since on a foggy day usually most of the fog is at the far end of the road and as we approach near, vision keeps clearing itself.

Foggy Highway

It’s a real difficult task for a machine to detect nearby cars and lanes in such a foggy condition, and is a good way to train and test the robustness of the driving model.

Torrential rain

I thought of making the rain part a little better by combining fog and rain. As there is always some haze during rains and it’s good to train the car for that also. There’s no new function is required for this. We can achieve the effect by sequentially calling both.

The car on the right is barely visible in this image, and this is a real world scenario. We can hardly make out anything on the road in heavy rain.

I hope this article will help you train the model in various weather conditions. For my complete code, you can visit my GitHub profile. And I’ve written a lot of other articles, which you can read on Medium and on my WordPress site.

Enjoy!

Transportation is about to hit exponential changes unlike anything we’ve ever seen before.

freeCodeCamp — Mon, 27 Feb 2017 20:36:51 +0000

By Adam Kell

Today’s car is essentially a computer on wheels. Under the hood, you’ll find a complex computer network communicating with several sensors. These can detect a variety of issues like tire pressure, acceleration, and engine oil quality, while also allowing controls for things like speed, temperature, power doors and power windows.

Emissions sensors in automobiles got their start in direct consequence to government regulation. After the EPA set forth more stringent policies for exhaust emissions, it became standard for cars in the United States to be equipped with catalytic converters. By the 1980’s, oxygen sensors were pivotal to making modern day emissions control possible by detecting and diagnosing the oxygen-to-fuel ratio expelled through the exhaust.

Soon after, similar components such as the oil level sensor, tire pressure sensor, fasten seatbelt light, and the check engine light were invented — designed to notify the driver when an issue was present. These sensors and indicators have been instrumental in making automobiles more reliable, mainstream, and affordable by helping consumers avoid many destructive, costly, or dangerous maintenance issues.

The advances in the transportation industry over the next 10 years will vastly eclipse the changes over the past half century.

These changes will not only improve the overall driving experience — the next 10 years will involve an ecosystem overhaul, such that getting from point A to point B will be unrecognizable from driving today. The intelligent autos and infrastructure of the 21st century will actively suggest, or in many cases take control of the car to protect against potential accidents and distracted driving, while providing real-time route planning and active traffic management.

_How [driverless cars see the world](https://www.youtube.com/watch?v=tiwVMrTLUWg" rel="noopener" target="blank" title=")

So what’s changed?

Onboard automotive sensors are pushing boundaries of perception, and doing so cheaper than ever — cars can sense more.

In previous years, the automotive industry has been hacking together hand-me-down sensors from other industries. But the scale and promise of autonomous cars has made even the most conservative automotive suppliers invest heavily into the dedicated autonomous vehicle sensor supply chain. Precision location sensors and services are becoming increasingly prevalent and pushing the boundaries of accuracy. Lidar, cameras, depth sensors, and radar are also fundamentally changing the perception benchmarks of vehicles. These sensors, together, will be the key to unlocking new levels of autonomy in the coming years.

_[The perception range of the sensor package in Tesla’s Autopilot system](http://www.tesla.com/autopilot" rel="noopener" target="blank" title=")

Autonomy systems and the related infrastructure are getting a lot more sophisticated — cars can know more.

Waymo’s autonomous miles driven now number in the millions, and at a high level of autonomy. Tesla’s autopilot odometer (depending on who you ask) reads in the hundreds of millions or billions of miles, albeit at a lower level of autonomy. These companies are leading the charge to the autonomous future, but there is a large supporting cast that will make autonomy possible.

_[Waymo’s reported disengagements per 1000 miles driven](https://medium.com/waymo/accelerating-the-pace-of-learning-36f6bc2ee1d5#.z5b8c2dil" rel="noopener" target="blank" title=")

Companies focused on generating and compressing high-res, high definition maps are making it less computationally-intensive to solve perception problems onboard the vehicle. Tools and frameworks are being developed to make it easier to tag and annotate images.

Computer simulations are getting closer to having the ability to train the underlying neural networks without fully relying on real cars in the physical world to uncover edge cases.

Kits are being developed to increase the feature-set of stock vehicles like adaptive cruise control and lane keeping, as well as developing a corpus of training to teach AI how humans actually drive.

Dedicated hardware is making tasks like vision more efficient by being designed to run specific algorithms in a very efficient way.

_[HD Maps include lane information and sign/signal information](http://360.here.com/2015/07/20/here-introduces-hd-maps-for-highly-automated-vehicle-testing/" rel="noopener" target="blank" title=")

Infrastructure upgrades for new requirements in connectivity and communication — cars can talk to each other and the environment.

As autonomous vehicles move to become a viable option for mainstream adoption, major infrastructure enhancements need to be considered and implemented. Infrastructure upgrades range in scope — from painting new, clearer lines that designate between lane separation, all the way to integrating new sensors and communication modules. Autonomous cars need to be able to perceive enough information about their environment in order to assess, make a plan, and then react. The way the infrastructure has been currently portraying information to human drivers isn’t necessarily the best way to portray this information directly to vehicles. For humans, we use paint in different colors, signs and signals, cones, and flares. For autonomous vehicles, these inputs will involve an environment which can know a lot more information about the conditions on the road — sensors monitoring traffic, optimizing traffic flow, and even cars that can communicate with each other. How much of of the autonomous future will have infrastructure 2.0, and how much of our autonomous systems will adapt to something closer to current infrastructure?

_[An example of a particularly difficult infrastructure to navigate](http://gizmodo.com/will-autonomous-cars-kill-the-traffic-light-1624440289" rel="noopener" target="blank" title=")

Intelligent manufacturing is making it possible to integrate new technology, build in new ways, and with new materials.

The way cars themselves are made is also being transformed by automation. New materials can be selected with optimal characteristics based on their physical, chemical, and thermal constraints. These new materials can be put together in more clever ways using AI to design structural elements in very non-intuitive ways. New developments are making it cheaper and faster to build prototype parts from production materials. New applications of computer vision and machine learning techniques, like reinforcement learning, are pushing the boundaries of the types of parts that can be automated in the factory. The confluence of these factors is changing the way that automobiles are built and tested.

_Advanced [BMW manufacturing line](http://www.goauto.com.au/mellor/mellor.nsf/story2/6A99C53ED0256365CA257CAB007C083D" rel="noopener" target="blank" title=")

Driver-focused sensors are immensely important in the transition from level 0 to level 5 autonomous systems.

As an increasing number of conditions are created in which the driver may not be the main operator of the vehicle, systems are necessary to make sure the driver is paying attention when they need to be. Monitoring distractions, emotional state, sobriety, wakefulness, and health are all things becoming possible to track using only a camera and software. Other sensors measuring biometrics of the driver can also provide insights about what the car should do. Integrating other health data (risk for heart attack or stroke, for example) may change the way the car behaves in certain situations. All of these capabilities focus on keeping people safer, and the cars being more contextually aware.

_Tools like [OpenCV](http://www.learnopencv.com/facial-landmark-detection/" rel="noopener" target="blank" title=") make it much easier to build computer vision systems

New services in fleet management, ridesharing, and repair will become possible with so many connected and intelligent automobiles.

The combination of cars being largely autonomous as well as sensor-laden will enable many new business models. Fleet management will become much more efficient since the cars will be able to communicate real-time status and be rerouted as situations change.

Ridesharing will continue to make strides in the efficiency of route planning started by companies like Uber. The car repair industry will be similarly transformed because we will know so much more about what is happening inside a car.

Car ownership itself may change. On average, car owners currently leave their car parked for 95% of the time. A ride sharing company dispatching autonomous vehicles could displace the need or desire to own a car. Ubiquitous real-time ridesharing (even pre-automation) is already yielding huge conversions away from car ownership in urban areas.

_Uber’s [Self driving fleet](https://www.engadget.com/2016/09/14/uber-pittsburgh-self-driving-cars-experience/" rel="noopener" target="blank" title=") in Pittsburgh

All these factors are leading to a transportation revolution, however, new entrants to the transportation industry face a much more complicated regulatory system, customer development process and supply chain. Comet Labs’ Transportation Lab helps startups developing the core technologies that will transform the transportation industry, accelerating their customer development by providing resources that money can’t buy (such as HD mapping data, autonomous test vehicles, and space to pilot).

Are you building transportation technology? We’d love to hear from you.

The world through the eyes of a self-driving car

freeCodeCamp — Wed, 08 Feb 2017 01:43:10 +0000

By David Brailovsky

Visualizing which part of an image a neural network uses to recognize traffic lights

In my last post I described how I trained a ConvNet (Convolutional Neural Network) to recognize traffic lights in dash-cam images. The best performing single network achieved an impressive accuracy of >94%.

While ConvNets are very good at learning to classify images, they are also somewhat of a black box. It’s hard to tell what they’re doing once they’re trained. Since I never explicitly “told” the network to focus on traffic lights, it’s possible that it’s using some other visual cues in the images to predict the correct class. Maybe it’s looking for static cars to predict a red light? ?

In this post I describe a very simple and useful method for visualizing what part of an image the network uses for its prediction. The approach involves occluding parts of the image and seeing how that changes the network’s prediction. This approach has been described in “Visualizing and Understanding Convolutional Networks”.

Self-driving cars today use much more sophisticated methods for detecting objects in a scene, as well as many more sensors as inputs. The ConvNet we examine throughout the post should be seen as a simplified version of what self-driving cars actually use. Nonetheless, the visualization method described in this post can be useful and adapted for different kinds of neural network applications.

You can download a notebook file with the code I used from here.

Example #1

I started with the following image which has a red traffic light:

_Source: [Nexar challenge](https://challenge.getnexar.com/challenge-1" rel="noopener" target="blank" title=")

The network predicts this image has a red traffic light with 99.99% probability. Next I generated many versions of this image with a grey square patch in different positions. More specifically, a 64 x 64 sliding square with a step size of 16 pixels.

Example of image with 64x64 grey square patch

I ran each image through the network and recorded the probability it predicted for the class “red”. Below you can see a plot of a heat-map of those recorded probabilities.

The color represents the probability of the class “red” when there was a square patch covering that position. Darker color means lower probability. There’s a smoothing effect because I averaged the probabilities each pixel got for all the patches that covered it.

Then I plotted the heat-map on top of the original image:

Very cool! ? The lowest probability is exactly when covering the traffic light. I then repeated this process with a smaller patch size of 16x16:

Exactly on the traffic light! ?

Example #2

I kept examining more images and came across this interesting example:

_Source: [Nexar challenge](https://challenge.getnexar.com/challenge-1" rel="noopener" target="blank" title=")

The ConvNet predicted the class “green” with 99.99% probability for this image. I generated another heat-map by sliding a patch of size 32x32 and a step size of 16 pixels:

Hmm… something’s not right ?. The lowest probability for “green” that any patched image got was 99.909% which is still very high. The image with the lowest probability was:

That actually looks fine, it covers the traffic light perfectly. So why was the network still predicting “green” with a high probability? Could be because of the second green traffic light in the image. I repeated the sliding patch process on the patched image above and plotted the heat-map:

Much better! ? After hiding the second traffic light, the probability for “green” dropped close to zero, 0.25% to be exact.

Looking at mistakes

Next I wanted to see if I could learn anything interesting by using this technique to understand some the network’s misclassifications. Many of the mistakes were caused by having two traffic lights in the scene, one green and one red. It was pretty obvious that the other traffic light is the part of the image that caused the mistake in those cases.

Another type of mistake was when the network predicted there’s no traffic light in the scene when there actually was. Unfortunately this technique was not very useful for understanding the reason the network got it wrong since there was no specific part of the image it focused on.

The last kind of mistake I looked at was when the network predicted a traffic light when there actually wasn’t one. See the example below:

And with the heat-map plotted on top:

Looks like the network confused the parking sign light for a traffic light. Interesting to see that it was just the right parking sign and not the left one.

Conclusion

This method is very simple yet effective to gain insights into what a ConvNet is focusing on in an image. Unfortunately it doesn’t tell us why it’s focusing on that part.

I also experimented a little with generating a saliency map as described in “Deep Inside Convolutional Networks”, but didn’t get any visually pleasing results.

If you know of any other interesting ways to understand what ConvNets are doing, please write a comment below ?

If you enjoyed reading this post, please tap ♥ below!

Recognizing Traffic Lights With Deep Learning

freeCodeCamp — Thu, 12 Jan 2017 15:53:20 +0000

By David Brailovsky

How I learned deep learning in 10 weeks and won $5,000

I recently won first place in the Nexar Traffic Light Recognition Challenge, computer vision competition organized by a company that’s building an AI dash cam app.

In this post, I’ll describe the solution I used. I’ll also explore approaches that did and did not work in my effort to improve my model.

Don’t worry — you don’t need to be an AI expert to understand this post. I’ll focus on the ideas and methods I used as opposed to the technical implementation.

Demo of a deep learning based classifier for recognizing traffic lights

The challenge

The goal of the challenge was to recognize the traffic light state in images taken by drivers using the Nexar app. In any given image, the classifier needed to output whether there was a traffic light in the scene, and whether it was red or green. More specifically, it should only identify traffic lights in the driving direction.

Here are a few examples to make it clearer:

_Source: [Nexar challenge](https://challenge.getnexar.com/challenge-1" rel="noopener" target="blank" title=")

The images above are examples of the three possible classes I needed to predict: no traffic light (left), red traffic light (center) and green traffic light (right).

The challenge required the solution to be based on Convolutional Neural Networks, a very popular method used in image recognition with deep neural networks. The submissions were scored based on the model’s accuracy along with the model’s size (in megabytes). Smaller models got higher scores. In addition, the minimum accuracy required to win was 95%.

Nexar provided 18,659 labeled images as training data. Each image was labeled with one of the three classes mentioned above (no traffic light / red / green).

Software and hardware

I used Caffe to train the models. The main reason I chose Caffe was because of the large variety of pre-trained models.

Python, NumPy & Jupyter Notebook were used for analyzing results, data exploration and ad-hoc scripts.

Amazon’s GPU instances (g2.2xlarge) were used to train the models. My AWS bill ended up being $263 (!). Not cheap. ?

The code and files I used to train and run the model are on GitHub.

The final classifier

The final classifier achieved an accuracy of 94.955% on Nexar’s test set, with a model size of ~7.84 MB. To compare, GoogLeNet uses a model size of 41 MB, and VGG-16 uses a model size of 528 MB.

Nexar was kind enough to accept 94.955% as 95% to pass the minimum requirement ?.

The process of getting higher accuracy involved a LOT of trial and error. Some of it had some logic behind it, and some was just “maybe this will work”. I’ll describe some of the things I tried to improve the model that did and didn’t help. The final classifier details are described right after.

What worked?

Transfer learning

I started off with trying to fine-tune a model which was pre-trained on ImageNet with the GoogLeNet architecture. Pretty quickly this got me to >90% accuracy! ?

Nexar mentioned in the challenge page that it should be possible to reach 93% by fine-tuning GoogLeNet. Not exactly sure what I did wrong there, I might look into it.

SqueezeNet

SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.

Since the competition rewards solutions that use small models, early on I decided to look for a compact network with as few parameters as possible that can still produce good results. Most of the recently published networks are very deep and have a lot of parameters. SqueezeNet seemed to be a very good fit, and it also had a pre-trained model trained on ImageNet available in Caffe’s Model Zoo which came in handy.

_SqueezeNet network architecture. [Slides](http://www.slideshare.net/embeddedvision/techniques-for-efficient-implementation-of-deep-neural-networks-a-presentation-from-stanford" rel="noopener" target="blank" title=")

The network manages to stay compact by:

Using mostly 1x1 convolution filters and some 3x3
Reducing number of input channels into the 3x3 filters

For more details, I recommend reading this blog post by Lab41 or the original paper.

After some back and forth with adjusting the learning rate I was able to fine-tune the pre-trained model as well as training from scratch with good accuracy results: 92%! Very cool! ?

Rotating images

Source: Nexar

Most of the images were horizontal like the one above, but about 2.4% were vertical, and with all kinds of directions for “up”. See below.

_Different orientations of vertical images. Source: [Nexar challenge](https://challenge.getnexar.com/challenge-1" rel="noopener" target="blank" title=")

Although it’s not a big part of the data-set, I wanted the model to classify them correctly too.

Unfortunately, there was no EXIF data in the jpeg images specifying the orientation. At first I considered doing some heuristic to identify the sky and flip the image accordingly, but that did not seem straightforward.

Instead, I tried to make the model invariant to rotations. My first attempt was to train the network with random rotations of 0°, 90°, 180°, 270°. That didn’t help ?. But when averaging the predictions of 4 rotations for each image, there was improvement!

92% → 92.6% ?

To clarify: by “averaging the predictions” I mean averaging the probabilities the model produced of each class across the 4 image variations.

Oversampling crops

During training the SqueezeNet network first performed random cropping on the input images by default, and I didn’t change it. This type of data augmentation makes the network generalize better.

Similarly, when generating predictions, I took several crops of the input image and averaged the results. I used 5 crops: 4 corners and a center crop. The implementation was free by using existing caffe code for this.

92% → 92.46% ?

Rotating images together with oversampling crops showed very slight improvement.

Additional training with lower learning rate

All models were starting to overfit after a certain point. I noticed this by watching the validation-set loss start to rise at some point.

Validation loss rising from around iteration 40,000

I stopped the training at that point because the model was probably not generalizing any more. This meant that the learning rate didn’t have time to decay all the way to zero. I tried resuming the training process at the point where the model started overfitting with a learning rate 10 times lower than the original one. This usually improved the accuracy by 0-0.5%.

More training data

At first, I split my data into 3 sets: training (64%), validation (16%) & test (20%). After a few days, I thought that giving up 36% of the data might be too much. I merged the training & validations sets and used the test-set to check my results.

I retrained a model with “image rotations” and “additional training at lower rate” and saw improvement:

92.6% → 93.5% ?

Relabeling mistakes in the training data

When analyzing the mistakes the classifier had on the validation set, I noticed that some of the mistakes have very high confidence. In other words, the model is certain it’s one thing (e.g. green light) while the training data says another (e.g. red light).

Notice that in the plot above, the right-most bar is pretty high. That means there’s a high number of mistakes with >95% confidence. When examining these cases up close I saw these were usually mistakes in the ground-truth of the training set rather than in the trained model.

I decided to fix these errors in the training set. The reasoning was that these mistakes confuse the model, making it harder for it to generalize. Even if the final testing-set has mistakes in the ground-truth, a more generalized model has a better chance of high accuracy across all the images.

I manually labeled 709 images that one of my models got wrong. This changed the ground-truth for 337 out of the 709 images. It took about an hour of manual work with a python script to help me be efficient.

Above is the same plot after re-labeling and retraining the model. Looks better!

This improved the previous model by:

93.5% → 94.1% ✌️

Ensemble of models

Using several models together and averaging their results improved the accuracy as well. I experimented with different kinds of modifications in the training process of the models involved in the ensemble. A noticeable improvement was achieved by using a model trained from scratch even though it had lower accuracy on its own together with the models that were fine-tuned on pre-trained models. Perhaps this is because this model learned different features than the ones that were fine-tuned on pre-trained models.

The ensemble used 3 models with accuracies of 94.1%, 94.2% and 92.9% and together got an accuracy of 94.8%. ?

What didn’t work?

Lots of things! ? Hopefully some of these ideas can be useful in other settings.

Combatting overfitting

While trying to deal with overfitting I tried several things, none of which produced significant improvements:

increasing the dropout ratio in the network
more data augmentation (random shifts, zooms, skews)
training on more data: using 90/10 split instead of 80/20

Balancing the dataset

The dataset wasn’t very balanced:

19% of images were labeled with no traffic light
53% red light
28% green light.

I tried balancing the dataset by oversampling the less common classes but didn’t notice any improvement.

Separating day & night

My intuition was that recognizing traffic lights in daylight and nighttime is very different. I thought maybe I could help the model by separating it into two simpler problems.

It was fairly easy to separate the images to day and night by looking at their average pixel intensity:

You can see a very natural separation of images with low average values, i.e. dark images, taken at nighttime, and bright images, taken at daytime.

I tried two approaches, both didn’t improve the results:

Training two separate models for day images and night images
Training the network to predict 6 classes instead of 3 by also predicting whether it’s day or night

Using better variants of SqueezeNet

I experimented a little bit with two improved variants of SqueezeNet. The first used residual connections and the second was trained with dense→sparse→dense training (more details in the paper). No luck. ?

Localization of traffic lights

After reading a great post by deepsense.io on how they won the whale recognition challenge, I tried to train a localizer, i.e. identify the location of the traffic light in the image first, and then identify the traffic light state on a small region of the image.

I used sloth to annotate about 2,000 images which took a few hours. When trying to train a model, it was overfitting very quickly, probably because there was not enough labeled data. Perhaps this could work if I had annotated a lot more images.

Training a classifier on the hard cases

I chose 30% of the “harder” images by selecting images which my classifier was less than 97% confident about. I then tried to train classifier just on these images. No improvement. ?

Different optimization algorithm

I experimented very shortly with using Caffe’s Adam solver instead of SGD with linearly decreasing learning rate but didn’t see any improvement. ?

Adding more models to ensemble

Since the ensemble method proved helpful, I tried to double-down on it. I tried changing different parameters to produce different models and add them to the ensemble: initial seed, dropout rate, different training data (different split), different checkpoint in the training. None of these made any significant improvement. ?

Final classifier details

The classifier uses an ensemble of 3 separately trained networks. A weighted average of the probabilities they give to each class is used as the output. All three networks were using the SqueezeNet network but each one was trained differently.

Model #1 — Pre-trained network with oversampling

Trained on the re-labeled training set (after fixing the ground-truth mistakes). The model was fine-tuned based on a pre-trained model of SqueezeNet trained on ImageNet.

Data augmentation during training:

Random horizontal mirroring
Randomly cropping patches of size 227 x 227 before feeding into the network

At test time, the predictions of 10 variations of each image were averaged to calculate the final prediction. The 10 variations were made of:

5 crops of size 227 x 227: 1 for each corner and 1 in the center of the image
for each crop, a horizontally mirrored version was also used

Model accuracy on validation set: 94.21%
Model size: ~2.6 MB

Model #2 — Adding rotation invariance

Very similar to Model #1, with the addition of image rotations. During training time, images were randomly rotated by 90°, 180°, 270° or not at all. At test-time, each one of the 10 variations described in Model #1 created three more variations by rotating it by 90°, 180° and 270°. A total of 40 variations were classified by our model and averaged together.

Model accuracy on validation set: 94.1%
Model size: ~2.6 MB

Model #3 — Trained from scratch

This model was not fine-tuned, but instead trained from scratch. The rationale behind it was that even though it achieves lower accuracy, it learns different features on the training set than the previous two models, which could be useful when used in an ensemble.

Data augmentation during training and testing are the same as Model #1: mirroring and cropping.

Model accuracy on validation set: 92.92%
Model size: ~2.6 MB

Combining the models together

Each model output three values, representing the probability that the image belongs to each one of the three classes. We averaged their outputs with the following weights:

Model #1: 0.28
Model #2: 0.49
Model #3: 0.23

The values for the weights were found by doing a grid-search over possible values and testing it on the validation set. They are probably a little overfitted to the validation set, but perhaps not too much since this is a very simple operation.

Model accuracy on validation set: 94.83%
Model size: ~7.84 MB
Model accuracy on Nexar’s test set: 94.955% ?

Examples of the model mistakes

Source: Nexar

The green dot in the palm tree produced by the glare probably made the model predict there’s a green light by mistake.

Source: Nexar

The model predicted red instead of green. Tricky case when there is more than one traffic light in the scene.

The model said there’s no traffic light while there’s a green traffic light ahead.

Conclusion

This was the first time I applied deep learning on a real problem! I was happy to see it worked so well. I learned a LOT during the process and will probably write another post that will hopefully help newcomers waste less time on some of the mistakes and technical challenges I had.

I want to thank Nexar for providing this great challenge and hope they organize more of these in the future! ?

If you enjoyed reading this post, please share it on social media!

Would love to get your feedback and questions below!