home art portfolio sketches

blueprints for machine learning robots

28 November 2015

It is an exciting time right now to be building machine learning models. There have been awesome breakthroughs with deep learning, google has open sourced their internal machine learning tools, baidu is setting new records in voice recognition, and so many other innovations. I’ve been thinking a lot on how I would re-engineer our training systems for these computational models.
Currently machine learning models are trained by feeding in hand crafted training data where people painstakingly hand label lots of training data. This training data can be millions of pictures with captions written out, pictures of males versus females, pictures of dog versus non-dogs, etc. It is not just images, it can be any kind of data such as video and text.

Creating this data by hand is very painful and takes a lot of time and effort, some of these data sets required almost 10 years to build!!!!
More importantly, it is not how living organisms learn, let alone humans beings. We are not fed images and then something yells out the name of what is in the image for millions of pictures. We learn from interacting with our environment. The closer we build our machine learning models to learn like humans learn, the more progress we will make towards building intelligent computers. We need to train computers in an unsupervised setting where the computer can learn and interact with its environment.

A human baby starts learning everyday from the day its born, even while its in the womb. A baby’s eyes are constantly taking in images. The eye is processing about 10 “frames” per second due to saccades. So after 3 years, a baby has taken in approximately 8 hundred million images. calculated as (12 images * 3600 seconds per hour * 16 hours awake * 365 days * 3 years) via: https://en.wikipedia.org/wiki/Frame_rate

if you wanted to train a computer with images today, there are a few data sets already on the internet such as image net,MNIST, and MSCOCO, among others.

The problem with static image training sets is that they contain only small snapshots of life: a picture of horses, a house, a beach,etc. But life includes all kinds of other concepts that can’t be captured in images. We need to train computers to understand concepts like “spilling a drink”, “helicopter landings”, “doggy paddling”, “returning a book”, “running in circles”,“shooting a basketball”,etc. These all require motion to understand what is going on. Images don’t capture the “full fidelity” of these situations, they are 2d snapshots , but we navigate the world in higher dimensions. We need a different system to capture data that has higher fidelity.

on robots

What if instead we built little robots with cameras that follow you around and are constantly analyzing all the video data they are receiving. They could follow you around in your day to day life. They would be programmed with a disposition to learn, meaning if it saw things in its environment that it did not recognize, it would want to learn about these new objects and concepts. All of their training data would come from real live interactions that they were having instead of just static images that other people labeled.

If you have a network of these robots, all the data could be collected to a central server to learn even faster.


To build this, we need actual hardware which can get expensive. Building the actual robot from scratch would be hard, I would build a prototype out of one of the many toy robots on the market. I would probably use something like the sphero, its cheap ($100), the mechanics on it seem simple, its just a ball! It has inductive charging and a SDK, but it doesn’t have cameras. I would not use the other robots with legs for the earlier versions as it make the hardware and software harder to deal with.
We need video cameras and microphones that it would use for navigation and data input. Most modern video cameras capture around 30 to 60 frames per second, which is more than a human baby captures. There would need to be server nearby that would store all the sensory input data for processing. This system would be analogous to “a brain in a vat”. A wireless data transmission system would be required, I’m not sure if wifi latency is low enough to allow for this realtime communication. Normally the robot would transmit all the input to a local computer , but if its out of reach of the server, it needs to be able to store data on its internal system for later processing. The local server would need to have high end GPU cards to process the data and build up the data model. There would need to be some kind of basic speaker on it so that it could “speak” and transmit sound. I bet that some robot like this already exists, the software is the key piece that is missing though.


The learning algorithm would need to be some kind of online realtime learning algorithm that allowed the robot to process data in realtime. Current algorithms that could be used for the base of this system would probably consist of reinforcement learning algorithms such as Q-learning.
The system would need a voice recognition system prebuilt and trained so that it could ingest all the words being said so taht it could associate it with the video it is seeing. The voice recognition system would improve over time with the more data it collected. The system would learn audio commands that you would say to it to tell it to focus on certain concepts to learn about.

For the cameras, I would actually start off by lowering the resolution and fidelity, maybe use black and white. There are 2 reasons for this: first, it is less data to store and process, but more importantly, we want to have a simulated world where both the robot and its environment could function. This simulation would need to be as close to the real world as possible and it much easier for us to construct simpler worlds. While we are sleeping, we could take the machine learning model and put it into the simulation to continue learning. While it is in the simulation, we can use all the new data it has obtained and run the robot through all kinds of tests and diagnostics to see if it is improving. We could constantly update the algorithms used for learning and run simulations and testing in the real world. With the combination of recent breakthroughs in artificial neural networks, fast NVIDIA GPU cards, cheap toy robots,open source software, and the internet to share all of this, I truly believe now is the time to build something like this.