It is an exciting time right now to be building machine learning models. There have been awesome breakthroughs with deep learning, google has open sourced their internal machine learning tools, baidu is setting new records in voice recognition, and so many other innovations. I’ve been thinking a lot on how I would re-engineer our training systems for these computational models.
Currently machine learning models are trained by feeding in hand crafted training data where people painstakingly hand label lots of training data. This training data can be millions of pictures with captions written out, pictures of males versus females, pictures of dog versus non-dogs, etc. It is not just images, it can be any kind of data such as video and text.
Creating this data by hand is very painful and takes a lot of time and effort, some of these data sets required almost 10 years to build!!!!
More importantly, it is not how living organisms learn, let alone humans beings. We are not fed images and then something yells out the name of what is in the image for millions of pictures. We learn from interacting with our environment. The closer we build our machine learning models to learn like humans learn, the more progress we will make towards building intelligent computers. We need to train computers in an unsupervised setting where the computer can learn and interact with its environment.
A human baby starts learning everyday from the day its born, even while its in the womb. A baby’s eyes are constantly taking in images. The eye is processing about 10 “frames” per second due to saccades. So after 3 years, a baby has taken in approximately 8 hundred million images. calculated as (12 images * 3600 seconds per hour * 16 hours awake * 365 days * 3 years) via: https://en.wikipedia.org/wiki/Frame_rateif you wanted to train a computer with images today, there are a few data sets already on the internet such as image net,MNIST, and MSCOCO, among others.
The problem with static image training sets is that they contain only small snapshots of life: a picture of horses, a house, a beach,etc. But life includes all kinds of other concepts that can’t be captured in images. We need to train computers to understand concepts like “spilling a drink”, “helicopter landings”, “doggy paddling”, “returning a book”, “running in circles”,“shooting a basketball”,etc. These all require motion to understand what is going on. Images don’t capture the “full fidelity” of these situations, they are 2d snapshots , but we navigate the world in higher dimensions. We need a different system to capture data that has higher fidelity.
If you have a network of these robots, all the data could be collected to a central server to learn even faster.
To build this, we need actual hardware which can get expensive. Building the actual robot from scratch would be hard, I would build a prototype out of one of the many toy robots on the market. I would probably use something like the sphero, its cheap ($100), the mechanics on it seem simple, its just a ball! It has inductive charging and a SDK, but it doesn’t have cameras. I would not use the other robots with legs for the earlier versions as it make the hardware and software harder to deal with.
We need video cameras and microphones that it would use for navigation and data input. Most modern video cameras capture around 30 to 60 frames per second, which is more than a human baby captures. There would need to be server nearby that would store all the sensory input data for processing. This system would be analogous to “a brain in a vat”. A wireless data transmission system would be required, I’m not sure if wifi latency is low enough to allow for this realtime communication. Normally the robot would transmit all the input to a local computer , but if its out of reach of the server, it needs to be able to store data on its internal system for later processing. The local server would need to have high end GPU cards to process the data and build up the data model. There would need to be some kind of basic speaker on it so that it could “speak” and transmit sound. I bet that some robot like this already exists, the software is the key piece that is missing though.
The learning algorithm would need to be some kind of online realtime learning algorithm that allowed the robot to process data in realtime. Current algorithms that could be used for the base of this system would probably consist of reinforcement learning algorithms such as Q-learning.
The system would need a voice recognition system prebuilt and trained so that it could ingest all the words being said so taht it could associate it with the video it is seeing. The voice recognition system would improve over time with the more data it collected. The system would learn audio commands that you would say to it to tell it to focus on certain concepts to learn about.