Curiosity is fundamental to intelligence — the individuals that approach life questioningly, seeking to understand themselves, others and all around them, are the world's premier sources of insight and innovation. Scientists have long toiled to create algorithms for curiosity, but recreating human inquisitiveness has largely proved elusive — most methods so far haven't been able to assess artificial intelligence's knowledge gaps, and AI has largely proven unable to formulate predictive hypotheses.
“Developing curiosity is a problem core to (robot) intelligence” sez George Konidaris @BrownUniversity @BrownCSDept https://t.co/vBWTx3oESv
— Brown Research (@BrownUResearch) June 2, 2017
In essence, while most humans are capable of telling bad ideas from good ideas from the off, and intuitively guessing what is worth researching and what isn't, machines have failed in this regard, wasting much time on investigating obvious dead ends.
However, Todd Hester and Peter Stone, computer scientists at Google DeepMind and the University of Texas respectively, decided to resolve the issue once and for all.
The pair developed a new algorithm, Targeted Exploration with Variance-And-Novelty-Intrinsic-Rewards (TEXPLORE-VENIR), that relies on a technique called "reinforcement learning" to circumvent the issue.
Find out how we've improved initial performance and learning speed on Atari games using minimal demonstration data https://t.co/BCI1jqrLAZ pic.twitter.com/AHFS23ybXJ
— DeepMind (@DeepMindAI) April 13, 2017
In reinforcement learning, an AI program is rewarded if a path it embarks on brings it closer to a preordained goal of some kind — for instance, the answer to a difficult maths problem. If it successfully achieves a reward, it is more likely to follow the same path again in future.
The organization's researchers conducted an experiment that challenged software bots to complete a series of tasks, such as moving to a specific location, in a simple, two-dimensional virtual world to present the challenge as cooperative rather than competitive, stimulating collaboration between devices in the process.
However, TEXPLORE-VENIR sets an internal goal for the program, and the program rewards itself for comprehending something new — even if the knowledge doesn't get it closer to an ultimate goal. Serendipitous discoveries and understanding are just as valuable as defined targets. It also rewards itself for reducing uncertainty-becoming familiar with new things.
The computer scientist pair tested their method in two scenarios. The first was a virtual maze consisting of a circuit of four rooms connected by locked doors, in which the bot had to locate a key, pick it up, and unlock a door. Each time it passed through a door it earned 10 points, and it had 3000 steps to achieve a high score. If the researchers let the bot explore for 1000 steps guided only by TEXPLORE-VENIR, it earned 55 door points on average during the 3000-step test phase. If the bot used other curiosity algorithms for its exploration, its score during the test phase ranged from zero to 35. In a different scenario, in which the bot had to simultaneously explore and pass through doors, TEXPLORE-VENIR earned about 70 points, R-Max earned about 35, and the others earned fewer than five.
In the second, the algorithm was implanted into a toy, the Nao. In three separate tasks, the machine earned points for hitting a cymbal, holding pink tape in front of its eyes, and pressing a button on its foot. Averaged over 13 trials, Nao was better at finding the pink tape on its hand exploring with TEXPLORE-VENIR than exploring randomly. It pressed the button on seven of 13 trials using TEXPLORE-VENIR but not at all exploring randomly, and hit the cymbal in one of five trials using TEXPLORE-VENIR, but never exploring randomly. Through experimentation with its own body and environment, TEXPLORE-VENIR was well-prepared for the assigned tasks — the researchers liken it to a baby learning how its limbs work before learning to crawl.
Nonetheless, curiosity could have a deleterious effect on a robot's productivity — if the rewards for achieving insights are greater than completing its basic, core tasks, the latter may be ignored in favor of an obsessive focus on the former.
R-Max earned fewer points when exploration was simultaneously added to door-unlocking precisely because it was distracted by its own curiosity — AI attention deficit disorder, in other words.
On the other hand, external rewards can also interfere with learning, much like a student chasing high grades or gold stars rather than learning for its own sake. The challenge now is to train robots to strike the right balance of internal and external rewards.