When Google open sourced TensorFlow its artificial intelligence engine last week—freely sharing the code with the world at large—Lukas Biewald didn’t see it as a triumph of the free software movement. He saw it as a triumph of data.
That’s how you’d expect him to see it. He’s the CEO of the San Francisco startup CrowdFlower, which helps online companies like Twitter juggle massive amounts of data. But after spending time at the Stanford AI Lab, he knows artificial intelligence. And his point is a good one.
In open sourcing the TensorFlow AI engine, Biewald says, Google showed that, when it comes to AI, the real value lies not so much in the software or the algorithms as in the data needed to make it all smarter. Google is giving away the other stuff, but keeping the data.
“As companies become more data-driven, they feel more comfortable open sourcing lots of [software]. They know they’re sitting on lots of proprietary data that nobody else has access to,” says Biewald, who also worked at Yahoo as a search engineer and helped bootstrap a notable search startup called Powerset, now owned by Microsoft. “What they’re not opening up is their data. They would never do that.”
Making Machines Smarter
Biewald compares this to IBM’s recent purchase of The Weather Channel, where Big Blue paid millions largely to acquire data it could use to feed its AI ambitions. “It’s interesting that while companies are buying data, they’re open-sourcing their algorithms,” he says. “It’s pretty clear where these companies’ bets are, in terms of what matters for machine learning.”
TensorFlow, you see, deals in a form of AI called deep learning. With deep learning, you teach systems to perform tasks such as recognizing images, identifying spoken words, and even understanding natural language by feeding data into vast neural networks connected machines that approximate the web of neurons within the human brain. If you feed photos of cats into a neural net, you can teach it to recognize cats. If feed it conversational data, you can teach it to carry on conversations.
The algorithms that drive these neural networks aren’t new. They date to the 1980s. What’s new is that, thanks to the Internet, their creators have the processing power and the enormous amounts of data to make these algorithms viable. To teach a system to recognize a cat, you need an awful lot of machines and an awful lot of cat photos.
After the rise of cloud computing, in which companies like Amazon and Microsoft rent access to the vast processing power of the net, we all have access to a vast arrays of machines. But the richest data sits inside massive companies like Google and Facebook. Billions of people use their services, which trade in a rich trove of information, from text to photos to videos to speech and beyond. Both companies are hard at work building powerful AI software. But their real competitive edge comes from having a vast quantity of high quality data they can use to teach this software to “think” more like a human.
Talent Needs Data
To be sure, Biewald is exaggerating (at least a bit) to make a point. Though Google has open sourced some very important piece of its AI engine, it’s keeping other pieces to itself (at least for now). What also matters in the competitive space is talent. Though the algorithms that drive this technology are an old thing, they evolve at rapid pace, moving into more and more areas, and this evolution is driven by some very smart people.
That’s one of the reasons Google open sourced TensorFlow. If people beyond the company can use its software, Google can more easily bring talent and ideas into the company—and its software. It also can continue to work with people who have left the company. “We have a lot of summer interns coming in and they do a lot of interesting research while they are here at Google,” says Jeff Dean, one of the Google engineers at the heart of the company’s AI work. “For some kinds of problems, they can basically just take their work and continue developing it on the open source release of TensorFlow.”
But there’s another reason Google can attract the top deep learning researchers: its data. The same goes for Facebook and other Internet giants. In recent years, many of the field’s top researchers already have joined these companies, including University of Toronto professor Geoff Hinton (now at Google), New York University professor Yann Lecun (now at Facebook), and Stanford professor Andrew Ng (now at Chinese search giant Baidu).
As Biedwald points out, you can’t necessarily get access to the same data if you’re an academic. “It’s kinda hard for academics and startups to do really meaningful machine learning work,” he says, “because they don’t have access to the same kind of datasets that a Google or an Apple would have.”
Yes, Apple generates lots of data too, through services like Siri. But some feel Apple could be at a disadvantage because, after taking a more extreme stance on privacy than Google and Facebook, it more tightly restricts how its engineers can makes use of the data they do have. That’s how important digital information is to this movement. Ken Forbus, a professor of computer science at Northwestern University who specializes in AI, believes Apple may have to rely more heavily on technologies beyond of the deep learning realm because of its stance on privacy.
There are many ways Apple can work around this, including changing its privacy policies. Like Google and others, it has acquired its own deep learning startups, and it has attracted AI talent in other ways. But one thing is indisputable: The future of AI can’t happen without the data.