Teaching Robots to Feel: Emoji & Deep Learning :space_invader: :thought_balloon: :two_hearts:
Recently, neural networks have become the tool of choice for a variety of tough computer-science problems: Facebook uses them to identify faces in photos, Google uses them to identify everything in photos. Apple uses them to figure out what you’re saying to Siri, and IBM uses them for operationalizing business unit synergies.
It’s all very impressive. But what about the real problems? Can neural networks help you find the :100: emoji when you really need it?
Why, yes. Yes they can. :smirk:
This post will outline some of the engineering behind Dango, allowing us to automatically learn from hundreds of millions of real-world uses of emoji, and distill this down to a tool small and fast enough to predict emoji for you in real time on your phone.:iphone: :thought_balloon: :bulb: :calling: :dango:
What is Dango?
Dango is floating assistant that runs on your phone and predicts emoji, stickers and GIFs based on what you and your friends are writing in any app. This lets you have the same rich conversations everywhere: Messenger, Kik, Whatsapp, Snapchat, whatever. (just making this possible in every app is an engineering challenge of its own, but that’s another story).
Suggesting emoji is hard: it means Dango needs to understand the meaning of what you’re writing in order to suggest emoji you might want to use. At its core, Dango’s predictions are powered by a neural network. Neural nets are computational structures with millions of adjustable parameters connected in ways that are loosely inspired by neurons in the brain.
A neural network is taught by randomly initializing these parameters and then showing the network millions of real-world examples of emoji use taken from across the web, like Hey how’s it going :wave: , Want to grab a :beers: tonight? , Ugh :rage: , and so on. At first the network just guesses randomly, but over time with each new training example, it slightly adjusts its millions of parameters so it performs better on that example. After a few days on a top-of-the-line GPU, the network starts outputting more meaningful suggestions:
Want to grab a drink tonight? :tropical_drink: :beer: :wine_glass: :cocktail: :smirk:
Things we’ve learned about emoji
The data-driven approach to emoji prediction means that Dango is smarter about emoji than we are. Dango has taught us new slang, and inventive ways that people around the world tell stories with emoji.
For instance: if you write "Kanye is the", Dango will predict the :goat: emoji . This goat of course represents Greatest of All Time (G.O.A.T), a title Kanye bestowed upon himself earlier this year:
after realizing he is the greatest living artist and greatest artist of all time.
— KANYE WEST (@kanyewest) February 14, 2016
Dango can express things that aren’t represented by any single emoji. For instance if you’re a resident of B.C. or Colorado, and enjoy "relaxing", Dango speaks your language.
420 tonight? :kissing_smiling_eyes: :dash: :smoking: :maple_leaf:
If you’re mad at someone and just want them to GTFO. Dango will helpfully show them the door:
Dango has also learned plenty from internet culture. It understands memes and trends. For instance, if you’ve seen the "but that’s none of my business" image of Kermit the Frog sipping tea:
but that’s none of my business :frog: :coffee:
There are many other subtle references and jokes that Dango understands, and it’s always learning to make sure that it keeps up to date
And certainly many we’ve not yet discovered.
More than just emoji
Given that Dango is trained on emoji, it might at first seem that the number of concepts it can understand and represent are small — as of this writing the Unicode Consortium has standardized 1624 emoji which, despite being a headache for font designers, is still a relatively small number.
However this doesn’t mean that there are only 1624 meanings . When you use emoji, their meaning is determined by how they look and the context of their usage – which can be highly diverse. :pray: can mean "high-five" or "thank you" or "please". :eggplant: can mean… eggplant, exclusively. What’s more emoji can be combined to express new concepts. :kissing_smiling_eyes: is a kissing face, but :kissing_smiling_eyes::notes: is whistling, and :kissing_smiling_eyes::dash: is exhaling smoke. These emoji combos can become quite elaborate:
at the dentist :grimacing: :syringe: :mask:
stuck in traffic :vertical_traffic_light: :car: :taxi: :blue_car:
All this means that the number of semantic concepts that Dango can represent is much greater than simply the number of individual emoji. This is a powerful concept, because it gives Dango a way of understanding a wide variety of general concepts, regardless of whether the Consortium has recognized them with their own symbols.
Dango is therefore also able to suggest stickers and GIFs. Since as shown earlier, Dango knows about get out :point_right: :door: :point_left: , it can suggest this GIF for you as well:
Let’s dig a little deeper into how that works.
A naïve approach to suggesting emoji (and the approach we first tried with Dango), would be to directly map some words to emoji: pizza :pizza: , dog :dog: , etc. but this approach is limited and doesn’t reflect how emoji (and language) are actually used. There are many examples where a subtle combination of words determines meaning in a way that is impossible to conciesely describe with a simple mapping.
My girlfriend left :broken_heart:
you got it :ok_hand::fist:
you know it :smirk:
he’s the one ❤ :couple:
She said yes! :heart_eyes: :ring: :raised_hands:
To handle these cases, Dango uses a recurrent neural network (RNN). An RNN is a particular neural network architecture that is well suited to sequential input, and is therefore used in areas as diverse as natural language processing, speech processing, and financial time-series analysis. We’ll quickly go over a high level of what an RNN is here, but for a more in-depth introduction take a look at Andrej Karpathy’s great overview .
RNNs handle sequential input by maintaining an internal state, a memory which lets them keep track of what they saw earlier. This is important to be able to tell the difference between I’m very happy :blush: :smiley: and I’m not very happy :pensive: :disappointed: :unamused: .
Multiple RNNs can also be stacked on top of each other: each RNN layer takes its input sequence and transforms it into a new, more abstracted representation that is then fed into the next layer, and so on. The deeper you stack these networks, the more complex the sorts of functions they can represent. Incidentally, this is where the now popular term “deep learning” comes from. Major breakthroughs on hard problems like computer vision have come partly from simply using deeper and deeper stacks of network layers.
Dango’s neural network ultimately spits out a list of hundreds of numbers. The list can be interpreted as a point in a higher-dimensional space, just as a list of three numbers can be interpreted as the x-, y-, and z-coordinates of a point in three-dimensional space.
We call the high-dimensional space semantic space , think of it as a multi-dimensional grid where various ideas exist at various points. In this space, similar ideas are close together. Deep learning pioneer Geoff Hinton evocatively refers to points in this space as “thought vectors”. What Dango learned during the training process was how to convert both natural language sentences and emoji into individual vectors in this semantic space.
So when Dango receives some text, it maps it into this semantic space. To decide which emojis to suggest, it then projects each emoji’s vector onto this sentence vector. Projection is a simple operation which gives a measure of similarity between two vectors. Dango then suggests the emoji with the longest projection — these are the ones closest in meaning to the input text.
Visualizing semantic space :open_mouth::thought_balloon::milky_way:
For those of us who are visual thinkers, this spatial metaphor is a powerful tool to help us inuit and talk about neural networks. (at Whirlscape we are addicted to spatial metaphors; see our earlier post about the algorithms of the Minuum keyboard ).
To help us visualize Dango’s semantic space, we can use a popular technique for visualizing high-dimensional spaces called t-distributed stochastic neighbour embedding , or t-SNE. This technique tries to place each high dimensional point into two dimensions in such a way as to make sure that points that were close to each other in the original space remain close in the two-dimensional space. Although this mapping will be imperfect, it can still tell us a lot. Let’s use t-SNE to visualize the emoji floating in semantic space:
Notice here how semantically similar emoji are clustered together automatically in this space. For example, most of the faces are clustered together in “Face peninsula” The faces arrange with the happy :grinning::stuck_out_tongue_closed_eyes::blush: in one region, the angry :angry::rage::confounded: in another. All of the heart emojis are also clustered right nearby at the peak that we call “Point Love”.
Further along the tail of the shape you can see other interesting groupings: :basketball::football:⚾:soccer: are all near each other, emoji faces-with-hair :man::woman::girl::boy: are clustered in isolation away from the faces-without-hair (because why would they want to hang out?). Right towards the end you see a number of flags and less popular emoji like the filing cabinet and the fast-forward sign.
Again, Dango was never explicitly told that faces are somehow different from hearts, or beers, or farm animals. Dango generated this semantic map by training on hundreds of millions of examples of real-world emoji use taken from across the web. So what do we mean by training?
Before training, a neural network is initialized ; it is given a set of more-or-less random values; it is, essentially, a clean slate. Sentences map to randomly into semantic space, wherein the emoji are randomly scattered.
To train a neural network, we define an objective function ; this is essentially a way of grading the network’s performance on a given example. The objective function outputs a score telling us how well or badly Dango did predicting a given example. The smaller the score, the better. We then use a very simple algorithm called gradient descent. With each training example, gradient descent slightly modifies the value of all of the millions of parameters in the neural network in whatever direction that most reduces the objective function.
After several days of this procedure taking place on GPUs, the objective function cannot be improved any further — Dango is fully trained and ready to take on the world!
The future of language
Language is becoming visual. Emoji, stickers, and GIFs are exploding in popularity, despite the fact that it’s still labour-intensive to use them in an advanced way. Enthusiasts create personal collections of images for every situation and have memorized every page of the emoji keyboard, but the rest of us rely on using emoji immediately accessible on our “most used” menu and sometimes forward a GIF here and there.
This visual language has matured alongside technology, and this symbiotic relationship will continue, with new technology informing new language, which in turn informs the technology again. Communication in the future will have artificial intelligence tools adapted to you, helping you seamlessly weave imagery with text, and Dango is proud to be at the cutting edge of this progression.
Hopefully you’ve been inspired by this under-the-hood look, and now like us, you’ll picture your every sentence floating somewhere in semantic space, surrounded by hundreds of emoji. Maybe you’ll start playing around with neural networks yourself . Let us know!
And, of course, please try Dango and give us feedback. So that whenever you ask yourself: What emoji should I use? Dango will be there with the answer.