神刀安全网

Add emoji support on your Elasticsearch

Since 2011 and Unicode 6.0, emoji is an integral and standardized part of the computer environment. They are more and more available on users keyboards (thanks to Android and iPhone) and as web developers and engineers, it is our duty to support them everywhere we can, even if you don’t like them.

There is no reason to believe that users will not want to use an emoji inside their usernames, biographies or even passwords, as they are valid characters in the same way as © or é.

A great deal of websites and applications are broken right now, because MySQL’s utf8 character set only allows to store a subset of Unicode characters. So, if you try to save an emoji in a utf8_general_ci table, it might go wrong pretty badly. the first thing to do is to migrate to the utf8mb4 character set, introduced in MySQL 5.5.3 .

In this article, I would like to expose my solutions to index emoji and search for it in Elasticsearch. The guys at Yelp do it already and it’s pretty sweet, you can search for donuts using an  emoji!

Let’s dive into Lucene and emoji!

What is an emoji

What I am talking about here are the pictorial symbols presented in a colorful form and used inline in text; as we see them in Twitter, Slack, WhatsApp, git commit messages , …

We must distinguish:

  • emoticon : text supposed to represent an expression, like ;-) or ¯/_(ツ)_/¯ ; There is not much we can do here as there is no standard or semantic to extract from them;
  • emoji : pictograms that can be used in text. They can be display as image or text:
    • emoji characters are the text representation of Emoji, normal glyphs encoded in fonts like other characters, like ; That’s what we will be searching for;
    • emoji presentation are the graphical representation, only to be considered by the display side, like Add emoji support on your Elasticsearch . Each system can bring his own images, so they don’t look identical everywhere but all are supposed to mean the same thing.

The specification is complex and introduces more than just glyphs:

  • there is a variation selector character, that lets you choose between the text or graphical representations ( U+FE0E and U+FE0F );
  • some emoji can be modified via Emoji Modifier, for example to change skin tone ( U+1F3FB..1F3FF );
  • you can combine some emoji with a zero-width joiner (ZWJ) character, to display them as a unique image (︎ ❤︎ ︎ can be displayed as Add emoji support on your Elasticsearch ).

In terms of availability, MacOS 10 and iOS 5 were the firsts to implement (badly) the specification, you can expect users to have emoji support on Windows 8+, Android, iPhone, iPad, MacOS… and on Linux of course. This means emoji can be displayed anywhere as emoji characters , and in some place as emoji presentation . I strongly recommend to have a look at caniemoji.com .

Add emoji support on your Elasticsearch

User can type emoji everywhere. Be ready!

How are we supposed to search them

The specification describes how to search with emoji:

Searching includes both searching for emoji characters in queries, and finding emoji characters in the target. These are most useful when they include the annotations as synonyms or hints. For example, when someone searches for ⛽︎ on yelp.com, they see matches for “gas station”. Conversely, searching for “gas pump” in a search engine could find pages containing ⛽︎.

Annotations are language-specific: searching on yelp.de, someone would expect a search for ⛽︎ to result in matches for “Tankstelle”.

That’s why we can search for donuts on Yelp by typing ︎, it is translated to annotations: dessert , donut and sweet for exemple. Then it’s only matching with text like a normal search. On the other side, if I store a tweet containing ︎, I must be able to find it when searching for dessert or donut . So each supported emoji must have a textual equivalent, which of course needs to be translated depending on your content language.

Skin tone modifiers and variation selectors can safely be ignored in search because we are only interested in the glyphs, not the way they are displayed.

But combined emoji should have their own annotations:

  • ︎ ❤︎ ︎ or Add emoji support on your Elasticsearch must not only match for “man” and “love”, but also “couple”;
  • ︎ ︎ ︎ or Add emoji support on your Elasticsearch must not only match for “man” and “girl”, but also “family”…

And finally, searching for ︎ should return documents containing the emoji ︎ with a higher rank than the ones talking about desserts, the glyph itself must be a search criteria too.

Elasticsearch analyzer for emoji

The default analyzer in Elasticsearch is called standard , it behaves like this when given a text containing an emoji:

GET /_analyze?analyzer=standard {   "text": "Give me a ︎ please." }  {   "tokens": [     {       "token": "give", ...     },     {       "token": "me", ...     },     {       "token": "a", ...     },     {       "token": "please", ...     }   ] } 

The emoji is considered as “other” and is removed from the tokens. We are going to need a custom analyzer with a tokenizer that keeps ︎; and whitespace is one of them.

As you can guess, it breaks your content on every whitespace, the produced tokens are Give , me , a , ︎ , please. . We could think our job is done! But it’s not, notice the dot at the end of please. ? Punctuation is not removed by the whitespace tokenizer!

Sadly there is no other tokenizer as smart as standard and as permissive as whitespace , if you feel like writing one, be my guest !

In the meantime we are going to remove punctuation in our analyzer by adding two token filters. This is not ideal because we will not be very smart about it. Sometimes punctuation is part of the token, but this is the better solution I could make:

"punctuation_filter": {   "type": "pattern_replace",   "pattern": "//p{Punct}",   "replace": "" }, "remove_empty_filter": {   "type": "length",   "min": 1 } 

The first token filter removes any punctuation sign (including: !"#$%&'()*+,-./:;<=>?@[]^_`{|}~ ) and the second one removes empty tokens.

We also need to clean-up modifiers and variation selectors at this stage, before the synonyms filter takes place, as we can have some hidden characters sticked to the real emoji. For example, here is a “Smiling Face With Sunglasses” emoji: ︎.

It composed like this: /uD83D/uDE0E . Now if I add a variation selector to force the display as text: /uFE0E , my whole token in the analysis process will be /uD83D/uDE0E/uFE0E . This is not a whitespace, so our whitespace tokenizer didn’t removed it. And this does not match our synonym either, so we need to get rid of it.

Here is the list of all the characters likely to get bonded to an emoji; and that could break things for our synonym filter:

  • /uFE0E : VARIATION SELECTOR-15 (force text representation);
  • /uFE0F : VARIATION SELECTOR-16 (force graphic representation);
  • /uD83C/uDFFB : EMOJI MODIFIER FITZPATRICK TYPE-1–2 (skin tone);
  • /uD83C/uDFFC : EMOJI MODIFIER FITZPATRICK TYPE-3 (skin tone);
  • /uD83C/uDFFD : EMOJI MODIFIER FITZPATRICK TYPE-4 (skin tone);
  • /uD83C/uDFFE : EMOJI MODIFIER FITZPATRICK TYPE-5 (skin tone);
  • /uD83C/uDFFF : EMOJI MODIFIER FITZPATRICK TYPE-6 (skin tone);

There is also a /u200D : ZERO WIDTH JOINER; it’s used to merge compatible emoji. We are going to replace it with a space before the tokenization. This way, we can index separately all the members of grouped emoji.

Our final pattern now looks impressive!

"punctuation_filter": {   "type": "pattern_replace",   "pattern": "//p{Punct}|//uFE0E|//uFE0F|//uD83C//uDFFB|//uD83C//uDFFC|//uD83C//uDFFD|//uD83C//uDFFE|//uD83C//uDFFF",   "replace": "" } 

We then add the filter for our ZWJ and our emoji synonyms, and we are good to go!

PUT /en-emoji {   "settings": {     "analysis": {       "char_filter": {         "zwj_char_filter": {           "type": "mapping",           "mappings": [              "//u200D=>"           ]         }       },       "filter": {         "english_emoji": {           "type": "synonym",           "synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"          },         "punctuation_filter": {           "type": "pattern_replace",           "pattern": "//p{Punct}|//uFE0E|//uFE0F|//uD83C//uDFFB|//uD83C//uDFFC|//uD83C//uDFFD|//uD83C//uDFFE|//uD83C//uDFFF",           "replace": ""         },         "remove_empty_filter": {           "type": "length",           "min": 1         }       },       "analyzer": {         "english_with_emoji": {           "char_filter": "zwj_char_filter",           "tokenizer": "whitespace",           "filter": [             "lowercase",             "punctuation_filter",             "remove_empty_filter",             "english_emoji"           ]         }       }     }   } } 

Of course you should add stemming, stop words… as you please (look at the core analyzer for english if you need inspiration).

Our new english_emoji synonym token filter is reading a file called analysis/cldr-emoji-annotation-synonyms-en.txt , I’m using the Solr format here to tell Elasticsearch that Add emoji support on your Elasticsearch Add emoji support on your Elasticsearch ( Add emoji support on your Elasticsearch ) translate to Add emoji support on your Elasticsearch Add emoji support on your Elasticsearch and france .

# We use explicit mapping # Because "dessert" is not supposed to index "donut". ︎ => ︎, dessert, donut, sweet ︎︎ => ︎︎, france

Here are some example of what going to be indexed with this sample:

GET /en-emoji/_analyze?analyzer=english_with_emoji {   "text": "Eat dessert in ︎︎" } # eat dessert in ︎︎ france  GET /en-emoji/_analyze?analyzer=english_with_emoji {   "text": "Eat dessert in france" } # eat dessert in france  GET /en-emoji/_analyze?analyzer=english_with_emoji {   "text": "Give me a ︎ please." } # give me a ︎ dessert donut sweet please 

So if I search for “France”, I get both document containing the word “France” and the emoji Add emoji support on your Elasticsearch Add emoji support on your Elasticsearch !

As you may guess the harder part here is to build the synonyms file. As I didn’t found any on the great internet, I started a repository where I provide synonyms dictionary for all languages included in the Unicode Common Locale Data Repository!

The version 27 of the CLDR started including emoji annotations but as a provisional state (not supposed to be used), and we are currently waiting for the 29th stable version , with much better content.

You can get all of those emoji synonyms dictionaries on github , alongside the scripts used to build them.

I hope it become the “go to” destination to build an emoji capable search, in any language. You will also find more complete examples of Elasticsearch implementation.

Highlight emoji in search results?

With our synonyms and tokenizer, highlight will work as expected too:

GET en-emoji/tweet/_search {   "query": {     "match": { "tweet": "donut" }   },   "highlight": {     "fields": { "tweet": {} }   } } 

Will answer:

"highlight": {   "tweet": [     "I love <em>︎!</em>"   ] } 

Twitter does not support this, now you do!

Support for emoticon

As I said earlier, emoticon can’t really be supported as they only are made of punctuation, but what if we used an Elasticsearch char_filter to translate :) to ︎ before any tokenization even takes place?

Yes, we would be able to search for “smile” in documents containing a simple punctuation smiley!

This can be done like this:

"char_filter": {   "emoticons_char_filter": {     "type": "mapping",     "mappings": [        ":)=>︎",       ":(=>︎"     ]   } } 

This can be a nice addition to your emoji enable search engine! You can jump over github to get a more complete pre-configured list of emoticon to emoji for your analyzers. Of course the mapping from emoticon to emoji is subject to different interpretations and may need customization. The mapping I suggest are based on this package .

Some vendors also chose to store :alarm_clock: instead of ⏰︎ in their databases, and the same recommendations can apply, you have to introduce more context into your index to be able to search efficiently with this glyph.

Conclusion

Emoji search is as easy as using synonyms, and it can be a great addition to any website or product you may be building. You may think “who’s lazy enough to type ︎ instead of pizza ”, but it’s much more than just search:

  • you can now highlight emoji matching a text search;
  • you can find documents similar to other document composed of emoji;
  • you can add a real meaning to textual emoticons;
  • you don’t just ignore such glyphs but instead can build on their real meaning.

Head over the emoji-search repository to find all the synonyms and emoticons and build a better search engine for your users. The analyzer we built here may not be the greatest but it’s a start, if you have any suggestion, feel free to help!

I would like to finish with a Vulcan salute  (because yes, this is an official emoji); happy searching!

PS: Oh and if you need some French Elasticsearch consulting or training… We can help  .

Ressources

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Add emoji support on your Elasticsearch

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址
分享按钮