How Cosmopolitan Are Emojis?

Exploring Emojis Usage and Meaning over Different Languages
with Distributional Semantics

Francesco Barbieri*, German Kruszewski+, Francesco Ronzano* and Horacio Saggion*
*Natural Language Processing Group (TALN)
Universitat Pompeu Fabra, Barcelona, Spain
+Language, Interaction and Computation Laboratory
University of Trento, Trento, Italy
In the proceedings of the ACM Multimedia Conference for 2016
Download the paper

Choosing the right emoji to visually complement or condense the meaning of a message has become part of our daily life. Emojis are pictures, which are naturally combined with plain text, thus creating a new form of language. These pictures are the same independently of where we live, but they can be interpreted and used in different ways. In this paper we compare the meaning and the usage of emojis across different languages. Our results suggest that the overall semantics of the subset of the emojis we studied is preserved across all the languages we analysed. However, some emojis are interpreted in a different way from language to language, and this could be related to socio-geographical differences.

We gathered a dataset of about 4,5 Million tweets containing at least one emoji and geolocated in USA, United Kingdom, Spain or Italy. In order to compare the use and the meaning of the emojis over different languages we create semantic models of emojis for American English, British English, Spanish and Italian using the skip-gram algorithm. The following figure shows the t-SNE (300->2) visualisation of the 150 most frequent emojis for each language. Similar emojis get clustered together, even if sometimes t-SNE might be misleading...

In the paper we compare the four vecotors spaces more formally, but even just looking at the t-SNEs we can see interesting patterns. For example, it seems that Twitter users from UK associate the emoji with beaches and weather (like , , and ), while other (mediterranean) contries like Spain and Italy don't. Also the emoji is used differetly by UK users, as it's similar to Christmas emojis, while in the other languages is just another tree (clusterized close to the vegetation like emojis). There are many emojis that are used differently like and you can spot the differences in the graphs!
However, it seems that the semantics of the emojis is similar, and that most of the emojis are used in the same way over the four languages, like and , but also and !

(You might be also interested in this work where we study how to build good vector spaces for emojis).
Downloads
You can find here a Google sheet documents containing all the results of our experiments that we could't fit in the paper.
The dimension of the emoji vectors is 300. We trained the embedding using all the tokens words, emojis and punctuation (only links are removed). In these models, the emojis character have been replaced by eojiCODE where CODE is the unicode of the emoji (for instance eoji1f60e is this emoji). See this for the possible Twitter emojis.

Word2Vec Text Format

One line per vector, tab separated, first colum is the token.

United States of America

United Kingdom

Spain

Italy

Gensim binaries

Use the gensim library to load them.

United States of America

United Kingdom

Spain

Italy



All these materials are frely available under Creative Commons CC BY 3.0, using the reference below for the attribution.


If you use this resources, please cite us

		
@inproceedings{barbieri2016cosmopolitan,
  title={How Cosmopolitan Are Emojis?: Exploring Emojis Usage and Meaning over Different Languages with Distributional Semantics},
  author={Barbieri, Francesco and Kruszewski, German and Ronzano, Francesco and Saggion, Horacio},
  booktitle={Proceedings of the 2016 ACM on Multimedia Conference},
  pages={531--535},
  year={2016},
  organization={ACM}
}