What does this Emoji Mean?

A Vector Space Skip-Gram Model for Twitter Emojis

Natural Language Processing Group (TALN)
Universitat Pompeu Fabra, Barcelona, Spain

Emojis allow us to describe objects, situations and even feelings with small images, providing a visual and quick way to communicate.
We analyse the emojis used in Twitter with distributional semantic models. We retrieved 10 millions tweets posted by USA users, and we built several skip gram word embedding models by mapping in the same vectorial space both words and emojis. We test our models with semantic similarity experiments, comparing the output of our models with human assessment. We also carry out an exhaustive qualitative evaluation.

Our approach, results and evaluation are described in details in the paper:

What does this Emoji Mean? A Vector Space Skip-Gram Model for Twitter Emojis
Francesco Barbieri, Francesco Ronzano and Horacio Saggion
In the proceedings of the Language Resource and Evaluation Conference 2016
Download
Here you can access the interactive visualization of the semantic vector space of Twitter emoticons and download the embedding models we generated.
Explore the emojis space by yourself!
(use the mouse scrolling to zoom)
Visualisation of 100 emojis vectors, reduced to two dimensions with t-SNE. Similar emojis get clustered together.

Download the emojis embedding models
The dimensions of the vectors is 300. raw is the raw model, where we trained the embedding on the whole dataset (only links are removed), the clean embeddings were trained on a cleaner dataset: stopwords, punctuation and links were removed. onlyemo are embedding trained only on emojis. In these models, the emojis character have been replaced by eojiCODE where CODE is the unicode of the emoji (for instance eoji1f60e is this emoji). See this for the possible Twitter emojis.

Word2Vec Text Format

One line per vector, tab separated, first colum is the token.

Raw Vectors

Clean Vectors

Onlyemo Vectors

Gensim binaries

Use the gensim library to load them.

Raw Vectors

Clean Vectors

Onlyemo Vectors

Download the EmoTwi50 dataset
Click here to download the EmoTwi50 dataset. The dataset is a TSV (tab-separated) with five columns: the first two columns represent the codes of the pair of emojis evaluated, the third column their gold standard similarity, the fourth column their gold standard relatedness and the fifth column the average of the previous two values. Each row of the file represents the gold standard evaluation results of a pair of emojis. Remember that in order to retrieve the vectorial embedding corresponding to an emoji in our models, you need to add the token "eoji" before the emoji code.

All these materials are frely available under Creative Commons CC BY 3.0, using the reference below for the attribution.


If you use this resources, please cite us

		
@InProceedings{emoji2016LREC,
  title     = {What does this Emoji Mean? A Vector Space Skip-Gram Model for Twitter Emojis},
  author    = {Barbieri, Francesco and Ronzano, Francesco and Saggion, Horacio},
  booktitle = {Language Resources and Evaluation conference, LREC},
  month     = {May},
  year      = {2016},
  address   = {Portoroz, Slovenia},
}