10 Open Source Multimodal Emotion Recognition Datasets

5 min readJun 12, 2022

In recent years, multimodal emotion recognition has become one of the research hotspots in human-computer interaction. How to analyze sentiment in multimodal data is an opportunity and challenge for the current sentiment analysis field. In this paper, I introduce datasets commonly used in the field of multimodal sentiment analysis.

1. Yelp Dataset

The Yelp dataset comes from the http://Yelp.com review website, which collects Yelp reviews on restaurants and food in five cities: Boston, Chicago, Los Angeles, New York, and San Francisco. There are a total of 44,305 comments, 244,569 pictures (there are multiple pictures for each comment), and each comment has an average of 13 sentences and 230 words. The sentiment labeling of the dataset is to assign five scores of 1, 2, 3, 4, and 5 to the sentiment tendency of each comment.

2. Tumblr Dataset

The Tumblr dataset is a multimodal sentiment dataset collected from Tumblr. Tumblr is a micro-blogging service. The multimedia content posted by users usually includes: pictures, text and tags. The dataset is to search for tweets with corresponding emotional tags according to the selected fifteen emotions, and only select the parts with both text and pictures, and then perform data processing to delete those texts that originally contain corresponding emotional words. , as well as those tweets that are not primarily English-based. There are a total of 256,897 multimodal tweets in the entire dataset, with emotions annotated as fifteen emotions including happiness, sadness, and disgust.

3. Twitter Irony Dataset

The Twitters Irony dataset is built from the Twitter platform, which collects English-language tweets containing images and some specific hashtags (e.g. #sarcasm, etc.) as positive examples, and collects tweets with images but no such tags , as a counterexample. The dataset was further curated to remove tweets containing conventional words such as sarcasm, sarcasm, irony, and irony. Tweets with URLs are also removed to avoid introducing additional information. In addition, words that often accompany satirical tweets, such as Jokes, Humor, have also been removed. The data set is divided into training set, development set and test set, which are 19816, 2410, 2409 tweets with pictures respectively. This dataset is marked as satirical/not satirical binary classification.

4. Youtube Dataset

The YouTube dataset collects 47 videos on YouTube. The collected videos are not of one topic, but videos of a series of diverse topics such as toothpaste, camera reviews, and baby products. The format of the video is that a single speaker speaks in front of the camera, with a total of 20 female and 27 male narrators, ranging in age from about 14–60 years old, from different ethnic backgrounds. Videos vary in length from 2–5 minutes, and all video sequences are normalized to 30 seconds in length. The annotation of the dataset is made by three annotators watching videos in random order, and the annotations are classified as positive, negative, and neutral. It should be noted that the annotation is not the viewer’s emotional tendency towards the video, but the description in the annotation video. Finally, of the 47 videos, 13 were marked as positive, 22 were marked as neutral, and 12 were marked as negative.


The ICT-MMMO dataset collects videos of movie reviews on social media sites. The dataset contains 370 videos of multimodal reviews in which a person speaks directly to the camera, expressing their review of a movie or stating facts related to a particular movie. The datasets come from social media sites YouTube and ExpoTV. All narrators express their views in English, and the length of the videos varies from 1–3 minutes. There are 370 movie review videos in total, 308 of which are from YouTube and 62 all-negative review videos from ExpoTV, totaling 228 positive reviews, 23 neutral reviews, and 119 negative reviews. It should be noted that this dataset does not label the viewer’s feelings about the video, but the emotional tendencies of the narrator in the video.


The MOSI dataset collects video blogs (vlogs) on YouTube that are dominated by movie review videos. The length of the videos ranged from 2–5 minutes, and a total of 93 videos were randomly collected from 89 different narrators, 41 women and 48 men, most of the speakers were around 20 to 30 years old age, from different ethnic backgrounds. The annotations of these videos were annotated by five annotators from the Amazon crowdsourcing platform and averaged, and annotated as seven categories of emotional tendencies ranging from -3 to +3. The sentiment annotation of this dataset is not the viewer’s feelings, but rather the sentimental tendencies of the commenters in the annotated video.


The data collected by CMU-MOSEI came from YouTube monologue videos, and those videos that included too many characters were removed. The final dataset contains 3228 videos, 23453 sentences, 1000 narrators, 250 topics, and the total duration reaches 65 hours. The dataset has both sentiment annotations and sentiment annotations. Sentiment annotation is a 7-category sentiment annotation for each sentence, and the author also provides 2/5/7 category annotations. Emotional annotation is an emotional annotation that includes six aspects of happiness, sadness, anger, fear, disgust, and surprise.


The MELD dataset is derived from the EmotionLines dataset, which is a plain-text dialogue dataset from the classic TV series Friends. The MELD dataset is a multi-modal dataset containing video, text, and audio on this basis. The final dataset contains 13,709 clips, and each clip not only has seven emotional annotations including fear, but also positive, Sentiment annotation for negative, neutral tri-categories.


The IEMOCAP dataset is quite special. It is not collected from videos uploaded by users of existing YouTube and other film and television platforms, nor from well-known TV programs such as Friends. It is performed and recorded by 10 actors around specific themes. The resulting multimodal dataset. The dataset collected videos of 5 professional actors and 5 professional actresses performing conversational performances around themes, including a total of 4787 improvised sessions and 5255 scripted sessions with an average duration of 4.5 seconds per session and a total of 4.5 seconds. 11 hours in duration. The final data annotation is emotion annotation, and there are ten categories including fear and sadness.

10. News Rover Sentiment

The News Rover Sentiment dataset is a dataset in the field of news. The videos used in this dataset are videos of various news programs and channels in the United States recorded between August 13, 2013 and December 25, 2013. The dataset is categorized by person, occupation, and the video length is limited between 4 and 15 seconds. Because the author believes that it is difficult to decipher people’s emotions in very short videos, and videos longer than 15 seconds may have multiple sentences with different emotions. In the end, the entire dataset has 929 clips, and three-category sentiment annotations are applied to each clip.

About Datatang

Founded in 2011, Datatang is a professional AI data service provider and committed to providing high-quality training data and data services for global AI companies. Relying on own data resources, technical advantages and intensive data processing experiences, Datatang provides data services to 1,000+ companies and institutions worldwide.

If you need data services, please feel free to contact us: info@datatang.com




Off-the-shelf AI training data, on-demand data collection & annotation services