We gathered transcript's lines and associated characters for as many episodes as possible.
To process, we started from the dataset built by Andrada Olteanu in her Kaggle Sentiment Analysis of Rick and Morty, which already contains 1.905 lines for most important episodes.
Then we completed this initial database by downloading all available episode's transcripts on the Rick and Morty Wiki fandom. First, we rejected the transcripts where we found the lines only, without any indication of speaking character name. We parsed the remaining transcript text files by using regular expressions. We also had to clean the dataset and get rid of irrelevant lines, like scene descriptions or behavior comments. Some adjustements were set manually since some of the transcripts contained typing errors and a specific formatting.
Therefore we obtained a dataset of 6.017 lines. For each transcript line, we provides the speaking character, with the associated episode season, number and title.
We're covering 50% of all episodes: 10/11 episodes for the first season, 8/10 episodes for the second season, 6/10 episodes for the third season, and 2/8 episodes for the fifth and final season - for now! Unfortunately, we found nothing relevant to cover the fourth season...