Data was scraped from metrolyrics.com. A simple PHP scraper was created to first gather all the artists based on a list of artists metrolyrics offers. Then, from each link to each artist, the lyrics to the top 65 songs (or fewer) of that artist were added to the dataset. In total, we gathered over 400,000 songs. In addition to taking the song lyrics and genre, we also took features for artist popularity (determined by metrolyrics when scraped on May 9th, 2018,) song title, artist, and year released.
We then filtered our dataset. We first eliminated songs from artists who were not popular, artists whose popularity was under 10 out of 100. We also eliminated songs from the dataset that did not specify a genre or were labeled as “other,” as this genre label seemed to be a catch-all for songs that were either of niche genres, and we believed this would add unnecessary noise to our dataset, or possibly cause models to label an unusually high percentage of songs as “other.”
We still had some problems with the dataset. The first issue was attempting to remove songs that were not entirely in English. While we used the Python library “langdetect” made by Google to filter songs that were not in English, but this only filtered songs that had no English; many songs were half English and another language, and these were not filtered. We were also concerned with the lack of different genre labels. One of the reasons we chose metrolyrics over other popular lyrics websites was because of the diversity of labels, although it is not as diverse as we would have liked.