Earlier this year, data scientist and ex-physicist Iain published a study on his personal blog, Degenerate State, entitled Heavy Metal and Natural language Processing – Part 1. In this study, Iain uses a corpus of lyrics from 222,623 “metal” songs to perform a linguistic analysis. The results are surprising and lead to some interesting questions about genre and cultural influence. Plus, Iain crowns “burn” the most metal word around. Is your curiosity burning (see what I did there)? Then grab your dictionaries and join me on a wild ride as we uncover some of the most fascinating results of Iain’s study.
Note: I definitely recommend that you read Iain’s original article. He’s the one who put in all the hard work, after all. However, things do get a bit heady near the end, and though you don’t need a degree in data science or statistics to get the gist of what he’s saying, having some understanding of mathematical modeling would certainly enhance comprehension.
As mentioned in the introduction, Iain set out with a goal to test natural language processing techniques, and to delve into metal lyrics to determine if an algorithmic approach could be taken to determine both lyrical complexity and lyrical similarity. Like any good scientist, Iain first proceeded by acquiring a large representative data pool. Iain used renowned heavy metal/hard rock lyrics database Dark Lyrics to develop a corpus of 222,623 songs from 7,364 bands and 22,314 albums. Dark Lyrics is an interesting choice because the site A) is updated slowly and B) lacks lyrics for a huge number of bands. For the sake of comparison, Encyclopedia Metallum boasts a database of 110,134 bands, though obviously not all of these (or perhaps even the majority) have lyrics listed on that site. Limits aside, Iain’s dataset is as good as any to perform statistical modeling, and though the results are not all-inclusive, it would be fair to call them representative.
Dataset in hand, Iain then walked through a series of different modeling techniques to explore complexity and interconnections between bands and their lyrics. The first fruit of Iain’s labor was the figure shown below that plots the proportion of swear words among all words against the readability (determined using the SMOG grade which uses a square-root measure of polysyllables to determine the reading level required to comprehend a piece of natural language) for 100 of the most popular bands on Dark Lyrics. The results, as we can see, are predictable in some aspects, but surprising in others.
Some key findings from the figure:
- Unsurprisingly, Five Finger Death Punch swears more than anyone.
- Unsurprisingly, Pig Destroyer, of all the bands used in the model, has the most complex lyrics.
- Napalm Death and Bolt Thrower require a higher reading level than Bruce Dickinson and Morbid Angel.
- Twisted Sister seems to use both the least number of swear words and the lowest number of polysyllabic words.
- Bands that swear more also tend to use more polysyllabic words.
That last fact in particular is exceptionally interesting, but there may be a simple explanation. There are quite a few syllables in the word “motherfucker.”
After performing this simple analysis, Iain then moves onto so-called “bag-of-words” models where the linear order of natural language is broken in favor of searching for the relative frequency of certain words in metal lyrics compared to a corpus of non-metal literature developed at Brown University. By correcting the data using a logarithmic parameter, Iain was then able to determine the 20 most metal words. You definitely should check out the full list on the blog, but the top five most metal words in Iain’s corpus are: Burn, cries, veins, eternity, and breathe. By comparison, the least five metal words are: particularly, indicated, secretary, committee, and university.
On casual examination, it appears that the most common metal words skew more toward the traditional styles of metal, namely thrash, power, NWOBHM, and classic doom. This conclusion makes sense when you consider the fact that Iain understandably culled his corpus from a site that caters toward more well-known metal bands. It’s not hard to imagine the lists may look slightly different if more underground bands were included in the corpus.
After determining the most frequent words, Iain ended his analysis by attempting to algorithmically categorize bands by both the log-likelihood associations of certain words appearing together and the distances between these words. These relationships, mathematically derived through a set of probabilistic distributions and some clever computing, were used to search for other similar log-likelihoods across the whole corpus (like a good statistician, Iain trained his model on 90% of his dataset and evaluated it with 10%), thereby enabling bands with similar likelihoods to be grouped. The figure below presents these findings as a tree-graph where similar bands are color-coordinated.
Similarly to the other figure, some interesting relationships can be seen in the results:
- Bands within the same genre (e.g. power metal/symphonic metal in green) tend to be lyrically similar.
- Bands formed in similar time periods (e.g. the yellow group which features a number of Florida death metal bands and other early 90s bands) tend to group together.
- Bands with more complex lyrics (e.g. Pig Destroyer and Carcass) are harder to classify.
The finding regarding lyrics seemingly showing when bands were formed is quite curious and seems to intimate, as Iain himself alludes to near the end of his article, that the socio-political conditions of certain time periods tend to affect the birth of genres and the lyrical topics of those genres. Perhaps the 90s bands can all be grouped together because their violent, gory lyrics were a statement against the Reagan-era excesses of the previous decade.
More interesting, though, is the fact that this study, performed purely through quantitative modeling, seems to show some basis for the conclusion that specific genres are rooted in specific lyrical topics. This may seem a foregone conclusion, but most of us, when pressed, would conjecture that we don’t pay much attention to lyrics and that musical style is the only determinant for genre. While stylistic differences are surely the key factor, Iain’s brilliant analysis also seems to show that something else is at play; perhaps there is a sort of subconscious zeitgeist that roots itself in certain language eras and that informs the formation of new genres as much as evolutions in musical style. If this conclusion is true, I can’t wait to see what new genres emerge as a result of this election season.
As I said before, you should definitely take the time to read all of Iain’s article. It’s both fascinating and thought-provoking and presents a meta-analysis of the genre we love so often missing from our own discussions. Iain goes into much greater depth than what I’ve covered here, and I look forward to Part 2.
(All Photos VIA)