There are many different kind of networks and fields where these analysis can take place and today’s post will be on literature and, in particular, the social network structures within the master piece of Spanish literature: El ingenioso hidalgo don Quijote de la Mancha.
As an interesting anecdote about the qualities of this novel, Sigmund Freud first came to Don Quijote as a boy and loved the novel so much that he learnt Spanish so as to read it in its original language keeping the secret from his parents who might have disapproved of the hobby. So if you want to fully enjoy the book and not to lose anything in translation go Freud on it.
So let’s follow Don Quijote through the Network and, in case it is not obvious enough, doing this sort of analysis on a book implies major spoilers ahead.
Gephi and Les Misérables
So turns out the amazing Gephi tool has some example social networks for us to play with but ¡Ay!, not the Spanish literature master piece, but another master piece, a french one; Les Misérables from Victor Hugo. Let’s have a look a it:
This network is based on the co-appearance weighted network of characters throughout the book from a data set developed by Donald E. Knuth (The Standford GraphBase: A platform from combinatorial Computing. Addison-Wesley. Reading, MA 1993).
What Les Misérables social network tells us
This first thing we can see is that Jean Valjean is the central character; not only because he has the largest degree but because he also has the largest betweenness centrality by a long shot. In other words, the whole story spins mainly around him with the exception of… Gavroche!
I place an exclamation mark because the knowledge I have about this novel comes mainly from the Musical (what an amazing musical by the way) and in this case Gavroche plays a small, though very emotional, part. However, in the book, Gavroche seems to be a heavy weight character with his own story detached from Jean Valjean.
The modularity colors clearly identifies the main story plots in the book, that is:
- Indigo: The priest Myriel saving Jean Valjean soul with his generosity.
- Sea Green: The mean Javert & Thernadier’s messing with Jean Valjean life.
- Purple: Cossete, Marius & Jean Valjean classic love triangle.
- Lime: Gavroge and his French revolution team.
- Royal Blue: Fantine struggle in Jean Valjean’s factory and later as a prostitute.
- Yellow Green: Valjean moral test saving an innocent man by revealing his true identity and becoming a fugitive.
It is quite remarkable that a quick network analysis can dissect so well the structure of a novel and, quite frankly, I don’t see how modern literature majors in faculties can go on without having specific courses in social network analysis. Science rules even in the arts!
Building Don Quijote de la Network
So let’s try to use the same principles used above to build a similar network for Don Quijote de la Mancha. For now we’ll have a look just for the first book of Don Quijote. Why Miguel de Cervantes decided to write a second book is a quite interesting story in itself; he was kinda forced to it, that is why the first book is a piece of art in itself and it makes sense to analyze it on its own grounds.
The other reason why I am not analyzing the second part is time; identifying characters is something that I need to do manually and I want to take my time. And yes, I know that there are NLP algorithms to recognize entities but they’re not good enough. Anyway, here we go:
Finding Don Quijote’s Characters
Since each character will be a node in our network we need to identifying them first. For this task we have at our disposal nice Named Entity Recognizer (NER) tools freely available like the one shared by the The Standford Natural Language Processing Group.
However, even if the NER does a perfect job recognizing entities, characters can use different names throughout the book. That is the case for Jean Valjean when he hides his true identity or with Don Quijote when he renames himself after he is “knighted”. This means that the only way to match the same character with different entities is having an understanding of the story and, so far, only humans can do that.
That is why for Don Quijote’s network I have selected manually the most important characters and name variants and only a handful of secondary characters.
Once we have our nodes we need to connect them. In this case make sense to calculate a co-appearance weight, so if we have characters A and B appearing like: A,B,A,B,A,A,A,B then we will have two nodes A and B where:
- A connects with itself with weight 2
- A connects with B with weight 3
- B connects with A with weight 2
This is Don Quijote’s network when considering co-appearance among its characters. The nodes’ size is proportional to their degree (10 to 20) and the edges to the co-appearance weight.The nodes colors are the result of running a modularity algorithm looking for the most relevant eight clusters.
Since Don Quijote de La Mancha is, among many other things, a travel & adventures book there are many characters (real and imaginary) that don’t last more than a few lines and I did not included any of those, if I had they would mainly be small satellites around Don Quijote and Sancho Panza.
What Don Quijote social network tells us
No surprise in Don Quijote & Sancho Panza being the main characters, in fact, Sancho plays the role of Quijote’s connection to reality and they are so inseparable that I placed them in the network one on top of each other since everything makes more sense in the network when considering them as one entity.
Then we have El Cura (The Priest) and El Ventero (The Innkeeper). Actually, there are more than one throughout the book and, though it might be a good idea to separate them, it also makes a lot of sense to keep them as the same entity so that we can weight appropriately how relevant the position is. Indeed, is this gang of four: Don Quijote, Sancho, El Cura & El Ventero the one containing the usual suspects in Don Quijote’s adventures.
The tremendous weight the edge joining Don Quijote and Sancho Panza has shows that this book does not rely so much in a complex plot as Les Misérables does as much as in a complex relationship between the two main characters and the combat of their two opposing worlds and realities.
El Cura though is fundamental in the story since he is the one taking us away from the exhausting battle of the two opposing forces represented by Don Quijote & Sancho Panza, and giving readers a rest with other stories.
The modularity metric’s colors, once again, can help us identify story plots in the book; being the most dramatic example the novel within the novel ‘El Curioso Impertinente’ which tells about the love triangle among Camila, Anselmo and Lotario.
Gephi also allows for dynamic time series, so I added the chapter in which is each character appears as a dynamic feature in the network and this is the result for a three chapters window animation:
Next… the ongoing process
So I am going to end this post, which is perhaps way too long for a blog already, sharing the Gephi network file that I used for the analysis and welcoming anyone who wants to improve the network:
And about the code that I used to build the network, allow me to bundle the scripts into an nice R package for network literature analysis, FOSS it and make another post about it.
Anyway, I would really love to see this network finished for the two books so, in the meanwhile, if you like literature in general, Don Quijote in particular and you feel like participating in brushing up this network please reach out. There are many things that could be improved since I don’t know about Don Quijote as much as I would like:
- Identifying more minor characters for the first book.
- Identifying all characters for the second book.
- Checking on the accuracy of the network.
- Interpreting the network results.
- Translating this post into Spanish (note to myself)
- Redo this network analysis using an un-directed net since most nodes connect in a symmetric fashion.
- you tell me.