Social Networking Analysis Final Project:
The Marvel Universe Social Network
Add Abstract: should not be more than 300 words
In the mid 2017s, the Marvel Cinematic universe (MCU) topped a 12-billion-dollar milestone in box office revenues6 and recently released its trailer for Avengers Endgame which garnered 48.7 million views on YouTube7 in a day of release. It is a passion for many comic book fans and a weekend’s
entertainment for the general public. Our final project is motivated by the inner geek to explore the universe of the marvel heroes and quantify our pre-existing notions about the universe.
Our research question was motivated from our interests and we wanted to analyze the role and influence of each hero in the network and how heroes drive strategic decision making for the studio. In upcoming sections, we will be quantifying the power of the network and see the influence of heroes through differentiation.
Marvel Universe has little over 6,400 heroes (nodes) in the comic world and close to 150 characters in the cinematic universe. This in turns creates a dense network of heroes and there are complexities added to understand the influence of each hero. To begin our analysis, we started with the assumption: heroes have a connection between each other if they appear in the same comic book. Our first step was to take a look at the properties and characteristics of the marvel network by calculating some network dimensions. After the initial analysis, we created network visualization to analyze the ties of the famous
Marvel heroes’ network and for that we used Gephi, a visualization and exploration software. Our analysis also includes the collection of Twitter data and topic modeling.
The process of collecting, cleaning, filtering and planning data was critical for the analysis of the network. Data was obtained from the given source file on Kaggle which included the network between all the heroes (574,467 ties) in the marvel universe and the comic they have been a part of. To look into the MCU, we decided to mine data on the Twitter and compare with pre-obtained data for our analysis.
Add 150 words here
For the understanding of the Marvel universe, we decided to construct one undirected networks to explore the social network of the heroes. The Marvel comic network, which includes 6,426 unique comic heroes and 574,467 edges. For the networks our assumptions were: The more times a hero appears in a comic, the more popular it is. We will focus on the Marvel comic network, by adopting this methodology, we were able to get the result and analyze.
The marvel comic network is one of the most unique networks. It is constructed by interactions of fictional characters but also at the same time, it is very complex and humongous. To understand the network complexities, it is justified to provide some network dimensions and descriptive statistics. Nodes depict superheroes and edges depict co-occurrences of superheroes within a single Marvel Comics issue. The MCU has a network diameter (the largest distance between any two nodes in the network), of 7 units/steps. The average degree (how connected the node is) is 34.027. Clustering coefficient is 0.53, showing that the two heroes that have collaborated with another hero, are much more likely to collaborate in the future than a randomly chosen pair. The average path length (the measure of characteristic path length completed over geodesics or the shortest path) is 2.889, thus, any pair of heroes can be connected through an average of 3 collaborations. The network has a relatively high Modularity which is 0.49, showing the dense connections between the heroes. The graph density (the relative fraction of edges that are present over total edges minus one) is 0.003. This indicates the comic network is quite sparse and several heroes acquire most of the edge connections. Most of the comic heroes are peripheral nodes and have only a few appearances. The comic universe is sparser. Table 1 shows the network dimensions.
Average Path Length
Hero social network
Table 1 – Marvel hero network descriptive statistics
Moving forward, we reviewed the individual influencers of the MCU. Our review involved a detailed analysis of specific nodes to garner some powerful insights. In terms of degree centrality (how well a node is connected in terms of direct connection), Captain America, Spider Man, and Iron Man are the top three heroes. This is proven by the fact that they are the oldest and most successful heroes. Betweenness (how well situated a node is in terms of the paths that it lies on), a type of centrality measure, can also be defined as the number of times of a node passes through the shortest distance between any two nodes in the network. In terms of betweenness centrality, Wolverine plays the most important role in conveying collaborations (information) to other hero in the network who are not connected with each other. In the MCU, Wolverine acquires the control on the collaborations between different clusters since he is the main connection between the X-men and Avengers. Closeness centrality (how close a given node is to any other) represents the ability of spreading information to the whole network within a shortest time frame. Mister Fantastic is the character who has the ability to reach all other hero as quick as possible, since he is the leader of Fantastic Four and the center of the comic universe. Page rank algorithm, (first used by Google) was utilized to calculate the relative score of a webpage’s authority and importance level. Unsurprisingly, Spiderman has the highest page rank centrality since he has the direct connections to other influential heroes such as Iron Man and Captain America. We also notice that all the connected heroes to Spiderman have relatively higher page rank score.
In order to gain a further understanding regarding tthe comic network, we created network related visualizations with the help of external software – Gephi. With the assumption that edge strength is based on whether two nodes appear in the same comic , we created the following figure.
Figure 1 – Marvel comic universe network
Figure 1 represents the Marvel comic universe network by ties with different strength. White ties represent weak connections between pure comic heroes, blue ties represent medium connections between comic nodes and movie heroes, and red ties represent strong connections between movie heroes.
Twitter Data Collection
The data collection process was done using “Tweepy” library which consumes behind the scene Twitter API allowing us to scrape tweets based on a given keywords and start posting time of the tweets we are looking for. For this we need to apply for a developer account on Twitter, in order to get an api key and access token which are mandatory for legal data scraping. Due to some limitations on the number of requests we are allowed to do in a time frame, we only used five runs with a sleep time of 15 minutes so that we don’t abuse the scraping process and also not getting blocked by the robot behind. At the end we were able to collect 10000 tweets related to the following keyword : #marvel.
Twitter Data preprocessing
The data preprocessing step was about preparing our collected data to be feeded to models either LDA for topic modeling or sentiment analysis. The process involves transforming the raw data into a clean data by following these steps :
– lowercase text
– remove whitespace
– remove numbers
– remove special characters
– remove emails
– remove stop words
– remove additional stopwords
– remove weblinks and mentions
After removing all the noise surrounding our tweets, we then move to the most important step which is Topic Modeling,
The topic modeling process is aiming to cluster our dataset of tweets into some homogeneous topics (clusters) that share the semantic behind using Latent Dirichlet Allocation (LDA), the algorithms takes as parameters the text corpus collected from tweets and the number of topics you want to cluster your data on. Until today there is no hard science allowing us to figure out the the number of topics on which our corpus might be clustered on, so the idea here is to make a hypothesis of a set of k clusters from 2 to 17 for example and start measuring the coherency on each topic, at the end we get some sort of dictionary where keys are the K clusters and values are the coherency metrics “Umass”, and then we pick the “K” value with the highest coherency, that’s the best K clusters your corpus is able to be splitted on. In our case our corpus was having a k around 3 and 4 depending on the execution run.
Twitter Data Analysis
1. Increasing dataset effect:
Increasing the number of tweets will definitely push our corpus to be bigger than the actual one, hence the number of topics covered in the corpus will be increased and diversified, from a sentiment analysis perspective, this will also make the sentiments distribution over text more diverse and non balanced since the data collection doesn’t have any rule based approach.
2. Sentiment Analysis
Figure 2 – Sentiment Scores
As we can see in figure 2, there is a high positivity circulating over the sentiments of Marvel related tweets, with over 8000 tweets, while on the other side Positive and Neutral are around 300 for each class.
Figure 3 – Topic Sentiment
As we can see there is a high dominance of positive tweets over the topic_1, topic_2 and topic_3 clusters and this is a very good point that could be interpreted with the good quality of clustering tweets by LDA since it was able to cluster all positive tweets within each other without knowing the sentiment before.
3. Top Hashtags
Figure 4 – Topic Hashtags
The plots above shows that there is a quite similarity between the the most common hashtags especially between topic_2 and topic_3 at the top 3 unigrams, meanwhile the topic_1 still a bit far from those two contexts as shown in the “pyldavis” figure before (refer to notebook), hence we can conclude that there is a correlation between the most used hashtags and the likelihood that a given document belong to a topic.
4. stop-word removal effect:
The stop-words removal is an essential step in Data Preprocessing for NLP text analysis, the main goal is to remove noisy words that won’t help the machine learning models in their predictions and keep only those pertinent and meaningful words, in addition the count of stop-words sometimes is very big which might absolutely bias the model into a specific class or a topic.
In the case where stop-words were the most used unigrams on Twitter, our corpus will be very fined in terms of unigrams and the extracted topics will be very meaningful since they won’t be using the most common words, but in other side we risk to lose the context of the topic, so it might be a good approach but with a certain threshold.
5. Punctuation removal effect:
An important NLP preprocessing step is punctuation marks removal, these marks – used to divide text into sentences, paragraphs and phrases – affects the results of any text processing approach, especially what depends on the occurrence frequencies of words and phrases, since the punctuation marks are used frequently in text. By keeping them results might be biased due to the noisy data that would still be in our corpus without helping too much either in the clustering process for topic modeling or the prediction of sentiment analysis.
6. Word Stemming effect:
Both stemming and lemmatization provide better results removing semantic duplicates. This allows returning the user more words related to the topic so he can have a better understanding of it. However, stemming adds noise to the results as it includes stems that are not real words, and that’s why we used only lemmatization in this process of topic modeling so we could have a better interpretation of the extracted topics.
7. future implementation
As a future improvements on this process, we can try augmenting the dataset and try a new approach for topic modeling with other algorithms like LSI Latent Semantic Indexing which is very powerful in keeping the semantic preserved between clustered topics, also for sentiment analysis we can use transformers who are very powerful on these kind of tasks since they are trained on very large corpus.
Our topic modeling result provides valuable information on heroes and their interactions. It is noted that some heroes are very close in the comics but not in the movies. Marvel should focus on such interactions to gain and do something remarkable to gain audience attention. As phase four of the MCU rolls out beginning 2019, there is a scope of being able to provide the audience with some of these interactions. For example: Falcon and Captain America are considered as a strong duo but in MCU they have had not a strong relationship.
Based on our Marvel network analysis results, it is recommended that the stakeholders and the producers should focus on the interactions of different movie series for the MCU and bring more new and centralized characters from comic universe to the movie world. Implementing different strategies towards to central and peripheral heroes in the MCU is vital. From the network analysis result, having a breakdown in villains and heroes will help us see the effect in potential interesting.
Add 150 words here
Team 9 December 10, 2018
Team 9 December 10, 2018
Team 9 December 10, 2018