April 5, 2016
Insight Fellow 2016
Ruth Toner was a Fellow in our most recent Data Science session in Silicon Valley. She’s since joined the Community team at Twitch as a Data Scientist. In this post she describes Fanguard , the tool she built at Insight to protect Tumblr readers from spoilers for blockbuster movies and popular TV shows.
Please note: this post will contain absolutely NO spoilers.
Before attending Insight Data Science, I spent eight years of my life in the field of particle physics. Like many postdocs and grad students, when I wasn’t trying to discover the basic laws of matter (i.e., debugging my code), I spent a lot of time surfing the Internet. I discovered a few things during these (brief!) voyages. First, people on the Internet are angry all the time. Second, if you really want to make people on the Internet angry, show them spoilers for that one movie they haven’t seen yet.
For my Insight project, I decided to make a product to protect people from spoilers for a given movie, game, or TV show. Basically, I wanted to make a spam filter, but for spoilers instead.
The Project, and What a Spoiler Looks Like on Tumblr
To simplify this problem, I decided to narrow my focus to spoilers on a single website: Tumblr . I chose Tumblr for three reasons:
- I was already familiar with Tumblr and how its community works.
- The API is easy to use and has a generous rate limit, allowing me to gather a large historical dataset.
- Tumblr is a labeled dataset. In other words, I knew which posts were spoiler posts, because people had already labeled the data for me.
To explain what I mean by 3 , here is what a Star Wars: The Force Awakens post might look like on Tumblr, with the spoiler text redacted:
Despite posting spoilers, this Tumblr user has followed an important informal community rule: in the hashtags below the post, they’ve added "#star wars spoilers" and "#tfa spoilers". This allows other users to avoid spoilers, usually via a Chrome Extension like Tumblr Savior that can block pages containing user-specified phrases (in this case, obviously, "star wars spoilers" and variants).
So, spoilers avoided, right?
Problems occur when people break these rules. Here’s another post, for instance. Note the proliferation of tags, not one of which is the word "spoiler". Unless you block every single Star Wars related word, Tumblr Savior therefore won’t help with this particular post.
My goal was to find a set of features which could best describe these spoiler-labeled posts, and then to train a model to pick up the "spoileriness" of unlabeled posts much like the above one.
What’s In the Data?
Before I could make a model, I needed to collect data. I’ll keep focusing on Star Wars for the rest of this post, but I actually ended up collecting data and training models for eight different TV shows, games, and movies.
As mentioned earlier, Tumblr has a very easy-to-use API. Accessing this API is made even easier using the pytumblr API client . I searched for several months’ worth of Tumblr blog posts using two separate tags: #star-wars-spoilers and just #star-wars . The first search formed my "spoiler" dataset, of things that were definitely spoilers. The second search, after I cleaned out any posts that also appeared in the spoiler dataset or contained the word "spoiler", was my "non-spoiler" dataset. This labeling is likely imperfect, and this "non-spoiler" dataset almost certainly contained at least some unlabeled spoilers, but I now had both a sample of posts that were definitely spoilers and another that were mostly not. The final size of my Star Wars dataset was about 25,000 posts, of equal numbers spoiler and non-spoiler posts.
In its raw form, this data was a mess, so I cleaned it by removing HTML formatting, excising common English " stop words ," and then lemmatizing the remaining words. From these words, I could now make a set of features. Using the scikit-learn python library, I made my first set of features by calculating the frequency of each of the 500 most common words in the body of the posts (also weighted by the words’ inverse frequencies in the dataset, to reduce the importance of words like "Jedi" appearing in nearly every post in both classes).
I constructed a similar feature vector for the top 200 words used in the tags. Many Tumblr posts are image-only, with little or no body text, so the only usable words are tags. In this case, my features were binary – a simple "yes" or "no" for whether the word appears in the tags. Note that in both cases, the word "spoiler" was not used as a feature at all, since I was specifically trying to catch unlabeled spoilers. Finally, I also used the total number of words in each post as an extra feature.
Making a Spoiler Filter and Testing It
Now I had a dataset of features, separated into two classes – "spoiler" and "non-spoiler." It was time to train a model to separate them. I toyed with several different machine learning classifier models. Random Forest classifiers do a good job with with sparse, non-linearly-separable datasets with lots of correlations, so that was a strong first candidate. Naïve Bayes and Support Vector Machines models are some of the other most common classifiers used for detecting spam. Ultimately, however, the Random Forest model gave me the best speed and separation performance.
To validate and test this performance, I held back a sample of both spoiler and non-spoiler posts, with all spoiler labeling removed – in other words, I made my own unmarked spoilers. The output of the Random Forest is a predictor between 0 and 1.0, reflecting how likely a given Tumblr post was to have contained a spoiler. By selecting posts using various cutoff values for this predictor, here’s what the resulting True Positive and False Positive rates look like for the test sample:
For a cutoff which picks up 80% of the original spoilers (the true positives), the algorithm mis-identifies 25% of the non-spoiler posts as spoilers (the false positives). Remember, however, that the purpose of this algorithm is to save people from unlabeled spoilers. Many of these "non-spoiler" posts may have actually been spoilers. For the remaining 20% of spoilers missed by the classifier (the false negatives), you can still rely of course on services like Tumblr Savior to pick up stuff which slips through the cracks. Using both products together, you should be well protected.
Want to Try It Out?
To see all this in practice, you can go to my web app, FanGuard.xyz and type in a tag or phrase to search for on Tumblr. You’ll get back a page of search results, with links to individual blog posts, the date of the posting, a column saying whether or not the post was tagged as a spoiler, and the results of FanGuard, i.e. whether or not my algorithm thought the post was a spoiler. In the drop-down menu, you have access to a spoiler filter not just for Star Wars , but for seven other movies, TV shows, and games taken from lists of the most popular reblogged content on Tumblr in 2015 .
There is also a quick pre-check before each spoiler filter is applied to a post; this step decides if a given post is Star Wars -related, Age of Ultron -related, etc. based on whether it contains relevant words. The "How careful should I be?" buttons give different levels of filtering, catching 60%, 80% (default), and 90% of spoilers, but with increasing false positive rates. Give it a try, but be aware that Tumblr can sometimes be NSFW!
What Makes a (Tumblr) Spoiler? And Next Steps…
So beyond Star Wars plot specific words (see: "Kylo" and "Ren"), what does a spoiler look like on Tumblr? Looking across the eight separate Random Forest filters, several patterns emerge.
…the most predictive feature of a spoiler by far is the [length]… spoiler posts [are] over twice as long as non-spoiler[s].
First, the most predictive feature of a spoiler by far is the total number of words in a post, with spoiler posts being on average over twice as long as non-spoiler posts. This makes sense, because when people spoil a movie or game, they’re often writing an in-depth description of its plot or characters (complete with convoluted conspiracy theories!).
Beyond length, several other features show up time and time again across filters. Here is a graph of the most important features appearing in at least three of the models (the variable averaged on the y-axis is a numerical measure of the classification power of a feature in the Random Forest):
When they talk about spoilers, Tumblr users seem to be employing a common underlying vocabulary and grammar, regardless of the movie, TV show, or game. Much of that common language appears to be strong sentiment words, especially obscenities. One could therefore try to train a "Universal Spoiler Filter" to identify the "spoileriness" of a comment based on this general vocabulary, rather than having to learn the details of every individual movie or show. Further information could also be gained from the grammar structure of the posts. For other ideas along these lines, see this paper by a group of researchers at the University of Maryland and Johns Hopkins University, who used Machine Learning to detect spoilers on the website TVTropes .
Long story short, it turns out that Machine Learning can help you make a pretty good spoiler filter for Tumblr. While you do incur some false positives, many of those may actually be true positives, due to imperfect user labeling. In three weeks, I was able to get a full version of FanGuard up and running, but there are tons of more interesting things to be done with the data. This could entail playing around with other techniques in Machine Learning and natural language processing (such as LDA topic modeling ). Or it could be trying to further determine the "universal language" of spoilers. This is an amazingly rich and interesting dataset, so hopefully either a new Insight Fellow or I myself will have the chance to explore it even more deeply in the future!
转载本站任何文章请注明：转载至神刀安全网，谢谢神刀安全网 » Catching Star Wars surprises and other spoilers with Machine Learning