神刀安全网

Improved GIF Tagging

Tumblr receives a massive daily volume of gifs. We can only associate gifs with metadata from the post, rather than the gif itself, which presents a tricky technical question: how should this gif be indexed for future searches? The post’s tags could be used, but users often use a post’s tags as an under-your-breath-style postscript to the content, not actual post metadata. Gif reactions are ubiquitous on social media platforms, and users expect relevant images quickly supplied to their fingertips to keep the banter going. So Tumblr has to address the need for an accurate and fast heuristic for returning gifs based on a text query.

As part of Tumblr’s recent Hack Day, I devised a potential improvement to our current method of matching gifs and tags. I built a standalone service, nicknamed Taggy , consisting of a simple API and an even simpler user frontend. The goal was to create a classification pipeline from Tumblr’s already extensive gif library to users, prompting them for additional metadata about a gif, and then storing that response in a way that provides easy future search and analysis.

First, a quick rundown of the technologies I chose for this project.

Slim provided a fast and lightweight framework for writing the API, Polymer provided clean, modular front end elements that could be quickly ported outside of this project, and Neo4j is a graph database that offered the ability to quickly and flexibly store relationships between data nodes. Below is my quick sketch of how I envisioned a finalized Taggy service would work.

Improved GIF Tagging

The main motive to use a graph database rather than a relational database was curiosity. The focus of this project would be about relationships between objects and not as much about the objects themselves. A graph database presented a perfect opportunity to quickly store and traverse these relationships while also providing the flexibility of storing relationship properties to differentiate and analyze. Neo4j was incredibly simple to set up and start using; their docs were thorough, and the design of Cypher Query Language was foreign, yet intuitive.

To start, I wrote a simple script to scrape Tumblr posts with gifs into Taggy’s graph and associate them to the parent post’s tags. With some sample data, Neo4j already had some interesting node views. Below you can see a query for all nodes in the database. Already you can see patterns of images with tags clustered around them. Most of these associations are from the post in which the image was found.

MATCH (n) RETURN n LIMIT 100 

Improved GIF Tagging

Above we are simply selecting 100 database nodes, regardless of their labels or properties. The result is we get a pretty set of image and tag nodes.

Once I had photos and tags in the system, I set up a front end for exposing a user to a question prompt. The idea is simple: you see a photo and a tag underneath and you’re asked if they match.

Improved GIF Tagging

I initially thought a simple Yes/No interface would suffice, and after a brief discussion with members of the community team, one of the biggest concerns was accidentally serving unflagged NSFW content to the user. I added another “panic” button which would immediately hide the current image and prompt the user to confirm they mean to mark the image as NSFW.

Improved GIF Tagging

The final button I realized we would need is a Pass button for gifs of subject matters you don’t recognize, or tags you don’t understand. This would save us from spurious data generated by users forced to make a guess. Each button press would establish a new user-based relationship between an Image and a Tag with a filterable property indicating whether the user responded in a positive, negative, or neutral manner. Using Cypher query language, you can filter by the kinds of relationships between two nodes. After playing with Taggy as a couple test users, I selected image nodes which had human sorted relationships with tags and ordered the results to start with images with the most of those relationships.

MATCH (n:Image) MATCH n-[r:SORT {response: true}]-(t:Tag) RETRUN n, t, COUNT(r) as sort_count ORDER BY sort_count DESC LIMIT 100 

Improved GIF Tagging

In the above query, we are first matching all nodes n that have an Image label (something given to a node at creation). Then we are refining that set by matching the Image nodes that have SORT relationships r with a response property marked as true (a relationship generated by a user who indicated the Image and Tag were a match). Finally, we are returning the matched Images, Tags, and relationships ordered by the Image nodes with the most user defined relationships.

All that is a long way of saying we queried for Image/Tag associations that held the most user agreement, which we can infer would indicate the highest likely match. You can easily reverse this query if you were looking for the most likely image matches to a specific term.

MATCH(t:Tag {name: "lol"}) MATCH t-[r:SORT {response:true}]-(n:Image) RETURN n,t,COUNT(r) as sort_count ORDER BY sort_count DESC LIMIT 100 

That’s as far as I got in the hack day implementation, a majority of my time I spent reading about Cypher Query Language and how to best structure graph databases. I’m certain I didn’t get it exactly right with my first attempt, but I got a basic proof of concept working. As I continue playing with Taggy I will continue to refine how I use node labels and properties to serve the question at hand. There are still many aspects of this problem to account for such as de-duplicating and merging classification data on separate nodes representing the same gif, filtering non-metadata tags from circulation, and truly testing the scalability of Cypher in a production environment. For now, I’m excited about what I learned and looking forward to posting updates in the future about our uses of graphing databases at Tumblr.

@fousheezy

转载本站任何文章请注明:转载至神刀安全网,谢谢神刀安全网 » Improved GIF Tagging

分享到:更多 ()

评论 抢沙发

  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址