Hundreds of millions of people use Quora to discover high-quality answers to questions important to them. The quality of our content and the civility of our community are two important factors that make Quora special. We want to maintain that quality even as billions of people start using Quora.
We have a huge volume of textual data via millions of questions, answers, and comments on Quora. We also have an enormous amount of metadata to complement our text corpus, including user upvotes & downvotes, user-topic interest & expertise relationships, question-topic relationships, the topic ontology, a social and influence graph of users, and a history of actions taken by users on Quora.
Such a rich dataset puts us in a unique position to use various Natural Language Processing (NLP) techniques to solve exciting problems critical to our success. In this post, we’ll give an overview of some our most important NLP challenges in the following broad areas:
- Maintaining and improve the quality of our content as we grow
- Providing more structure to our content and make knowledge seeking easier
- Keeping our community safe and civil
We will go in more depth about these challenges as well as introduce different types of NLP challenges in our future posts.
On Quora, you can find the best answer to any question. Some answers are naturally more helpful than others, so it’s important for us to understand answer quality in order to show readers the best answers first. To do so, we need to score any answer on Quora along many subjective dimensions, including writing style, readability, completeness, and trustworthiness. We’ve previously written in more detail about our approach to understanding answer quality here .
Automatic Grammar Correction
Because Quora is a global community of people sharing and growing the world’s knowledge, we want to enable people with various levels of English proficiency to write on Quora. Using NLP, we can automatically improve the grammar of a piece of text and make it easier to read, without changing its meaning.
People ask all sorts of questions on Quora. Some of these questions are easy to read, and others are not. Some are very specific, and others are more general. Some have a single, objective answer, and others spark a subjective discussion. It’s important for us to understand all these qualities (among many others!) to feed into our quality and relevance systems.
Duplicate Question Detection
It’s frustrating to see many different questions asking the same underlying query — readers have to look at many pages to find answers, and writers have to write the same answer many times. Instead, we want to have a single, canonical question on Quora for any query. To achieve that goal, we need to be able to detect duplicate questions by predicting if a new question already exists on Quora in some other form. With millions of questions on Quora, duplicate question detection is a challenging problem, and doing so in real-time as people type questions into the Ask Bar is even harder.
Related Question Generation
After getting our questions answered, we often want to find answers to other related or follow-up questions. The same thing happens on Quora, too—people love discovering new content through related questions. Given a question, finding the most relevant related questions already on Quora is a challenging NLP problem, particularly because the line between a duplicate question and a related question is often thin.
Topic Biography Quality
Many of the people writing answers on Quora are world experts in their field. Understanding a writer’s expertise and authority on a topic is an essential feature of our quality systems, and topic biographies are an essential input. Using topic biographies, writers can share their qualifications for writing in a topic; for example, biographies in the “Machine Learning” topic include:
- “Ph.D. in Applied Statistics.”
- “Taken some undergrad courses.”
- “ML Engineer at Quora.”
- “Head of Montreal Institute for Learning Algorithms.”
Each these biographies signifies a different level of expertise and authority, and predicting expertise based on topic biographies alone is a hard NLP challenge. Not only are topic biographies less than 100 characters, but our system also has to work on topics ranging from parenting to jazz to rocket science.
Quora’s topic ontology is massive, with topics ranging from broad topics like “Science” to narrow topics like “Tennis Courts in Mountain View”. We want to show people content in topics they want to learn more about as well as unanswered questions in topics they are an expert in. To do so, questions need to be labeled with appropriate topics. Our NLP challenge is to identify which topics in our ontology should be applied to each new question on Quora, which is particularly difficult because each question contains a small amount of text.
Millions of people write long, thoughtful answers on Quora, and that’s phenomenal. But, sometimes readers just want to skim through the content of an answer without having to read all the details, particularly when reading Quora on a mobile device. One of our open NLP challenges is automatically creating summaries of answers, where answer summaries capture the spirit of the original, longer answer. We don’t yet have answer summaries, but we think they can greatly improve user experience.
Automatic Answer Wikis
While some readers enjoy reading several different answers on the same question page, others would rather quickly find an answer and move on to something else. Expanding on answer summaries, another open NLP challenge is creating a summary answer wiki using all of the information on a question page. For example, we might want to compile and summarize the various views expressed across all the answers of a question in a format that’s easy to skim. This problem is even harder than just answer summaries, but it’s one we’re excited to tackle.
Hate Speech/Harassment Detection
Quora has a polite and helpful community that doesn’t shy away from direct, but respectful, discourse. It’s essential that keep it that way as we continue to scale. In order to maintain Quora’s civil and respectful tone, we can use NLP to quickly detect content that contains abusive and harassing language so it can be removed.
Quora is a great platform for writers to get distribution on their content, but that also means that it’s a great platform for spammers to drive traffic to their websites. Spam detection is thus an important problem that we solve using NLP: given a question or answer, we automatically classify whether or not it is “spam”. While we do want individuals and companies representing their products on Quora, there’s a thin, but important, line between spam and self-promotional content.
Question Edit Quality
On Quora, anyone in the community can improve the wording of a question. Sometimes, users change questions in a way that changes their meaning, or worse, vandalizes them. Predicting whether or not an edit preserves a question’s meaning or makes the question clearer is also a hard NLP challenge that we’re working on.
Quora has a huge database of high-quality textual data and rich metadata. We have a variety of NLP problems that are critical to our mission, and we’re in a unique position to solve them. We’ve already built amazing solutions for some of these problems, but our systems are constantly improving, and many problems remain to be solved.
We invite all the NLP researchers and engineers to apply to our NLP position and help us grow and share the world’s knowledge!