Most of the content on the Web is in the form of natural language text, such as news articles, blogs, tweets, and queries to search engines. In order to improve the ability of computers to perform tasks related to these texts, including translation, summarization, recommendation, and search, researchers develop algorithms, models, and tools executed automatically by computers for Natural Language Processing (NLP). Yet, current state-of-the-art NLP models are still not fit to process the variety of text styles that appear across the Web. In an effort to expand the utility of NLP tools for Web content, in this blog post we discuss novel modeling of syntactic parsing for Web queries.
One important building block of automatic text processing for NLP is syntactic parsers: automatic systems for discovering the syntactic connections between words in texts. An example of the output of syntactic parsing is shown above for the sentence “the cat ate food:” cat is the subject of the verb ate, food is the direct object , and the serves as a determiner for the word cat . These relations form the syntactic parse tree for the given sentence. Syntactic parsers are usually trained to handle grammatical English sentences, such as those written in editorially-supervised news articles. Therefore, the performance of off-the-shelf syntactic parsers drops when applied to texts that are not “well written.” Yet, most of the texts encountered on the Web are generated not by professional authors, but rather by ordinary users. A very common type of these texts is Web search queries issued to search engines. In theory, syntactic parsers can be applied to Web search queries to aid in determining the best search results based on the relationship between words in the query. They can also find entities in the query, help a user reformulate a query, and more. However, search queries often do not follow the grammatical conventions of standard English, and many resemble keyword-keyphrase bundles, e.g. “weather chicago,” “best hotel where to find new york,” and “why is the sky blue yahoo answers.”
When faced with the challenge of parsing a given Web query, two problems need to be addressed:
- What is the correct parse structure for this query?
- How can an automatic parser learn to output this structure?
In a research paper published in the proceedings of this week’s 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics : Human Language Technologies (NAACL-HLT), we propose solutions for both these problems. First, in answer to problem (1), we define a grammar structure that suits these keyphrase-bundles based on the notion of a syntactic segment : a word sequence that can be represented using a single parse tree, but is not syntactically connected to the other parts of the query. In our proposed grammar structure, each query is split into syntactic segments, and each segment is represented by a (fully connected) parse tree. An example:
In addition to defining this grammar structure, we also released a 5,000- query treebank : a dataset which we mined from Web queries landing on Yahoo Answers and manually tagged for segmentation and parsing for the benefit of the research community (via the Yahoo Webscope program).
After running a number of state-of-the-art parsers on this data, we verified that systems trained for grammatical English perform poorly. For example, in the query “invent toy school project,” which corresponds to the need “invent a toy for a school project,” an out-of-the-box parser thinks the word “toy” modifies the word “project”, where in fact it does not – they are parts of different segments!
Next, we turn to problem (2), of how to train a new system that can handle our segmentation-enriched grammar structure. We propose three approaches that we implemented and compared in our paper. First, we designed a two-step procedure of learning the segmentation of a query and later applying an off-the-shelf parser to each segment found in the first phase. For segmentation, a standard approach is to use supervised learning , where a CRF system for detecting segments is trained on the queries in our treebank.
We next proposed two methods based on the concept of distant supervision , where the training examples are not manually-tagged queries, but automatically derived. The advantage of using distant supervision over supervised learning is that no need for long, arduous, and sometimes error-prone, human labor is required for training a machine learning algorithm. Specifically, we utilized millions of queries for which the searchers clicked on pages in Yahoo Answers. The question in the clicked Yahoo Answers page, which is typically a well-formed sentence, expresses the same request as the corresponding query, and often contains many of its words. We call a pair consisting of a query and a corresponding clicked question, an aligned pair .
Knowing this, we defined several rules for automatically detecting segment boundaries in queries based on their aligned questions. Examples for these rules include word reordering (i.e. words that switch positions between the query and question) and connectives (such as forms of the verb be) that appear in the question but are missing from the query.
Given the automatically-segmented queries, we train a CRF system similar to the supervised approach above.
In our second distant supervision approach, we apply a query-to-question (Q2Q) algorithm, which converts an input query into a full question. This algorithm utilized the aligned pairs between queries and clicked questions to learn templates that perform this conversion (see details in: Dror et al. 2013. From query to question in one click: suggesting synthetic questions to searchers ).
Given a new query, our parsing algorithm:
- converts the query into a complete question using the Q2Q algorithm
- parses the generated question using an off-the-shelf parser
- projects the parse tree of the question onto the query, detecting segments based on words along the question’s parse tree that are missing from the query
We compared the performance of the three approaches (as well as a no-segmentation baseline) against the query treebank. When it comes to the segmentation task on its own (which we measure in F1 score), we see that for the entire set of queries, both CRF approaches perform similarly. When we start looking into interesting subsets, differences start to emerge: queries which our off-the-shelf parser deemed difficult to parse by reporting a low confidence score were segmented better by our distant-supervised system. This is an important distinction, because low-confidence queries can be detected in a production system so it can choose which segmentation algorithm to use online. Two other subsets we looked at were queries with just one segment, and those with multiple segments. The supervised system performed better on the former, but the distant supervision did extremely well on the latter, highlighting its improved ability to detect correct segmentation locations (as opposed to not over-detecting them).
When we look at the end-to-end task of parsing, not all findings still apply. For example, Q2Q-based projection really picks up, due to the fact that it parses well-formed sentences with more context than just the query words. In single-segment queries, it even passes the gray bar (seen in the chart below), which signifies the performance of a parser that has access only to the query text, but has full knowledge of the correct query segments in the query.
In sum, we introduced new tasks for the NLP community, together with a dataset for benchmarking, and proposed several algorithms to address these tasks. We still have a long way to get to really good performance on the most difficult types of queries. We also invite the research community to obtain the dataset from Webscope and try out other approaches that might do better!