I’ve recently been tinkering with a mechanism for determining word senses based on their grammatical usage. This has me pretty excited, because, so far, it seems to be reasonably accurate (i.e. not terrible), and lightning-fast. I’m doing this by doing some heavy statistical NLP work, computing statistical correlations between word senses and syntax — specifically, link-grammar disjuncts.
The basic observation driving this work is that fairly often, one can identify the meaning of a word (or at least narrow it down) simply by observing how it is used in a sentence. To do this, one needs accurate, fine-grained syntactical information about a sentence: and Link Grammar provides an abundance of it. Link Grammar generates very fine-grained syntactic linkage information for every word in a sentence. Every dictionary word is associated with dozens, or even hundreds, of different linkage patterns, or ‘disjuncts’. These disjuncts indicate how a word in a sentence can be connected to other words to its left or right. One may imagine that these disjuncts provide highly detailed grammatical information about a word. For example, they not only distinguish between a noun and a verb; they not only distinguish between present, past and future tenses of a verb, they not only distinguish between a transitive and an intransitive verb, but they also distinguish between a wide variety of relationships that are so fine that most are not even given formal names by linguists. This fine-grained information is a rich source of information, and is ideal for correlating with word senses.
For word-sense tagging, it turns out that simply picking the Most Frequent Sense (MFS) already gives a rather strong indication of what the correct word-sense assignment should be — it gives the correct answer about half the time– See [Mi04] below. So the idea here is that supplementing the MFS with additional data– how the word was used syntactically — will improve accuracy even further. That is, the idea is to compute the “MFS for a given syntactic context”.
Before one can perform statistical correlations between syntax and sense, one must first assign a sense to each word. This is, of course, the really hard part to modern NLP. For now, I’m taking a fairly easy, straightforward, yet strong approach to this: I’m using an algorithm due to Rada Mihalcea[Mi05]. This algorithm is currently, as far as I know, the most accurate algorithm known for tagging words with word senses. Unfortunately, it is also fairly slow and CPU intensive, making practical deployment tricky. It performs this tagging by associating, with each word in a sentence, a list of its possible senses. Then, senses of nearby words are linked together with a similarity measure. This forms a network, a graph, over the sentence, with vertices being word-senses, and edges being weighted by the similarity between senses. Such a network is formally a Markov chain, and can be solved as such. There are many ways of solving a Markov chain; Mihalcea proposes, and I’ve implemented, the Google (Page-Brin) page-rank algorithm[PB]. The result is that each vertex (each word-sense) is assigned a probability; The highest-probability senses do indeed appear to be the linguistically correct senses most of the time; the correct sense will almost always appear in the top three probabilities.
Once a word sense is identified, it can be correlated with the Link Grammar disjunct in play for the particular sentence. This is done simply by processing a lot of sample text. A database then stores a frequency count for each (word, disjunct, word-sense) triple. After a reasonable amount of data is accumulated the unconditional probability p(w,d,s) can be calculated (w==word, d==disjunct, s==sense), and, from this, various marginal and conditional probabilities and entropies. To make use of this information in a new sentence, one first parses the sentence using the Link Grammar parser, thus obtaining (word,disjunct) pairs. It is then a straightforward (and fast) database lookup to obtain the conditional probability p(s|w,d), the probability of observing the sense s given the pair (w,d). The exciting result of this effort is that, quite often, the conditional probability p(s|w,d) identifies one sense more or less uniquely (i.e. there is one sense for which p(s|w,d) is about 1).
Although this result is quite exciting, its based on the inspection of a small handful of nouns and verbs. I believe that the result holds well in a broad setting, but I don’t have any quantitative measure for the extent of the setting. Clearly, there will be *some* words for which the sense will be obvious from the grammatical usage. But, on average, how many of these are there per sentence? In some cases, the sense won’t be unique, but there will be many senses that are ruled out. How often does this happen? Is it possible that the accuracy results are equal to, or even improve, on the Mihalcea accuracy results? (They may improve on them by averaging over and eliminating false-positives, eliminating them because of the various different semantic contexts a word might appear in). A quantitative measure of the recall and accuracy can, in principle, be done, as there is a database (the SemCor database) of text that has been hand-annotated with the correct senses. I’ve not yet given any serious thought to performing this quantitative analysis.
Still, I’m pretty excited. It seems to work pretty well; I like that. I’ve already roughed in some basic infrastructure into the Link Grammar parser so that it will return the sense tags for each parse, assuming you have the database installed. The tags returned are WordNet 3.0 sense keys — strings like “run%2:38:04::” which can be used to look up specific senses from WordNet.
To expose this function, database support has been added to the link-grammar parser. This has been added to the parser itself, as opposed to a layer built on top of it, because database support is needed for other reasons — specifically, for parse ranking (Gee, I haven’t talked about parse ranking, have I?). The database support is provided by sqllite[SQLLITE]. This was picked for two reasons: (1) its license is public domain, and is thus compatible with the link-grammar BSD license, and (2) it is an embedded database, requiring zero administration by the user. This second point is quite unlike traditional SQL databases, which typically require trained database administrator to configure and operate. One reason that zero administration is possible is because the database is used in a read-only fashion: the data it holds is static. Code integrating this database is in the link-grammar SVN repository now, and will be available in version 4.4.2.
Creating the dataset is a good bit tricker. Currently, the Mihalcea algorithm is implemented within OpenCog. Was this a good technology choice? I dunno, but it seemed like a reasonable experiment at the time. Parsed sentences are fed to OpenCog, where word senses are assigned, and then frequency counts are updated. I’ve been feeding it a diet consisting solely of parsed Wikipedia articles — not very healthy, but maybe OK for now. The Markov chain network used to solve for word senses is four sentences wide, as a window sliding across an article. That is, a given word sense is influenced by other words occurring in sentences as far as four sentences away. This should keep accuracy up, without bogging down in solving a Markov chain across an entire article. The current OpenCog implementation uses the Page-Brin PageRank algorithm; however, I’m thinking tht it might be faster simple to use the linpack subroutine library to solve for the eigenvectors directly. (The Page-Brin algorithm shows its power when the Markov chain has billions or trillions of nodes, e.g. as used by Google. By contrast, with a four-sentence sliding window, the Markov matrix connects at most thousands of senses, and thus should be rapidly solvable by ordinary linear equation techniques.)
So far, I’ve put a few CPU-months of data crunching into this. Its not much. It’s slow; it takes minutes per sentence — and this gives only one word-sense, disjunct pair. To build up a reasonable statistical dataset will require cpu-hours or more per word — and there’s not really all that many cpu-hours in a cpu-month. So its slow slogging. The database coverage remains quite thin, as most disjuncts have been observed only a handful of times. There are maybe about a thousand or so words for which an ‘adequate’ amount of statistics have been collected; I’d like it better if the database was deep enough to cover at least 20K words and 100K senses. Since the preliminary results look so promising, I’m corssing my fingers, and am slowly been tinkering with ways to improve performance.
I’ll let you know …
[Mi04] Rada Mihalcea, Paul Tarau, Elizabeth Figa, “PageRank on Semantic Networks, with Application to Word Sense Disambiguation” (2004) COLING ‘04: Proceedings of the 20th international conference on Computational Linguistics DOI
[Mi05] Rada Mihalcea, “Unsupervised Large-Vocabulary Word Sense Disambiguation with Graph-based Algorithms for Sequence Data Labeling”, (2005) HLT ‘05: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing DOI