This is my capstone project at Galvanize Data Science Immersrive - San Francisco (2016).
Homograph words are words with the same spelling, but can have different meanings. The list of homograph words in English can be found in homograps.md. For example, as an old joke says:
Q: Why do you think movie stars are so cool?The twist here is that the words "cool" and "fans" have two different meanings. It is fairly simple for human to decide which meaning of a homograph word is used given the context where the word is in. But to train the computer to do the same is not a simple tasks. In Natural Language Processing, this challenge is call "word-sense disambiguation". Several algorithms have been developed, some are simple and some are very sophisticated. In this project, I attempted to explore the word sense disambiguation challenge with different type of models.
A: Because they have lots of fans.
The dataset I used for this project is obtainedfrom from the Senseval corpus where the meanings of the homographs are labeled in each example.
For each homograph, the example sentences and their lables are split into two lists. The documents are then lemmatized and Tf-idf vectorized using scikit-learn libraries. Once the text documents are converted to numeric, For this project, there are two different approaches to classify the meanings of the homographs: semi-supervised and supervised learnings.
This approach ignores the labled meanings of the homgraphs in their examples and attempts to classify the meanings using clusterings. The reason behind this is that if there can be words that appear mulitple times, unsupervised cluserings may be able to detect some useful patterns that can help the classification process. Then the labels were used to determine the accuracy, precision and recall rates of the models.
Since all of the unsupervised methods did not yield good results, I decided to utilize the labels of the datasets to make supervised predictions.
Word | Accuracy | Precision | Recall |
---|---|---|---|
"HARD" | 72% | 74% | 80% |
"INTEREST" | 75% | 73% | 75% |
"LINE" | 72% | 74% | 72% |
When compared the base algorithm where we always predict the most common meaning of a word (with accuracy between 51.4% and 57%), this model seems to perform better.
The Multinomial Naive Bayes model is simple, yet effective for the purpose of this project. Since the model can be applied as long as we have a labeled dataset, it can be extended to other languages, even non-latin languages such as Chinese or Japanese. For the time being, this model is quite bulky (we need one model for each word), but it can be improved in the future.