Tags: Natural Language Processing, Computational Linguistics, Graduate School
A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.
Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.
Interest in NLP is peaking due to a number of factors. First, there is increased maturity of some techniques in the field, showcased by some high profile successes such as the Watson system, Siri and Google Translate (just to name a few). Second, there is the price decrease for computational power (in 2011, an i7 6-core with HT, 3.2GHz each, 32 Gb of RAM could be purchase by $1300, retail). Finally, there is the increased availability of data (Wikipedia, Twitter, the Enron corpus, just to name a few).
Still, NLP sucks (oblig. xkcd). Why? My take is that people perceive language as easy given that they learned it easily (even though they seem to forget how difficult it is to master a foreign language) and they somewhat consider computers to be smart (although that expectation is changing and more and more grounded on fact as computers are everywhere). Therefore, if the computer fails at language that means the developer (that'd be you in this case) is really bad at their work.
I like to believe you can tell the NLP practitioners around because they are the people with small dark clouds on top of them raining at all times. That cloud is called high error rate and if you (or most likely your thesis supervisor) wants you to add some NLP to your research work, you better be ready to take some water damage.
How bad it is? you might ask. Well, let's take a "solved" problem in NLP, Part-of-Speech (PoS) tagging. That is the problem of, given a sentence, divide its words into nouns, verbs, articles, etc. You might think of this as a trivial exercise... just a simple dictionary look-up, right? Nope. Nothing is easy when it comes to language and this task (which won't really solve any real world problem on itself) still needs context to be performed, think for example: "flies like an arrow".
A state of the art PoS tagger will be the Stanford Log-linear Part-Of-Speech Tagger, as published in
Kristina Toutanova, Dan Klein, Christopher Manning, and Yoram Singer. 2003. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of HLT-NAACL 2003, pp. 252-259.
It has a performance quoted at 97.24 accuracy. Yay! But wait, that's per token. With an average English sentence length of 20 tokens (words) that means a 56% sentence level accuracy. And this is what NLP considers a "solved" problem (the authors even say part-of-speech tagging is now a fairly well-worn road). Moreover, that is taking the most ambiguous word in English ("to") and setting it as a PoS tag on itself (/TO) because it is too difficult to disambiguate without deeper syntactic understanding (if you are not aware of this, you might think the usual NLP reported numbers are somewhat unfairly good compared to your own runs). These numbers are also usually reported on training/evaluation sets quite different from the texts you might have at hand. For example, instructions tends to start with a verb in imperative form ("1. do this", "2. do that"). I have seen untrained PoS taggers fail rather miserably in such texts.
What does that mean to you? Well, whatever application you are using NLP for, it has to tolerate high error rate. But better yet, smart the NLP away. Can you annotate it off-line? Think about it, most NLP research is done annotating parts off-line so the impact of improvements of a particular subsystem can be measured. If you do an end-to-end NLP system the errors of each stage will be cascading to the next and you most probably won't be able to show much in terms of impact of NLP to the rest of your work. Another alternative is to reduce "all the wonderful variations of natural language" to a small number of predetermined cases. For example, many years ago I took part of the GALE program, Go-No-Go evaluation, distillation task. This task involved cross-language speech recognition plus machine translation. The distillation task was Question-Answering-like, but QA is difficult! So instead of doing full-fledged QA, they used template queries:
LIST FACTS ABOUT [event]
FIND STATEMENTS MADE BY OR ATTRIBUTED TO [person] ON [topic(s)]
DESCRIBE THE ACTIONS OF [person] DURING [date] TO [date]
These templates are a far cry from a full QA system but enable enough of the task to be of use, rendering the problem (somewhat) doable (at least researchable).
OK, you did your try but it looks you're stuck doing NLP for the next year and a half. You got some textual data (Twitter posts, software requirements, medical discharge summaries, you name it). I can offer you three tips:
Choose a toolkit, any toolkit.
Develop some data curiosity, don't get suck into plain evaluation numbers.
Don't be afraid of data slaving, that is, annotate your own data.
Learn a toolkit (or more) and develop intuitions about the type of errors it makes. What is a "toolkit"? A set of NLP tools that work together enabling you to assemble end-to-end systems. While might be tempting to get individual tools and put them together as your own toolkit, don't roll your own. You will start spending most of your time debugging it rather than doing research work in your own field. If you really, really want to roll your own, at least roll it like the existing ones (for example, use stand-off markup rather than modifying the text), which in turn means study the existing ones. From that perspective, you should beware of standalone annotators. Don't be tempted to think "I got this program from this Web site that does Part-of-Speech tagging. It uses an input format which will be difficult to obtain and output format that cleverly modified my input text." and then believe that if it came from the Internet it must be good (right? right?). Code that mix brackets from different sources is bad for your health. Don't do it!
Which toolkit to use? That depends on the language you use, the amount of processing power you have available and a number of other things. Some obvious choices include:
UIMA (Java of the Apache flavor)
GATE (Java of the "I understand this code" flavor)
NLTK (Python of the "look mama, no framework" flavor)
LingPipe (easy to use)
Stanford NLP (best analytics in town overall)
When you get sucked into the NLP work, just looking at the final numbers will be very tempting and will get you published, but it won't really solve your problem. For example, when working in machine translation, standard metrics will consider all errors the same, while verb errors are much harmful for understanding a translation. However, once you looked at the data you can no longer use it: set aside a dev set. And if you keep trying different ML techniques / annotators / approaches on the same data set, you're overfitting it either way, it is like you're looking at it: set aside a dev set. Also look at intermediate models as much as you can: clusters, decision trees, part-of-speech tags, parse trees, support vectors (ha! that last one was a joke).
Don't be afraid of annotating data yourself. It is tedious work, yes, but it goes fast and you learn a lot. Once you get good enough at it, write annotation protocols and test them on family members and random strangers. You might get consistent enough results to convince your supervisor to hire further annotators to finish the job.
To conclude, some less crucial tips:
Take some time to learn a tiny little bit of linguistics. For this, I recommed
"Linguistics - An Introduction to Language and Communication" by Adrian Akmajian, Richard A. Demers and Robert M. Harnish (MIT press).
Be methodical. Keep track good track of metadata: what did you do and when. You'll appreciate it when success arrives and it comes the time to write a paper or reproduce the results.
Learn some of the technical lingo of NLP pratitioners. From papers you read, google the some terms up. For example, words that appear only once in a corpus are called 'hapax'.
Peruse the ACL Anthology -- "A Digital Archive of Research Papers in Computational Linguistics" 36,000 papers!
Think about the data you're using before spending significant efforts on it. If you use proprietary data that means you can't share it, which in turns means less research impact (I used E! Online data for my dissertation and I had these issues).
Use the source Luke. If you use Free Software and don't submit patches one day you'll wake up and all your friends will be Apple fanboys programming iPad $1 apps in objective-C.