Tags: UIMA, Python
I have been using a lot of Python lately in work for a customer. Programming python has many positives, but when it comes to processing large amounts of text, I still choose Apache UIMA.
In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.
This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons.
The original version of this post was at the (now defunct) Hack the Job Market blog from MatchFWD, an (also now defunct) local startup. I reproduce it here.
Python is great not only for its flexibility but also for its ability to interface with other languages.
In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.
This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons. The dictionary itself is an XML file that looks like this (see SVN for the full sample):<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
<token canonical="United States" DOCNO="276145">
<variant base="United States"/>
<variant base="United States of America"/>
<variant base="the United States of America"/>
</token>
</synonym>
In our case it looks something like this:
<token canonical="Community Managers" french="Gestionnaire de communauté">
<variant base="Writer & Community Manager" source="1"/>
<variant base="Marketing and Community Manager" source="1"/>
<variant base="Community Manager" source="2"/>
<variant base="Blog Editor" source="1"/>
<variant base="Community manager social media" source="2"/>
</token>
The power of Concept Mapper lies in the fact that it structures all
the entries in RAM and matches against large dictionaries in linear
time. Moreover, you can add any extra information in the XML and refer
them back from within UIMA by using custom types. In our case, we have
job titles, aliases and extra information (such as language, source
information, etc).
Setting a Java pipeline to use the UIMA Concept Mapper Annotator
requires fiddling with three XML descriptors (unless you are using uimaFIT, for
descriptorless UIMA, but I haven't tried it yet with Concept Mapper)
and a type descriptor.
The changes to the type descriptor are needed for modifying the
descriptor files, so let's start there. To access extra fields in the
dictionary XML file, you need to change DictTerm.xml and define UIMA
features for each piece of information:
<featureDescription> <name>DictCanon</name> <description>canonical form</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> <featureDescription> <name>DictFrench</name> <description>French form</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription> <featureDescription> <name>DictSource</name> <description>Source for the alias</description> <rangeTypeName>uima.cas.String</rangeTypeName> </featureDescription>The three descriptor files are
<nameValuePair>
<name>TokenizerDescriptorPath</name>
<value>
<string>
/path/to/OffsetTokenizer.xml
</string>
</value>
</nameValuePair>
Concept Mapper needs to know how to map the attributes in the
dictionary XML to the features in the dictionary term. This is
accomplished by two parameters, AttributeList and FeatureList, both
arrays of strings which should be of the same length (this is quite
ugly, somebody feels like submitting a patch for it? *grin*):
<nameValuePair>
<name>AttributeList</name>
<value>
<array>
<string>canonical</string>
<string>french</string>
<string>source</string>
</array>
</value>
</nameValuePair>
<nameValuePair>
<name>FeatureList</name>
<value>
<array>
<string>DictCanon</string>
<string>DictFrench</string>
<string>DictSource</string>
</array>
</value>
</nameValuePair>
Finally, at the end of ConceptMapperOffsetTokenizer.xml you need to
point to your dictionary:
<fileResourceSpecifier>
<fileUrl>file:/path/to/your/dictionary.xml</fileUrl>
</fileResourceSpecifier>
public class Analyzer {
private AnalysisEngine ae;
private CAS aCAS;
private TypeSystem ts;
private Type termType;
private Feature sourceFeat;
public Analyzer() throws IOException, InvalidXMLException,
ResourceInitializationException {
XMLInputSource in = new XMLInputSource("/path/to/OffsetTokenizerMatcher.xml");
ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in);
ae = UIMAFramework.produceAnalysisEngine(specifier);
aCAS = ae.newCAS();
ts = aCAS.getTypeSystem();
termType = ts.getType("org.apache.uima.conceptMapper.DictTerm");
sourceFeat = termType.getFeature("DictSource");
}
public String[] analyze(String s) throws AnalysisEngineProcessException {
aCAS.reset(); // reuse the Common Annotation Structure object
aCAS.setDocumentText(s); // set the text to analyze
ae.process(aCAS); // run concept mapper
FSIterator<AnnotationFS> it = aCAS.getAnnotationIndex(termType).iterator(); // jobs titles?
if (!it.hasNext()) {
return new String[0]; // no job titles, nothing to see here
} else {
// return the first one, with offsets and source
AnnotationFS ann = it.next();
return new String[]{ String.valueOf(ann.getBegin()),
String.valueOf(ann.getEnd()),
ann.getFeatureValueAsString(sourceFeat) };
}
}
}
Now, calling this from Python is very straightforward. I am using an embedded JVM through JPyPe:
from jpype import java
if not jpype.isJVMStarted():
_jvmArgs = [ "-Djava.class.path=/path/to/uima/jars/uimaj-bootstrap.jar:/path/to/uima/jars/uimaj-core.jar:/path/to/uima/jars/uimaj-document-annotation.jar:/path/to/uima/jars/uima-ConceptMapper.jar:/path/to/your/code/bin")
jpype.startJVM("/path/to/your/jvm/libjvm.so", *_jvmArgs)
Analyzer = jpype.JClass("your.packages.Analyzer")
analyzer = Analyzer()
analysis = analyzer.analyze("Sample text talking about what are the tasks for a Blog Editor")
if len(analysis) > 0:
print 'start', analysis[0]
print 'end', analysis[1]
print 'source', analysis[2]
Our internal tests show processing 200+ megabytes of text against a
dictionary with hundreds of thousands of entries takes a little bit
more than 4 minutes.
While the integration is working, is still far from being nice or
"pythonic". I hope to gather enough insights from this to contribute
to an outstanding NLTK ticket
since 2010.