Using Apache UIMA Concept Mapper Annotator with Python via JPyPe

Wed, 28 Mar 2012 17:44:15 -0400

Tags: UIMA, Python

permalink

I have been using a lot of Python lately in work for a customer. Programming python has many positives, but when it comes to processing large amounts of text, I still choose Apache UIMA.

In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.

This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons.

The original version of this post was at the (now defunct) Hack the Job Market blog from MatchFWD, an (also now defunct) local startup. I reproduce it here.

Post originally at the Hack the Job Market blog

Python is great not only for its flexibility but also for its ability to interface with other languages.

In a new project we will be deploying in the upcoming weeks we are searching for job postings using (among other things) a large dictionary of job titles.

This is the task for the Apache UIMA Concept Mapper Annotator, which is part of the Apache UIMA Addons. The dictionary itself is an XML file that looks like this (see SVN for the full sample):
<?xml version="1.0" encoding="UTF-8" ?>
<synonym>
  <token canonical="United States" DOCNO="276145">
    <variant base="United States"/>
    <variant base="United States of America"/>
    <variant base="the United States of America"/>
  </token>
</synonym>
In our case it looks something like this:
<token canonical="Community Managers" french="Gestionnaire de communauté">
        <variant base="Writer &amp; Community Manager" source="1"/>
        <variant base="Marketing and Community Manager" source="1"/>
        <variant base="Community Manager" source="2"/>
        <variant base="Blog Editor" source="1"/>
        <variant base="Community manager social media" source="2"/>
</token>
The power of Concept Mapper lies in the fact that it structures all the entries in RAM and matches against large dictionaries in linear time. Moreover, you can add any extra information in the XML and refer them back from within UIMA by using custom types. In our case, we have job titles, aliases and extra information (such as language, source information, etc). Setting a Java pipeline to use the UIMA Concept Mapper Annotator requires fiddling with three XML descriptors (unless you are using uimaFIT, for descriptorless UIMA, but I haven't tried it yet with Concept Mapper) and a type descriptor. The changes to the type descriptor are needed for modifying the descriptor files, so let's start there. To access extra fields in the dictionary XML file, you need to change DictTerm.xml and define UIMA features for each piece of information:
<featureDescription>
  <name>DictCanon</name>
  <description>canonical form</description>
  <rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>

<featureDescription>
  <name>DictFrench</name>
  <description>French form</description>
  <rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>

<featureDescription>
  <name>DictSource</name>
  <description>Source for the alias</description>
  <rangeTypeName>uima.cas.String</rangeTypeName>
</featureDescription>
The three descriptor files are
  • OffsetTokenizerMatcher.xml - an aggregate Analysis Engine that in turns calls the tokenizer (OffsetTokenizer) and the dictionary matcher (ConceptMapper). In the default install you might need to change the paths inside the descriptor to ensure it points to the other two descriptors.
  • OffsetTokenizer.xml - the tokenizer. The default one worked well for us, but it is nice to see Dictionary Annotator easily supports custom tokenizers.
  • ConceptMapperOffsetTokenizer.xml - this is the main descriptor. It needs to be changed in at least two locations. First, it uses a tokenizer to parse the strings in the dictionary on load time and as such it needs the location for the tokenizer (both tokenizers need to be in sync, otherwise it won't work). This is the TokenizerDescriptorPath option, make sure it points to the OffsetTokenizer:
    <nameValuePair>
      <name>TokenizerDescriptorPath</name>
      <value>
        <string>
          /path/to/OffsetTokenizer.xml
        </string>
      </value>
    </nameValuePair>
    
    Concept Mapper needs to know how to map the attributes in the dictionary XML to the features in the dictionary term. This is accomplished by two parameters, AttributeList and FeatureList, both arrays of strings which should be of the same length (this is quite ugly, somebody feels like submitting a patch for it? *grin*):
    <nameValuePair>
      <name>AttributeList</name>
      <value>
        <array>
          <string>canonical</string>
          <string>french</string>
          <string>source</string>
        </array>
      </value>
    </nameValuePair>
    <nameValuePair>
      <name>FeatureList</name>
      <value>
        <array>
          <string>DictCanon</string>
          <string>DictFrench</string>
          <string>DictSource</string>
        </array>
      </value>
    </nameValuePair>
    
    Finally, at the end of ConceptMapperOffsetTokenizer.xml you need to point to your dictionary:
    <fileResourceSpecifier>
        <fileUrl>file:/path/to/your/dictionary.xml</fileUrl>
    </fileResourceSpecifier>
    
That's it for the descriptor files. UIMA comes with some tools to process collections of documents using annotators but in my case I want to process text from Python, so I wrap the Analysis Engine in a Java method I can call from Python using JPyPe, something on these lines:
public class Analyzer {

	private AnalysisEngine ae;
	private CAS aCAS;

	private TypeSystem ts;
	private Type termType;
	private Feature sourceFeat;

	public Analyzer() throws IOException, InvalidXMLException,
			ResourceInitializationException {
		XMLInputSource in = new XMLInputSource("/path/to/OffsetTokenizerMatcher.xml");
		ResourceSpecifier specifier = UIMAFramework.getXMLParser().parseResourceSpecifier(in);
		ae = UIMAFramework.produceAnalysisEngine(specifier);
		aCAS = ae.newCAS();
		ts = aCAS.getTypeSystem();
		termType = ts.getType("org.apache.uima.conceptMapper.DictTerm");
		sourceFeat = termType.getFeature("DictSource");
	}

	public String[] analyze(String s) throws AnalysisEngineProcessException {
		aCAS.reset(); // reuse the Common Annotation Structure object
		aCAS.setDocumentText(s); // set the text to analyze

		ae.process(aCAS); // run concept mapper

		FSIterator<AnnotationFS> it = aCAS.getAnnotationIndex(termType).iterator(); // jobs titles?
		if (!it.hasNext()) {
			return new String[0]; // no job titles, nothing to see here
		} else {
		        // return the first one, with offsets and source
			AnnotationFS ann = it.next();
			return new String[]{ String.valueOf(ann.getBegin()), 
			       	   	     String.valueOf(ann.getEnd()),
					     ann.getFeatureValueAsString(sourceFeat) };
		}
	}
}
Now, calling this from Python is very straightforward. I am using an embedded JVM through JPyPe:
from jpype import java

if not jpype.isJVMStarted():
    _jvmArgs = [ "-Djava.class.path=/path/to/uima/jars/uimaj-bootstrap.jar:/path/to/uima/jars/uimaj-core.jar:/path/to/uima/jars/uimaj-document-annotation.jar:/path/to/uima/jars/uima-ConceptMapper.jar:/path/to/your/code/bin")
    jpype.startJVM("/path/to/your/jvm/libjvm.so", *_jvmArgs)

Analyzer = jpype.JClass("your.packages.Analyzer")

analyzer = Analyzer()

analysis = analyzer.analyze("Sample text talking about what are the tasks for a Blog Editor")
if len(analysis) > 0:
   print 'start', analysis[0]
   print 'end', analysis[1]
   print 'source', analysis[2]
Our internal tests show processing 200+ megabytes of text against a dictionary with hundreds of thousands of entries takes a little bit more than 4 minutes. While the integration is working, is still far from being nice or "pythonic". I hope to gather enough insights from this to contribute to an outstanding NLTK ticket since 2010.

Comments

Your name:

URL (optional):

Your e-mail (optional, won't be displayed):

Something funny using the word 'elephant' (spam filter):

Your comment: