As I'm married to a current IBM emloyee, I'm disqualified from participating in the AI XPRIZE sponsored by IBM. So I'm putting my ideas in this blog post, might they help inspire other people.
The XPRIZE follows the path of other multi-year challenges that have resulted in great accomplishments such as commercial rockets. In the case of the AI challenge, it diverges from previous challenges by being completely open ended: any major AI break-through can win the 3M USD prize.
What I'd like to see is a team tackling improvement in scientific communications by leveraging recent advances in machine reading and taking them to the next level. I would like to see some work on scientific metadata (possibly in the directions of the Semantic Web) that captures the main dicoveries in a scientific paper. This metadata should be feasible to be produced by a human, the machine reading aspect is there just to bring enough value to the metadata during transition to entice humans to self-annotate.
The case for this improvement lies in the amount of researchers in many key fields having physically no time to keep up-to-date with published results. A high-level summary or the possibility to query "has anybody applied method X to problem Y" would be invaluable. Moreover, this type of setting allows for a very constrained inference, simplifying scientific discovery for sometimes obvious, sometimes overseen new findings.
I'm not stranger to this approach. My most cited paper came to be as a contributor to a multidisciplinary project on automatic extraction and inference in the genomics domain (some form of automated inference was later realized many years after I left the project).
This is further simplified by reporting standards in many scientific disciplines. Take for example, the one from the American Psychological Association (thanks to Emily Sheepy to pointing me to that report). Such standards specify the type of contributions and the information to be expected on them, even up to the headers of each section.
I believe all these pieces together have a chance of, if now winning, at least doing well in the competition. And irrespective of the competition, this technology deserves to exist and help accelerate human discovery.
Regarding business aspects, it would be nice if the metadata format is an open format and the commercialization centers on extracting metadata from existing publications and authoring tools. Extracting metadata and doing inferences for profit is somewhat contrary to the goals of accelerating research, but that's speaking as a scientist, not a business person.
Let me summarize the concept with an example:
Given an existing paper, for example Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network (Toutanova et al., 2003) produce metadata of this type:
Part-of-Speech tagging (as a link to an ontology)
Conditional Markov Model (or CMM with features x,y,z; all linked to an ontology)
97.24% accuracy over Penn Treebank WSJ (the metric and the corpus are also links)
These entries can be further populated by the scientists upon publication, maybe with the help of an authoring tool.
From this metadata, a system can answer "what is the best performance for POS and what technique does it use?" but also "POS and role labeling are similar problems (fictional fact): both use similar techniques and both rank their performance similarly; however, the best performance in POS is using skip decision lists (also a fictional fact) but that technique has never been attempted on role labeling"
Wish the participants the best of luck. Look forward seeing great technology being developed as a result of the challenge!