Because The Air Is Free -- Pablo Duboue's Blog

Main page

Turning Your Class Project Into a Published Academic Paper

Thu, 18 Feb 2016 08:09:39 -0500

Tags: academic, research


In recent years, teaching has been to me a great source of satisfaction. A common situation when teaching is for a student to want to turn a class project into a full-fledged research publication. I have decided to put my ideas on the topic in this blog posting for helping students in such situation. This is also relevant to me as this year marks the 20th anniversary of my first international publication, which started as a class project.

First and foremost, for the students reading this, congratulations on successfully completing an outstanding class project! Even if the path to publication is arduous and can be discouraging, the fact you are entertaining the idea of publishing your work to a larger audience is a success on itself. Now, there are as many reasons to publish as there are class projects. Some are better than others, let's look at them in turn.

Quick Prototyping Use Cases with Widgets in Jupyter Notebook

Sat, 02 Jan 2016 23:07:32 -0500

Tags: floss, python, programming


One of the key tenants of open source development is to scratch your own itch, that is, to build something of use and value to their authors (compare that with commercial development --building something for a customer-- or research --new solutions to challenging problems--). However, for a project to survive and to attract that elusive second contributor, it is important to make the project useful to others. Which bring us to the problem of having some sort of user interface. Being a back-end person, I usually struggle creating them. Some of my projects (for example, see remain in a state of "working back-end good enough for me". This is of course useful enough for me but a far cry from useful to anybody else. I have found recently some relatively new technology and workflow that might enable to get out of this local equilibrium: widgets for Jupyter Notebook.

Building a Speech Aligner in Java with Sphinx4 and VoxForge models

Wed, 16 Dec 2015 01:38:27 -0500

Tags: floss, speech, java, art


When I started fiddling with speech technologies back in 2010, I was interested in doing a real-time part of speech tagging of radio broadcasts. That exceeded my technical knowledge on the subject but I managed to learn enough Sphinx4 to assemble an end-to-end speech alignment system that takes an audio book and aligns it to its source material. Such system is almost straight out of Sphinx4 examples but it took a bit of time configuring and finding the right models to make it work. I used the aligner to make a funky installation in the Foulab booth for the Ottawa Mini-Maker Faire 2010. Over the years I have received a few requests for the system and I released it open source this month. This blog post describes a little bit more about it, how I used it and how other people can use it for their own projects.

24 Pull Requests: What, Why and How

Sun, 06 Dec 2015 22:04:25 -0500

Tags: Open Source, Challenge, Git, GitHub


Since 2012 I have been participating in an inspiring open source challenge called "24 pull requests". This post is an extended version of a lighting talk I gave at the Observe, Hack, Make hackercamp in The Netherlands in 2013.

Overfitting Machine Learning Experiments: When Cross-validation is No Silver Bullet

Mon, 30 Nov 2015 13:07:43 -0500

Tags: Machine Learning


(Update 2022: this topic is discussed at length in my book's first chapter, which can be previewed on Google Books.)

A few years ago, I attended a very good talk about identifying influencers in social media based on textual features. To evalute the results, the researchers employed cross-validation, a very popular technique in machine learning where the train set is split in n parts (called folds). The machine learning system is then trained and evaluated n times, each time in all the training data minus one fold and then evaluated in the remaining fold. In that way it is possible to have evaluation results for an evaluation set of the same size as the train set without doing the "mortal sin" of evaluating in training. The technique is very useful and widely employed. However, that doesn't stop you from overfitting at the methodological level, meaning if you repeat multiple experiments over the same data you will get enough insights into it to "overfit" it. This methodological problem is quite common, so I decided to write it down. It is also not very easy to spot due to the Warm Fuzzy Feeling (TM) that comes with using cross validation. That is, many times we as practitioners feel that by using cross-validation we buy some magical insurance policy against overfitting.

NLP survival tips for non-NLP Graduate Students

Wed, 25 Nov 2015 07:30:31 -0500

Tags: Natural Language Processing, Computational Linguistics, Graduate School


A few years back my wife kindly hosted me at the McGill CS Graduate Student Seminar Series. It was a well attended, candid talk about how to succeed at incorporating natural language processing (NLP) techniques within graduate research projects outside NLP itself.

Given the nature of the talk, I did not find sharing the slides as I do for my other talks to be that useful. Instead I'm putting that content into this blog post.

Turn your class audio and PDF into YouTube videos using Free Software Tools

Fri, 20 Nov 2015 08:31:15 -0500

Tags: PDF, MEncoder, YouTube


Last year I taught a graduate level, semester length class in Machine Learning over Large Datasets at Facultad de Matématica, Astronomía y Física de la Universidad Nacional de Córdoba, in Argentina.

I have made the slides and audio recordings of the classes available on-line at the class' site (in Spanish, sorry) under a CC-BY-SA license. I have recently dovetailing the audio and PDF into videos which I'm uploading to a playlist in YouTube. In this blog post I want to describe the tools I used to record and create the final videos.

Mining QA Pairs from IRC Logs Using Simple Heuristics and a Chat Disentagler

Mon, 22 Apr 2013 03:56:46 -0400

Tags: IRC, QA


If you haven't heard of TikiWiki, it is a Wiki/CMS that follows the "Wiki way" also for development process: the development happens in SourceForge SVN and everybody is invited to commit code to the project. Even though that sounds brutal, it actually produces a very different development dynamic and a feature-rich product.

A number key people in the Tiki world live around Montreal and I have met some of them and been intrigued by the project for a while. It turns out their annual meeting ("TikiFest") was in Montreal/Ottawa this year so I got to attend for a few days and work on an interesting project: Mining Question/Answer pairs from #tikiwiki history. While the topic of mining QA-pairs has received a lot of attention in NLP and related areas, this is a real attempt at making this type of technology available to regular users. (You can see a demo on my local server.)

The process involves (see the page on for plenty of details):

  • Downloading the logs
  • Normalizing the IRC logs across different IRC logging clients
  • Identifying users that are likely to have asked questions
  • Identify the threads where said users participated
  • Assembling the final corpus
  • Indexing

To avoid having to annotate training data for the question identification, I'm using the approximation of finding IRC nicks that have only said 2 to 10 things in the whole logging history. The expectation is that the said users appear on #tikiwiki, ask a question receive an answer and left.

For the extraction step, I'm using a publically available implementation of Disentangling chat (2010) by Elsner and Charniak.

If this approach works, I can think of packaging it for use on other QA-oriented IRC channels (like #debian). If this interests you, leave me a comment or contact me.

Older Posts