Tags: floss, academic
I got my first paper publish in 1996, in a conference in Antofagasta, Chile (the bus trip there was gruelling, that might be worth talking about in another post). It was in models and simulation, a joint work with Nicolas Bruno (and both ended up at Columbia University for PhD, that's yet for another blog post). From there I went to do my undergraduate thesis in Spanish parsing using LFGs in Haskell. Later on I continued working on model and simulation before getting into PhD. In the PhD I went through three advisers, two in my first year, where I worked in word sense disambiguation before moving to natural language generation in the medical domain. My final years of PhD where dedicated to natural language generation for intelligent analysis. At IBM Research, I moved into question answering, initially in the Human Resources domain, with a detour on expert search before settling on the Watson Jeopardy! project. Each change of topic and domain involved extensive background research to get myself up to speed. After I left IBM and started doing consulting work, it got even worse, so I won't bore you with the details. How to keep track of all that information in my head?
In 2012, I came to terms I had to spend less time reading research and more time tracking what I read. Everytime we read something, it is for a purpose. Without keeping extra metadata, it start becoming akin to haven't read the papers at all. It is not that I have a particularly bad memory, but after a few hundred papers, the names of the authors and the titles start escaping me. Interestingly, I remember under which circumstances I found the paper or where (as the place or the device or printout) I read it. Therefore, I decided to use a tool to keep track of such metadata.
After some search of various available tools, I decided to write my own, which I did and I have been using for many, many years. I credit the management of extended number of sources I had to go through for writing my book to having this tool, which I silly named P.A.P.E.R. (Pablo's Artifacts and Papers Environment plus Repository). This month I open sourced the tool at a lighting talk in Vancouver's Learn Data Science meetup. This post describes the tool, which is seeking beta testers, users and contributors (your help, basically).
One of the distinguishing features of P.A.P.E.R. is that it keeps its state in a YAML text file and it does not have a UI per se, but it relies in custom widgets for Jupyter notebooks. Using the tool is thus similar to a smalltalk session.
As mentioned before, the idea is to keep metadata about what have been read. Breadcrumbs to find the paper again at a later time. I am finding that on-line search engines have become extremely unreliable for known-item search. Basically, there is a very high wall of spammers hiding your item (I will happily include Medium blogs in that wall). The bias towards fresh information also makes the search for old stuff very difficult. Sure, there are more recent papers in the topic, but you want to find an old one because it describes an algorithm you want to implement or makes a point that supports your argument and it is a paper you have already invest the effort of reading and understanding.
From that perspective, P.A.P.E.R. pushes a trade-off of spending a little less time reading and a little more time cataloguing what we read. It allows, however, to record as much or as little metadata as we feel each paper deserves.
Conceptually, P.A.P.E.R. is a BibTeX plus a mind map, with optional digital asset manager (i.e., paper storage) and search engine. The data model is a hierarchical attribute-value pair DAG, completely contained in a single YAML file. You can define your own relations and the YAML file can be opened with any text editor, it can be searched, carried in your phone, etc. More importantly, it can be checked into a SCM (i.e., git) without trouble.
A data model for the 500 papers I have in the tool weights 800KB (that's about 20% of what I read since 2012, AI is an insane field, but that's for another blog post). The beauty of a text file, is that if the tool breaks, you still have the data, there are no complex binary upgrade issues as when using DBs.
The digital asset manager is a paper repository where papers (or other artifacts) can be “absorbed” into the tool to a folder-balanced hash-based file. That solves the problem of having downloaded the same paper multiple times, with different filenames (it happens!). It gives it a canonical name and place in your hard drive. (But you can use the DAG without the repo.)
Exports to a static website, and populates folders with symlinks for all the papers depending on topics.
BibTeX import and generation (you can add a "citing" relation between a paper you're writing and existing entries; it then produces a BibTeX file for you).
Reading Lists support: you can prioritize which papers to read, per topic.
Arbitrary relations between papers, you can link papers based on your own relations. You can even add custom extra fields (which are themselves typed, the data model is strongly typed).
Command-line UI, it allows to export the BibTeX, generate the static website and other things, this is an area being improved upon (pull requests are welcomed).
Create new paper entries from plain text files with extra annotations. This allows me to keep a small text file with notes as I'm reading papers in the phone or the tablet and then upload the PDFs and annotations to the tool easily.
Custom Jupyter Notebook Widgets (see the lighting talk for an example or the sample notebooks).
Full text search engine, using Whoosh. It supports text search, Boolean search and query-by-example. Therefore, given the text of a paper, it helps you find related papers in your repository (we used it for Ying and Duboue (2019)).
The project is quite hackable, you can take some pieces and adapt it for your needs (hook it to external repositories, etc.) It is 2000+ lines of code in 10 python files.
For the release, I added some documentation and test cases. I also packed it in PyPI, currently on the test instance. I am particularly interested in people helping test it under a Mac OS X environment:
python3 -m pip install --index-url https://test.pypi.org/simple/ --no-deps paperapp_DrDub pip install paperapp_DrDub[widgets] pip install paperapp_DrDub[fulltext]
This tool started at Les Laboratoires Foulab (Montreal’s hackerspace) and the Jupyter for quick prototyping was presented at Montréal Python #56. Many thanks to the awesome Montreal community for their help, hopefully some people there might find the tool useful.
The killer feature I'm currently coding is synchronizing a BibTeX file with a paper pile. All the pieces are there, hopefully I'll get to it soon and move it to general PyPI. That'll be for another blog post, though. Cheers!
The repo is on GitHub: https://github.com/DrDub/PAPER