Tags: academic, personal
I haven't been blogging much as I was busy writing The Art of Feature Engineering, a book that came out in the summer on Cambridge University Press. Here are some thoughts about the book writing process itself, condensed and expanded from its introduction. Also some thoughts on earlier criticism it has encountered.
My interest with feature engineering started working together with David Gondek and the rest of the Jeopardy! team at IBM TJ Watson Research Center in the late 2000s. The process and ideas in the book draw heavily from that experience. The error analysis sessions chaired by David Ferrucci were grueling two days of looking at problem after problem and brainstorming ideas of how to address them. It was a very stressful time; hopefully this book will help you profit from that experience without having to endure it. Even though it has been years since we worked together, the book exists thanks to their efforts that transcend the show itself.
After leaving IBM, during the years I have been doing consulting, I have seen countless professionals abandon promising paths due to lack of feature engineering tools. I wrote the book for them.
There are as many reasons to write books as there are books written. In the case of the book, it was driven by a desire to both help practitioners and structure information scattered on a variety of formats. I am happy with the result (otherwise I wouln't have moved forward for publication, of course). The feedback I'm getting from online sites and colleagues seems to be coalescing so far around two issues:
Many times the book misses "when to use what". I share that concern but I have echoed published material when available. After spending two years reviewing material and going through 300 sources, the answer is still "we don't know". It is possible to restrict the problems further (assume linear models, for example) and then sweeping recommendations can be made. But that is not the case with data and problems found in the industry. In my opinion, it is better to know you don't know than to build on top of shaky intuitions.
Not all techniques are explained in detail. That's correct, for many techniques, an intuition behind the technique is given and a reference or multiple reference for it are provided. The idea is to help a practitioner find a technique based on structuring of the material provided by the book. At about 100 pages, the theory discussion in the book is concise and (hopefully) can enrich every practitioners' toolbox.
Also note that for key methods used in the case studies, an open source implementation is provided that was written specifically for the book. I adhere to the idea that source code is a way of communication. From my perspective, this is the most in-depth explanation I can provide. But of course, each person is different and this might not work for everybody.
The good news is that feature engineering is slowly getting the attention it deserves and multiple books have come up on the topic this year. Depending on the reader, each person can find something that caters their specific needs.
Now, for some thanks!
The book benefited from more than 30 full book and book chapter reviewers. In alphabetical order, I would like to extend my total gratitude to Sampoorna Biswas, Eric Brochu, Rupert Brooks, Gavin Brown, Steven Butler, Pablo Gabriel Celayes, Claudio Conejero, Nelson Correa, Facundo Deza, Michel Galley, Lulu Huang, Max Kanter, Rahul Khopkar, Jessica Kuo, Alice Liang, Pierre Louarn, Ives Macedo, Aanchan Mohan, Kenneth Odoh, Heri Rakotomalala, Andriy Redko, David Rowley, Ivan Savov, Jason Smith, Alex Strackhan, Adrián Tichno, Chen Xi, Annie Ying, Wlodek Zadrozny.
My students at the 2014 Machine Learning on Large Datasets in Universidad Nacional de Córdoba were also instrumental on the creation of this book and the students of the 2018 graduate course in Feature Engineering tested an earlier version of the material. The fantastic Data Science community in Vancouver, particularly the one centred around the paper reading LearnDS meetup also proved very helpful with comments, suggestions and as a source of reviewers.
Cambridge University Press helped move this book from concept to finished product. Kaitlin Leach help made all the difference when navigating the intricacies of book publishing, together with Amy He that helped me stay on track. The annonymous reviewers provided from the publisher helped grow the manuscript in a direction that better fits with existing instructional material.
Finally, a book always have a deep toll on a family. This book is unusually interwined with family as my wife, Annie Ying, is also a data scientist. This book started during a personal sabbatical while I was accompanying her in NY on a spousal visa. She proofread every chapter and helped keep my sanity during the dark times of writing. This book would not exist without her. It would have not even been started. Annie, muchas gracias, de corazón.