My name is Romain Guion, and I am heading an engineering and data science department at a London-based startup called Vortexa. I previously worked as a data scientist with an entertainment Los Angeles-based startup called Pluto TV and with an AI-driven transportation startup called Padam. Prior to that I worked for 4 years in the healthcare industry as a consultant, project leader and scientist. I also briefly worked as a M&A analyst in the energy industry and as a rocket scientist.
I am also a graduate from the University of Cambridge, and I hold a MSc in Math and Physics from Ecole Centrale Paris. The maths covered advanced statistical learning, discete-time stochastic processes, optimization, signal processing, financial mathematics, as well as a range of linear algebra, analysis and topology topics.
Although Data Science is mainly about critical thinking, people like to talk about tools, so here it is:
All my professional work is confidential. My LinkedIn captures my professional journey. Until now, I never took the time to capture my side projects, so work in progress, and suggestions welcomed!
(Note: one bit of my work that is public are my patents)
Freelance Data Scientist for Padam.io. Padam brings AI-powered on-demand bus services to medium and low density areas. My intervention is focused on the UK market. My mission started very broad: how can we improve our clients’ profit, and Padam’s service in general. I started framing generic revenue and cost questions, and went down to product management questions. I then got access to 3 years of an anonymised database, and mapped growth bottlenecks and potential levers. I then went on to define KPIs for customer retention and operational efficiency, and I am working on optimised processes to increase those. In practice this entailed building a machine learning pipeline to predict the number of bookings as a proxy to customer satisfaction, and interpreting the main contributors as KPIs to improve. I also looked at geographical and temporal mismatch of supply & demand to redefine bus lines. The details are obviously confidential. Tools: iPython, pSQL. 2018
Current project: multi-user movie recommendation system. I am working in collaboration with a Java developer on doing an app that makes movie nights between friends smoother. I am covering the machine learning aspect - multi-user / multi-preference recommendation system. If the algorithm works well, there may be many applications. Our current approach is a cloud database and machine learning web services provided to an Android app. The recommendation engine is a variation of mono-user collaborative filtering and low rank matrix factorisation. Tools: Python. 2018
Space-time analysis of eddies captured by ADCP devices. Work done at the Whittle Laboratory at Cambridge around 2013. The aim was to help tidal turbines work better, as many were working well in the lab but broke on-site. I looked at the most popular device used in marine environment to measure currents and eddies: the Acoustic Doppler Current Profiler (ADCP). I found that there is a significant gap in the frequency and size of eddies this tool could capture. And this blind spot is right in the range that can break tidal turbines and is site-dependant. This led to a conference presentation at Oceans14 in St John’s, Canada, and a few papers. Tools: pen & paper, and Matlab. 2013
Transportation network design optimization using custom genetic algorithms - application to Hong Kong railway. Work done at Ecole Centrale Paris in 2011. In this project we were looking at transportation networks, and given certain nodes and passenger journeys, how best to chose lines running through them. One part of the team looked at applying standard algorithms, while I developed my own evolutionary algorithms and implemented them in C++. I defined lines as entities that could mate, and the mating process would have evolutionary-inspired processes such as mutations, crossing over etc. To set the evolutionary pressure I came up with a purpose made Lagrangian multiplier, which really had a massive impact on the optimization performance. With the same team we made a broad review of artificial intelligence techniques at the time. We also quickly looked at teaching a bot to move on a 2D grid and get food using a very basic reinforcement learning technique, but the algorithm was barely better than random! (oups!). Tools: pen & paper, and C++. 2011
Forecasting bike rides at the network level and at the station pair level. This was a data challenge I had to complete in a very limited time period. My first approach is focused on time-based trends, leverages Facebook’s Prophet package. In a subsequent algorithms, I explore network effects and clusters of bike stations whose demands are correlated. Overall, the biggest learning for me was to structure the problem and potential solutions in details, and realize many solutions boil down to the variance vs bias problem. Indeed, to reduce noise (variance), aggregating data across space and time is effective, but only as long as the aggregated events are reasonably i.i.d. When one departs too far from this it creates bias. Tools: Python. 2018
Freelance Data Scientist for Cetrix Cloud Services. The client wanted to leverage Hubspot and Salesforce to automatically identify fit and interest points. My role was to scope and plan how to harvest user intent for dynamic adaptive marketing (e.g. personalized webpages or emails), given current and future data. Tools: Python. 2018
Leads prioritization for a start-up in Los Angeles. The company was looking at increasing the conversion rate of its call center given limited resources. I was sent a list of 10,000 anonymized leads that had been called 1-3+ times, and the document captured the time to reach out, call duration, what the main objection of the customer was, and whether it had converted or not. The key for me was to shift from customer conversion prediction to call conversion predictions, and think of calls as if they were evolving in a sort of Markov-chain. The states of this Markov chain would be the objections, but the reason why it isn’t a true Markov chain is (i) that more than one step of history was required, and (ii) that the initial state seemed to matter quite a bit - I thought of this as a customer Persona. The result was a prioritization algorithm in python based on (i) a Random Forrest Classifier that would predict conversion probability, and (ii) a time-sensitivity algorithm that would manage the different impact of calling back later of each lead type at each stage of the sales process. Tools: Python. 2018
Workshop at HIT: Kaggle Titanic dataset and Raspberry Pi with capacity-touch sensor and camera for object recognition. Check-out my Python Titanic kaggle kernel (machine learning and data science). I prepared this quickly for a workshop I co-led at the Harare Institute of Technology (Zimbabwe) in 2017. We had 13 second and fourth year cybersecurity students and a couple of professors attending our workshop. We designed the workshop to broaden their horizon, and help them independantly explore a few topics. Entrepreneuship is vital in Zim. We chose to introduce the students to Data Science with the titanic dataset, as they then become part of the Kaggle community and can keep on learning by themselves. We thought hardware was also important, so my friend brought a few Raspberry Pis with two modules: capacity-touch sensor and the camera. The capacity-touch sensor was relevant for the students to detect when people touch a door handle or another conductive surface. For the workshop we ended up doing something more interactive and build a piano using spoons, apples and bananas. The camera was for intrusion detection, using openCV and tensorflow for image recognition. The rational is that motion sensors create a lot of false positive alerts, as they can be triggered by a moving branch, a dog, or someone authorized to be in the house. We thought this could become a business, but realised Google had been working with Nest for a year to do just that. Competing with Google on AI is hard. Tools: Python. 2017
Math
A/B testing statistics - intuition and derivation for power analysis. Online resources on A/B testing tend to be black boxes with very top level insights. My experience is that it prevents experimenters from truly understanding what’s going on. In this document, I give a brief introduction of what A/B testing is used for, and focus on providing a simple, yet more detailed intuition of the frequentist statistical logic. 2018
A/B testing - bootstrap and bayesian perspective. This post explores how to represent and interpret data while it is being collected. In this post we take a concrete marketing A/B test with various conversion rates and revenue per customer in each case. After quickly going over bootstrap methods to estimate uncertainty, we consider Bayesian probabilities as a way to model previous knowledge. In particular we enter some details about conjugate priors. 2018
Python
Recognition of word “activate” from audio clips, using a Convolutional layer and GRU (deep learning NLP. Spectrogram above. Listen to an example with a chime after the word detection! 2020
LSTM neural network generating jazz impro, based on the 5th course of the Deep Learning Coursera specialisation. Listen to an example!. 2020
Vulgarisation medium blog post on how a couple of hours of machine learning can turn a repetitive business task into an automated self-improving process. 2019
Very simple chatbot engine implementing (a simplified version of) the architecture of a paper on a Facebook dataset of Stories, Questions and Answers. The paper presents a structure to (a) represent the Stories as bag of words and taking into account word sequence within the sentence, and sequence between sentences, and (b) combine the stories and questions into some sort of prediction through some sort of LSTM RNN. 2019
Predict amazon reviews rating / sentiment using TfIdf, and Vador (NLTK) as features, combined in a simple classifier. I also played with Word2Vec (Spacy), and found that representing the reviews as the sum of its word vectors was a bad idea (in the simple implementation I used). Indeed, using the resulting vector as a feature to my classifier seemed to generate no predictive power to my classifier. The issue probably comes from the sum / average operator that effectively increases entropy. A LSTM would probably be the classic tool, but I still like to test the simple tools first (although with little hope in this case). Perhaps another type of non-linear word-vector combination, perhaps weighted by TF-IDF would be a sensible step? 2019
Improving matching rates on a two-sided market place. This was a 48h data challenge I had to cram into 24h. The aim was to deliver 2-3 actionable recommendations about how to improve matching in the market place. Here are the links for the presentation of my results and the python notebook. 2018
Causal inference - Lalonde dataset. Causal inference is what people typically call “controlling for confounding factors”. Causal inference is used in observational studies in all fields, e.g. public health, econometrics and marketing. This specific example explores the famous Lalonde dataset, which was a particularly unbalanced study looking at the effect of a job training program. In this notebook I use propensity matching to separate the treatment effect from an unbalanced population. Tools: Python, CausalInference package. 2018
Ads dataset. This is an anonymised data set of 19k lines of data about ads creatives, networks, number of days before release, number of views and actions taken by viewers. In this file I disentangle the impact of creative artists from the broadcasting context. I start with simple averages, then use a fixed effects model, and introduce an experimental random forrest classifier coupled with a fixed effect model. Tools: Python, Statsmodel, Sklearn. 2018
Planning a marketing campaign. A ficticious start-up is planning a street marketing campaign in London. Leveraging a public data set, I identify the best place and time to run the campaign, what are the key assumptions made, and how to test them. 2018
TreefortBNB - a fun Priceonomics puzzle. The puzzle gives 10,000 lines of data on a business similar to Airbnb. The aim is to get nice stats on the most expensive cities in the US on TreefortBnb.com, so that Priceonomics can write an article about it. The key insight for me was a paradigm shift from listing statistics (the raw data) to booking statistics (which is most representative of user experience). This is still a running puzzle, so I won’t give more details. 2018
Most of the mini-projects that follow were pretty easy and with extensive guidance. However, they were nevertheless pretty insightful, as it helped bridge the gap between what I know in math and algorithmics and real world applications.
Java project
Java notes. While following some java courses, here are some key notes:
Scala. When I followed Martin Odersky’s EPFL course on functional programming in Scala and Spark. Some key learnings:
Data Science. While following online courses, I did quite a few mini-projects. Those were very simple and top-level, but a good excuse to get up to date with the popular python tools people use in data science today: Numpy, Pandas, Matplotlib, Seaborn, Plotly, SciKit Learn, Python SQL interface, SciPy, TensorFlow, NLTK, SHAP, lifelines etc. Here is a few for which I took some notes:
Matlab. When I followed Andrew Ng’s Stanford Machine Learning course, I did quite a few mini-projects that led me to write classic machine learning algorithms from first principle (algorithmics and matrix computations).
PySpark