François Dumont

Portfolio

Power BI

Python Streamlit

Google Looker Studio

Advanced web scraping

Text/Data Mining

Machine Learning

About

Turning data into actionable insights

Passionate about data analysis and problem-solving, I help businesses make informed decisions by transforming raw data into meaningful insights.

From quality enhancement to data-driven strategy

In the automotive electronics industry, I conducted in-depth data analyses and developed clear visualizations that improved process efficiency and product quality. My work played a key role in refining anomaly detection and optimizing workflows.

Expanding my expertise

To strengthen my skills, I completed an intensive Data Analyst training program, where I deepened my knowledge in Business Intelligence, Machine Learning, and Text Mining, while sharpening my expertise in SQL, Python, and data visualization.

Looking for new challenges

I'm eager to apply my expertise in a dynamic environment where data-driven decision-making is key. I'm particularly interested in leveraging analytics to drive business performance and operational excellence.

Copyright © 2025 - RisInsight - Legal notice

An interactive and efficient way to showcase a data project ⋙ live insights and dynamic visualizations

Here is a very good example of a Streamlit application.
I have created it for the oral demonstration as part of my Data Analyst study project.

⬇️ 👁‍🗨 Please, have a deep look ! ⬇️

As part of a data science project, I developed a Streamlit application to present our end-to-end workflow, from data exploration to machine learning predictions. The dataset, sourced from Kaggle, contained categorical attributes such as platform, publisher, and release year.

Key features & technical highlights:
✅ Data exploration & visualization
‣ Presented the dataset with key insights.
‣ Conducted Exploratory Data Analysis (EDA) using interactive Plotly charts to uncover sales trends.

✅ Data enrichment via web scraping
‣ Scraped multiple websites to fill missing values and add new quantitative features.
‣ This enhancement significantly improved our machine learning model’s accuracy.

✅ Machine learning implementation
‣ Tested multiple ML models for sales prediction.
‣ Applied feature encoding and target variable transformation for better performance.

✅ Sentiment analysis
‣ Analyzed text data related to video games to extract valuable insights from player feedback.

This project demonstrates my acquired expertise in data analysis, data wrangling, web scraping, feature engineering, and machine learning while leveraging Streamlit for interactive reporting.

Sidebar menu, images, dataframes / table preview. Ability to filter data.
Rich text with markdown language.

Streamlit example

Advanced chart rendering, like with Bokeh, Plotly, Matplotlib and more !

Streamlit example

Streamlit example

Several container options: popover, dropdown...

Streamlit example

Options with sliders, dynamic chart updating...

Streamlit example

Code integration, work with columns...

Streamlit example

Options with checkboxes, radio buttons...

Streamlit example

Extract data efficiently from the web for powerful insights and analysis

Log extract Advanced web scraping

During my Data Analyst study project, I enhanced the given dataset by filling in missing values and adding quantitative attributes, sourced through web scraping, to improve our machine learning models.
Recently, I revisited this work and made significant improvements, greatly enhancing efficiency, logging, and reliability.

⬇️ 👁‍🗨 Have a look ! ⬇️

This script aims to scrap all video games ranked on the Metacritic.com website (currently 13,589 games) and collect information such as:

Rank: Rank of the game by Metascore
Name: Name of the... game
Platform: Platform on which the game has been released
Developer: Developer of the game
Publisher: Publisher of the game
Release_date: Release date of the game
Release_year: Release year of the game
Genre: Genre of the game
Rate: Entertainment Software Rating of the game
Description: A brief description of the game
Critic_score: Metascore of the game
Critic_positive_reviews: Count of 'positive' reviews received from highly respected critics
Critic_mixed_reviews: Count of 'mixed' reviews received from highly respected critics
Critic_negative_reviews: Count of 'negative' reviews received from highly respected critics
User_score: Average score given by end users
User_positive_reviews: Count of 'positive' reviews received from end users
User_mixed_reviews: Count of 'mixed' reviews received from end users
User_negative_reviews: Count of 'negative' reviews received from end users
Game_url: URL game page

This web scraping script is well-optimized, focusing on efficiency, robustness, and scalability.

⋙ Asynchronous execution for performance:
☑ Uses asyncio and aiohttp for non-blocking HTTP requests, significantly reducing wait times.
☑ Retrieves platform scores in parallel with asyncio, improving efficiency.

⋙ Advanced concurrency management:
☑ Implements concurrent.futures.ThreadPoolExecutor to process multiple tasks simultaneously.
☑ Limits the number of concurrent threads (max_threads=5) to optimize system and server load.

⏫Leveraging thread pools and asynchronous functions significantly enhanced efficiency,
reducing scraping time by at least 5x.

⋙ Robust error handling:
☑ Implements a retry mechanism (MAX_RETRIES) with exponential backoff to handle network failures.
☑ Detects bans (e.g., captchas, HTTP 403) and logs errors without crashing the script.

⋙ Anti-detection mechanisms:
☑ Uses randomized User-Agent (fake_useragent) and dynamic referers to mimic real users.
☑ Includes optional random delays to avoid bot detection.

⋙ Optimized data storage and export:
☑ Implements lock mechanisms to ensure data integrity.
☑ Periodically saves data in batches to avoid memory overload.
☑ Merges and compresses CSV files automatically for efficient storage management.

Uncover hidden patterns and valuable insights from unstructured data

During my Data Analyst study projects, I have collected, using web scraping, user review comments for each game/platform in the original dataset, focusing on games with at least 50 user comments.
I prioritized the oldest comments and limited the collection to a maximum of 500 comments per game, resulting in a 390MB CSV file.

Here is a sample of the gathered data:

Data sample of scraped data

Have a look at this example Darkest Dugeon. Each quote is associated with a note from 0 to 10. Metacritic then categorises each score as positive, mixed or negative review like this:

Metacritic metric

Scores in [0-4] are counted as negative.
Scores in [5-7] are counted as mixed.
Scores in [8-10] are counted as positive.

📋 We then obtain this summary:

User score count example

Our goal was to predict the category label (our target) associated with each user comment (the feature).

⋙ Here is the technical pipeline:

Filter out only the English quotes using the Python langdetect module.
Create a new column named 'Sentiment' by applying the filter above.
Sentiment = -1 for negative, 0 for mixed, 1 for positive.
Do some cleaning: lower characters, remove noise characters
Word tokenization then stop words filtering using Python nltk module functions.
Apply word lemmatization.
Apply the TF-IDF method.

Since the label distribution in the dataset used (from 132083 quotes) was unbalanced, I applied a combination of techniques to balance the data. Specifically, I used SMOTETomek for this purpose.

Labels count plot
Raw data

Raw data - Labels count plot

Labels count plot
Over-sampled data

Resampled data - Labels count plot

I then fitted Gradiant Boosting Classifier models from the scikit-learn Python library, with and without re-sampling method, and check the difference in predictions.

Prediction confusion matrix
Raw data

Raw data - Prediction confusion matrix

Prediction confusion matrix
Over-sampled data

Resampled data - Prediction confusion matrix

We can clearly observe an improvement in predictions, particularly for negative and mixed sentiments, albeit at the cost of slightly reduced accuracy in positive sentiment predictions.

We have finally used those predictions in our video game sales predictions, using features input such as:

Publisher
Year of release
Platform
Genre
Developer
Rate
Platform Type
User comments

⬇️ 👁‍🗨 Streamlit demo of the complete project ⬇️

Build predictive models and uncover data-driven insights with AI

As part of the Data Analyst training program at Datascientest, some courses cover both supervised and unsupervised Machine Learning.
The study project allowed us to apply our knowledge to a real-world case. My project focused on forecasting video game sales based on qualitative features, using a dataset from Kaggle.

Video games dataset sample

The qualitative features for predictions are:

Platform
Year
Genre
Publisher

And the chosen target is Global_Sales.

⋙ After data exploration and cleaning, and first tests with classic models, we found out that we get better results when the target was processed with the Box-Cox transform. This transformation aims to make data more Gaussian-like.

Quantile to quantile plot
Raw data

Raw data - Target data QQ plot

Quantile to quantile plot
Box-Cox transformed data

Box Cox trasformed data - Target data QQ plot

⋙ We preferred encoding the qualitative features with a Binary Encoder over a standard OneHotEncoder to minimize generated data.

⋙ We determined that the XGBoostRegressor was giving good results with our data, but improvement was still possible.

That's why we integrated the scraped data into our dataset, incorporating quantitative features such as user scores, the number of positive, mixed, and negative reviews per game, as well as critic scores and their respective review counts.

This enhancement improves prediction accuracy by 18 %, increasing it from 44 % to 62 %.

⬇️ Please have a look to our full project Streamlit demo ⬇️