Passionate about data analysis and problem-solving, I help businesses make informed decisions by transforming raw data into meaningful insights.
From quality enhancement to data-driven strategy
In the automotive electronics industry, I conducted in-depth data analyses and developed clear visualizations that improved process efficiency and product quality. My work played a key role in refining anomaly detection and optimizing workflows.
Expanding my expertise
To strengthen my skills, I completed an intensive Data Analyst training program, where I deepened my knowledge in Business Intelligence, Machine Learning, and Text Mining, while sharpening my expertise in SQL, Python, and data visualization.
Looking for new challenges
I'm eager to apply my expertise in a dynamic environment where data-driven decision-making is key. I'm particularly interested in leveraging analytics to drive business performance and operational excellence.
Insightful dashboard showcasing key analytics extracted from raw call center data
- I made it with a soft, minimalist layout and pastel colors -
Report for the customer service:
Overall performance report:
Team performances report:
Revenue data report at a glance:
✨ E T L ✨
The dataset is constructed around multiple Excel / CSV files:
.
After extracting data from source files using the appropriate connectors, it is processed, filtered, properly formatted, and standardized. Structuring tables into an efficient star schema, with the fact table at the center surrounded by dimension tables, is crucial for effective big data management.
Streamlit Application
An interactive and efficient way to showcase a data project ⋙ live insights and dynamic visualizations
Here is a very good example of a Streamlit application.
I have created it for the oral demonstration as part of my Data Analyst study project.
⬇️ 👁🗨 Please, have a deep look ! ⬇️
As part of a data science project, I developed a Streamlit application to present our end-to-end workflow, from data exploration to machine learning predictions. The dataset, sourced from Kaggle, contained categorical attributes such as platform, publisher, and release year.
Key features & technical highlights:
✅ Data exploration & visualization
‣ Presented the dataset with key insights.
‣ Conducted Exploratory Data Analysis (EDA) using interactive Plotly charts to uncover sales trends.
✅ Data enrichment via web scraping
‣ Scraped multiple websites to fill missing values and add new quantitative features.
‣ This enhancement significantly improved our machine learning model’s accuracy.
✅ Machine learning implementation
‣ Tested multiple ML models for sales prediction.
‣ Applied feature encoding and target variable transformation for better performance.
✅ Sentiment analysis
‣ Analyzed text data related to video games to extract valuable insights from player feedback.
This project demonstrates my acquired expertise in data analysis, data wrangling, web scraping, feature engineering, and machine learning while leveraging Streamlit for interactive reporting.
Sidebar menu, images, dataframes / table preview. Ability to filter data.
Rich text with markdown language.
Advanced chart rendering, like with Bokeh, Plotly, Matplotlib and more !
Several container options: popover, dropdown...
Options with sliders, dynamic chart updating...
Code integration, work with columns...
Options with checkboxes, radio buttons...
Looker Studio Dashboard
A dynamic and interactive example for visualizing key data insights
The source dataset contains data from a company offering a delivery and storage service.
⬇️ 👁🗨 Have a look ! ⬇️
Lines of analysis:
Overall view of deliveries: number of deliveries by month, package damage by day of the week, some KPIs.
Overview of store performance: percentage of parcels scanned manually, average time spent by truck in each store, delivery count per warehouse, delivery count by store.
Advanced Web Scraping
Extract data efficiently from the web for powerful insights and analysis
During my Data Analyst study project, I enhanced the given dataset by filling in missing values and adding quantitative attributes, sourced through web scraping, to improve our machine learning models.
Recently, I revisited this work and made significant improvements, greatly enhancing efficiency, logging, and reliability.
⬇️ 👁🗨 Have a look ! ⬇️
This script aims to scrap all video games ranked on the Metacritic.com website (currently 13,589 games) and collect information such as:
Critic_positive_reviews: Count of 'positive' reviews received from highly respected critics
Critic_mixed_reviews: Count of 'mixed' reviews received from highly respected critics
Critic_negative_reviews: Count of 'negative' reviews received from highly respected critics
User_score: Average score given by end users
User_positive_reviews: Count of 'positive' reviews received from end users
User_mixed_reviews: Count of 'mixed' reviews received from end users
User_negative_reviews: Count of 'negative' reviews received from end users
Game_url: URL game page
This web scraping script is well-optimized, focusing on efficiency, robustness, and scalability.
⋙ Asynchronous execution for performance:
☑ Uses asyncio and aiohttp for non-blocking HTTP requests, significantly reducing wait times.
☑ Retrieves platform scores in parallel with asyncio, improving efficiency.
⋙ Advanced concurrency management:
☑ Implements concurrent.futures.ThreadPoolExecutor to process multiple tasks simultaneously.
☑ Limits the number of concurrent threads (max_threads=5) to optimize system and server load.
⏫Leveraging thread pools and asynchronous functions significantly enhanced efficiency, reducing scraping time by at least 5x.
⋙ Robust error handling:
☑ Implements a retry mechanism (MAX_RETRIES) with exponential backoff to handle network failures.
☑ Detects bans (e.g., captchas, HTTP 403) and logs errors without crashing the script.
⋙ Anti-detection mechanisms:
☑ Uses randomized User-Agent (fake_useragent) and dynamic referers to mimic real users.
☑ Includes optional random delays to avoid bot detection.
⋙ Optimized data storage and export:
☑ Implements lock mechanisms to ensure data integrity.
☑ Periodically saves data in batches to avoid memory overload.
☑ Merges and compresses CSV files automatically for efficient storage management.
Text / Data Mining
Uncover hidden patterns and valuable insights from unstructured data
During my Data Analyst study projects, I have collected, using web scraping, user review comments for each game/platform in the original dataset, focusing on games with at least 50 user comments.
I prioritized the oldest comments and limited the collection to a maximum of 500 comments per game, resulting in a 390MB CSV file.
Here is a sample of the gathered data:
Have a look at this example Darkest Dugeon.
Each quote is associated with a note from 0 to 10. Metacritic then categorises each score as positive, mixed or negative review like this:
Scores in [0-4] are counted as negative.
Scores in [5-7] are counted as mixed.
Scores in [8-10] are counted as positive.
📋 We then obtain this summary:
Our goal was to predict the category label (our target) associated with each user comment (the feature).
⋙ Here is the technical pipeline:
Filter out only the English quotes using the Python langdetect module.
Create a new column named 'Sentiment' by applying the filter above. Sentiment = -1 for negative, 0 for mixed, 1 for positive.
Do some cleaning: lower characters, remove noise characters
Word tokenization then stop words filtering using Python nltk module functions.
Apply word lemmatization.
Apply the TF-IDF method.
Since the label distribution in the dataset used (from 132083 quotes) was unbalanced, I applied a combination of techniques to balance the data. Specifically, I used SMOTETomek for this purpose.
Labels count plot Raw data
Labels count plot Over-sampled data
I then fitted Gradiant Boosting Classifier models from the scikit-learn Python library, with and without re-sampling method, and check the difference in predictions.
Prediction confusion matrix Raw data
Prediction confusion matrix Over-sampled data
We can clearly observe an improvement in predictions, particularly for negative and mixed sentiments, albeit at the cost of slightly reduced accuracy in positive sentiment predictions.
We have finally used those predictions in our video game sales predictions, using features input such as:
Publisher
Year of release
Platform
Genre
Developer
Rate
Platform Type
User comments
⬇️ 👁🗨 Streamlit demo of the complete project ⬇️
Machine Learning
Build predictive models and uncover data-driven insights with AI
As part of the Data Analyst training program at Datascientest, some courses cover both supervised and unsupervised Machine Learning.
The study project allowed us to apply our knowledge to a real-world case. My project focused on forecasting video game sales based on qualitative features, using a dataset from Kaggle.
The qualitative features for predictions are:
Platform
Year
Genre
Publisher
And the chosen target is Global_Sales.
⋙ After data exploration and cleaning, and first tests with classic models, we found out that we get better results when the target was processed with the Box-Cox transform. This transformation aims to make data more Gaussian-like.
Quantile to quantile plot Raw data
Quantile to quantile plot Box-Cox transformed data
⋙ We preferred encoding the qualitative features with a Binary Encoder over a standard OneHotEncoder to minimize generated data.
⋙ We determined that the XGBoostRegressor was giving good results with our data, but improvement was still possible.
That's why we integrated the scraped data into our dataset, incorporating quantitative features such as user scores, the number of positive, mixed, and negative reviews per game, as well as critic scores and their respective review counts.
This enhancement improves prediction accuracy by 18 %, increasing it from 44 % to 62 %.
⬇️ Please have a look to our full project Streamlit demo ⬇️