Yanis NASSI

BELLABEAT CASE STUDY

SCENARIO

As a junior data analyst, I'm recently working for a company named Bellabeat. Bellabeat is an high-tech manufacturer of health-focused smart devices for women founded in 2013 by Urška Sršen & Sando Mur. Here are some examples of bellabeat products :

BUSINESS TASK

To better understand their customer needs and the smart devices market, Urška Sršen asked me to conduct an analysis. We want to identify the trends in smart device usage, see how these trends can apply to Bellabeat customers and help Bellabeat build a data driven marketing strategy. I'll have a week to conduct the analysis and share it with the Bella beat marketing analytics team. But I'll force myself to do it in 3 days without having to lower the quality of my study just to see how I would react to tight deadlines.

ASK

To define what kind of trend we want to identify in smart devices usage, we will try to answer a few questions :

Id love to ask more question to Urška on the data she provided me to clarify some points. In a real life situation I would reach out to my supervisors to be sure I'm heading the good way with my analysis.

GET THE DATA READY

In the case study, a Kaggle Dataset is provided containing data from 2016 on 30 random Fitbit Users. First, we will perform The ROCC test to see if we can trust the dataset :

RELIABILITY issue : The dataset provided is already huge, there are approximately 2 millions observations by table and there are 7 tables by period for 2 periods of time. For technical reasons it will be difficult to deal with a bigger dataset on my personal laptop . As it is a Case study we will perform the analysis on this sample size but the recommendations i’ll do couldn't really apply in a real life scenario.

Source is not really UNDERSTANDABLE: I'll try to make assumptions on the data.

Source is OUTDATED : As it is a Case study we will still perform the analysis on this outdated data but the recommendations i’ll do couldn't really apply in a real life scenario.

Fitness data about users seems to be really hard to find as there might be some confidentiality issues in sharing it publicly. I searched BigQuery public datasets and fitabase to find better data and couldn't find it for free. As this case study is provided by google (a trustworthy source) and that i can't just “give up”, i'll take this case study as a challenge to make sense out of this dataset by making assumptions on the data. It can also be seen as an opportunity to do an analysis on Data inconsistency and explain why this dataset is a really bad one which should never be used to make data driven decisions.

The Kaggle dataset is organized in 2 folders containing the same type of tables for 2 months in 2016 (march and april).

We have 3 types of data :

  1. The data that is continuously recorded by the device, there is data as long as the user is wearing the device. We have different types of metrics (Calories, Sleep, MET, Intensity, Steps)

    There are different files for the same metrics (minutes, hour and day) . I'll import just the metrics by minutes for Calories Sleep and Intensity.

    • Calories : Calories burnt by minute. (continuous entire number)
    • Intensity : Intensity of the activity made by the user by minute. (0,1,2,3)
    • Sleep : There is a record if the user is asleep (1,2 or 3 for the recovering quality of sleep)

    We have the same tables for April and March

  2. The user logs into the app to record activity himself, he asks for a report on this day or he just opens the app which automatically generates a record for the day. There is just one table by month called DailyActivity (by day).

  3. The user enters the data himself on the app. Also one table by month called WeightLogInfo. (We will discuss this point later on the analysis part)

I came up with this classification because we have way more data in the table "Continuous_record" than “DailyActivity”. WeightLogInfo is the one with the least data (maybe because it requires more effort from the user). A table called SleepDay is also available only for the month of april. We use the table to know how many people track their sleep on the total population.

DATA CHOICE EXPLANATION

PROCESS

We will use BigQuery and Tableau to perform this analysis.

As I was trying to import data to BigQuery I noticed there was some invalid format in the dates. I tried to automatise the cleaning on Google Sheets but couldn't manage to switch AM/PM hours to 24h based timestamp. I tried to clean it on RStudio and finally succeeded in setting the appropriate timestamp for my data. Here is the process :

 ## Charger les bibliothèques nécessaires
   library(dplyr)
   library(lubridate)
                
## Charger les données
   data <- read.csv("/Users/msayn47painter/Desktop/CASE_STUDY_Bellabeat
                     /Bellabeat_data/April_data
                     /Fitabase Data 4.12.16-5.12.16/weightLogInfo_merged.csv")
                
## Vérifier que la colonne est bien en texte
    data$Date <- as.character(data$Date)
                
## Convertir les dates en format Date-Time avec heure en 24h
    data$Date <- mdy_hms(data$Date)
                
## Forcer l'affichage des heures, même pour minuit (00:00:00)
    data$Date <- strftime(data$Date, format="%Y-%m-%d %H:%M:%S")
                
## Sauvegarder le fichier nettoyé en CSV
    write.csv(data, "/Users/msayn47painter/Desktop/CASE_STUDY_Bellabeat/Bellabeat_data/April_data
                     /CleanedData_april/WeightLogInfo_april.csv", row.names = FALSE, quote = FALSE)
                

I did the same operation for all tables.

Now we have to organize the data to better understand it and clean it. First I'll group these 3 tables (Sleep, Intensity and Calories) in one with the following columns

Id : fitbit user id
Activity minute :Minute in which record happens
Calories :Calories burnt at this time
Intensity :0 if the user is awake 1 2 3 he or she sleeps.

As it is more complete, We'll use the data by minutes tables and use it to create a table by hours: Continuous_record (minutes) and Continuous_hours

-- march

            CREATE TABLE `nimble-mode-447321-p1.bellabeat.Continuous_record_march` AS
            WITH 
                calories AS (
                    SELECT 
                        ActivityMinute AS date,  
                        Calories,
                        id  -- Ajout de la colonne id de Calories_march
                    FROM `nimble-mode-447321-p1.bellabeat.Calories_march`
                ),
                sleep AS (
                    SELECT 
                        date,  
                        Value 
                    FROM `nimble-mode-447321-p1.bellabeat.Sleep_march`
                ),
                intensities AS (
                    SELECT 
                        ActivityMinute AS date,  
                        Intensity 
                    FROM `nimble-mode-447321-p1.bellabeat.Intensities_march`
                )
            SELECT 
                c.date AS ActivityMinute, 
                c.id,  
                MAX(c.Calories) AS Calories,  
                COALESCE(MAX(s.Value), 0) AS Sleep_Value, 
                MAX(i.Intensity) AS Intensity  
            FROM calories c
            LEFT JOIN sleep s ON c.date = s.date  
            LEFT JOIN intensities i ON c.date = i.date  
            GROUP BY c.date, c.id
            
-- april 
            
            CREATE TABLE `nimble-mode-447321-p1.bellabeat.Continuous_record_april` AS
            WITH 
                calories AS (
                    SELECT 
                        ActivityMinute AS date,  
                        Calories,
                        id  
                    FROM `nimble-mode-447321-p1.bellabeat.Calories_april`
                ),
                sleep AS (
                    SELECT 
                        date,  
                        Value 
                    FROM `nimble-mode-447321-p1.bellabeat.Sleep_april`
                ),
                intensities AS (
                    SELECT 
                        ActivityMinute AS date,  
                        Intensity 
                    FROM `nimble-mode-447321-p1.bellabeat.Intensities_april`
                )
            SELECT 
                c.date AS ActivityMinute, 
                c.id,  
                MAX(c.Calories) AS Calories, 
                COALESCE(MAX(s.Value), 0) AS Sleep_Value,  
                MAX(i.Intensity) AS Intensity  
            FROM calories c
            LEFT JOIN sleep s ON c.date = s.date  
            LEFT JOIN intensities i ON c.date = i.date  
            GROUP BY c.date, c.id

We will also use the table DailyActivity (daily Data)

Id : Fitbit user id
Activity Date :Date in which record happens
TotalSteps :Total steps the day of the record
TotalDistance :Total distance traveled (Fitbit and logged activity)
TrackerDistance :Total distance traveled recorded by device
LoggedActivitiesDistance :Total distance entered by the user manually
VeryActiveDistance :Total distance being very Active (Running or sport)
ModeratelyActiveDst:Total distance being moderately active (light sport session, manual work, walking fast)
LightActiveDistance:Total distance being lightly active (Cooking, walking,groceries family meeting or light activities)
VeryActiveMinutes :Total minutes being Very active
FailyActiveMinutes :Total minutes being Fairly active
LightlyActiveMinutes:Total minutes being Lightly active
SedentaryMinutes :Total minutes being Sedentary
Calories :Total calories burnt during the day

And the table WeighLogInfo (Daily Data) :

Id : Fitbit user id
Date :Date of record
WeightKg :Total steps the day of the record
WeightPounds :Total distance traveled (Fitbit and logged activity)
Fat :fat index
BMI :Body Mass Index
IsManualReport :TRUE/FALSE

And Finally the table SleepDay :

Id : Fitbit user id
SleepDay :Day of record
TotalSleepRecords :Number of sleep record
TotalMinutesAsleep :Total minutes being Asleep
TotalMintesInBed :Total minutes being in Bed

DATA CLEANING

DATA MERGING :

First it is important to merge the data from both monthes to have all the data in the same table


-- Continuous record
INSERT INTO `nimble-mode-447321-p1.bellabeat.Continuous_record` 
(id, ActivityMinute, Calories, Sleep_Value, Intensity)
SELECT id, ActivityMinute, Calories, Sleep_Value, Intensity
FROM `nimble-mode-447321-p1.bellabeat.Continuous_record_april`
UNION ALL
SELECT id, ActivityMinute, Calories, Sleep_Value, Intensity
FROM `nimble-mode-447321-p1.bellabeat.Continuous_record_march`;
    
-- Dailyactivity 
CREATE TABLE `nimble-mode-447321-p1.bellabeat.Dailyactivity` AS
SELECT * FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity_april`
UNION ALL
SELECT * FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity_march`;
    
-- WeightLogInfo 
CREATE TABLE `nimble-mode-447321-p1.bellabeat.WeightLogInfo` AS
SELECT * FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo_april`
UNION ALL
SELECT * FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo_march`;

Checking FORMATS :


SELECT column_name, data_type
FROM `nimble-mode-447321-p1.bellabeat.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'Continuous_record';

SELECT column_name, data_type
FROM `nimble-mode-447321-p1.bellabeat.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'DailyActivity';

SELECT column_name, data_type
FROM `nimble-mode-447321-p1.bellabeat.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'WeightLogInfo';

Continuous_record :

Column Format
ID INT64
Activity Minute TIMESTAMP
Calories FLOAT64
Sleep_Value INT64
Intensity INT64

DailyActivity :

Column Format
ID INT64
Activity Date DATE
TotalSteps INT64
TotalDistance FLOAT64
LoggedActivitiesDistance FLOAT64
VeryActiveDistance FLOAT64
ModeratelyActiveDistance FLOAT64
LightActiveDistance FLOAT64
SedentaryActiveDistance FLOAT64
VeryActiveMinutes INT64
LoggedActivitiesDistance INT64
VeryActiveDistance INT64
FairlyActiveMinutes INT64
SedentaryMinutes INT64
SedentaryActiveDistance INT64
Calories INT64

WeightLogInfo :

Column Format
ID INT64
Date TIMESTAMP
WeightKg FLOAT64
WeightPounds FLOAT64
Fat STRING
BMI FLOAT64
IsManualReport BOOL
LogId INT64

All the tables have a good data format (except "fat" which should be a Int64 but as the colum is empty we'll ignore this format).

Checking for NULL values :


SELECT 
    COUNT(*) AS total_rows,
    COUNTIF(id IS NULL) AS missing_id,
    COUNTIF(ActivityMinute IS NULL) AS missing_activity_minute,
    COUNTIF(Calories IS NULL) AS missing_calories,
    COUNTIF(Sleep_Value IS NULL) AS missing_sleep_value,
    COUNTIF(Intensity IS NULL) AS missing_intensity
FROM `nimble-mode-447321-p1.bellabeat.Continuous_record`;
    
SELECT 
    COUNT(*) AS total_lignes,
    COUNTIF(Id IS NULL) AS id_nulls,
    COUNTIF(ActivityHour IS NULL) AS activityhour_nulls,
    COUNTIF(Calories IS NULL) AS calories_nulls,
    COUNTIF(Sleep_Value IS NULL) AS sleep_value_nulls,
    COUNTIF(Intensity IS NULL) AS intensity_nulls
FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity`;
    
    
SELECT 
    COUNT(*) AS total_lignes,
    COUNTIF(Id IS NULL) AS id_nulls,
    COUNTIF(Date IS NULL) AS date_nulls,
    COUNTIF(WeightKg IS NULL) AS weightkg_nulls,
    COUNTIF(WeightPounds IS NULL) AS weightpounds_nulls,
    COUNTIF(Fat IS NULL) AS fat_nulls,
    COUNTIF(BMI IS NULL) AS bmi_nulls,
    COUNTIF(IsManualReport IS NULL) AS ismanualreport_nulls,
    COUNTIF(LogId IS NULL) AS logid_nulls
FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo`;

No NULL Values

Checking for ABNORMAL VALUES:


SELECT 
    COUNT(*) AS total_lignes,
    COUNTIF(Calories < 0) AS calories_negatives,
    COUNTIF(Sleep_Value NOT IN (0,1,2,3)) AS sleep_value_invalides,
    COUNTIF(Intensity NOT IN (0,1,2,3)) AS intensity_invalides
FROM `nimble-mode-447321-p1.bellabeat.Continuous_record`;
    
SELECT 
    COUNT(*) AS total_lignes,
    COUNTIF(Total_Calories < 0) AS calories_negatives,
    COUNTIF(Sleep_Value NOT IN (0,1,2,3)) AS sleep_value_invalides,
    COUNTIF(Intensity NOT IN (0,1,2,3)) AS intensity_invalides
FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo`;
    
SELECT 
    COUNT(*) AS total_lignes,
    COUNTIF(TotalSteps < 0) AS total_steps_negatifs,
    COUNTIF(TrackerDistance < 0) AS tracker_distance_negatifs,
    COUNTIF(VeryActiveDistance < 0) AS very_active_distance_negatifs,
    COUNTIF(ModeratelyActiveDistance < 0) AS moderately_active_distance_negatifs,
    COUNTIF(LightActiveDistance < 0) AS light_active_distance_negatifs,
    COUNTIF(SedentaryActiveDistance < 0) AS sedentary_active_distance_negatifs,
    COUNTIF(VeryActiveMinutes < 0) AS very_active_minutes_negatifs,
    COUNTIF(FairlyActiveMinutes < 0) AS fairly_active_minutes_negatifs,
    COUNTIF(LightlyActiveMinutes < 0) AS lightly_active_minutes_negatifs,
    COUNTIF(SedentaryMinutes < 0) AS sedentary_minutes_negatifs,
    COUNTIF(Calories < 0) AS calories_negatives
FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity`
  

There are some missing values in WeightLogInfo for "Fat", We ll keep everything and discuss it later.

Checking for DUPLICATES :


SELECT Id, ActivityMinute, COUNT(*) AS nb_occurences
    FROM `nimble-mode-447321-p1.bellabeat.Continuous_record`
    GROUP BY Id, ActivityMinute
    HAVING COUNT(*) > 1;
    
SELECT Date, id, COUNT(*) AS nb_occurences
    FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo`
    GROUP BY Date, id
    HAVING COUNT(*) > 1;
    
SELECT ActivityDate, id, COUNT(*) AS nb_occurences
    FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity`
    GROUP BY ActivityDate, id
    HAVING COUNT(*) > 1;

Deleting DUPLICATES


REPLACE TABLE `nimble-mode-447321-p1.bellabeat.Continuous_record` AS
SELECT DISTINCT
    ActivityMinute, 
    id, 
    Calories, 
    Sleep_Value, 
    Intensity
FROM `nimble-mode-447321-p1.bellabeat.Continuous_record`
GROUP BY ActivityMinute, id, Calories, Sleep_Value, Intensity;
        
REPLACE TABLE `nimble-mode-447321-p1.bellabeat.DailyActivity` AS
SELECT DISTINCT 
    Id, 
    ActivityDate, 
    TotalSteps, 
    TrackerDistance, 
    LoggedActivitiesDistance, 
    VeryActiveDistance, 
    ModeratelyActiveDistance, 
    LightActiveDistance, 
    SedentaryActiveDistance, 
    VeryActiveMinutes, 
    FairlyActiveMinutes, 
    LightlyActiveMinutes, 
    SedentaryMinutes, 
    Calories
FROM `nimble-mode-447321-p1.bellabeat.DailyActivity`;
        
REPLACE TABLE `nimble-mode-447321-p1.bellabeat.WeightLogInfo` AS
SELECT DISTINCT 
    Id, 
    Date, 
    WeightKg, 
    WeightPounds, 
    Fat, 
    BMI, 
    IsManualReport, 
    LogId
FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo`;
        

Checking for DATA INCONSISTENCY :



        SELECT DISTINCT id
        FROM `nimble-mode-447321-p1.bellabeat.Continuous_record` 

        SELECT DISTINCT id
        FROM `nimble-mode-447321-p1.bellabeat.Continuous_hours`

        SELECT DISTINCT id
        FROM `nimble-mode-447321-p1.bellabeat.Dailyactivity`

        SELECT DISTINCT id
        FROM `nimble-mode-447321-p1.bellabeat.WeightLogInfo`

        SELECT DISTINCT id
        FROM `nimble-mode-447321-p1.bellabeat.SleepDay`
    

Users :

35 users in Continuous_record 35 users in DailyActivity 24 users in SleepDay 13 users in WeightLogInfo

Generating MORE DATA :

We can now generate more data from our cleaned Tables. First we will create a table by hour from the table "Continuous_record" :


-- Continuous record en heures

CREATE TABLE `nimble-mode-447321-p1.bellabeat.Continuous_hours` AS
WITH aggregated_data AS (
    SELECT
        EXTRACT(DATE FROM ActivityMinute) AS Date,  -- Extraire la date
        EXTRACT(HOUR FROM ActivityMinute) AS Hour,  -- Extraire l'heure
        id,
        ROUND(Calories) AS Calories,  -- Arrondir les calories avant d'agréger
        Sleep_Value,
        Intensity
    FROM `nimble-mode-447321-p1.bellabeat.Continuous_record`
)
SELECT
    Date,
    Hour,
    id,
    SUM(Calories) AS Total_Calories,  -- Somme des calories pour chaque heure
    ROUND(AVG(Sleep_Value)) AS Sleep_Value,  -- Moyenne arrondie pour Sleep_Value
    ROUND(AVG(Intensity)) AS Intensity  -- Moyenne arrondie pour Intensity
FROM aggregated_data
GROUP BY
    Date, Hour, id
ORDER BY
    Date, Hour, id;

We'll call this table "Continuous_hours"

Then, We're will create new columns for our tables.

We can add A column to track the day of the week in "DailyActivity" :


REPLACE TABLE `nimble-mode-447321-p1.bellabeat.DailyActivity` AS
SELECT 
    Id,
    ActivityDate,
    EXTRACT(DAYOFWEEK FROM ActivityDate) AS DayOfWeek,
    TotalSteps,
    TrackerDistance,
    LoggedActivities,
    VeryActiveDistance,
    ModeratelyActiveDistance,
    LightActiveDistance,
    SedentaryActiveDistance,
    VeryActiveMinutes,
    FairlyActiveMinutes,
    LightlyActiveMinutes,
    SedentaryMinutes,
    Calories
FROM 
`nimble-mode-447321-p1.bellabeat.DailyActivity`;

We can add a new column to know the DayoftheWeek in "Continuous_hours".


REPLACE TABLE `nimble-mode-447321-p1.bellabeat.Continuous_hours` AS
SELECT 
    Date,
    Hour,
    id,
    EXTRACT(DAYOFWEEK FROM Date) AS DayOfWeek,
    Total_Calories,
    Sleep_Value,
    Intensity
FROM 
    `nimble-mode-447321-p1.bellabeat.Continuous_hours`

We will do the same for each table.

To have hints about users Inactivity we can create a table from "Continuous_hours".

WITH
ordered_data AS (
    SELECT
    id,
    Date,
    Hour,
    LAG(TIMESTAMP(CONCAT(CAST(Date AS STRING), ' ', CAST(Hour AS STRING), ':00:00'))) 
        OVER (PARTITION BY id ORDER BY Date, Hour) AS previous_activity,
    TIMESTAMP_DIFF(
        TIMESTAMP(CONCAT(CAST(Date AS STRING), ' ', CAST(Hour AS STRING), ':00:00')), 
        LAG(TIMESTAMP(CONCAT(CAST(Date AS STRING), ' ', CAST(Hour AS STRING), ':00:00'))) 
        OVER (PARTITION BY id ORDER BY Date, Hour),
        HOUR
    ) AS time_diff
    FROM
    `nimble-mode-447321-p1.bellabeat.Continuous_hours`
),
inactivity_periods AS (
    SELECT
    id,
    previous_activity AS inactivity_start,
    TIMESTAMP(CONCAT(CAST(Date AS STRING), ' ', CAST(Hour AS STRING), ':00:00')) AS inactivity_end,
    TIMESTAMP_DIFF(
        TIMESTAMP(CONCAT(CAST(Date AS STRING), ' ', CAST(Hour AS STRING), ':00:00')), 
        previous_activity, HOUR
    ) AS inactivity_duration
    FROM
    ordered_data
    WHERE
    time_diff > 1  -- Si l'écart entre les horaires est supérieur à 1 heure
)
SELECT
id,
inactivity_start,
inactivity_end,
inactivity_duration
FROM
inactivity_periods
ORDER BY
id, inactivity_start;

We'll call this table "Inactivity_continuous" it shows the periods of time that has more than 1 hour between 2 records.

We have a "Sedentary" timespan in "DailyActivity". We ll add a table to track time spent Sedentary but not in bed. The table "SleepDay" is only available for April so we will do an average for time in bed by user and substract it from "SedentaryTime".

WITH AverageTimeInBed AS (
    SELECT 
        Id, 
        AVG(TotalTimeInBed) AS AvgTimeInBed
    FROM 
        nimble-mode-447321-p1.bellabeat.SleepDay
    GROUP BY 
        Id
)

SELECT 
    da.*, -- Sélectionne toutes les colonnes de DailyActivity
    COALESCE(atb.AvgTimeInBed, 0) AS AvgTimeInBed, 
    da.SedentaryMinutes - COALESCE(atb.AvgTimeInBed, 0) AS Sedentary_notinbed 
FROM 
    nimble-mode-447321-p1.bellabeat.DailyActivity da
LEFT JOIN 
    AverageTimeInBed atb
    ON da.Id = atb.Id;

Now we have to check for negtive values in this table.

SELECT *
FROM nimble-mode-447321-p1.bellabeat.DailySedentary_notinbed
WHERE Sedentary_notinbed < 0;

There are 61 records with negative values on a 1397 total records (4%). User 6117666160 combine 12 of these 61 records. Lets now check for time spent sedentary but not Asleep :

WITH AverageMinutesAsleep AS (
    SELECT 
        Id, 
        AVG(TotalMinutesAsleep) AS AvgMinutesAsleep
    FROM 
        nimble-mode-447321-p1.bellabeat.SleepDay
    GROUP BY 
        Id
)        
SELECT 
    da.*, -- Select all columns from DailyActivity
    COALESCE(ams.AvgMinutesAsleep, 0) AS AvgMinutesAsleep, 
    da.SedentaryMinutes - COALESCE(ams.AvgMinutesAsleep, 0) AS Sedentary_notAsleep
FROM 
    nimble-mode-447321-p1.bellabeat.DailyActivity da
LEFT JOIN 
    AverageMinutesAsleep ams
    ON da.Id = ams.Id; 

We still have 47 negative values in column "Sedentary_notAsleep" for this table. We ll try to give an explanation for that later.

One last table will be created to check for record "Hight Intensity Activity" in Continuous_hours without record in "DailyActivity" (to know what is the share of people who do sports without tracking it in the App

SELECT DISTINCT ch.Id, ch.Date
FROM nimble-mode-447321-p1.bellabeat.Continuous_hours ch
LEFT JOIN nimble-mode-447321-p1.bellabeat.DailyActivity da 
    ON ch.Id = da.Id AND ch.Date = da.ActivityDate
WHERE ch.Intensity = 3
  AND da.Id IS NULL;

ANALYSIS

Now we can enter the most exciting part of this study : the Analysis. To better understand our data, we’ll use Tableau to make visualizations. Data patterns and trends will be more accessible that way. Unfortunately I'm using Tableau public and there is no way I can directly link Tableau to Big Query. I have no choice but to download the tables from BigQuery through Google Cloud Storage as CSV files and upload it to Tableau. For security reasons i wont put the Query containing my google bucket name publicly on this website.

INACTIVITY ANALYSIS

This Dashboard as been created from the table "Continuous_record". Several similarities and differences emerge among the users :

Data Inconsistency : There should have been a recurrent inactivity period to charge the device for each user

SLEEP ANALYSIS

This analysis represented a challenge as there is multiples sources of data for sleeping periods. First, we have to figure out which table we want to use : "Continuous_hours" or "SleepDay" ?

Continuous hours is a more complete table with more data so we'll start to analyse this table.

There are big Inconsistencies in the table "Continuous_hours".

First, according to the data, users appear to have spent more time asleep than awake, which does not align with normal human habits. The average sleep time per day and per user is around 10 hours/day. Additionally, the sleep hours are inconsistent, showing a peak at 5:00 PM for users. I have checked the data for duplicates or other issues, but nothing unusual was found. However, the results remain inconsistent. Ive lready noticed that there was a majority of "1" which means "Asleep" in the data which made no sense, at first, there was only 1 for Awake, 2 for restless Sleep and 3 Awakening. I added the 0 for "awake" thinking awake and Asleep would balance the sleeping patterns in my data but it remains inconsistent.

So We'll analyse "SleepDay" table instead :

The data in the SleepDay table makes much more sense

CALORIES BURNT

The data from the "DailyActivity" and "Continuous_hours" tables show similar trends. Users who burn the most calories per hour (from "Continuous_hours") also tend to burn the most calories per day(from "DailyActivity"). The Weekdays count show differents trends but we will try to rationnally explain this difference.

Calories burnt from "Continuous_hours" :

Calories burnt from "DailyActivity" :

First, given the differences, it’s important to decide which variables (from “Continuous_hours” or “DailyActivity”) will be used to analyze calories burned by day of the week. In “DailyActivity,” the total number of calories burned is highest on Wednesdays. However, the average calories burned per day show that Sunday is the day users burn the most on average. If users log workouts through the app themselves for "DailyActivity", this could explain the difference. Wednesdays might be when users record the most activities, leading to higher average calories burned per hour in “Continuous_hours". We will focus on total calories burnt by day (from "DailyActivity") and average calories burnt by hour (from Continuous-hours) for this analysis.

Here is what data on calories tells us :

INTENSITY OF ACTIVITIES

This has been the trickiest part of the analysis considering that we have a lot of data that are not telling the same story in the differents tables, also, the results of this analysis needs to match the results of the "CALORIES BURNT" category to really make sense.

Intensities from "Continuous-hours".

Intensities from "Continuous-hours" : 2nd table.

Intensities from "DailyActivity".

Now I made a few Dashboards to show differences

Differences : Time spent in the different intensity levels :

Differences : Time spent Very Active by user

Differences : Time Spent Sedentary

Differencies : Weekday

It seems like in DailyActivity each time there is a record, the record is also made during night (when people are asleep) so I ll try to explain the differencies in intensity levels using "Sedentary_NotAsleep" instead of "Sedentary"

Intensities conclusions :

Inconsistencies in "Continuous_hours" :

We have no choice but to focus our analysis on the "DailyActivity" table, here is what data tells us :

Considering the inconsistencies in the "Continuous_hours" table, we won't use the table "high Intensity not recorded" to track high intensity activities not recorded in Daily-Activity. We can still use a query to check which users have a record in "DailyActivity" each time an high intensity level is recorded in "Continuous_hours" :

WITH AverageTimeInBed AS (
SELECT 
    Id, 
    AVG(TotalTimeInBed) AS AvgTimeInBed
FROM 
    nimble-mode-447321-p1.bellabeat.SleepDay
GROUP BY 
    Id
)
SELECT 
    da.*, -- Sélectionne toutes les colonnes de DailyActivity
    COALESCE(atb.AvgTimeInBed, 0) AS AvgTimeInBed, 
    da.SedentaryMinutes - COALESCE(atb.AvgTimeInBed, 0) AS Sedentary_notinbed 
FROM 
    nimble-mode-447321-p1.bellabeat.DailyActivity da
LEFT JOIN 
    AverageTimeInBed atb
    ON da.Id = atb.Id;

Here are the users that have a record in "DailyActivity" each time they have an high intensity activity recorded in "Continuous_hours" :

4020332650

4057192912

4388161847

2891001357

There is 11% of diligent users that record their activity each time they do sport.

STEP COUNT ANALYSIS

There is a step count only in the table "DailyActivity" which makes the data source choice easier for this variable.

Here is what data tells us about users walking habits :

WEIGHT ANALYSIS

Table used here is "WeightLogInfo" (Reminder that this table counts only 13 users):

Here is what data tells us about users weight :

Weightlog info activity

We'll also analyse number of times users recorded their weight and when.

COMPARISONS

Calories burnt & Weight :

"Very Active" & Weight :

Steps & Weight :

Sleep & Weight :

There is no correlation between Sleep and Weight or the correlation is not really significant.

Records analysis :

First, lets remember how many users there is in each table.

Now lets analyse recordings in these tables :

INTERPRETATIONS

INACTIVITY :

SLEEP :

CALORIES BURNT :

INTENSITY LEVELS :

STEP COUNT :

WEIGHT :

RECORDS :

RECOMMANDATIONS

It is important to remind that this dataset didn't passed our ROCCC test, the recommandations i will make here are based on this unreliable dataset. In my point of view this case study is not really about the outcome (making a real data based strategy for Bellabeat), but more a way to show our analytical thinking process. We will still make recommandation based on this dataset but it is risky to use them in a real life scenario. I tried to explore all the data that was available in this dataset trying to make sense out of it. I might have lost a bit of time analysing "Continuous_hours" but in my point of view i had to check for inconsistencies to make sure i wont use this data. This was also an interesting study to spot inconsistencies in data. I wasn't expecting a dataset from Google analytics to be so messy at the beginning. It have been a bit frustrating but i think that being able to spot inconsistency are even more important in data analysis than the analysis itself (because if analysis is made with unreliable data and that a strategy is based on this analysis, it can lead to huge loss for a company.) In a real life scenario i would have asked my supervisor for budget to acces more data, I would have pushed my analysis further asking "why" to each result and maybe use some more quantitative analysis tools.

INACTIVITY :

SLEEP :

CALORIES :

INTENSITY :

STEPS :

WEIGHT :