Cyclistic : Google Data Analytics Case Study
Business Task (Ask Phase) :
The Cyclistic, a bike share company has 2 sets of customers, Casual riders who opt for single -ride passes or full-day passes and Annual members who opt for annual membership. The finance analysts have identified that annual membership will be much more profitable than casual riders. The Business task here is to build a Marketing strategy to convert casual riders to annual membership, through which the company believes the revenue will increase better than capturing new customers. In order to achieve this first we need to identify how the casual riders are different from the annual members? What factor/ factors would influence casual riders to become annual members?
The key stakeholders here are the “Lily Moreno - Director of Marketing” & “Cyclistic executive team”, who will approve the recommendations.
Prepare Phase:
The data is provided by Motivate International Inc. with the license making the data available to the public in their server. The data is provided in the CSV format, one file per month for the previous 12 months. Customer personal information is secured without making it public..
First I checked the file size to make sure every file for the previous 12 months has the data in it. Ideally if the file has data it should be in MBs. if the file says 1KB or 2KB then data could be missing.
I notice that the data is current, until the previous month.
All the columns in the CSV file are relevant to the analysis.
Not every ride has a Start and End Station Name & ID.
All the 12 CSV files have the same column structure and I will use SQL to merge all the files into 1 table (Table Name: [dbo].[Cyclistic_Case_Study]). Added Month and Year columns to represent each CSV file.
Considering Ride_id as a unique ID, I noticed that only for Month Nov 2020 and Dec 2020 Ride_Id repeats. And also the difference between start and end date for Dec2020 is wrong.
I will make sure I will not lose the Original dataset before cleaning the data. I will place all the Original downloads in a secure folder with proper naming convention (YYYYMM_divvy_tripdata).
I feel the price per Ride and the Ride Description is missing in this data set to help with more realistic analysis.
Process Phase:
In order to process/ Clean the data I will be using SQL (Since the data is huge and SQL is a better option to work with a huge dataset), which I am hands On and much familiar with.
There are 13 columns imported to SQL from the CSV and added a few additional columns required for analysis.
Ride_id - This is the Unique Identifier representing each Ride.
Rideable_type - Type of bicycle used
Started_at - Date & Time of the ride start
Ended_at - Date & Time of the ride End
Start_station_name - Ride from where it was started
Start_station_id - An Id Representing each Start Station
end_station_name - Ride from where it was Ended
end_station_id - An Id Representing each End Station
Start_lat - Latitude of the Start Station
Start_lng - Longitude of Start Station
End_lat - Latitude of Start Station
End_lng - Longitude of End Station
Member_casual - Identify if the Customer is Casual or Member
Columns added for Analysis
Year - Year of Ride
Month - Month of Ride
Ride_Length_Minutes - Difference between Start Date and End Date in Minutes
Start_Day_Of_Week - Start Date Week Name
End_Day_Of_Week - End Date Week Name
Data Cleaning steps:
I will first exclude the records where the Start date is greater than the End Date. Start Date should always be less than the End Date.With this Duplicate Ride_ID is cleaned (209 RIde_Id’s).
SELECT * INTO [dbo].[Cyclistic_Case_Study_Clean] FROM [dbo].[Cyclistic_Case_Study] WHERE CAST([Started_at] AS DATETIME)<CAST([ended_at] AS DATETIME)
For all practical purposes I will exclude the rides <= 60 seconds Ride Length, assuming these could be a trial ride or shifting bicycle from one docking station to another docking station.
DELETE FROM [dbo].[Cyclistic_Case_Study_Clean] WHERE Ride_Length_seconds <=60
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Ride_Length_Minutes INT
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Ride_Length_Seconds INT
UPDATE A
SET Ride_Length_Minutes = DATEDIFF(minute,[Started_at],[ended_at])
FROM [dbo].[Cyclistic_Case_Study_Clean] A
UPDATE A
SET Ride_Length_Seconds = DATEDIFF(second,[Started_at],[ended_at])
FROM [dbo].[Cyclistic_Case_Study_Clean] A
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Start_Day_Of_Week VARCHAR(100)
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD End_Day_Of_Week VARCHAR(100)
UPDATE A
SET Start_Day_Of_Week = DATENAME(WEEKDAY,[Started_at]),
End_Day_Of_Week= DATENAME(WEEKDAY,[ended_at])
FROM [dbo].[Cyclistic_Case_Study_Clean] A
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Distance_Meters FLOAT
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Distance_Kms FLOAT
ALTER TABLE [dbo].[Cyclistic_Case_Study_Clean] ADD Distance_Miles FLOAT
I also observed multiple Id for Start_Station_Id & End_Station_Id. Since I am not using these ID’s for analysis purposes, for now I will not work on this. Station ID changes after Dec 2020. Looks like ID’s changed after Dec 2020.
SELECT [start_station_name],COUNT(DISTINCT [start_station_id]) FROM [dbo].[Cyclistic_Case_Study_Clean] GROUP BY [start_station_name] HAVING COUNT(DISTINCT [start_station_id])>1 --- 584
SELECT [end_station_name],COUNT(DISTINCT [end_station_id]) FROM [dbo].[Cyclistic_Case_Study_Clean] GROUP BY [end_station_name] HAVING COUNT(DISTINCT [end_station_id])>1 -- 581
Analysis Phase:
I used Tableau, SQL & Excel to do initial Analysis.
Over all there are 5,054,677 Rides in the past 12 Months and among which 2,324,949 rides are by Casual Customers and 2,729,728 are by Member Customers.
Overall the number of Rides Increase from the month of March 2021 to July 2021, which is same for Casual Customers, whereas for Members rides have increased for Aug 2021 and Sep 2021 too.
From the above tables and Charts following are the observations:
Over all, Members have a greater number of rides (8% more) than the Casual Riders.
Members have a shorter Avg duration of ride than Casual Riders.
For both Members and Casual riders Avg duration of ride is more during the weekends.
Casual Riders are more during the weekend (Saturday & Sunday) than the Members. This is observed during Apr, May, June, July, Aug, Sep of 2021.
During June, July, Aug of 2021, Casual Riders have more rides than Members.
Count of Ride for Members is almost even across all the weekdays, while it's uneven for Casual Riders, which proves that Members use the bicycle on a regular basis while Casual Riders use most during weekends for leisure/ hangouts.
Casual Riders and Members are more active from March as we can see an increase in numbers.
Share & Act Phase
I feel the price per Ride and the Ride Description is missing in this data set to help with more realistic analysis.
Comments
Post a Comment