Wangsheng's World
Citi Bike Project
July 2024 — Present
(Actively updating this page... Some files can be accessed at my GitHub page at this moment.)
Introduction
As in a city heavily troubled by congestion, lots of New Yorkers choose to ride a bike to commute or do an in-city travel. Although riding through a city is a great way to feel the city, more likely to be shocked by the awful ground transportation in this metropolitan, visitors accept riding bikes as an awesome option to travel from one location to another in New York. As a major bike sharing provider, Citi Bike is likely to be a perfect representative to depict the bike sharing environment in New York City. As a cycling lover, I would like to investigate in the topic of bike sharing in New York, though the subject itself is interesting.
Data Preparation
Before we even start to prepare the data, we need to know what is the population of interest. To understand the landscape of bike sharing in New York City, we may consider all the people in the New York City as the population. However, that can be difficult. Thankfully, we can take a look at the population who ride Citi bikes and extend the findings to the larger population with some logical reasoning. Hence, the population of interest in the major part of this project is the people who ride Citi bikes.
Thanks to Citi Bike, it is easy to access the bike sharing data on its website and thus allowing me to start such a project more easily. The raw data all come from Citi Bike's System Data. You can overview the data they provide there and download trip history data for your own investigation.
The data you can download includes:
  • Ride ID
  • Rideable type - classic_bike or electric_bike
  • Started at - when the rider start to ride
  • Ended at - when the rider stop riding
  • Start station name
  • Start station ID
  • End station name
  • End station ID
  • Start latitude
  • Start longitude
  • End latitude
  • End longitude
  • Member or casual ride
Considering the huge size of the datasets and the data types provided, it is far from plausible to use all these data to carry out the analysis. Statistically, it is good to sample these data with logically correct methods and use sampled data which is of a much smaller size to do analysis. Such an analysis is expected to be appliable to the whole population. Here, we are going to use Simple Random Sampling (SRS). By using SRS, each individual can be selected with equal probability. Understanding SRS may be vulnerable to sampling error, we are not denying other sampling methods as moving on the project.
The Citi bike ride history data are divided into months. Data of each month may be divided into multiple datasets. To help simplify and automate the process of data retrieving, sampling, and early-stage feature engineering, we are going to write some functions. The functions are intended for the datasets having the data as introduced earlier.
  • The first function is to retrieve monthly data and sample it. It takes the year, month, file path, and sample fraction in and output a randomly sampled dataset.
  • The second function is to repeat the first function for a whole year or a list of years.
  • The third function is to do some basic feature engineering. It can make sure the data type of each column is good to use. Some features are added as well. For example, the duration of the ride is added by calculating the difference between "started at" and "ended at".
To handle missing data, we need to check the context each time. As we inspect the missing values more carefully, we find some rides are missing end data like end station name, end latitude, and end longitude. It might be interesting to do a little more research on that topic. Namely, does that mean some sensors/parts are broken, or that bike is stolen? Anyway, for this project, considering the large size of the dataset, we would like to drop all rows of data having any missing values.
Stage 1 - Analysis of July 2024
In this stage, we are going to investigate in a recent month (July 2024) to have a basic understanding of the recent bike sharing environment in New York.
According to what Citi Bike provides, there are over 4,700,000 rides in the month of July. We set the sample fraction to be 0.01 which is resulting in a random sample of about 47,000 rides. A dataset of such size is good for us to wrangle with on a personal computer. The dataset we use is here for your convenience. Since we sampled with a fraction of 0.01, all the number in this part is supposed to be multiplied by 100.
When we carry out Exploratory Data Analysis, plotting is a good means to help us understand the data. Tableau is a powerful tool available to handle the task in an elegant way and we are going to use it as a main force to help plot. Another strength of Tableau is that we can make dynamic dashboards by combining multiple sub-plots we created to obtain more insights and interesting and meaningful findings.
We plot the ride count by day and divide into member and casual riders.
We find some interesting things:
  • Overall, the ride count fluctuates like a wave of a cycle of a week.
  • Overall, more people ride from Monday to Friday.
  • However, more casual riders are during the weekends than working days.
  • Members' riding count waves more intensely, compared with casual riders.
A logical explanation could be the majority of Citi Bike users are using Citi bikes as a transportation to commute daily for their jobs. Note that we are not saying all those members are using bikes to commute. During weekends, a little bit more casual take a Citi bike to travel around. This could because of either more tourists or other public transportation's special operation during the weekends like fewer subway shifts/services.
What are some top popular Citi Bike stations in New York? This could be an interesting question and we can easily find out the answer with this dataset.
We can clearly see that "Broadway & E 14 St" is the most popular station where people pick up a bike and "W 21 St & 6 Ave" is the most popular station where people drop off their bikes. Both of these two stations are near Flatiron District and Union Square where both locals and tourists gather together.
When we put these graphs together into a dashboard, we can get more insights of the daily usage of the month which may also include the makeup of bike types of these rides.
Since the dashboard is a dynamic one. You may download the Tableau workbook for your own exploration.
Stage 2 - Analysis of the Year 2023
For the second stage, we would like to analyze the bike sharing on the scale of a year by using the data in the year of 2023. As an extension of the first stage, we are now equipped with some necessary ideas about what are some points worth our investigation. This is good. However, at the same time, we are expanding the scope of the analysis. There are problems and special considerations we should pay attention to when we do such a thing.
One especially important thing before we delve into the analysis is it that what is the sample or sample size in this stage. When we were carrying out our earlier analysis, we sampled about 1% of all the available Citi bike ride history. That was about 47,000 (total 4,700,000 for the month of July 2024). When we were analyzing the dataset, there was some annoying cost of time. Now, for the whole year, we are considering if a smaller sample size we should choose. As long as the sample is randomly selected, a smaller sample size would save us great amount of time during the analysis. Hence, if we know the result would not make much difference, why not? Why not sample it of a smaller size for our convenience?
The interesting things we discovered earlier in Stage 1 is worth of our revisit. Do the popular bike stations change over months or seasons? What is the most popular bike stations (which can be defined as most frequent top popular stations by month, or the overall mostly visited stations) in the whole year?
As I inspected, regardless of which definition we used for the popular station, the most popular (mostly used) Citi bike station was "W 21 St & 6 Ave." (it was a popular station for the months January, November, and December in 2023.) I would like to note that for some months, there are multiple stations of the same ride count in this dataset. However, in the whole dataset, they are more likely to be of similar counts.
(If you remember, "W 21 St & 6 Ave" station is a popular station of July 2024 as we discovered in Stage 1.)
In addition, we are interested in the monthly trend of bike rides. What does it look like? What is the makeup of Citi bike users? How about the usage of different bike types?
Let's first focus on the ride trend and ignore that "ridiculous" sharp tip in February for a while. We can see member riders ride more but the average duration of each ride is shorter as compared to casual riders. During Summer season, people use classic bikes more. However, more electric bikes were used during Winter Season. Classic bikes were more likely to be used for a much longer time when the average ride duration is more than that of electric bikes as you can see there are more sharp tips for classic bike in ride duration graph.
Now, let's take a close look at the sharp "surge" of ride duration at week 7. Since the ride count is normal, that is saying some people rode for especially long time. As we inspect the bike type, we could also become believing in it that those rides were of classic bike. Such a thing is interesting as the data is telling you some people ride a classic bike which was not power-assisted in Winter. Would you think that as a normal thing? Who would be willing to ride a bike when the temperature is low? Thus, it is quite natural for us to wonder if there were incorrect data as outliers. It is difficult to confirm. But if it is correct, how interesting!
In case you want to inspect the dashboard by yourself, the Tableau workbook is attached.
As an extension of the investigation of popular Citi bike stations, it might be interesting to dive a little deeper to look for the popular routes. Popular routes may have different definitions in different scenarios. We want to uncover them in varying standards. But for now, let's first define a popular route in an easy way. Here's the definition: A popular route is a bike route starting from a popular station and ending at a popular station. Namely, it is a route between two stations (the two stations can be the same). To remind you, a popular station here is a station either has top ranked starting ride count or ending ride count, depending it's seen as a start station or end station. In our earlier exploration, we defined stations be popular by ranking the sum of their starting and ending rides. Then, we are importing the dataset into Tableau for a dynamic comprehension of the data.
In the dashboard, you can set the range of months to narrow down your perspective for more detailed comprehension. The Tableau workbook is attached below.
At this point, you must be wondering about the popular routes in a broader definition. What are the most popular stations that start and end in any stations? Are they the same as these popular routes above? Let's investigate it now.
We find the ride counts for the routes here are far more than those earlier. Why? This is because the definition of a popular station. To some sense, a popular station is popular because people like to start or end their rides there. People can ride a bike to anywhere or from anywhere and hence it is more likely to push the stations to be a popular one if the stations are more "decentralized." Interestingly, the perspective we applied can be a criteria to classify Citi bike stations.
The edges represent rides (direction ignored) and nodes are stations. Larger nodes represent popular stations which have more rides overall. Thicker edges represent more rides. Note that some edges and nodes are ignored for clearity. Hope the graph helps you understand the difference we talked about earlier.
Stage 3
When it comes to third stage, we want to conduct an analysis with data of past 3-5 years, to summarize the bike sharing trend in recent years. In the same stage, a future development forecast should be presented.
Ideal Expectation
Bike sharing environment could change at any time. However, with those accessible real-time data, we are likely to have our already obtained result automatically adjusted if we have automated analyzing process. Thus, developing an automated analyzing process for the data of the same format is also an objective of the project, in addition to analysis itself.