Using Data Decided the Location of a Business (The Battle of the Neighborhoods Project)

Chibuzo Ugonabo
7 min readDec 4, 2020

This Blog is a Walk through of my Cousera IBM Data scientist certification capstone project.

Disclamer:

This project was done in the middle of a pandemic so some of the data my be a little outdated and it may not be the best time to start a business. That been said, here is the report

Table of contents

Introduction:

  • Business Problem
  • Data
  • Methodology
  • Analysis
  • Results and Discussion
  • Conclusion

Introduction

In this project we will explore data from the city of New York. NYC need no introduction, it is one of the most populous city in United States with a population of 8.39 million in 2020. It is a hub of diverse cultures combining all facets of the globe.NYC is a major industrial center and the financial capital of the world. There are 5 boroughs in the city, and with such a large geographical area there is a huge competition between companies. Thus there is big challenge in figuring out the most ideal spots to open up a new business and maximizing profits. In this project, I’ll be focusing on opening up a Chinese restaurant as that is one of the most the popular cuisine in America according to the analysis done by Chef pencil using goole search data

Most popular cuisine by state by Chef pencil’s

Business Problem:

Where is the best place to open a Chinese Food Restaurant in New York City, maximizing customers and profit?. The aim is to identify the best place to open a new Chinese food restaurant in the city, which location would be the most appropriate taking into account the competitors and the inhabitants of the different neighborhoods of the city.

Target Audience of this project and some demographic facts

This project is particularly useful to developers and investors looking to open or invest in a Chinese restaurant in the city of New York. Overall, New York is a great place to open a restaurant with an ethnical cuisine. As New York is the most diverse city in the world (800 languages are spoken in New York). With its diverse culture, comes diversity in the food items. There are many restaurants in New York City, each belonging to different categories like Chinese, Indian, French, etc. Why did I decide to focus on Chinese cuisine in our project? Chinese cuisine grew in popularity because it offers a cheaper and readily available alternative, to a home cooked meal, and arguably a healthier alternative to fast-food.

Data:

We used data from https://geo.nyu.edu/catalog/nyu_2451_34572 for a breakdown of all of the neighborhoods in NYC. The dataset includes the 5 boroughs, 306 neighborhoods and their latitude and longitude coordinates. We also used Foursquare API calls to get information for restaurants in each of the neighborhoods. In these calls were the venue names, ratings, and the neighborhood of the restaurant.

Methodology

The main motto of this project is to find the best location to open a new restaurant in NYC; based on competition in different localities & their population.

So, to do this I have used two different data sets available as mentioned above. Those two data sets contain locality information of NYC, different age groups of people, & population.

To solve this problem I am going to use the “K-Means Clustering Algorithm”. K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

  • The centroids of the K clusters, which can be used to label new data labels for the training data (each data point is assigned to a single cluster). I will be utilizing different maps in-order to give a more clear vision to the target audience.

Steps I took for the analysis:

  • Collected required data: location and type (category) of every restaurant within our latitude and longitude. We also have the particular type of restaurants locality.
  • Explored the “restaurant density” across different areas of NYC — we will use K- means to identify a few promising areas close to the center with low number of restaurants & their type.
  • Explored the most promising areas & within those create clusters of locations that meet some basic requirements established in discussion with stakeholders. We will take into consideration locations with less restaurants in radius of 500 meters. We will also present the map of all such locations but also create clusters (using K-means clustering) of those locations to explore the neighborhoods.

Analysis:

Data Identification, capturing and cleaning

Search & identify the relevant data source & capture it. Here we are using Wikipedia to get data about New York City. Then we remove all the redundant values (Data Cleaning). Then we combine neighborhoods similar to the Bronx. Now the data is cleaned & ready to use.

Combining different data source and sorting neighborhood based on Longitude and latitude

Now, we will combine the neighborhood datasets with the postal address’ alongside the dataset with Latitude & Longitude and save them into seperate data frames. The resulting data frame will contain details about Postal Codes, Boroughs, Neighborhoods, and Latitude & Longitude. We finally then visualize it using the folium map.

Explore the NYC’s neighborhoods

Firstly, we explored all the neighborhoods in the city of New York using the Latitude & Longitude data, using the Foresquare API to get the restaurant venues available in NYC. Then, we explored the unique categories in the neighborhoods by filtering the venue details for all possible “Chinese Restaurants”. Next, we found each neighborhood along with the top most common venues. Finally, we identified the top 10 venues for each neighborhood.

Clustering

With an assumption of 5 clusters, use K-Cluster algorithm to come up with 5 different clusters in NYC with similar set of Venues.Explore each cluster and determine the discriminating venue categories that distinguish each cluster. Identify the clusters & Boroughs/Neighborhoods with Maximum number restaurants and what types.

Results and Discussion

First things first, we want to see the breakdown of neighborhoods in each borough. Looking at the graph below we can see that Queens has the highest amount of neighborhoods, followed by Brooklyn and then Manhattan. The reason we wanted to look into this is because we wanted to see the breakdown of how many neighborhoods there were per borough, and compare that to how many Chinese restaurants were already in the neighborhood. This way we can minimize competition and have a better idea of where we should actually open up a restaurant.

We can visually see that there are the highest amounts of Chinese restaurants in Queens, followed by Manhattan, then Brooklyn.

  • We see that Queens has the highest number of Neighborhoods.
  • Queens also has the highest number of Chinese food restaurants
  • Chinatown in manhattan has the highest number of Chinese food restaurant

Based on the results of our analysis, I would state that Manhattan and Brooklyn are the best locations for Chinese cuisine in NYC. To have the best shot of success, I would open a Chinese restaurant in Brooklyn. Brooklyn has multiple neighborhoods with average ratings exceeding 8.0 on a scale of 1.0 to 10.0 and has less amount of Chinese restaurants than Manhattan, making competition easier. In addition, we should keep in mind, that real estate prices in Brooklyn are much cheaper than in Manhattan. Particularly, I would recommend considering opening a Chinese Restaurant either in Cobble Hill or in North Side, because both of these neighborhoods have the highest rating for Chinese restaurants.

Limitations and suggestions for future researches

All of the above analysis is depended on the accuracy of Four Square data. Besides, during this project, we used a free Sandbox Tier Account of Foursquare API that goes with limitations as to the number of API calls and results returned. To get better results, future research work and more comprehensive analysis could consider using a paid account to bypass these limitations as well as incorporating data from other external databases.

Conclusions

In the project we have gone through the process of identifying the business problem, specifying the data required, extracting and preparing the data, performing data analysis, and lastly providing recommendations to the investors/developers. During the project, we applied different data science methods and instruments to get the answer to our main question: “Where in the City of New York, should the investor open a Chinese Restaurant?” The findings of this project will help the relevant investor better understand the advantages and disadvantages of different New York neighborhoods/boroughs in terms of opening a Chinese restaurant.

--

--