Berlin or Munich?

Photo by Stefan Widua on Unsplash

Introduction

It is quite difficult to find two more dramatically contrasting cities within the same country than Berlin and Munich.

Munich and Berlin are two of the best cities in Germany, and both have a lot to offer. Berlin is not only Germany’s capital and largest city; it is also the cultural hub of the nation. One of the most fascinating cities in Europe, Berlin is vibrant and edgy and is Germany’s center for fashion, art, and culture. Berlin’s nightlife is famously impressive, known as the techno capital of the world. Berlin is a city of culinary delight offering a wide variety of food ranging from traditional German food to American burgers, at an affordable price.

Munich is the wealthy capital of Bavaria and the gateway to the Alps. It is said to be one of the most beautiful and charming cities in all of Germany and is filled with museums and beautiful architecture. It is most famous for being the center of Oktoberfest festivities, which attracts over 6 million visitors every year.

Business Problem

A person looking to migrate or to visit Germany would like to evaluate the neighborhoods of Berlin or Munich in terms of what the cities have to offer, the standard of living in terms of housing and other amenities, the social aspects, etc. Such information is intended to help the person in making a decision in terms of which city to choose, be it for a visit or relocation. This study will help stakeholders make informed decisions and address any concerns they have including the different kinds of cuisines, provision stores, and other offerings.

Data Description

In terms of data, it is of paramount importance to obtain geographical location data for both Berlin and Munich. Postal codes in each city serve as a starting point. Using Postal codes, one can find out the neighborhoods, boroughs, venues, and their most popular venue categories.

Berlin

To derive our solution, We scrape our data from http://www.places-in-germany.com/14356-places-within-a-radius-of-15km-around-berlin.html

This www.Places-in-Germany.com page has information about some states in Germany

1. Borough: Name of Neighbourhood

2. Zipcode: Postal codes for Belin

The data will be scraped with a tool called Beautiful Soup and directly transformed to Pandas and Data Frame. It needs some cleaning while gathering only the boroughs of the city.

Munich

The postal code and district names of all districts in Munich are required to solve the task. The data published at https://www.muenchen.de/int/en/living/postal-codes.html is used in order to fetch the necessary data. The data is fetched by using the pandas library and the built-in pd. read HTML () function. This function scrapes the data available on the website and stores all tables in data frames.

1. District: Districts name

2. Postal Code: Postal code for Districts in Munich

Geodata

The python geopy library is used for getting the latitude and longitude values. This library only requires the name of a neighborhood and accepts also a postal code and returns the latitude and longitude values for the given address.

Venue data

As a next step, the available top 100 venues shall be fetched for each postal code. For this problem, we will get the services of Foursquare API to explore the data of two cities, in terms of their neighborhoods. The data also includes information about the places around each neighborhood like restaurants, hotels, coffee shops, parks, theaters, art galleries, museums, and many more. We selected one Borough from each city to analyze their neighborhoods.

We will use the machine learning technique, Clustering to segment the neighborhoods with similar objects on the basis of each neighborhood data. These objects will be given priority on the basis of foot traffic (activity) in their respective neighborhoods. This will help to locate the tourist’s areas and hubs, and then we can judge the similarity or dissimilarity between two cities on that basis.

Methodology

By using Python libraries, we will be creating our model. The required packages are the following:

import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
import numpy as np
import json
from geopy.geocoders import Nominatim
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
import requests
import geopy as geo
import geopandas as gpd
import folium
from sklearn.cluster import KMeans

import matplotlib.cm as cm
import matplotlib.colors as colors

pd.options.mode.chained_assignment = None
print('Imports done!')

Data Collection

To collect data for Berlin, We scrape our data from http://www.places-in-germany.com/14356-places-within-a-radius-of-15km-around-berlin.html using the following code:

url='http://www.places-in-germany.com/14356-places-within-a-radius-of-15km-around-berlin.html'
req=requests.get(url)
soup=BeautifulSoup(req.text,"html.parser")
table = soup.find_all('table')
df=pd.read_html(str(table), header=0)[0]

The data look like this

Overview of Postal codes in the neighborhoods of Berlin.

By using the page https://www.muenchen.de/int/en/living/postal-codes.html, Munich data has been collected. The data is fetched by using the pandas library and the built-in pd. read HTML () function. This function scrapes the data available on the website and stores all tables in data frames. The following codes have been used for this.

url = 'https://www.muenchen.de/int/en/living/postal-codes.html'
munich_data_list = pd.read_html(url)
munich_data = munich_data_list[0]
munich_data

Munich data will look like this

Overview of Postal codes and District names in Munich.

Data Processing

We split all the places according to boroughs for Berlin

df.rename({'Postal code / Place':'Borough'},axis=1, inplace=True)berlin=df.Borough.str.split(expand=True)
borough=[]
for name, values in berlin.iterrows():
#print(name, values[0], values[1],values[2])
if values[2] is None:
borough.append(values[1])
else:
borough.append(values[1] + ' ' + values[2])
berlin['Borough']=borough

We split all the places according to their postal codes for Munich

items = []
for idx, codes in enumerate(munich_data['Postal Code']):
code_list = codes.split(',')
district = munich_data['District'][idx]
for element in code_list:
element = element.replace(' ', '')
items.append({'District': district, 'Postal Code': element})

Feature Selection

For both the datasets, only the borough, neighborhood, postal codes, and geolocations (latitude and longitude) are needed.

berlin=berlin[['Zipcode', 'Borough']]
berlin.shape
berlin.drop_duplicates(inplace=True)
berlin.shape
berlin.dropna(axis=0, inplace=True)
berlin.shape

Visualizing the Neighborhoods of Berlin and Munich

Using the Folium package, it is possible to visualize the maps of Berlin and Munich with the neighborhoods

Map of Berlin showing the neighborhoods
Neighbourhood Map of Munich

Visualization of the neighborhoods is available now and with the help of Foursquare API, it is easy to collect information pertaining to each neighborhood including that of the name of the neighborhood, geo-coordinates, venue and venue categories within a 500m radius of each neighborhood

def getNearbyVenues(names, latitudes, longitudes, radius=500):

venues_list=[]
for name, lat, lng in zip(names, latitudes, longitudes):
print(name)

# create the API request URL
url = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
lat,
lng,
radius,
LIMIT)

# make the GET request
results = requests.get(url).json()["response"]['groups'][0]['items']

# return only relevant information for each nearby venue
venues_list.append([(
name,
lat,
lng,
v['venue']['name'],
v['venue']['location']['lat'],
v['venue']['location']['lng'],
v['venue']['categories'][0]['name']) for v in results])

nearby_venues = pd.DataFrame([item for venue_list in venues_list for item in venue_list])
nearby_venues.columns = ['Neighborhood',
'Neighborhood Latitude',
'Neighborhood Longitude',
'Venue',
'Venue Latitude',
'Venue Longitude',
'Venue Category']

return(nearby_venues)
Data Collected using Foursquare API for Berlin

One Hot Coding

One Hot Encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. Since we are trying to find out what are the different kinds of venue categories present in each neighborhood and then calculate the top 10 common venues to base our similarity on, we use the One Hot Encoding to work with our categorical data type of the venue categories.

berlin_venue_cat = pd.get_dummies(venues_berlin[['Venue Category']], prefix="", prefix_sep="")
berlin_venue_cat
berlin_venue_cat['Neighborhood'] = venues_berlin['Neighborhood']

# moving neighborhood column to the first column
fixed_columns = [berlin_venue_cat.columns[-1]] + list(berlin_venue_cat.columns[:-1])
berlin_venue_catt = berlin_venue_cat[fixed_columns]

berlin_venue_cat.head()
berlin_grouped = berlin_venue_cat.groupby('Neighborhood').mean().reset_index()
berlin_grouped.head()
Data with the mean values for each of the Neighborhoods after one-hot encoding

Top Venues in the Neighborhoods

In this step, we will rank and label top venue categories in the neighborhoods of Berlin and Munich.

Let’s make a function to get the topmost common venue categories

def return_most_common_venues(row, num_top_venues):
row_categories = row.iloc[1:]
row_categories_sorted = row_categories.sort_values(ascending=False)

return row_categories_sorted.index.values[0:num_top_venues]

There are way too many venue categories, we can take the top 15 to cluster the neighborhoods.

Creating a function to label the columns of the venue correctly.

num_top_venues = 15
indicators = ['st', 'nd', 'rd']

# create columns according to number of top venues
columns = ['Neighborhood']
for ind in np.arange(num_top_venues):
try:
columns.append('{}{} Most Common Venue'.format(ind+1, indicators[ind]))
except:
columns.append('{}th Most Common Venue'.format(ind+1))

# create a new dataframe
neighborhoods_venues_sorted = pd.DataFrame(columns=columns)
neighborhoods_venues_sorted['Neighborhood'] = berlin_grouped['Neighborhood']

for ind in np.arange(berlin_grouped.shape[0]):
neighborhoods_venues_sorted.iloc[ind, 1:] = return_most_common_venues(berlin_grouped.iloc[ind, :], num_top_venues)

neighborhoods_venues_sorted.head()
Top 15 venues in the Berlin

Model Building- K means

By using the KMeans Clustering Machine learning algorithm to cluster similar neighborhoods together, we will be going with the number of clusters as 5.

k_num_clusters = 5

berlin_grouped_clustering = berlin_grouped.drop('Neighborhood', 1)

# run k-means clustering
kmeans = KMeans(n_clusters=k_num_clusters, random_state=0).fit(berlin_grouped_clustering)
kmeans

Visualizing the clustered Neighborhoods

Data is processed, collected, and compiled. Missing data were removed. Finally, the model is built. Using Folium package clustered neighborhoods map are built.

We drop all the NaN values to prevent data skew

Berlin_merged_nonan =Berlin_merged.dropna(subset=['Cluster Labels'])
Map of Clustered Neighbourhoods of Berlin
Map of Clustered Neighbourhoods of Munich

4.8 Examining Clusters

We could examine our clusters by expanding on our code using the Cluster Labels column:

Cluster 1

cluster0 = Berlin_merged_nonan.loc[Berlin_merged_nonan['Cluster Labels'] == 0, Berlin_merged_nonan.columns[[1] + list(range(5, Berlin_merged_nonan.shape[1]))]]
cluster0['1st Most Common Venue'].value_counts()

Cluster 2

Berlin_merged_nonan.loc[Berlin_merged_nonan['Cluster Labels'] == 0,
Berlin_merged_nonan.columns[[1] + list(range(5, Berlin_merged_nonan.shape[1]))]]

Result and Discussion

The neighborhoods of Berlin are very multicultural with a variety of supermarkets, restaurants, bars, coffee shops, ice cream shops, drug stores,s, and clothing stores. It has a variety of shopping options in terms of fish markets, garden centers, gaming café, dessert shops, bookstores, and sporting goods shops, the main modes of transport seem to be trams and buses.•The cluster number zero indicates that the most common venues are supermarkets and Bakery. It also has the greatest number of tram stations, metro stations and bus stops. This cluster also has a café, soccer fields, a climbing gym,s, and German restaurants. This caters well to families having kids.

The neighborhoods of Munich offer a wide variety of cuisines and eateries including Italian, Currywurst joint, Asian, Chinese, etc. with a lot of Sporting goods shop. The blue cluster, the most common cluster in Munich, seems to have a lot of similar districts in the city. In cluster number one, drugstores and supermarkets are the most common venues and it’s, therefore, the ‘Drugstore + Supermarket’ Cluster. Cluster number two, with lots of hotels and plazas, also has a lot of coffee shops making this my favorite.

Conclusions

The purpose of this project was to explore the cities of Berlin and Munich and see how attractive it is to potential migrants and tourists. The neighborhoods of Berlin and Munich have more like and similar venues. While the neighborhoods around both the cities have a wide variety of experiences, each of them unique, to offer; the dissimilarity exists in terms of different venues and facilities but with minor variations. The cultural diversity is quite evident, it also gives one a feeling of a sense of inclusion. Overall, it’s up to the stakeholder’s preference as to what each wants to experience, and what suits their tastes.

Link to detailed code

Thanks for reading!

References

  1. Munich Neighbourhood clustering using the K — means Algorithm

2. IBM Capstone Project — The Battle of Neighbourhoods in Berlin: Restaurants

3. A Tale of Two Cities: Clustering Neighbourhoods of London and Paris using Machine Learning

4. Foursquare API

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store