top of page
  • Rohan Dawar

Building a Web Scraper for citypopulation.de



Introduction

In this project I will be scraping the website citypopulation.de with BeautifulSoup to create a dataframe & csv file of populations for sub-national entities. What is scraping? Web scraping is the process of extracting content and data from a website through it's HTML code. What is https://www.citypopulation.de/ ? This website provides up to date data on population and areas for all countries of the world, including territories and subdivisions.

What is Beautiful Soup? Beautiful Soup (AKA BS4) is a Python library for pulling data out of HTML pages. In this project, I will be using BS4 to get the country pages within a continent, as well as parsing the population data from the subdivisions of that country. What are sub-national entities? Sub-national entities are any administrative or census division within a country such as provinces, states, territories, municipalities, etc. citypopulation.de tries to keep up to date population data for all national and sub-national entities on Earth



Outline & Narrative

 

Download webpage using requests

To begin, we'll use the requests library to download the webpage and create some simple functions to help parse URL strings:

Next we will create a class for countries to easily access the attributes: url, Name, if it is an html suffix and the continent it belongs to:

Next, a helper function ContinentDict that takes a list of continents, and returns a dictionary with continent keys and values are list of countries (belonging to that continent), parsed from the html page of the continent on citypopulation.de :

Next, a simple function to create country objects from the dictionary object returned from the ContinentDict function:


 

Test Parsing

Using beautiful soup objects to inform our function building:

To find the date we can use the find function to search for 'rpop prio1':


 

Pandas Dataframe

Writing functions to handle beautiful soup objects and return a dataframe:

We can start with the 'deepest' function that adds a city (or any subdivision) to the passed dataframe:

Our next 'layer' up is adding a table (ie. set of subdivisions) to the passed dataframe:

Next we can write a function that takes the Country object and finds all the subdivisions to be added to the passed dataframe:

And finally our 'top' layer is a function that takes the list of country objects and the passed dataframe and passes it to the above functions:

Testing our functions, we can see in the output that our dataframe is built:

Now we can reset the index to get our final dataframe:


 

Example Analysis

Preliminary analyses on our dataframe using pandas:

As we can see this dataframe accurately indexed the data from the html page:


 

Conclusion

Exporting, Summary, Future Work & References:


214 views0 comments

Comments


bottom of page