
Introduction
In this project I will be scraping the website citypopulation.de with BeautifulSoup to create a dataframe & csv file of populations for sub-national entities. What is scraping? Web scraping is the process of extracting content and data from a website through it's HTML code. What is https://www.citypopulation.de/ ? This website provides up to date data on population and areas for all countries of the world, including territories and subdivisions.

What is Beautiful Soup? Beautiful Soup (AKA BS4) is a Python library for pulling data out of HTML pages. In this project, I will be using BS4 to get the country pages within a continent, as well as parsing the population data from the subdivisions of that country. What are sub-national entities? Sub-national entities are any administrative or census division within a country such as provinces, states, territories, municipalities, etc. citypopulation.de tries to keep up to date population data for all national and sub-national entities on Earth
Outline & Narrative
Download webpage using requests
To begin, we'll use the requests library to download the webpage and create some simple functions to help parse URL strings:
Next we will create a class for countries to easily access the attributes: url, Name, if it is an html suffix and the continent it belongs to:
Next, a helper function ContinentDict that takes a list of continents, and returns a dictionary with continent keys and values are list of countries (belonging to that continent), parsed from the html page of the continent on citypopulation.de :
Next, a simple function to create country objects from the dictionary object returned from the ContinentDict function:
Test Parsing
Using beautiful soup objects to inform our function building:
To find the date we can use the find function to search for 'rpop prio1':
Pandas Dataframe
Writing functions to handle beautiful soup objects and return a dataframe:
We can start with the 'deepest' function that adds a city (or any subdivision) to the passed dataframe:
Our next 'layer' up is adding a table (ie. set of subdivisions) to the passed dataframe:
Next we can write a function that takes the Country object and finds all the subdivisions to be added to the passed dataframe:
And finally our 'top' layer is a function that takes the list of country objects and the passed dataframe and passes it to the above functions:
Testing our functions, we can see in the output that our dataframe is built:
Now we can reset the index to get our final dataframe:
Example Analysis
Preliminary analyses on our dataframe using pandas:
As we can see this dataframe accurately indexed the data from the html page:

Conclusion
Exporting, Summary, Future Work & References:
Comments