Working with GIS Data in Python: Part I — Geometry as Data
- Mic

- Jun 12
- 6 min read
If one thinks of geographical data, then a city is simply a polygon, your favourite café is a point, and the way to get their is a sequence of lines.
In GIS data (Geographic Information System), those really are the types. And once you start thinking in those terms, a surprising amount of analysis becomes straightforward, provided you have the right tools.
This is the first part in a series on GIS. We are starting with the fundamentals: vector data, the two libraries that handle it, and a handful of real-world examples using publicly available datasets you can run yourself. In later parts we will cover projections, interactive maps, and finally raster data, like satellite imagery, elevation models, and continuous grids of values.
This is no way a benchmark or a complete introduction of GIS, it is simply what I found out about GIS when trying to get into it the first time.

The Two Worlds of GIS
Before any code, one distinction worth having early is that GIS data comes in two fundamental forms.
Vector data is geometry, whether this is a point, a line, or a polygon. This will also depend on context, for example a city could be a point in one resolution and a polygon in another. Similarly, a river is often a sequence of lines, but could be a polygon. Vector data is discrete, meaning that each feature is a shape, and each shape has attributes attached to it.
Raster data on the other hand is a grid. A matrix of values covering space. Examples for this can be satellite imagery, an elevation model, or a rainfall map. Raster data is continuous, every cell in the grid has a value, even if that value is “no data”.
This post is about vectors. The distinction matters because the tooling is completely separate and mixing them up will produce confusing installation errors and confusing results in roughly equal measure.
GeoPandas — Geometry as a Column
If you have worked with pandas, GeoPandas will feel like home. That is the design.
GeoPandas extends a regular DataFrame with a geometry column. That column holds geometry objects — points, lines, polygons. Everything else behaves exactly like a pandas DataFrame: filtering, grouping, merging, iteration all work the same way.
The quickest way to start is with the datasets GeoPandas ships with. No download required.
This will show the most useful basic information about the first three countries in the list, in addition to its geometry, without the geometry you get
NAME SOVEREIGNT CONTINENT POP_EST GDP_MD ISO_A3
0 Fiji Fiji Oceania 889953.0 5496 FJI
1 Tanzania United... Africa 58005463.0 63177 TZA
2 W. Sahara Western... Africa 603253.0 907 ESH and the last column gives
NAME geometry
0 Fiji MULTIPOLYGON (((180 -16.06713, 180 -16.55522, ...
1 Tanzania POLYGON ((33.90371 -0.95, 34.07262 -1.05982, 3...
2 W. Sahara POLYGON ((-8.66559 27.65643, -8.66512 27.58948...We can already see a difference here. As an island nation, Fiji is composed of multiple polygons, while the other two are a single polygon. If we pick this geometry entry and display it in full we get
MULTIPOLYGON (
((180 -16.067132663642447, 180 -16.555216566639196, 179.36414266196414 -16.801354076946883, 178.72505936299711 -17.01204167436804, 178.59683859511713 -16.639150000000004, 179.0966093629971 -16.433984277547403, 179.4135093629971 -16.379054277547404, 180 -16.067132663642447)),
((178.12557 -17.50481, 178.3736 -17.33992, 178.71806 -17.62846, 178.55271 -18.15059, 177.93266000000003 -18.28799, 177.38146 -18.16432, 177.28504 -17.72465, 177.67087 -17.381140000000002, 178.12557 -17.50481)),
((-179.79332010904864 -16.020882256741224, -179.9173693847653 -16.501783135649397, -180 -16.555216566639196, -180 -16.067132663642447, -179.79332010904864 -16.020882256741224))
)We see that they different polygons consist of a sequence of coordinates, of course in each case the first and the last coincide to give a closed polygon, and we have three of these in total.
Plotting is, is straight forward and done as usual with matplotlib support (plus of course throwing in a matplotlib.pyplot import).
With the dataset available at the time of this post we then get

Choropleth maps, where colour encodes a value, are done very easily as well. We just need to clean up the data a bit.
This gives us the following map with the current data.

The missing_kwds argument handles countries with no data gracefully. Worth knowing before you wonder why half a continent is missing. We also left out Antarctica.
Dissolve — Merging Features
One of the most useful operations is combining features by a shared attribute. Here, collapsing countries into continents.
This gives us a coloured map of continents with colour describing the total population

The dissolve method behaves like groupby, including that the aggfunc controls how numeric columns are combined when geometries are merged. Here we sum populations. You could take the mean, the max, or write a custom function, in the same way as you would for a normal groupby. Note that in the image you can clearly see where the dataset makes its continents split. In this case Russia and French Guiana count for Europe, even though they are mostly in Asia respectively in South America.
Spatial Joins — The Core Operation
The question that makes GIS worth learning: which features from dataset A fall inside features from dataset B?
GeoPandas ships a second built-in: major world cities. Together with the country polygons, we can answer “which country is each city in?” — a question that is annoying to answer with a table join but trivial with a spatial join.
This gives us the following head at the end.
NAME_left NAME_right CONTINENT
0 Vatican City Italy Europe
1 San Marino Italy Europe
2 Vaduz Austria Europe
3 Lobamba eSwatini Africa
4 Luxembourg Luxembourg EuropeWe can then use this to count cities per country, at least those that are in the dataset.
This generates two images, the first overlays the countries with the locations of the cities.

While the second colours the countries by the count of cities in them from the dataset.

This is the spatial join in action. The cities dataset had no country information, we specifically cut it down to only the name and the location. The polygons dataset had no city count, but the join connected them using location alone.
Shapely — What GeoPandas Uses Under the Hood
GeoPandas does not do geometry itself, it delegates this to Shapely.
Shapely handles the computational geometry: constructing shapes, testing relationships, performing set operations. While you would use GeoPandas most of the time while working with GIS, understanding Shapely helps when you need to construct a geometry from scratch or inspect what a geometry operation actually produces. In a simple example this would be usual set operations.
This will produce the images of usual images of two overlapping polygons one is used to.

These operations are what power spatial analysis. When you ask “which part of this flood zone overlaps with residential parcels”, that is an intersection. When you ask “give me everything within 1km of any hospital”, that is a union of buffers. In these cases Shapely handles the geometry andGeoPandas applies it to a whole column of rows.
The relationship is worth understanding once and then forgetting about. In practice, you write GeoPandas most of the time. But Shapely is there when you need to drop down a level, for example if you need to implement something not yet directly provided by GeoPandas.
The File Formats
One last thing before we wrap up this part: GIS data comes in formats you will not recognise from regular data work. GeoPandas can read most of them with read_file().
While we have been using option 4 in our examples above, this was really option 1 in disguise. The downloaded zip files are a Shapefile. For an example that used GeoJSON in disguise we have the following.
This gives us the printout
name IS... geometry
0 Indonesia IDN MULTIPOLYGON (((117.70361 4.16342, 117.70361 4...
1 Malaysia MYS MULTIPOLYGON (((117.70361 4.16342, 117.69711 4...
2 Chile CHL MULTIPOLYGON (((-69.51009 -17.50659, -69.50611...and as a plot

For example 3, the GeoPackage format in disguise, we use the following.
This we can plot again, the geometries are all of the type LINESTRING, sequences of connected lines, which gives us the following.

That format we used, in out original examples above, is very useful for sources like Natural Earth. They publish free vector datasets at multiple resolutions, this includes countries, coastlines, rivers, populated places. All available as zipped shapefiles. In all three cases GeoPandas will unzip and read them in one call.
What Is Next
In the next part we focus on projections and interactive maps.
This includes things like 0.005 degrees is not 500 metres, at least not in most places. That one will cause problems before you expect it to.
We will also look at OSMnx, a library that turns OpenStreetMap into a graph you can run shortest-path queries on. And Folium, which produces interactive web maps with hover tooltips and clickable markers from the same GeoDataFrames we have been working with here.



Comments