Working with GIS Data in Python: Part III — When Your Data Is a Grid

Mic
Jun 19
10 min read

Updated: 4 days ago

Sometimes your data is not a shape, instead it is a set of attributes or values that describe something in a grid region. This is what is simply called a raster.

A grid is of course mathematically nothing else but a matrix of values, where every cell covers a patch of the Earth’s surface and the spatial arrangement of those values is the entire point.

There is a lot of data that is presented in this way, the most obvious are things like elevation models, as well as climate data, rainfall, temperature, air quality, and similar type of environmental data. But if you make the grid very large, then also data like satellite images work in this way. The Python GIS ecosystem handles this, but the tooling is completely separate from GeoPandas and Shapely. You need two new libraries and a slightly different model when thinking about the data.

In the first and second part of this series we covered vector data: geometry as a column, projections, spatial joins, OSMnx, and Folium. This post is about raster data, namely what it is, how to read it, how to analyse it, and how to produce images that are worth looking at.

The Raster Model

A raster is a grid of values, so in mathematical terms a matrix with each cell being number. Each cell covers a geographic area, the size will depend on the resolution of the raster. A 30-metre resolution elevation model has one height value for every 30×30 metre patch of ground. A 1-kilometre resolution climate dataset has one temperature value per square kilometre.

That is the whole data structure. The complexity comes from scale: a 30-metre DEM (Digital Elevation Model) covering a single country can have millions upon millions of cells. Global climate rasters add a time dimension on top, this could go over 12 months or 30 years, making this a third dimension in your data. The data model is fairly simple, but the files can become large quickly. This is fairly common when looking at data that describes natural processes or information. There are no real discrete position increments, so approximations can have vast amount of data.

Three things distinguish a raster from a regular NumPy array:

A coordinate reference system: What geographic space does this grid cover?
A transform: How do cell coordinates (row, column) map to real-world coordinates (x, y)?
Nodata values: cells with no valid data, masked rather than set to zero.

Understanding those three things is most of what you need to understand raster GIS.

Free Raster Data Worth Knowing About

Unlike vector data, raster files are large and require registration or direct download. The most accessible public sources:

Copernicus DEM (GLO-30) — 30-metre global elevation model, freely available via AWS S3 without authentication. Excellent quality.

SRTM — NASA’s Shuttle Radar Topography Mission. 30-metre resolution, global coverage. Available via the elevation Python package or direct USGS download.

WorldClim — Monthly climate normals (temperature, precipitation) at multiple resolutions. Free download at worldclim.org.

Sentinel-2 — ESA optical satellite imagery, 10-metre resolution, global, updated frequently. Available via the Copernicus Data Space or via sentinelhub.

NOAA ETOPO — Global elevation and bathymetry (ocean depth). Useful for creating nice-looking terrain maps.

For the examples below, we will use the elevation package (which wraps SRTM download), and generate synthetic but visually realistic rasters where real downloads would interrupt the narrative.

Rasterio — The Foundation

Rasterio wraps GDAL, the underlying C library that handles almost every geospatial format, with a Python interface that reads raster bands as NumPy arrays. This comes with a bit of preparation, since we need to install GDAL, for example via on macOS

brew install gdal

Then we need the elevation package to download our data.

This creates a GeoTIFF file, that we can then open and process with Rasterio. But there is a warning in order, we had to overwrite the maximum number of downloaded tiles, usually set to 9. Alternatively we could go to a smaller resolution, the one we used is using SRTM1 which is a 30 metre resolution, we could go down to SRTM3 which is 90 metre. For the code that generates our image to verify the data we use the following.

This produces for us the following image.

The Sydney harbor does not really look like a harbor...

Comparing this with a map generated using the Contextily package.

We can see a similar shape of the 'water' areas at elevation zero and below but there are clear differences. This is a good point to warn that the data source is important. The data SRTM1 we used, is meant for land elevation. It does not really include water elevation levels, hence when we come close to elevation zero, we are at the boundary of our data and it will be less and less precise.

Hillshade — Making Elevation Look Like a Map

Elevation data displayed raw looks flat. The technique cartographers use to convey terrain is hillshading: simulating the way sunlight falls across a surface.

This takes our previous example, except that we now mask the water with a single color and puts in perspective with a shaded version

This makes elevation look a bit more natural and not as flat.

The right panel looks like a physical terrain model, even though the data is identical. The difference is entirely in how you render it.

The key here is vert_exag=2, it doubles the perceived height difference, which helps for areas with modest relief. For the Himalayas, you would set it to 0.5.

Clipping a Raster to a Country Boundary

The most common first step with a raster: cutting it down to the area you actually care about.

This will cut out the area of the city of Sydney from the previous examples, leaving us with the following.

Clipping out the city limits also puts the opera at the sea.

The profile pattern is worth understanding. A raster file carries metadata, this includes CRS, data type, compression, number of bands, NoData value, transform, alongside the pixel values. Rasterio bundles all of this into a dictionary it calls a profile.

When you write a new raster, you copy and update the profile rather than specifying everything from scratch. It is the right abstraction.

In the example above, you can see how Rasterio is only cutting out a part from the data in the GeoTIFF file. The shapes for the cut-out is obtained from OsMNX, put into the correct CRS, then the extracted from the 'geometry' part. It is then transformed into the correct geometry of the CRS of the raster data and applied.

The rest of the code is then, more or less, the same as the previous example.

Band Maths — NDVI from Sentinel-2

Before we used simple numerical data, in our case elevation data. Satellite imagery typically comes as multiple bands: separate arrays for different wavelengths of light. The most useful computation you can do with two of those bands is NDVI — the Normalized Difference Vegetation Index.

NDVI uses the near-infrared (NIR) and red bands to measure how much living vegetation is present in each pixel. The formula is as follows:

NDVI = (NIR − Red) / (NIR + Red)

Values near 1.0 indicate dense vegetation. Values near 0 indicate bare soil or built-up areas. Negative values typically indicate water.

A small note about the above example, it assumes that you downloaded a data from the Copernicus Browser, namely the raw Sentinel-2 bands. We need the layers B02, B03, B04, and B08, which give blue, green, red, and finally near-infrared. These are either downloaded as a single tif file or assembled into one with Rasterio. The result is a file 'sentinel2_tile.tif' used in the example above. It then produces the following image.

Comparison between a true colour view of Sydney and the NDVI view.

The red parts are beaches or sea. The green parts are land and the colour goes from lightly vegetated areas in light green towards dark green areas that show denser forested areas. In the images we also added back the marker for the opera house and you can clearly make out the Royal Botanical Garden in dark green just to the east of it.

rioxarray — When the Raster Has Dimensions

A single GeoTIFF is one thing, but many raster datasets are not just “one image”. They have extra dimensions: time, band, height, pressure level, scenario, model run, and so on. At that point, plain Rasterio can feel a bit like trying to read a spreadsheet by counting row numbers with a ruler.

This is where xarray becomes useful. It gives us labelled multi-dimensional arrays. Instead of remembering that month 6 is summer-ish, or that axis 0 is time, we can work with names: time, lat, lon, season.

For this example, we will use xarray’s small built-in tutorial climate dataset rather than a large external WorldClim download. It is not a full climate-normal product, but it has the structure we need: air temperature over space and time.

If you are interested in a more exhaustive example, you can download the followng data set from WorldClim 2.1, just be warned that the server acts up from time to time.

Historical climate data
Version: WorldClim 2.1
Variable: tavg
Resolution: 10 minutes
File: wc2.1_10m_tavg.zip

For our small example we use the following.

The result is not just a NumPy array. It knows what its dimensions mean.

<xarray.DataArray 'air' (time: ..., lat: ..., lon: ...)>
Coordinates:
  * time     datetime64[ns] ...
  * lat      float32 ...
  * lon      float32 ...

That is the important bit, we have a temperature variable, but we also have labelled coordinates. The array knows which values belong to which times, latitudes, and longitudes.

The tutorial dataset stores temperature in Kelvin, so first we convert it to Celsius.

The data is more frequent than monthly, so let us average it into monthly means. This is another place where labelled arrays are pleasant: resample(time="MS") means “group by month start along the time coordinate”.

This condensed our previously 2920 time slots into just 24. Now we can calculate seasonal and overall means. We are not counting array positions. We are grouping by the labelled time coordinate.

This is the central point of the example, instead of saying “take indices 5, 6, and 7”, we are saying “give me the season called JJA”. This makes the code much closer to the question we are asking.

Now we add the plot.

This gives us the following very rudimentary image.

Obviously summer is warmer and winter is colder, and the mean temperature shows the general pattern.

The plot is not really the point, our example uses a very simplistic dataset for convenience. The point is how little index bookkeeping there was. Once the data has labelled dimensions, we can ask labelled questions like:

monthly_temperature.groupby("time.season").mean("time")

and:

seasonal_temperature.sel(season="JJA")

That is xarray doing the boring administrative work we would otherwise do ourselves, probably incorrectly, and probably somewhere at an inconvenient time when we have to.

Clipping with rioxarray

So far this was mostly xarray. The rioxarray part appears when we want raster-aware behaviour: coordinate reference systems, spatial dimensions, and clipping.

The tutorial dataset covers North America, we will clip the mean temperature map to the United States.

There is one small nuisance first. The dataset stores longitude from 0 to 360, while most vector boundary datasets use longitudes from -180 to 180. So we convert the longitude coordinate.

Now load a country boundary, similar to how we did it in the previous posts.

Before clipping, we need to tell rioxarray which dimensions are spatial and what CRS the raster uses.

Now the clip itself is just one line.

After adding the usual plotting routine, plotting both the temperature as well as the boundary we end up with the following

Due to our simplistic dataset it is of course not very pretty, but it does show what we want.

This is the difference between “an array with pixels” and “a raster with labelled dimensions”. With Rasterio, we usually think in terms of arrays, transforms, and metadata. With xarray and rioxarray, we can keep the labelled structure attached to the data for longer.

That means we can write things like:

seasonal_temperature.sel(season="DJF")

and:

mean_temperature_geo.rio.clip(usa.geometry, usa.crs, drop=True)

The first line is climate logic, while the second line is spatial logic. Neither one needs us to remember which integer axis meant what. For once, the script itself can be the one remembering the boring details.

GDAL — The Engine Room

GDAL is not a Python library you will reach for often. It is the C library that almost everything above uses under the hood.

Rasterio wraps GDAL, GeoPandas reads files through Fiona or Pyogrio, which use GDAL, and PyProj uses PROJ, a sibling project. For all of them GDAL is the common foundation.

You might reach for it directly to convert between obscure formats, to run bulk format conversions on the command line, or when a file type is not yet wrapped by a higher-level library. Examples like the following.

The GDAL command-line tools (gdalinfo, gdal_translate, gdalwarp) are often more convenient than the Python API for one-off operations. For example gdalinfo my_file.tif tells you everything about a raster file in one shot.

The useful thing to know: if you are having installation problems with Rasterio or GeoPandas, GDAL version mismatches are the most common cause. In those cases it might be best to avoid a simple pip install and reach for something like conda.

Where the Two Worlds Meet

While vector and raster data are separate, they often have to come together for analysis.

The most common bridging operation: sampling a raster at vector point locations. Questions like

What is the elevation at each city?
What is the temperature at each weather station?
What is the NDVI value inside each farm polygon?

For example

One call per point, zero loops. Rasterio handles the coordinate-to-pixel conversion internally.

The Full Stack

In all three posts together we have seen a number of different libraries and also how they can be used together.

The vector stack: GeoPandas, Shapely, PyProj, Folium, OSMnx. The raster stack: Rasterio, rioxarray, xarray, GDAL. The bridge here: rasterio.mask for clipping rasters to vector boundaries; rasterio.sample for sampling rasters at point locations. This all comes together in this example.

This example is a small geospatial relay race.

Elevation downloads the SRTM height data.
Rasterio reads and clips the DEM so it can become the shaded terrain background.
OSMnx handles the OpenStreetMap side: it finds Sydney Central Station, gets the City of Sydney boundary, and downloads the street network inside the 10 km radius.
GeoPandas then does the vector analysis work by building a 1 km grid, cutting the street lines by grid cell, and calculating street density per grid cell.
Matplotlib pulls the pieces together into one static plot, with elevation underneath, density-coloured streets on top, and the city boundary and station marker added for context.

Vector data is discrete. Raster data is continuous (not in the mathematical sense). Most real-world GIS problems involve both and we have seen here how the Python ecosystem handles the connection cleanly.