Creating a dataset of satellite images for StyleGAN training

8 min readFeb 18, 2022

Update:
The dataset described in this article is now published and downloadable on Kaggle as a small version with 4041 images, and an extended version with 51k images.

Generative Adversarial Networks (or GANs for short) have been a hot topic in the machine learning community for a while now, and are mostly known for their ability to replicate realistic-looking imagery and transition between output states of the network, resulting in mesmerising morph videos. There are a variety of pre-trained models floating around, being used to generate faces, landscapes, paintings, etc. Just take a look at thisxdoesnotexist.com to get an impression of the scope of different models.

In this two-part article series we will take a closer look at the process of creating a dataset to train a GAN on — more specifically my own workflow or pipeline to collect thousands of satellite images.

In part one of the series we will take a look at the big picture, the why and basic how of certain aspects, and go over workflow specifics including code examples in part two.

Introduction

Beginning of 2021 I finished a project called Landshapes in which I myself trained a GAN on satellite imagery and used it to drive an art installation I conceptualised for my Master’s thesis. You can read the whole thesis here, if you like — or just look at some pretty pictures, that’s also fine.

I am a designer by trade and before starting the project, I had absolutely no clue about anything related to machine learning or GANs. I like to think that I know a tiny fraction more than I did when I started, but that could also be pure imagination.

Instead of starting with a theoretical deep-dive, I approached the project from an experimental point of view, tinkering with pre-made software and building quick prototypes to broaden my knowledge. Trying and failing early helped me immensely to learn and understand the matter without having to get a degree in computer science first to understand everything involved.

Training a first GAN

I started out by training a GAN using RunwayML and the AID dataset, an aerial landscape dataset I found online. It was a fast way to get started and get my first functioning GAN, but the results were not quite yet what I had in mind.

Looking at the GAN output, it seems rather muddy, muted in color, with random shapes instead of crisp landscapes I was hoping for. Some of this might be due to limited training time, but in this case more probably due to the quality of training data. Putting real data and generated images next to each other and seeing the visual similarity gives this assumption some validity.

Good thing was that now I had a better idea of what data I was actually searching for, and knowing that the AID dataset would not quite cut it.

Defining data qualities

It always help to actually know what you are searching for before getting too much into detail. In preparation to searching and collecting, what kind of data did I actually need?

In my eyes, the data should be:

Satellite imagery (obviously) of the RGB color channels
Diverse in scenery (to hopefully have a similar diverse output after the training process)
1024x1024 pixel in resolution (based on reference resolutions commonly used with StyleGAN)
Visually appealing (a big learning from the initial test training with the AID dataset)
Ideally more than 2000 images (nothing to back up that number, except what seems to be the minimum required for StyleGAN training)

Searching for open data sources

Have you ever wondered where services like Apple or Google Maps get their beautiful satellite images from?

Admittedly, I hadn’t until the need for my own satellite data occurred. The good thing is that Google Maps is kind enough to tell us where they get their data from. In this case it is data from Landsat, Copernicus & TerraMetrics — nice search terms to start an investigation.

In tiny font at the bottom, you can see the data sources Google Maps is using for this specific earth segment

While TerraMetric offers commercial (and for a student-project costly) access to satellite imagery, Landsat and Copernicus are both satellite missions run by NASA and ESA respectively. They allow free access to their data through platforms like EarthExplorer.

Searching for an individual image on EarthExplorer is easy enough, but downloading them by hand would be rather cumbersome, and this is not what we are here for, is it?

Ideally, I would optimise the process and also avoid the big 200MB source file with all available image channels that the satellite provides (some quick-math reveals that this would result in 2000*0.2GB = 400GB of data — yikes), but only get a nice, data-saving 1024x1024px RGB JPG out of it.

Building a data collection pipeline

Coming to the hands-on part of the article! After some experimenting and tinkering, I came up with this functioning pipeline that would allow me to collect diverse satellite images in a mostly automated way. In this part, we will go in detail over the build pipeline, utilizing:

SENTINEL-2
QGIS
Google Earth Engine
Google Colab

What are these three things, you might ask? Let’s quickly go over them and look at their use in the process.

SENTINEL-2

To quote ESAs mission page, “SENTINEL-2 is a European wide-swath, high-resolution, multi-spectral imaging mission. The full mission specification of the twin satellites flying in the same orbit but phased at 180°, is designed to give a high revisit frequency of 5 days at the Equator.”

SENTINEL-2 has circled the world for multiple years now and will provide us with the satellite images we are looking for.

Google Earth Engine

As seen earlier, we could already access individual satellite images with platforms like EarthExplorer. Google Earth Engine is another platform that combines dozens of datasets (including NASAs Landsat and SENTINEL-2 from ESAs Copernicus Program) and allows researchers to combine, analyse and export data with code. In our case, Google Earth Engine will be a gateway to the precious Sentinel-2 data in an accessible format.

QGIS

QGIS is a free and open-source Geographic Information System software that will help us determine and save location coordinates for us to export through EarthEngine. The function could probably also be replaced by code in EarthEngine alone, but I found it a bit easier to split them up like this.

Google Colab

The final written code will have to run somewhere for some time to request and save thousands of images. You could probably do this on your own machine, but it is actually easier to outsource the computation to something like Google Colab.

Google Colab allows us to write and execute Python code on Google servers, so that we do not block our own computer when sourcing the dataset.

Tying all components together in one workflow

With all the needed software components established, let’s see how they all play together nicely to create a full dataset.

Step 1: Generate locations
Using QGIS and a shape layer of earth’s coastline, thousands of random coordinates on land or directly on the coastline are generated and the longitudes and latitudes exported as a table.

Step 2: Search SENTINEL-2 dataset
A script running on Colab iterates over the table, requests the satellite image at that location from the SENTINEL-2 dataset and calculates a square export region laying within the image bounds.

Step 3: Generate the export link
Using the calculated export region, a request to Google Earth Engine returns a link with the desired region as a JPEG export.

Step 4: Save the image and repeat
Another script running on Colab requests the image behind the link and saves it in my Google Drive before repeating the whole process for the next location.

And that’s it for the export part: letting the script run for a few hours in the background, and suddenly my Google Drive folder was filled with beautiful satellite images from all around the world!

Cleanup the data

The last step before we can proceed to train our GAN is to clean up the data a bit. Unfortunately Earth Engine does not come with an autoexposure button (or at least I did not find it), so some images were a bit too dark, some too bright, or lacking saturation and contrast.

Remembering my first attempt of training a GAN with somewhat muted images, I wanted to give these images a bit of “pop” before attempting the training.

To do so, I imported all images into Adobe Lightroom, and gave them a swift auto-correction. Well… kind of swift. Some images had noticeable artefacts, or still were not looking quite how I imagined them to look like. So I took one evening, some cups of coffee and went to work — iterating over all images one by one, quickly adjusting some basic settings or sorting out unusable images.

The final dataset consists of 4082 images: 1995 coastline images, and 2087 landmass images.

Generating novel images with a GAN

After training a StyleGAN-2-ADA model on the dataset, I could finally create novel, AI-generated images or videos of interpolating from one image to the next.

Example of an interpolation video from one landscape to the other

And that’s it, folks! ✌️

Go and read part 2 of the article series, where I will go over and explain all the necessary code that I wrote or collected to make the pipeline work!