A closer look at the code for creating a dataset of satellite images for StyleGAN training

7 min readMay 11, 2022

Update:
The dataset described in this article is now published and downloadable on Kaggle as a small version with 4041 images, and an extended version with 51k images.

Generative Adversarial Networks (or GANs for short) have been a hot topic in the machine learning community for a while now, and are mostly known for their ability to replicate realistic-looking imagery and transition between output states of the network, resulting in mesmerising morph videos. There are a variety of pre-trained models floating around, being used to generate faces, landscapes, paintings, etc. Just take a look at thisxdoesnotexist.com to get an impression of the scope of different models.

In this two-part article series we will take a closer look at the process of creating a dataset to train a GAN on — more specifically my own workflow or pipeline to collect thousands of satellite images. In part 1 of the article we took a look at the big picture, the why and basic how of certain aspects — go read it here, if you haven’t already.

In this part 2 we will go over some workflow specifics including code examples. Let’s dive in!

Exporting a single SENTINEL-2 satellite image with Google Earth Engine

Screenshot of the Earth Engine Interface showing code snippets and an exported satellite image of an island

Basic Code

Go to https://earthengine.google.com and sign up for a new account (you will need to connect it to your existing or newly created Google account)
Either access and duplicate this code example, or create a new script and paste in the JavaScript code below. I tried to comment everything thoroughly to explain what is going on in each line.
Click the link that appears in the console on the right side to access and download the exported image thumbnail image

Satellite image of a green island surrounded by water — The exported image will look something like this!

Highlighting some code aspects and decisions

Google has API references for each function and even some guides to help you get started, but let’s dig into some of the why’s of the code above. If you have further questions, feel free to ask them in the comments.

Why do we need to get the image footprint?

Our random location could be at the center, or the very edge of a satellite image. We need the footprint of the image to calculate its center, and from there on a create a square bounding box that covers most of the satellite image.

What happens if the export area is bigger than the satellite image?

The algorithm will have nothing to export outside the image’s footprint. No-information-areas will be filled back.

Why not combine multiple images into one?

You can combine multiple satellite images into a mosaic, like shown in this guide. However, that can lead to some visible artefacts where the lighting of different images does not quite match and I did not want to have these artefacts in my images.

Exporting a single image leads (in my opinion) to cleaner results, and also saves an extra computational step.

Why calculate the centroid and extra bounding box?

The code above:

Finds the center of the satellite image
Creates a new square bounding box around the center and increases its size gradually
Select the biggest possible export area that fits inside the satellite footprint

The code is not particularly pretty as it brute forces finding the maximum export area — but hey, it works. If you find a more elegant way, let me know!

Why do we need to set export parameters?

With this code snippet we tell the API how we want to display (and then export) our image.

SENTINEL-2 images contain way more information than we need for our simple purpose here. Take a look at the SENTINEL-2 collection documentation here to see an overview of all the different bands.

We only need band B4/B3/B2 (the channels for Red/Green/Blue) and also not the full resolution, but a down-scaled version of e.g. 2048x2048 pixel.

The min and max values adjust how the color channel range will be adjusted to fit our exported 0-255 RGB image. The ideal values will differ from image to image, and I unfortunately have not found a good way to automatically set these values yet, but you can play around with the values to see which one will fit best.

Play around with the parameters to see what visualisation works best for you

Exporting a multiple SENTINEL-2 satellite images with QGIS, Google Earth Engine and Google Colab

We successfully exported a single SENTINEL-2 image — now it’s time to do these thousands of times to build a dataset. Let’s automate a bit, shall we?

Generating random locations using QGIS

This step would probably be possible to do this all in Earth Engine itself, but I found it easier to do it in QGIS instead.

Step 1:
Download, install and open the latest QGIS version for your system.

Step 2:
Creating a new project, we are greeted with this screen.

Step 3:
Next we are going to generate some random locations. I did all of my satellite images to show at least a sliver of land, so I wanted to make sure that all the random locations are placed on land or directly on the coastline.

To do so, download the coastline files from NaturalEarthData, and the Countries WGS84 Feature Dataset. Drag both downloaded folders into QGIs, and the updated project should look something like this:

Step 4:
Next, let’s generate some random points! QGIS has functions build inside that allow you to place random points either inside a polygon (our land file), or directly on a line (our coastlines file).

To generate random points on land, select the layer of the Countries WGS84 dataset and go to Vector > Research Tools >Random Points in Layer Bounds.

You can of course set your own parameters, but I chose 100 locations with a minimum distance of 1 degree between them for now.

Run the script, and we have some random points on land!

Step 5:
To generate random points on the coastline, select the layer ne_110m_coastline and go to Vector > Research Tools >Random Points on Lines. Running the command will result in some random points on the coastline outlines (same settings as above: 100 points, 1 degree minimum distance).

As you can notice, we have some coastlines that are denser populated, and some that are rather sparse. The reason for this is that the placement algorithm treats each outline as its own entity and tries to place 100 points on each. No matter if it’s the long coast of Antarctica or a single island of Hawaii: each is supposed to get 100 points (as long as they are at least 1 degree apart).

Step 6:
To export the locations, right click on the layer Random points > Exports > Save Features As… and select Comma Separated Value [CSV] as the target format. Define the export path, and set GEOMETRY in Layer Options to AS_YX. You can check if everything worked out by opening the CSV file and pasting some coordinates into Google Maps.

Exporting SENTINEL-2 images with Google Colab

Now with a script to export a single image and a list of random locations in place, we can go to the next step and combine the two aspects.

The code logic will be relatively simple:

Take the first set of coordinates in the CSV file
Request the satellite image for the location
Download and save the image in Google Drive
Repeat

Let’s export some images:

Go to the Colab Notebook I wrote and either run it directly or duplicate it to customise it for your needs. If you are unfamiliar with Colab, it might be good to read this starting guide first.
Upload the CSV with exported coordinates to the runtime (the default code looks for a file called random_points.csv).
Now just execute each cell top-to-bottom, following the instructions for Earth Engine authorisation, and optional for Google Drive.
Execute Cell [4] and watch satellite images appear in your Drive or the runtime storage!

And that’s it for this article! I hope you enjoyed the read, and after following along, you should be able to understand how to create your own dataset of satellite images, or use parts of it to transfer the method to another dataset.

Until next time ✌️