Pre-Processing

There are many different ways to combine pre-processing steps. Depending on the application, more or fewer steps need to be run through. The SIPT library offers a ready-made pre-processing pipeline that is suitable for most use cases. The input is a folder with raster data and the output another folder with the results. The pipeline is executed on all files inside the source folder. Each step has a progress counter and logs the final duration of the operation in seconds.

The processing order is the following:

  1. Clears and recreates the destination directory.
  2. Optionally applies a mask to raster files using the given mask type and value.
  3. Reprojects all raster files to EPSG:4326.
  4. Crops the raster data to match the total bounds of the provided shapefile.
  5. Merges cropped files into mosaics.
  6. Resamples raster data based on a reference file or if not given to the first output of the previous step.
  7. Applies an additional cropping operation to fit the exact shape boundaries.
  8. Optionally filter for minimum and maximum masked areas.

Input Data¤

The data-structure after the copy step of the data retrieval with this library provides already a valid input data-structure. It is required that each file has a valid timestamp in it's path. The timestamp string can be in any of these formats:

  • yyyymmddThhmmss
  • yyyydddThhmmss
  • yyyydddhhmmss
  • yyyymmdd
  • yyyyddd

Parameters¤

Name Type Description
src str Input directory path with images and optionally masks.
dst str Output diretory path. Will contain the results as flat structure.
shapefile str Path to a valid shapefile. Supported formats: .shp, .geojson, .kml, .gpkg
file_patterns [str] Pattern(s) to group files additionally to the timestamp during the merge step (default: []). The pattern can include wildcards:

* matches any sequence of characters (including an empty sequence).
? matches any single character.
[seq] matches any character in seq.
[!seq] matches any character not in seq.
mask_file_pattern str The mask filename pattern used to identify mask files. File ending excluded. The pattern can include wildcards:

* matches any sequence of characters (including an empty sequence).
? matches any single character.
[seq] matches any character in seq.
[!seq] matches any character not in seq. Default None.
mask_type MaskType The type of the mask, either BITMASK or THRESHOLD. Must be provided together with mask_file_regex. Default None.
mask_value float Depending on the mask_type either a bitmap or a threshold value. Default 0.
min_masked_area int Filter out images with less than the required min nodata area in the shape defined by the shapefile. Default is 0.
max_masked_area int Filter out images with more than the required max nodata area in the shape defined by the shapefile. Default is 100.
dst_crs str Destination reference coordiante system for the reprojecting step. Defaults to EPSG:4326.
reference str A reference image which should be used for resampling. Default None.
crop_shape boolean Crop to the excact shape, not only the boundaries of the given shapefile. Default true.
num_processes int Number of processes to use for multiprocessing. A good reference is the number of physical CPU cores. Default 1.

Example¤

Python
from sipt.pipeline import preprocess
from sipt.processing.mask import MaskType

preprocess("./src", "./processed", "./shape.geojson", "*QA_PIXEL*", MaskType.BITMASK, 0b00001110,
            "./image.tif", num_processes=8)