Pre-Processing
There are many different ways to combine pre-processing steps. Depending on the application, more or fewer steps need to be run through. The SIPT library offers a ready-made pre-processing pipeline that is suitable for most use cases. The input is a folder with raster data and the output another folder with the results. The pipeline is executed on all files inside the source folder. Each step has a progress counter and logs the final duration of the operation in seconds.
The processing order is the following:
- Clears and recreates the destination directory.
- Optionally applies a mask to raster files using the given mask type and value.
- Reprojects all raster files to EPSG:4326.
- Crops the raster data to match the total bounds of the provided shapefile.
- Merges cropped files into mosaics.
- Resamples raster data based on a reference file or if not given to the first output of the previous step.
- Applies an additional cropping operation to fit the exact shape boundaries.
- Optionally filter for minimum and maximum masked areas.
Input Data¤
The data-structure after the copy step of the data retrieval with this library provides already a valid input data-structure. It is required that each file has a valid timestamp in it's path. The timestamp string can be in any of these formats:
yyyymmddThhmmssyyyydddThhmmssyyyydddhhmmssyyyymmddyyyyddd
Parameters¤
| Name | Type | Description |
|---|---|---|
src |
str |
Input directory path with images and optionally masks. |
dst |
str |
Output diretory path. Will contain the results as flat structure. |
shapefile |
str |
Path to a valid shapefile. Supported formats: .shp, .geojson, .kml, .gpkg |
file_patterns |
[str] |
Pattern(s) to group files additionally to the timestamp during the merge step (default: []). The pattern can include wildcards: * matches any sequence of characters (including an empty sequence).? matches any single character.[seq] matches any character in seq.[!seq] matches any character not in seq. |
mask_file_pattern |
str |
The mask filename pattern used to identify mask files. File ending excluded. The pattern can include wildcards: * matches any sequence of characters (including an empty sequence).? matches any single character.[seq] matches any character in seq.[!seq] matches any character not in seq. Default None. |
mask_type |
MaskType |
The type of the mask, either BITMASK or THRESHOLD. Must be provided together with mask_file_regex. Default None. |
mask_value |
float |
Depending on the mask_type either a bitmap or a threshold value. Default 0. |
min_masked_area |
int |
Filter out images with less than the required min nodata area in the shape defined by the shapefile. Default is 0. |
max_masked_area |
int |
Filter out images with more than the required max nodata area in the shape defined by the shapefile. Default is 100. |
dst_crs |
str |
Destination reference coordiante system for the reprojecting step. Defaults to EPSG:4326. |
reference |
str |
A reference image which should be used for resampling. Default None. |
crop_shape |
boolean |
Crop to the excact shape, not only the boundaries of the given shapefile. Default true. |
num_processes |
int |
Number of processes to use for multiprocessing. A good reference is the number of physical CPU cores. Default 1. |