Removing sensitive information from panoramic imagery

20 min readApr 16, 2021

Rohan Katkam (5278120) r.katkam@student.tudelft.nl
Vivian Dsouza (5386101) v.k.p.dsouza@student.tudelft.nl

In this blogpost, we reproduce a project by the City of Amsterdam to remove sensitive information from panoramic imagery. The municipality collects street level images of the city every year to monitor the state of the city’s infrastructure and for further planning and assessment.

However, along with the city streets, sensitive information is also collected while the images are captured by a vehicle driving around with a camera. These include people moving around and vehicle registration plates. To avoid violating the citizen’s right to move around freely or right to be forgotten, the authors use object detection models to blur areas containing sensitive information.

The resolution of the original images are 8000 x 4000 pixels and are panoramas of the city streets. However, they are annotated at a resolution of 2000 x 1000. The authors claim this is sufficient to ensure that objects not blurred at this downsampled resolution would not be recognisable in the original resolution. Due to the confidential nature of the project and the non disclosure agreement we signed, it is not possible to share in this blog the data from the original source. Nevertheless, we make our best effort to describe in depth what we did with alternate images and thorough description in text. Instead, we use similar images from Google street view as shown below:

Panoramic image of a street in Amsterdam similar to the dataset. Source: Google Street View

The main aim of the project is to be able to blur the people and license plates appearing in the panoramic images taken in the city of Amsterdam. This entails accurate detection and classification of these objects. As about a million images are taken each year, it is not feasible to manually go through all images to blur them. Hence, an object detection algorithm is required to be implemented. To achieve this goal, the authors use multiple techniques like RAW, Pre-trained YOLOv5 , Faster-CNN and YOLOv5. Among these, YOLOv5 has the most bold claim for Precision and Recall and the authors base their future research work on this result.

From the paper :

One can see that YOLOv5 consistently outperforms all other models on all scores, the only exception is that Faster R-CNN is scoring higher for the recall on detecting persons. Although Faster R-CNN theoretically should perform better than YOLO, we argue that YOLO performs better due to the many data augmentations that we can do in their code base.

Hence, we keep our focus on the YOLOv5 model. First, we verify this result on our own by reproducing, and subsequently diagnose misclassification as well as comment on possible further improvements. Next, we attempt to use data augmentation methods like shear, perspective transformation and mix up which are provided by YOLOv5. The original configuration does not use these features and we experiment the model’s sensitivity to them. Finally, out of curiosity, we attempt to make a prediction for similar images from Google Street View and panoramic images we took ourselves on a street in Delft. These help us gain insight in how robust the model is and how well it generalises to new data.

Dataset and other information provided

For the given problem, we have available 3743 panoramic images from around the city of Amsterdam. These have been taken in different situations, (like varying crowds, time of day) and environments, (mostly streets but also canals, highways, crossroads and others) to ensure the model is robust. The dataset is annotated with a minimal bounding box enclosing the objects that need to be blurred. The file directory is arranged as per the YOLO annotation format , and is already split into train (60%), test (20%) and validation (20%). The Average Precision at IoU =0.5 value claimed by the authors are 0.746 for the persons class and 0.853 for the license plate class. We will explain these metrics later.

YOLOv5

YOLOv5 is the yet unofficial latest framework in the YOLO family, which is a family of state of the art object detection models. The architecture of YOLOv5 is very much similar to YOLOv4. That is, the architecture consists of Face, Neck and a backbone framework and all combine to form a one stage detector. It is natively implemented on Pytorch and it is available in the github page of Ultralytics Company. It claims to have better training speeds than it’s predecessor YOLOv4 . However, the claims are not explicitly compared on their website/github nor are they officially benchmarked by the authors. Owing to it’s architectural prowess, it does run well with pictures as well as real-time frames.

YOLOv5 network

Object detectors, as the name suggests is used for feature extraction and collection over a series of stages. Being a one stage object detector, YOLOv5 allows for faster predictions than those seen in two stage detector models.

As mentioned earlier, YOLOv4’s architecture is imbibed into the YOLOv5 model. YOLOv4 itself came as a result of adding several new features for regularisation, activation and has a variety of loss methods under it’s belt. As shown in the figure, it is segmented into the following sections:

Input: Input images or it can also be the input patch. Data augmentation along with geometric transformations are applied. It potentially improves the training process, and allows for better robustness to the distortions in practical life images
Backbone: This segment is a pretrained neural network (on ImageNet datasets) which literally forms the backbone for feature extraction, like VGG16, ResNet, CSPDarknet (which was used in YOLOv4). They have many stages on convolution and pooling and uses activation functions like Sigmoid, and ReLU. YOLOv5 also uses a CSP (Cross Stage Partial Network) network. In a nutshell, these networks form a dense structure (with multiple direct connections) and they aim to overcome the vanishing gradient problems, reducing computation and encourage higher feature extraction.
Neck: In this case neck is the stage that consists of various upsampling and downsampling stages to compute and collect feature maps. This path aggregation block in YOLOv4 is a modified conacatenated version of Path Aggregator Network which is similarly implemented in YOLOv5. In YOLOv5, the neck has a Spatial Pooling Pyramid, which is a special type of pooling method that increases the receptive field of the backbone and creates fixed size features for the purpose of training detectors.
Head: This is the final detection part and it detects the classes, their confidence scores, and the bounding box coordinates.

YOLOv5 itself has a range of different model structures, namely: small, medium, large and xlarge. They vary based on the number of learnable parameters and consequently also on the run times. The paper to be reproduced went ahead with small and medium models possibly due to their manageable performance for mobile based applications.

YOLOv5’s github repository provides a simple, and a modular integration of their models for training and evaluating datasets, thus making it easier for the user to implement these deep learning models on their data.

Reproduction

Follow us on a journey where we explain the process of reproducing the results of the paper. We begin with a brief note on the experimental setup. Google Colab was our go to platform to execute our codes that use Pytorch modules. The datasets are uploaded on the drive for easy access, and they are trained over Google GPUs.

Firstly, import all the necessary python and pytorch libraries. We clone the latest configuration files and scripts from the github repository of YOLOv5, after which we install all it’s required dependencies. All these files are necessary for training, and evaluation purposes. Wandb is also used for visualising and tracking this process.

!git clone https://github.com/ultralytics/yolov5 # clone repo
!pip install -qr https://raw.githubusercontent.com/ultralytics/yolov5/master/requirements.txt

Next, we define the data configuration file which is a yaml file. A yaml file stores the data in a serialised format. This yaml file indicates the directories of training and validation images, along with an overview of the class labels. In our case the labels are labelled as: 0 as person, in order to identify people in the images, and 1 as license which aids the detector to identify license plates on automobiles.
Further, we describe the network architecture as a yaml file, as shown below. It includes parameters namely the number of classes, width_multiple (which is a value to reduce the number of channels in the layers which ultimately reduces the number of parameters) and depth_multiple (that is a scaling factor related to the number of bottlenecks in a module). This will evolve to new anchors based on the dataset and trainable parameters if the current existing anchors do not fit around the object as expected. This is the default part of YOLOv5 training workflow. YOLOv5 has a backbone stage with a ‘focus’ network It is a layer to increase the number of channels (depth) and decrease the dimension of the input. This can potentially reduce the computational costs while doing convolutions at later stages. YOLOv5 Head in this case includes Neck and the final Head detection stages. Using a deep neural network visualiser like Netron, we get a gigantic structure for YOLOv5!: https://drive.google.com/file/d/1GafuF0hF_QS9clAwVySTHlsiGeOwH7Ds/view?usp=sharing

# parameters
nc: 2  # number of classes
depth_multiple: 0.33  # model depth multiple
width_multiple: 0.50  # layer channel multiple
# adam: true# anchors
anchors:
  - [10,13, 16,30, 33,23]  # P3/8
  - [30,61, 62,45, 59,119]  # P4/16
  - [116,90, 156,198, 373,326]  # P5/32# YOLOv5 backbone
backbone:
  # [from, number, module, args]
  [[-1, 1, Focus, [64, 3]],  # 0-P1/2
   [-1, 1, Conv, [128, 3, 2]],  # 1-P2/4
   [-1, 3, C3, [128]],
   [-1, 1, Conv, [256, 3, 2]],  # 3-P3/8
   [-1, 9, C3, [256]],
   [-1, 1, Conv, [512, 3, 2]],  # 5-P4/16
   [-1, 9, C3, [512]],
   [-1, 1, Conv, [1024, 3, 2]],  # 7-P5/32
   [-1, 1, SPP, [1024, [5, 9, 13]]],
   [-1, 3, C3, [1024, False]],  # 9
  ]# YOLOv5 head
head:
  [[-1, 1, Conv, [512, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 6], 1, Concat, [1]],  # cat backbone P4
   [-1, 3, C3, [512, False]],  # 13[-1, 1, Conv, [256, 1, 1]],
   [-1, 1, nn.Upsample, [None, 2, 'nearest']],
   [[-1, 4], 1, Concat, [1]],  # cat backbone P3
   [-1, 3, C3, [256, False]],  # 17 (P3/8-small)[-1, 1, Conv, [256, 3, 2]],
   [[-1, 14], 1, Concat, [1]],  # cat head P4
   [-1, 3, C3, [512, False]],  # 20 (P4/16-medium)[-1, 1, Conv, [512, 3, 2]],
   [[-1, 10], 1, Concat, [1]],  # cat head P5
   [-1, 3, C3, [1024, False]],  # 23 (P5/32-large)[[17, 20, 23], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
  ]

Next up, it is time to feed the hyperparameters that were set according to the parameters from configuration files given by the authors, and the default values from YOLOv5’s yaml file. These default parameters evolved as a result of optimisation using genetic algorithm over massive datasets like COCO. Such an evolution takes hours of GPU runtime to come to optimised values and this is the reason why we chose these values. We also tried to study the effects of some of these parameters on the model performance (more on it in the later sections).
Define the output directories to store the plots, and the log files. We now begin to train the model!
- Keeping in mind the values the authors have trained with, the epochs we set is 50, with a batch size of 8. The optimiser is Adam in this case.
- The image size has a high resolution of 2048 so as to detect even small objects out there, and made sure the checkpoint information is saved every 4 epochs. As we are tying to replicate the published results, training is carried out with no pretrained weights.
- After passing all the configuration files we have discussed earlier, we finally run the training program

!python3 train.py --adam --img 2048 --batch 8 --epochs $epoch --save_period 4 --data $data_yaml --cfg $cfg_yaml --weights '' --name $out_name

The code logs all the information for plotting the metrics. It also generates a set of batch images after they were augmented and also some of the figures that were misclassified.
The metrics were evaluated and tested visually too on a few test images.

!python detect.py --weights $weight_dir --conf $conf --source $img_source --project $save_img_dir

Difficulties faced in reproduction

It is not surprising that the details pertaining to reproducibility in the paper were rather vague as the authors do not aim to introduce a novel method. Instead, they intend to demonstrate their use of object detection models to identify the sensitive information required to be blurred.

The paper merely mentions YOLOv5 with little to no information about the kind of model used or the hyperparameter configuration. YOLOv5 requires configuration files (.yaml) for the execution of the model which were not provided to us and were not described in the paper. Thus, we had to write these ourselves. The paper does not mention which version of YOLOv5 they have used to obtain their results. In the dataset shared it is apparent that they have attempted to use the small and medium versions, with both producing similar results. However there are 9 versions available, with trainable parameters ranging from 7.3 million to 141.8 million. This detail is not trivial as it may leave a reproducer confused with which version to select, and may use excessive unnecessary computation if a larger version was chosen. We used the both the small and medium models with details as below:

Difference between the small and medium version of YOLOv5

Details about the data augmentation methods used to achieve the given result in the paper is also not clearly stated. They say they use data augmentation methods like mosaic and mix-up but do not comment on what values they were set to. They also do not comment on which other methods were used , if they were set to default or modified.

While training the medium version we had trouble using a larger batch size of 8 and had to reduce it to 4 in order to be able to fit it into memory.

The structure of the dataset given to us was configured to be compatible with YOLOv5 directly, with the YOLO annotation format, but this information was not clearly conveyed. As the file structure was unintuitive, it took us a while to figure out how to use it.

Finally, in order to aid reproducibility the authors may consider sharing the code, but we understand this may undermine the confidentiality of the information.

Metrics used

Other than the standard classification error , precision and recall, YOLOv5 also uses metrics explained below:

IoU: Evaluates how close a predicted bounding box is close to the ground truth. It is defined by the fraction of Intersection area over Union area of the bounding boxes.

AP: The average precision averaged over multiple Intersection over Union (IoU) values

mAP@0.5 or AP50 : Is the mean average precision over all categories with a minimum IoU of 0.5

Box : Is a replacement for the deprecated GIoU in YOLOv5. GIoU is a generalised IoU which also takes into account how close the predicted and true bounding boxes are if there is no overlap.

Objectness: Confidence of whether an object is present in the bounding box. This is predicted using logistic regression and should be 1 if the bounding box prior overlaps a ground truth object by more than any other bounding box prior. YOLOv5 uses a threshold of 0.5.

Results of reproduction

These results are obtained on a resolution of 2048 x 1024 pixels.

It is seen that the reproduction effort is successful and the output values are nearly the same as the paper claims. The confusion matrix of the reproduced results is shown below.

It is seen from the confusion matrix that the results are fairly good : 73% of people and 88% of license plates would be correctly identified and blurred. The 27% of false negatives for the people class is a bit concerning though. We address this in the next section and suggest using a lower confidence threshold to improve recall.

The PR curve is shown below:

It is evident that for higher recall the precision drops rapidly. This is especially apparent for the person class. This means that if we decrease the confidence value threshold, many false positives person objects would be identified. Even at the cost of having non sensitive information unnecessarily blurred, it is still better than having the information revealed.

Small version

It is seen that for the small version all the training and validation curves approximately match those of the authors in the dataset shared with us. The small version achieves a AP50 value of about 80% , using 7.3 million parameters.

Medium version

Again, for the medium version the results are reproduced successfully. This time the model achieves an AP50 value of 82% using 21.4 million parameters : about 3 times that of the small version. We argue that the drastic increase in computation required is not worth such a marginal increase in accuracy. For all further experimentation in this blog we use the small version.

A comparison of the results per class also is presented below. These too match those of the original paper sufficiently.

Identification of objects misclassified

In order to understand how to improve the accuracy of the model we decided to identify objects which were frequently misclassified to help in future improvements. Upon investigation of the predicted data, we found that the following objects were often generating false positive predictions:

With these new insights we offer some suggestions for improvement.

It was observed that many of the misclassified objects were not on the ground plane or just above the ground where objects were expected to be. For instance, there was an example of a window in a high rise building being misclassified as a license plate , or the top of a light pole being classified as a person. It was also observed that rarely a license plate is predicted (with low confidence) on the roof of the photographer’s car itself. In order to solve this, we suggest increasing the focus on only areas where the objects are expected i.e pixels with y axis value between 400 to 700. All the license plates and persons occur in this area only and hence the region proposals should focus here. It may also be worth considering cropping out the other parts of sky and ground to reduce computation effort.

For the license class it is observed that anything with text and intricate yellow shapes like leaves is easily misclassified. In order to avoid this we suggest training with more confusing examples related to this to make the model more robust. While we did not find any such examples, it raises concerns over whether foreign license plates would be detected and blurred accurately.

However, we must note that perhaps more important are the false negatives, as they allow sensitive information to pass through. We suggest increasing the recall to avoid this problem . While a model with high recall will predict many incorrect labels, the main issue of removing sensitive information will be addressed better.

Hyperparameter sensitivity check

YOLOv5 offers a list of around 25 hyperparameters which can be seen along with their default values in the list below. In bold text are shear, perspective and mixup which we attempt to tune later.

lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3) 
lrf: 0.2 # final OneCycleLR learning rate (lr0 * lrf) 
momentum: 0.937 # SGD momentum/Adam beta1
weight_decay: 0.0005 #optimizer weight decay 5e-4 
warmup_epochs: 3.0 # warmup epochs (fractions ok) 
warmup_momentum: 0.8 # warmup initial momentum 
warmup_bias_lr: 0.1 # warmup initial bias lr 
box: 0.05 # box loss gain 
cls: 0.5 # cls loss gain 
cls_pw: 1.0 # cls BCELoss positive_weight 
obj: 1.0 # obj loss gain (scale with pixels) 
obj_pw: 1.0 # obj BCELoss positive_weight 
iou_t: 0.20 # IoU training threshold 
anchor_t: 4.0 # anchor-multiple threshold # 
anchors: 3 # anchors per output layer (0 to ignore) 
fl_gamma: 0.0 # focal loss gamma (efficientDet default gamma=1.5) hsv_h: 0.015 # image HSV-Hue augmentation (fraction) 
hsv_s: 0.7 # image HSV-Saturation augmentation (fraction) 
hsv_v: 0.4 # image HSV-Value augmentation (fraction) 
degrees: 0.0 # image rotation (+/- deg) 
translate: 0.1 # image translation (+/- fraction) 
scale: 0.5 # image scale (+/- gain) 
shear: 0.0 # image shear (+/- deg) 
perspective: 0.0 # image perspective (+/- fraction), 
range 0–0.001 
flipud: 0.0 # image flip up-down (probability) 
fliplr: 0.5 # image flip left-right (probability) 
mosaic: 1.0 # image mosaic (probability) 
mixup: 0.0 # image mixup (probability)

The default values shown are already optimised for object detection on the COCO dataset. These hyperparameters are configured using the configuration (.yaml) files in the /data directory. YOLOv5 also provides a mechanism to evolve these parameters using a genetic algorithm, but this is resource and time consuming. Instead, we evaluate the effect of changing one hyperparameter at a time , keeping the remaining constant. The paper talks about this but refrains from commenting on the values used or their result:

For YOLO, we also use recently published augmentations such as Mix-up and Mosaic.

In this section, we also assess the sensitivity to perspective and shear, as this kind of geometric distortion helps the object detection task.

Perspective

The perspective hyperparameter applies a perspective transformation to the images for training. The image stitching and projections of the spherical imagery to 2D may distort the images. We were under the impression that the perspective hyperparameter may help undo these effect and improve the accuracy.

A exaggerated example of a perspective transformation is shown below for representation purposes.

Perspective transformation- Exaggerated effect for demonstration. Original Image Source: Google Street View

We trained the model with values of 0.0005 and 0.001.

It is observed that the results for both values are worse than the original validation curves. As the perspective value is increased from 0.0005 to 0.001 the instability in the training also increases causing the curves to be noisy. We conclude that the perspective transformation is distorting the image making it more difficult for the classifier to identify decision boundaries. Hence, we do not advise adding perspective transformations.

Shear

The shear hyperparameter applies a transformation to the training images as shown in the representative example below. Again, we were motivated to tune this hyperparameter as we believed it may undo distortions of image stitching and help in improving accuracy.

We added a shear of 10 and 20 deg and trained the model again with results as below:

A shear of 10 deg performed better than a shear of 20 deg. The latter shows more instability during training and noise in the validation curves. Both values fail to match the precision of the original model. We conclude that the shear is tilting the objects in the image and making it more difficult for the classifier. Hence we do not advise increasing this hyperparameter.

Mixup

Mixup uses a convex combination of pairs of examples and their labels. It uses two images to multiple and superimpose with different coefficent ratios and subsequently adjusts the labels with these new superimposed ratios. This helps to make the model more robust to adversarial examples. Using mixup improved the generalisation on CIFAR-10 , ImageNet-2012 and other datasets for object detection. In the YOLOv4 model, this data augmentation method was effective in improving detection. This motivates us to evaluate the sensitivity for this project too. An example of mixup training image is shown below, with alpha=0.5.

Using mixup produces similar results to the original results. There is a peculiar unexplained drop in mAP@0.5 curve after about 23 epochs. Although close, the precision values are actually lower than the original results. Mixup has shown to be effective in increasing the accuracy in classifier models and perhaps the results would converge to better values after more training. Unfortunately, we did not have the computing resources to carry on training for extended periods of time. Further investigation and reproducibility of these results is required, and hence for now we suggest using the original values itself.

Making predictions on new data

A robust object detection model would be able to generalise to new data taken from a similar distribution. In order to evaluate how well the reproduced model generalises to images taken in different settings we made predictions of new images. We obtained images from Amsterdam’s streets using Google Street View and clicked some of our own panoramic images in Delft. As these images are not from the original dataset, we include them in this blogpost.

We only have a few images which are insufficient to produce a reliable validation score on these dataset. Hence, we do not annotate the images to calculate scores, but view the results visually instead.

Predicting panoramic images taken from Google Street View

Despite being already blurred by Google, the model identifies objects from both the person and license class with fair accuracy as seen below. These are a few false negatives (eg. the cyclist on the bridge) but these are usually unrecognisable by us anyway. The images had a resolution of 2048 x 1024 pixels, which is similar to the original model training resolution of 2000 x 1000.

It also appears that this image uses a different image stitching and projection to generate the 2D image. This would mean that straight lines may not remain straight in the projection which may cause issues for the classification model.

Predicting panoramic images taken by us in Delft

Out of further curiosity and in order to be able to include images in this blog, we took some panoramic images of ourselves on the streets of Delft. These images are in much larger resolution of 9696 x 3072 pixels , allowing for people even significantly further away to be predicted accurately. The model is able to identify all the license plates which would give away sensitive information. It also correctly identified the person in the image, both near and far away as can be seen in the images below. Again, this image is taken in a different setting, location , resolution and stitching algorithm than the dataset the model trained on.
It is not surprising that the accuracy is good as CNNs are shift invariant, and advanced object detection models are fine tuned to draw decision boundaries between such objects. Nevertheless, it is assuring to see the results.

Conclusion

In conclusion , we were able to successfully reproduce the best claims of the author. We learnt how state of the art object detection models like YOLOv5 are making it increasingly easier to train advanced models and make accurate predictions. These models work significantly better than those of the recent past, and are still continuously evolving rapidly. We saw first-hand the issues faced by developers in releasing a new version of such models like YOLOv5. It is apparent that this paper was unlike others, and had a different kind of challenge. We learnt how deep learning is deployed in the industry to solve problems related to justice and people’s rights, and not merely limited to technology.

In order to aid future work, we attempted to tune the hyperparameters to the data and measured the robustness of the model on new images from other sources. Due to the confidential nature of the project we were unable to include pictures of where the model could use improvement, but hope to have covered it well anyway.

Rohan did most of the work related to setting up the inital YOLO file configurations and the rest of the training was a combined efforts with each of us training with one version or parameter. Vivian worked more on identification of misclassifying images. Tuning the hyperparameters , collecting new images and writing this blog was a joint effort. We hope you enjoyed reading it.

References

[1] YOLOv5 Github: https://github.com/ultralytics/yolov5

[2] Mixup: https://arxiv.org/abs/1710.09412

[3] Evaluation metrics definition: https://cocodataset.org/#detection-eval

[4] Improving classification accuracy: https://core.ac.uk/download/pdf/208323664.pdf

[5] YOLOv3 paper which explains the metrics: https://arxiv.org/abs/1804.02767

[6] YOLOv4 paper which explains some of the data augmentations: https://arxiv.org/abs/2004.10934

[7] Training speed claims: https://blog.roboflow.com/yolov4-versus-yolov5/

[8] Cross Stage Partial Network (CSP): https://arxiv.org/pdf/1911.11929.pdf

[9] YOLOv5 improvements : https://blog.roboflow.com/yolov5-improvements-and-evaluation/, https://arxiv.org/pdf/1608.06993.pdf

[10] Path aggregator network issue on Github: https://github.com/ultralytics/yolov5/issues/1410

[11] Path aggregation network for Instance Segmentation paper: https://arxiv.org/pdf/1803.01534.pdf

[12] depth_mutiple and Scaling factor: https://github.com/ultralytics/yolov5/issues/2367

[13] Anchor evolution: https://github.com/ultralytics/yolov5/issues/1901

[14] Backbone stage and computation costs: https://github.com/ultralytics/yolov5/issues/804

[15] Netron website: https://netron.app/

[16] Hyperparamter evolution: https://github.com/ultralytics/yolov5/issues/607