Replacing a watermark over one image can be an easy task. Replacing it on a 50.000 set of images, can be a tricky job to do.
This case is really simple. The company change their logo. They have all the images on their website watermarked with their old logo and they wanted to re-watermark all the images. They did not stored the original photos on upload but stored just the new watermarked ones.
So here is where I came in and try to sort things out.
Task list in to replace the watermark:
Find logo on images
- Images are not all watermarked in the same way. Some have the logo at 50% transparency in the center and some have 6 small logos distributed over the image.
- Not all images have same size. But I can clearly distinguish some pattern.
850 x XXX 600 x XXX XXX x 850 XXX x 600
There are some other images, like 2% that can not be grouped by this pattern size
So here is what I have done. I used some custom PHP scripts on a local XAMPP server and some easy and simple desktop software.
- I digged a little bit into the matter and found that BatchInpaint can remove objects like watermarks from the image. The problem was that all my images were different size and different position of the logo. So I had to think how to separate all of them by size and logo position.
- Crop all images and store them in different folders grouped by size for further processing.
- Find (detect) watermark position pattern using PHP GD library(Or at least this was my intention).
At this point I have come to a dead-end. Separating the images seemed impossible using PHP GD library or some other OCR library that out there.
Although I have found and test some functions and classes, they all give a negative result since the old logo is 50% transparent over the image. So comparing pixels colors it is not a good idea. Not talking about it is a tremendous time-consuming process for 50.000 images (about 2GB). I have read good stuff about OpenCV, but I have not been able to set up the software on my Windows PC and the learning process seems impossible in the given period of time. So if you have time take a deep look at this software. I will come back to this, a little bit later on this tutorial.
Remember that all images were spread over 10.000 folders and some had same name, like main_1.png, main2_.png and so on.
I started building a simple correspondence database with just 2 fields. One had the ID and the other the image path: “folder/image_name.png”
Second, I created a small script that moved all images in one folder only, and renamed them with the ID key. Like 1.png, 2.png, 3.png and so on. At this point, at least I could see all images at once and quickly identify which one needs to be moved.
Given the time I had to deliver the job I stopped investigating at this point and start thinking like my kid does. Just separate all images manually. So my idea was to empty the recycle bin, open one image and start passing all of them with the arrow navigation. When I’d found an image I’d just delete it using the delete key and continue the arrow navigation. This was the quickest (not elegant nor smartest) solution I have found.
It took me around 4-5 hours, including the breakfast 🙂
This was awesome, you should try it sometimes with your friends….nope, not really. The most boring task I have ever completed and lived to tell the story.
At this point I had all images with the centered logo in one folder and all others in the Recycle Bin. I’ve rescue them and put in a different folder on my drive (they were like 10.000 or so). I thought that creating a third column to my table and add a string to all rows specifying the logo position was a good idea. So I did it.
Step 2: Start resizing images
I used a small library to resize images into this resolutions called ImageManipulator (which you can download it here) and I got this dimensions pattern:
850 x XXX 600 x XXX XXX x 850 XXX x 600
I searched for images with a ratio in between 0.6 and 1.6 and I got a really good result. About 45.000 images entered in this 4 groups. The other 5.000 images I had to be more permissive with the ratio and tried a search in between 0.4 an 2.2. and got another 4 groups with about 4.000 images.
The third time I just passed the script again with a min ratio of 0.2 and a max ratio of 3.5 and I got all remaining images into 4 other folders Except 2 that I will process the separately due to their big aspect ratio 4 and 5.2 respectively.
All good for now. I will continue this tomorrow in the Part II of this case study.