Detecting the missing files when comparing two directories can be a tricky job to do. So this is my scenario:
I am trying to pass some image throw a small piece of software for batch processing and I always get timeout due to the big quantity of images. There are around 80.000 images that I am trying to process and the software get stuck to (let say) 10.000 images so I have to start all over again and I have no idea which are the missing files.
So my solution is to take out the images that have been processed from the folder and feed the program with the images that have not been processed.
So to find out this I have to compare the 2 directories for duplicated files. In other words “detect the missing files”
Comparing two directories for missing file is really an easy task. You just have to iterate through the first directory and see if the same file exists in the second. If the file exists, it means it has been processed, so I will move it to a third folder (I prefer this, just in case) or I can just delete it. This way I can detect the files that have been processed and leave untouched the files that have not yet processed.
Detect missing files while compare 2 folders – the php function
So to get the job done, I came up with this function:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 |
/** * @param array $dir: this is an array with your custom paths * @param string $ext: File extension * @param boolean $rename: If false, it will delete the file. * @param boolean $output: If false, no message will be output to screen. * @return string */ function compare_two_directories($dir, $ext=".jpg", $move=true, $output=true){ $files = glob( $dir[1]."/*".$ext ); $count = 0; if($output) echo "<pre>I found this duplicate files:<br />"; foreach ($files as $file) { $file_name = basename($file); // check if file exists in the second directory if(file_exists($dir[2]."/".$file_name)){ if($output) echo "$file_name"; if($move) { rename($file, $dir[3]."/".$file_name); // move the image to folder 3. if($output) echo " <span style='color:green'>moved</span> to ".basename($dir[3])."<br />"; } else { unlink($file); // just delete the image if($output) echo " <span style='color:red'>deleted</span><br />"; } $count++; } } if($output) echo "</pre>"; return "Done processing and found <span style='color:green'>$count</span> duplicated <span style='color:red; font-weight:bold;'>$ext</span> files "; } |
Use function to compare files inside directories like this
You can call it like this. I like to include also the time that was used, just for statistics purpose, but you can omit it.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
// this is the path to your script file and your directories are relative to it. define("MY_PATH", dirname(__FILE__)); // define("MY_PATH", "/var/www/my/custom/path/to/my/directories"); // set your custom paths $dir[1] = MY_PATH."/dir1"; $dir[2] = MY_PATH."/dir2"; $dir[3] = $dir[1]."_processed"; // please note the folder "dir1_processed" must exist if you want to move files to it // call the function and get the job done echo compare_two_directories($dir); // this next line is an example for .png images, DELETE files and output messages will be sent to screen // echo compare_two_directories($dir, ".png", true, false); |
But if you only need to list the different files, just comment the rename function like this and file that are missing will only by displayed on screen.
1 |
// rename($file, $dir[3]."/".$file_name); |
Get more stats when comparing directories for duplicated files
I like to also know the processing time just for statistics purpose. So to do that you can just wrap the upper code like this:
1 2 3 4 5 |
$time = microtime(true); // Gets microseconds // the code here echo "<br />Processing took <span style='color:blue'>".round( (microtime(true) - $time), 2).'</span> seconds'; |
Note that all directories must be on the same level as the script file if you want your script to work out of the box. But if this is not your case, feel free to edit it so it adapts to your specific file structure or your server configuration.
See memory usage
Speaking about server configuration, you will normally need a lot of memory if you have to compare lots of files. I normally do this kind of jobs on a local machine using XAMPP, but you can also do it on your normal server. You can take a peek at your memory usage by using this little function:
1 2 3 4 5 6 7 8 |
function get_memory() { $size = memory_get_peak_usage (true); $unit = array('b','kb','mb','gb','tb','pb'); return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i]; } echo "<br />".get_memory()." of memory were used wile processing" ; |
Just place this at the end of your file to see your memory usage when comparing the 2 folders
Get “compare directories for missing files” script
You can download a full working copy of this script from the Github repository and compare your directories for missing files. Here is a screenshot of it working. I agree with you that it needs some more style 😉
There is also a second choice in which you can store both directories in 2 distinct arrays and then just compare the two arrays. If there is a match, then move the files to a third folder or delete them. I really did not test the two alternatives, but I think the first one is faster than the second because it doesn’t need to iterate the second directory. But I could be wrong since it has to do lots of single-file checks.
Please let me know if you try this second approach and with one worked best for you.
One thought on “How to compare two directories for missing files using PHP”