Tuesday, September 4, 2007

Finding duplicate files on your system?!

I ( and many others ) have a lot media files ( mp3, jpg, avi, etc. ) lying around in the system. I wondered that how shall I get the list of all the duplicate files lying in my computer. Writing a script in Ruby which identifies the duplicate files using the MD5 hash values of the files contents, was no difficult a task. Here is the script.


#!/usr/bin/ruby

## This file finds all the duplicate files form a directory given
## at the command line.
## Released under the GPLv2
## Copyright (C) tuxdna(at)gmail(dot)com

require 'digest/md5'

## novice use of exceptions
begin
throw nil if ARGV.length == 0
rescue
print "Usage: ", $0, " \n"
exit 1
end

directory = ARGV[0]
print "Name of directory given is :", directory, "\n"

## do not proceed if it is not a directory
exit 1 if File.file?(directory)

puts "Getting the list recursively, for all the files and sub-directories."
filelist = Dir[directory+"/**/*"]

puts "Now scanning the files: "
puts "Determining file size and Filtering the directories:"

sizehash = Hash.new { |h,k| h[k] = [] }
filelist.each do |filename|
if File.file?(filename)
sizehash[File.size(filename)].push(filename)
end
end

## prune those entries which do not have same size
sizehash.delete_if { |k,v| v.length<=1 }
duplicates_md5 = Hash.new { |h,k| h[k] = [] }
sizehash.each do | size, files |
files.each do |filename|
md5sum = Digest::MD5.new( File.new(filename).read )
## Necessary to do this because md5sum is an object of class Digest::MD5
## and we need a string for a key!!
md5sum = md5sum.to_s
duplicates_md5[md5sum].push(filename)
end
end

## prune those entries which do not have same md5 hash value
duplicates_md5.delete_if { |k, v| v.length <= 1 }
## print the files if we find duplicates now!
duplicates_md5.each do |h, files|
puts "Following files match: "
files.each { |f| puts f }
puts
end
exit 0

4 comments:

AG said...

hey hi, infact i got the same topic for my '3-day proj'...identifying duplicates on unix n windoes system based on content matching..lol

kaash meine ye blog phele dekha hota...but we did it more simply :P

code on ur blog is bit hi-fi...i guess u cud hav easy done the job with simpler commands like 'cmp' or 'diff'. The real problem comes when it starts comparing with system files (/bin, /dev types)..by doing so, I/O interface files are corrupted..n hence keyboard goes off...any suggestions to fix that bug??

well u can extend ur work...after finding the duplicates, you can convert hard links for duplicates to soft links, thereby saving a lot of space ;)...or a simple (rm -i command for deleting duplicates)

Cheers
Ankit

AG said...

n guess who was our Proj Manager....guess maro
..
..
..
..
..
Lord KP ;)

Kazim Zaidi said...

Hey Ankit,
There are clear advantages to the md5sum code than using cmp & diff to do the job.

First, generating md5 and comparing the sum is faster than comparing the entire contents.

Second, its not neccessary to compare every file with every other file. Ruby's strong grouping capabilities on Hashes makes the job simpler (and the code cleaner).

Cheers!

Unknown said...

In this condition I used DuplicateFilesDeleter effectively. This software will let you get a huge amount of space for your use by deleting the files that were at multiple locations.