A while back, someone asked how to clean up disk space and remove duplicate files from a system. I’m going to answer that question, but first I have to explain a little bit about how operating systems behave with disk drives.

The operating system and the disk

how-a-disk-drive-is-set-up

Each file on disk has at least two parts. The first part is the metadata, or inode. This part of the file has information about the file.

  • What’s the name of the file?
  • Who created the file?
  • When was the file created?
  • When was the file last changed?
  • When was the file last accessed? (Someone could have looked at it without changing it.)
  • How big is the file?
  • Where is the file located on the disk?
  • Who is allowed to read the file? Who is allowed to write to the file? Who is allowed to run the file, if it is a program?
  • What type of file is it? (Is it an image, an audio, text, a PowerPoint, …)
  • Did the original file come with a checksum or hash tag?

The second part of the file contains the data. If the file is really small, it might only require one block of data. Bigger files require more blocks of data.

All of that metadata is stored in a catalog, or a directory, at the beginning of the disk. The actual content of your files is spread over the disk.

What happens when you read a text file?

When you instruct your computer to open a file, the operating system looks at the disk catalog and finds the file name. (Depending on your caching setup, the operating system might need to read the catalog from the disk.) It then checks to see if you have permission to open the file. If you do, the system looks for the pointer to where the file is located, travels across the disk to that location, reads the data, and sends it back to you.

What happens when you delete a file?

When you delete a file, all the operating system does is to delete the metadata. The actual data, the red blocks in the diagram above, are still on the disk, at least until they are  overwritten. However, the system has no idea where those data blocks are, so the operating system considers the file deleted. The operating system writes over the data blocks as time progresses. If you really want to delete your files, you need to format your disk. And, this metadata deletion and not content deletion is why undelete sometimes works. Lifehacker has an article on free software that helps you to recover lost files.

Get rid of duplicate files

So now that you understand a bit about metadata, I can explain how the different tools work to find duplicate files.

The only 100% sure way to tell if files are the same is to compare the content of the files, and that would take a long time. So, programs make use of that metadata. What do you think? If a program has a similar file name, the exact same size, and the exact save creation timestamp, what are the chances that the files are the same? They’re pretty good. If you add to those checks a checksum or hash tag and modified date, you’re even more sure the files are the same. If you’re really in doubt, you can compare data blocks.

The Internet boasts several free programs that look for duplicate files. I’m going to show you Duplicate Cleaner Free 3.0, but remember that all of these programs basically work the same.

Step 1: Download and install Duplicate Cleaner Free 3.0 or higher.

Step 2: Launch the program.

Step 3: On the Search Criteria tab, set your search criteria. Refer to the figure below.

search-criteria

Red Circle: Decide whether you’re going to search for audio file duplicates or not. For this example, I’ve selected regular mode. Next, select same content or ignore content. Selecting same content will cause the program to compare data blocks. Ignore content will let the program rely on the creation date, file size, and other information.

Blue Circle: What type of files do you want to search? Everything on the disk or just your pictures? Set up your filters here. Also, you can tell the system to ignore tiny files or large files and files with certain dates.

Green Circle: You can probably just use the defaults for this section. Zero size files are files that have metadata in the catalog but don’t use data blocks. While they aren’t using your disk space, they’re still using catalog space. System files and folders are programs like Windows. NTFS is NT File System. Hard links are copies of the metadata, or inodes.

search-location

Step 4: Set the Scan Location in the second tab.

If you don’t want to search your entire disk, this is where you can narrow the search. In my example, I’m only looking for duplicate photos, so I limit my search to the “Pictures” directory.

scan-now

Step 5: Hit the Scan Now button.

Step 6: Once the program is done scanning, look through your duplicate files list. Click on the duplicate files tab.

Files that have duplicates are shaded the same.

remove-files-1024x559

Select the files you want to delete. Remember to leave at least one file in each group unchecked!

Hit file removal.

Decide if you want to delete the files straight away, or if you want to move them all to one location to deal with later.

Then, close the dialog box and you are done.

Share