dupedit

description


Dupedit is a program for finding duplicate files. Give it files and folders, and dupedit will find all sets of exactly identical files. It cares only about the content of files (not filenames). Here is what makes dupedit special:

  1. Dupedit distinguishes physical copies (duplicates) from «false copies». In dupedit-terminology, false copies are multiple names of the same file; deleting a false copy may lead to data loss. For each physical copy, dupedit groups its false copies. Internally, this group of false copies is treated as one file, thereby avoiding the file self-comparison hazard.
  2. Dupedit compares any number of files at once without hashing. The algorithm may be described as «compare as you go». Each file is read only once, no matter how many and big duplicates you have.
  3. As the name implies, a future goal is to become the ultimate tool for editing duplicates, where editing means managing.

# no hashing #


This is the strength of dupedit.

  1. Dupedit is never wrong about which files are identical.
  2. Dupedit is very fast at elimination. Instead of reading whole files for hashing, compare as you go and skip the rest of files that no longer look like exact copies. Of course, like other deduplication programs, only files of equal size are worth comparing.

hard facts


developerAndreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)
programming languageC
operating systemstested on Linux, old builds exist for Windows. In order to port current versions to Windows, I need to find the windows-equivalent of readdir(), stat(), pthreads, and other POSIX functionality.
licensetime limited LGPL3
languages (localized at compile time)en.ascii, nb-NO.UTF-8
browse the source codedupedit
download current stable versiondupedit-5.4.tar.bz2
forumcli-apps.org

userinterface


commandline noninteractive


All versions support this mode of operation. Only versions from 5.0 onwards are any user friendly; before that, I used to call the program «fcmp».

[me@pc:~/dupedit/testdir] touch empty0 empty1
[me@pc:~/dupedit/testdir] echo -n "hello" > dup0
[me@pc:~/dupedit/testdir] echo -n "world" > unik
[me@pc:~/dupedit/testdir] cp dup0 dup1; cp dup0 dup2
[me@pc:~/dupedit/testdir] dupedit .

-- #0 -- 0 B --
0.0     empty1
0.1     empty0

-- #1 -- 5 B --
1.0     dup0
1.1     dup1
1.2     dup2

In total: 5 identical files. Deduplication can save 10 bytes.
[me@pc:~/dupedit/testdir] #Let's create false copies
[me@pc:~/dupedit/testdir] ln dup0 hardlink0
[me@pc:~/dupedit/testdir] ln -s dup1 symlink1
[me@pc:~/dupedit/testdir] sudo mount --bind . ../mnt
[me@pc:~/dupedit/testdir] dupedit dup? *link? ../testdir/dup0 ../mnt/dup0

-- #0 -- 5 B --
0.0.0   ../mnt/dup0
0.0.1   ../testdir/dup0
0.0.2   hardlink0
0.0.3   dup0
0.1.0   symlink1
0.1.1   dup1
0.2     dup2

In total: 7 identical files. Deduplication can save 10 bytes.

The 2 or 3 hexadecimal dot separated numbers preceding the file's relative path by a tab in between, is a unique enumeration of the 2 levels of grouping of the file. Other lines (not beginning with a hexadecimal number) are comments. Files with a common first number are identical — they contain the same data. Of those, files with a common second number are physically the same file — if you changed one, all would change. The third number is there only to enumerate members of that second subgroup, and is omitted when it is unnecessary.

commandline --interactive


Version 6 is scheduled for release this summer. Instead of just outputting and forgetting results, this totally different beast is made with interactivity in mind. When presenting each set of duplicates, a primitive shell lets you do whatever you want with them. It works by the same principles, but thanks to a never-ending supply of new ideas while studying real-time programming at university, some experiments and the following boring re-implementation (the file reading subsystem was rewritten 4 times), you won't find many source lines in common with version 5. Although not finished yet, I managed to compile it on tuesday. After fixing 2 elementary mistakes, it worked by a miracle, kind of. Some subsystems just segfaults, others are not written yet. I have 4 exams the following 4 weeks, so be patient.

Qt GUI


Will have to wait for version 6. Below is a screenshot of my 2 attempts at creating a graphical user interface for version 5.4. As we can see, they are localized in Norwegian. Only the first became usable. Sadly, I was too lazy to fix the last bug, so I never released it. I'm a n00b at GUI design.
dupedit with Qt graphical user interface (initial attempt)dupedit with Qt graphical user interface (with tabs)

editor


In future. The idea is that dupedit saves its output to a file and opens it with your editor of choice, so you can delete lines corresponding to files you want deleted (maybe also giving some fancy commands if you want to sym/hardlink files instead). When you save, dupedit finds and acts on your changes. For the impatient, modifying something like vidir to operate on output from dupedit should be straightforward. The format of dupedit's output was created with this in mind.

automatic user-defined policies and actions


This is already possible. Let's say you pipe dupedit's output into a script of yours. While dupedit is doing nothing on its own, your script may for example automatically delete all except one file for each group of duplicates. To select the surviving file, you may simply define an ordered list of preferred directories. To top it all, let the surviving file inherit the longest basename and oldest creation/modification times from the group of duplicates. The real point is: You have endless possibilities to define policies and actions in any programming language.

false copies — the «file self-comparison hazard»


There might be many reasons why multiple paths reference the same file:

Comparing a file against itself is a serious hazard which can only result in falsely accusing it to be a copy of itself, which is easy enough for users to believe when presenting the file with its multiple filenames. A file on a UNIX filesystem has only one inode number, and a filesystem has only one device number (even in multiple simultaneous mounts of the filesystem*). Files that share inode- and device numbers are considered by dupedit as «false copies» and treated as one file.

*beware of network filesystems


Network filesystems do not inherit the device-ID of the source filesystem. More annoyingly, (at least on sshfs) inode numbers do not reflect hardlink relationships. Therefore, you can not trust dupedit's ability to distinguish false from physical copies if some of them are on a network filesystem, which means you must consciously do it yourself.**

**you're doing it wrong anyway


Running dupedit or any non-distributed deduplication algorithm on a network mount is a total waste of network bandwidth compared to running it locally where the files are, or, to compare files between computers, hashing files locally by each computer and send checksums over the net. The nearest thing to such a distributed deduplication program that I know of, would be rsync, which is not a deduplication program.

Wrong: Jamming your network, slowing down dupedit, exposing yourself to the «file self-comparison hazard».

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] dupedit mnt

Wrong: In case you thought checksumming (like fdupes does) magically saves any bandwidth, think again. It will be hazardous too, read the source or ask Adriano Lopez.

[me@pc:~] mkdir mnt && sshfs user@example.com:/home/user mnt
[me@pc:~] fdupes mnt

Correct: Run dupedit on the machine where the files are.

[me@pc:~] ssh user@example.com
[user@server:/home/user] dupedit .

Only if you are hash-collision paranoid and the checksums match, might you have an excuse for running a deduplication program on a network mount.

licensing


As usual, source code is the place to express feelings. Code snippet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/* dupedit 6.0 pre-alfa
 *
 * Copyright 2009 Andreas Nordal (ɯoɔ˙lıɐɯƃ@lɐpɹou˙sɐǝɹpuɐ)
 *
 * This release of dupedit becomes public domain when the year 2030
 * begins. Until then, redistribution and modification must be done
 * according to the GNU Lesser General Public License as published
 * by the Free Software Foundation, either version 3, or (at your
 * option) any later version. By doing so, you may postpone the
 * transition from LGPL to public domain, but not advance it.
 *
 * This program is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU Lesser General Public License for more details.
 *
 * You should have received a copy of the GNU Lesser General Public License
 * along with this program.  If not, see <http://www.gnu.org/licenses/>
 */

/* Freedom is more than having a fixed set of options. (Richard Stallman)
 * My conclusion:
 * The 4 freedoms of GPL are insufficient; freedom implies no lisence.
 * Solution:
 * Make the lisence last as long as you need or care, just not forever.
 */