Copyfind

Copyfind is an open source windows-based program that compares documents and reports similarities in their words and phrases. It is free and available to anyone. It is licensed under the Gnu Public License, which basically means that you can do whatever you like with it except to try to sell it to someone else.

Download Copyfind 4.1.5 Executable
Download Copyfind64.4.1.5 64-Bit Executable

Unlike most modern software packages, Copyfind is a single executable file. You don’t install it, you just run it. Simply click on the link to download the executable file. If you’re running a 64-bit version of Windows, you can select the 64-bit executable, which runs about 10-20% faster than the 32-bit version. Place that file in a convenient location and execute it from the command line.

I haven’t yet written instructions for using Copyfind. Instead, I have posted an example and a commented script to feed to it at:

Sample Script

The script uses my collection of Shakespeare Sonnets, which you can obtain as a zip file at:

Sonnets

To try out Copyfind, the script, and the Sonnets, please open the zip file of Sonnets and copying them into a new folder (Copyfind can’t read the zip file, it needs the Sonnets unpacked). Then put Copyfind.4.1.5.exe (or Copyfind64.4.1.5.exe) and script.txt in the same folder. Edit the script.txt file so that the folders are all correct (at present, they include “Louis Bloomfield” in the path names, which obviously won’t work for you).

When you have the script.txt file spruced up, run the command window or the window powershell and “cd” to the folder containing Copyfind and the script. Then execute:

Copyfind.4.1.5.exe < script.txt

or

Copyfind64.4.1.5.exe < script.txt

Copyfind should run and should read from script.txt. It ought to compare the Sonnets in two different ways and generate a report.

Once you’ve got it working, you can start tinkering with different scripts. You can load documents individually or in folders into groups 1, 2, 3, … and then compare those groups against one another or internally. When you’re done with a collection of documents, run the “Done” command and you can begin again fresh. Each time you start fresh, you can specify a different reporting folder. You should be able to automate comparisons to run for hours or days without your intervention. You can either use one giant script file and feed it to Copyfind by hand, or you can write a program that calls Copyfind many times and feeds it a different script each time.

Download Copyfind 4.1.5 Source

As open source software, you’re welcome to tinker with Copyfind to add features or make it behave differently.

8 thoughts on “Copyfind

  1. Thanks for this wonderful programme! Have just begun my first ‘big run’ of documents (50000) for my newspaper reprint project. It’s a few days in now and says its up to 750000000 of -650000000. It occurs to me that this ‘flipping’ over into the negative is because the int variable you are using is signed. Just wondering (I can’t tell from your code) will this negatively affect the results? Overwriting entries or some such thing? Thanks! -MHB

    1. What an interesting question! I don’t think that the counting process has any connection with the comparison process, so it should continue without trouble. I’ve got my fingers crossed. In any case, I should edit the code to allow for 64-bit signed integers and check for any other situations in which enormous values could cause trouble.

      Lou

    2. I’ve taken a look at the code and it looks like it’s purely a reporting problem. The comparison count and the total comparison value are both stored in c++ variables of type “int” You have overflowed them. In the next update, they’ll both by c++ type “long long” (64-bit integers), so it will be a real challenge to overflow them.

      Lou

  2. Hi,

    Im trying to compare an essay to a number of excerpts from different sources. it seems that once a phrase has been matched it won’t match again. in one of the sources there is the phrase “mixed-sex schooling” – copyfind has underlined one example of this phrase in the essay, but not the second instance when the exact same phrase was used again. Can u please help as this is a small problem in an otherwise invaluable programme.

    1. At present, the software marks each matching phrase and doesn’t consider that same copy of the phrase again. I mean to develop the software to allow it to find internal redundancies, but I just haven’t had time. I’m under so much pressure to publish or perish, that I have had to let this work sit. Modern higher education is a strange thing, but I can’t fix it and just have to try to survive.

      Lou

      1. Hi Lou, thanks for the reply – I feel your pressure 🙂 No worries, it’s a small problem that I can easily work around. Thanks once again for your excellent software – there is nothing else like it!

Leave a Reply