Purpose:
WCopyfind compares text or word processor documents with one another to determine if they share words in phrases.
WCopyfind reads .DOCX, .TXT, and .HTML files natively and it does a pretty good job of reading .PDF files, as long as they contain text content rather than pure image content. WCopyfind can extract test from .DOC files, but without much sophistication or finesse.
Overview:
- Select the documents by browsing for them or dragging them from Windows Explorer into WCopyfind.
- Adjust the comparison parameters.
- Select or create a folder in which all the report files will be placed.
- Run the comparison process.
- Examine the results.
Step-by-step Instructions:
Step 1: Start WCopyfind.
- Download or otherwise locate WCopyfind.4.1.5.exe (or WCopyfind64.4.1.5.exe for 64-bit
computers) and click on its icon.
Step 2: Choose Documents to Compare
Two possible paths:
- Right-click on the old or new document list.
- Select “Browse for Documents” in the popup menu.
- Search for and select the documents you want to include in the comparison.
- Press “Open” to add these documents to the document list.
- Repeat 1 through 4 as needed [Note: Steps 1 & 2 can be completed by simply double-clicking on the list.]
Or:
- Start Windows Explorer.
- In Explorer, select one or more documents that you wish to compare. See Notes for discussion of Web-resident documents.
- Drag-and-drop those documents into WCopyfind’s document windows.
- Repeat 2 & 3 as needed.
[Additional Features: You can save or load a particular list of documents via the popup menu (right-click on the list). You can delete selected documents or all of the documents in a list via the popup menu, or selected documents by pressing “delete”. Automatic sorting of the documents in a list as they are added can be turned off from the popup menu.]
Step 3: Adjust Comparison Rule Parameters
- Shortest Phrase to Match — Range: 1 to infinite This number is the minimum string length that WCopyfind will consider to be a match. For example, when this parameter is set to 6, WCopyfind will ignore matching phrases that are only 5 words long or less. I recommend leaving this parameter at 6 (words).
- Fewest Matches to Report — Range: 1 to infinite This number is the fewest matching words in a pair of documents that will cause WCopyfind to report a document match in its “Compare Documents” window and generate a pair of underlined comparison documents in the Report Files Folder. There is no recommended value for this parameter.
- Most Imperfections to Allow — Range: 0 to 9 This number is the maximum number of non-matches that WCopyfind will allow between perfectly matching portions of a phrase. For example, if this value is set to 2, then WCopyfind will bridge its way across two non-matching words to connect pieces of perfectly matching prose. A value of 0 will limit WCopyfind to finding only perfect matches, while a value of 1 to 9 will allow WCopyfind to find imperfectly matching phrases (matches that contain flaws). Increasing this value slows the program down. I recommend a value of 0 (if speed or absolute matching are your main requirements) or 2 (if you want to find matches despite minor editing).
- Minimum % of Matching Words — Range: 0 to 100 This number is the minimum percentage of perfect matches that a phrase can contain and be considered a match. Setting this value at 100 limits WCopyfind to finding only perfect matches. I recommend a value of 100 (if speed or absolute matching are your main requirements) or 80 (if you want to find matches despite minor editing).
- Ignore All Punctuation — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore all punctuation characters when it is performing its comparisons. While punctuation will continue to appear in the reports that WCopyfind generates, it will not affect the phrase matching. The matching will normally increase when punctuation is ignored. I recommend against checking this box unless you really want to ignore all punctuation.
- Ignore Outer Punctuation — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore any punctuation characters that appear to the left or right of a word when it is performing its comparisons. For example, the quoted sentence: “The box, which I found, is broken.” will be treated as though it were simply: The box which I found is broken (with no final period) . While this “outer punctuation” will continue to appear in the reports that WCopyfind generates, it will not affect the phrase matching. The matching will normally increase when outer punctuation is ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Ignore Numbers — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore any number characters when it is performing its comparisons. For example, the words 8-fold and 10-fold will match if this parameter is checked. While numbers will continue to appear in the reports that WCopyfind generates, they will not affect the phrase matching. The matching will normally increase when numbers are ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Ignore Letter Case — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore capitalization of letters when it is performing its comparisons. For example, the words Whenever and whenever will match if this parameter is checked. While capital letters will continue to appear in the reports that WCopyfind generates, they will not affect the phrase matching. The matching will normally increase when capitalization is ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Skip Non-Words — Checked: Yes or No When checked, this parameter causes WCopyfind to completely skip words that contain any characters other than letters, except for internal hyphens and apostrophes. The non-words will neither be used in matching, nor will they appear in the reports that WCopyfind generates. If you check this box, I suggest also checking “Ignore Outer Punctuation,” so that words that begin or end with punctuation aren’t skipped over (including plural possessives). I recommend against checking this box if you want absolute matching, but for checking this box if the documents you are comparing contain many non-textual items, including filenames, URL, and other word-processor junk.
- Skip Words Longer than _____ Characters — Checked: Yes or No, with Range: 0 to 255 When checked, this parameter causes WCopyfind to completely skip words that are longer than the number of characters you select. The too-long-words will neither be used in matching, nor will they appear in the reports that WCopyfind generates. I recommend checking this box and setting the number of characters at 20, unless your documents really do contain words longer than that. This choice will allow WCopyfind to skip over many non-textual items, including filenames, URL, image data, and other word-processor junk.
- Basic Characters Only (in DOC Files) — Checked: Yes or No When checked, this parameter causes WCopyfind to limit the character set it recognizes when reading a .DOC file (old-style Microsoft Word format). WCopyfind then considers characters outside that basic collection to be non-printing characters and does not include them in the matching process. I recommend selecting this option if you are comparing .DOC files that have relatively few non-English characters in them.
- Language Selecting the most appropriate language helps WCopyfind determine which characters are letters, punctuation, or capitalizations.
Step 4: Choose Reporting Folder and Style
- Browse to locate or create the reporting folder. It must exist before you run the comparison.
- Check “Brief Report” box if you want the comparison files to contain only the matching phrases (see Note).
Step 5: Run Comparison and Examine Results
- Click “Run” — matches will be reported in the Comparison Window. (See Note for explanation of comparison line.)
- A small window will open while the comparison process is running, allowing you to abort the process before it finishes.
- When the process finishes, a browser window will open, allowing you to examine the pairs of matching files. You can click on the files individually for ease of printing or you can click on the “side-by-side” option to display the pair of file together in adjacent panels of new browser window.
- When you view the files side-by-side, all the matching phrases are actively linked between the two files. If you click on a matching phrase in the left file panel, the corresponding phrase in the right file will move to the top of the right panel, and vice versa.
- You can also double-click on a comparison line in WCopyfind’s internal report window to examine the two comparisons (which have been saved as .html files) in your internet browser.
[Additional Features: You can save the report list to a file via the popup menu (right-click on the report list). You can delete selected lines or all of the lines in the list via the popup menu, or selected lines by pressing “delete”. You can also reopen the browser window described in part C.]
Notes about File Formats
- WCopyfind knows how to unpack and analyze .docx files. It uses zlib to decompress each open document format file and then reads and decodes the document.xml file it contains. It handles the full unicode character set and should work for most language that put white-space or punctuation between words.
- WCopyfind knows how to read .html (and .htm) files. It recognizes UTF-8 characters, when they are present, and thus should be able to handle many languages.
- WCopyfind knows how to open and analyze .pdf files, although it doesn’t handle complicated characters and it doesn’t always divide words correctly. However, if you place the pdftotext executable (available online at XpdfReader in their command line tools download zip file) in the same folder as the WCopyfind.4.1.5 executable, WCopyfind will make use of pdftotext whenever it opens a .pdf file. Pdftotext is much more sophisticated and it reads .pdf files amazingly well. Since pdftotext is an open source program covered by the Gnu Public License, I include 32-bit and 64-bit executable versions of it in this zip file: pdftotext.zip. Important Note: .pdf files sometimes contain images of text rather than actual text. You can view those images in various .pdf viewers, but there is no actual text for WCopyfind to read. If you want to work with such imaged text, you’ll need to use OCR (optical character recognition) on the .pdf document and then save it as text. Adobe Acrobat can do this sort of thing.
- WCopyfind can sift through .doc files, looking for text, but it will also find internal file information such as image names and formatting instructions. There are just too many different formats of .doc file and learning how to analyze them accurately is too difficult. WCopyfind will simply do its best to read what it can.
- WCopyfind can read .txt files well because they are very straightforward. If the .txt file begins with the BOM (Byte Order Mark), WCopyfind will assume the .txt file is using the UTF-8 character set and will handle many languages. If the BOM is absent, it will assume the standard 8-bit Windows character set.
- WCopyfind can read other file formats, but without sophistication. It will simply assume the standard 8-bit Windows character set and try to find text in the file.
Additional Notes:
-
- The comparison window and “matches.txt” both display two groupings of numbers:A. The first is the “Perfect Match” grouping and it shows (1) the number of perfectly matching words in phrases of at least “Shortest Phrase to Match” words, (2) the percentage of words in the Left document that are in those phrases, and (3) the percentage of words in the Right document that are in those phrases.B. The second is the “Overall Match” grouping and it shows (1) the number of perfectly and imperfectly matching words in phrases of at least “Shortest Phrase to Match” words in the Left document, (2) the percentage of words in the Left document that are in those phrases, (3) the number of perfectly and imperfectly matching words in phrases of at least “Shortest Phrase to Match” words in the Right document, (4) the percentage of words in the Right document that are in those phrases
- The “Make Vocab” button generates a long list of all of the words that appear in all of the document files. It produces an output file, which you chose after pressing the “Make Vocab” button, listing all the words and the numbers of times they appear in all of the documents. The final list is roughly in descending order of usage frequency, though it is not truly sorted. Generating this vocabulary may take a long time, particularly as the list of words encountered gets longer during the generation process. The ignore and skip parameters are active when making a vocabulary and should be selected carefully. I recommend checking “Ignore Outer Punctuation,” “Ignore Letter Case,” “Skip Non-Words,” and “Skip Words Longer than 20 Characters.”
- In the reports, perfect matches are indicated by red-underlined words and bridging, but non-matching words are indicated by green-italicized-underlined words.
- The matching phrases are links. If you click on a matching phrase, you will be taken to the equivalent phrase in the other document of the pair.
- WCopyfind can “surf the web” by following “internet shortcuts.” If you want it to load a document from the web during the comparison process, simply create an “internet shortcut” to that web-document and drag the “internet shortcut” into one of the two input windows of WCopyfind.For example, if you want to include http://www.nytimes.com/ in the documents to be searched, first create an “internet shortcut” to that site somewhere on your desktop or in a folder. You can create it by using an internet browser to open http://www.nytimes.com/ and then dragging the link icon onto your desktop or into a folder.You can also create “internet shortcuts” using a search engine such as Google. To do this, perform a search using that search engine and then drag and drop each interesting link into a folder on your desktop. The folder will then contain “internet shortcuts” to those interesting pages.Once you have the “internet shortcuts,” you can drag them into WCopyfind’s document windows. You’ll see the “internet shortcut” name, followed by the extension “.url”. When you run the comparison, WCopyfind will load the document over the web and compare it as though it were a local file. You can save a folder full of “internet shortcuts” and drag them into WCopyfind whenever you want to compare local documents against them. Be aware that broken links will terminate the loading process (I should fix this problem). Also, WCopyfind reloads web pages when preparing reports that contain them, so if a page changes between the load for comparison and the load for reporting (an unlikely event, except with news services, etc.), the report may be scrambled.
- If you select “Brief Report,” WCopyfind will abbreviate its report files so that they contain only the matching phrases, in the order in which they appear in the document. Each of these phrases will be followed by a line break. In effect, the “Brief Report” option simply suppresses the inclusion of non-matching text in the reports and also inserts a line break (actually a new paragraph) at the end of each matching phrase.
- WCopyfind may report an error number if something goes wrong during processing. Those error numbers are:
1 CANNOT OPEN FILE
2 CANNOT ALLOCATE WORKING HASH ARRAY
3 CANNOT ALLOCATE HASH ARRAY
4 CANNOT ALLOCATE SORTED HASH ARRAY
5 CANNOT ALLOCATE SORTED NUMBER ARRAY
6 CANNOT OPEN LOG FILE
7 CANNOT OPEN COMPARISON REPORT TXT FILE
8 CANNOT OPEN COMPARISON REPORT HTML FILE
9 CANNOT ALLOCATE LEFT MATCH MARKERS
10 CANNOT ALLOCATE RIGHT MATCH MARKERS
11 CANNOT ALLOCATE LEFTA MATCH MARKERS
12 CANNOT ALLOCATE RIGHTA MATCH MARKERS
13 CANNOT ALLOCATE LEFTT MATCH MARKERS
14 CANNOT ALLOCATE RIGHTT MATCH MARKERS
15 CANNOT OPEN HTML FILE
16 CANNOT OPEN DOCUMENT FILE
17 CANNOT OPEN SIDE BY SIDE HTML FILE
101 CANNOT ACCESS URL
102 NO FILE OPEN
103 CANNOT FIND FILE
104 CANNOT FIND FILE EXTENSION
105 BAD DOCX FILE
106 BAD PDF FILE
107 CANNOT FIND URL LINK
108 CANNOT OPEN INPUT FILE
Great little program. However, what happens to my apostrophes? In the comparison docs, the apostrophes are given strange characters such as: .
I am using the program to compare two documents (one of which I wrote) to identify parts that I took from another document (which a co author wrote). However, I want to use the output to easily replace my original. But this program replaces apostrophes for some reason. I have tried saving as plain text, but still the problem persists.
Thanks, David. It’s clearly a bug in my code and I’ll try to find it and fix it. Could you tell me what type of document format you’re using and whether those apostrophes are normal apostrophes (straight) or “smart quote” apostrophes (curly).
Lou
Hi Lou, thanks. I am using Word 2002. Even apostrophes as in Lou’s program became a numbered box, and when cut and paste back became a square symbol. Otherwise I use smart quotes. However with plain text I assume these become straight. BTW I am using the latest Firefox. Don’t know if that makes a difference.
As an Associate Professor I come across plagiarised material often. Wcopyfind really does the job. Thanks a Ton.
Thanks! I’m glad you find the software useful. — Lou
Hi – Can anyone explain the relationship between the “Minimum % of Matching Words” and the “Most Imperfections to Allow” parameters? I understand the latter is about bridging words but changing the former seems to affect the results. What is the formula the program uses for “Minimum % of Matching Words”? For example, if I have the following string “This a string that is too far to be matched” and I set imperfections to 2 and I compare to the string “This a string that is to be matched” (eliminating the words “too far”) what setting for “Minimum % of Matching Words” will cause this to be flagged? I guess I just know what a “% of Matching Word” is…
As WCopyfind is sniffing around, comparing two possibly matching phrases, it starts by identifying the part of those phrases that match perfectly. It then tries to extend those phrases, forward and backward, until one of two limits is reached. Each time it has to step over or around a mismatched word, it tallies a flaw. Once the number of flaws exceeds your “Most Imperfections to Allow” choice, it stops extending the phrases. Moreover, it will stop building phrases once the percentage of matching words in the phrases drops below your set limit: “Minimum % of Matching Words”.
When WCopyfind is extending two long, almost perfectly matching phrases, the % of matching words is very high and that limit is unlikely to stop it from extending the phrases. In all likelihood, it will extend those long phrases until it encounters too many flaws. This behavior allows WCopyfind to keep extending phrases that are probably related to one another.
When WCopyfind is extending two short, almost perfectly matching phrases, the % of matching words is relatively low and that limit is likely to stop it from extending the phrases. This behavior keeps WCopyfind from trying to extend short matching phrases that are probably not related.
Hi – Thanks for the very quick info. Sorry to be dense, but I’m not sure I understand what the numerator and denominator are in the “% of Matching Words” calculus? I created files A.txt and B.txt containing the text in my original post. Setting the imperfections allowed to be 2, I tried settings of 1% thru 99% for Minimum % of Matching Words but this never allows the two word gap to be bridged?
I am excited about using the program, but having some difficulty. I have a .doc file and a pdf file on my C drive for comparing. I have browsed and put one in the Old documents and the other in the New documents section and designated a folder for the report. I know there is at least 1 exact sentence shared between the two, yet the report comes back as no shared content. If I put the same document in both the new and old section for comparison, I do get a 100% duplication reported. Any suggestions?
Thank you
Does the PDF file contain the sentence as text or as an image? Many software packages generate PDF files that are images and that can’t be read as text by ordinary software. Adobe Acrobat can convert those image-based documents into text-based ones, but WCopyfind cannot.
Lou
I am drawing a blank on how to create a report folder. Any help?
Hello! First, let me thank you for this wonderful program that works VERY well! I have two questions. The matches are showing in red as hyperlinks. Is there anyway to have the matches show as just highlighted material? Finally, is there anyway to keep the source formatting delivered in the results? Thank you so much!
Hi I’m trying to use your program, but it’s not loading the side-by-side comparison results. I’m using chrome as browser and comparing .docx files in “New Documents”. Any idea what I might be doing wrong?
WCopyfind may be having trouble lauching the browser itself. You can do it manually by looking for the matches.html file in your reporting folder. Double-clicking on it should display the summary page. If you want to see the individual side-by-side pages, look for the SBS—.html files and double-click on them. If they won’t load, then there may be something wrong in the html I’m generating. I haven’t tinkered with that aspect of the programs for years and the formatting may be dated.
Lou
Hi Sir , it is wonderful application. I have one question , cant it read images also as strings. If yes , any idea how to do that. I am willing to add that functionality to it but need your guidance 🙂
My programs don’t know how to handle images; instead, they try to jump over non-text content. In trying to compare images, I’m not sure what the concept would be — same image, same data structure for that image, or same textual content in the image. You’re welcome to tinker with the code in any case.
Lou
Hi to all,
I thought that sharing a common example would be handy to understand some of the workings of the problem. I am interested in the % match statistic. In particular, I made 2 documents with the following phrases, set up to avoid repetitions:
Document 1:
This is a phrase in common. Segment in first doc only.
Document 2:
This is a phrase in common. Second part uncommon. This part was added.
I set shortest phrase to 3. If I understand correctly, the number of words in matching phrases would be 6 (first sentence). Since total number of words are 11 and 13, respectively, percentages should be 6/11=55% and 6/13=46%
However, the report is: 68 (91% L, 94% R)
What do I make of that?
Many thanks for any help. I think it will be useful for future users.
Hi Marcel,
I tried comparing your two documents and obtained the results you expected:
6 (46% L, 54% R) and 6 (46%) L; 6 (54%) R
That suggests that WCopyfind was having trouble reading your documents and somehow read them as one character per word. That would account for the 68 matches it found because there are about 68 characters in each of your files.
The input section of WCopyfind is relatively primitive and can’t handle exotic file formats. It’s a hobby activity for me and I only tinker with it occasionally. As my day job becomes more and more absurd, however, I may return to doing stuff that I enjoy … like improving this software.
Lou
It would be wonderful to have more than one result from one sentence. I mean, If the first doc have “hello man how are you” and the second doc has “hello how are you”, and “hello you”, it would be very usefull find both sentences of second doc, from the sentence of the first doc. The program only is giving one result.
It would be possible?
Thanks a lot.
That’s an interesting idea. I can probably implement it. I’ve put it on my to-do list.
Lou
That would be great Lou, really. My objetive is not about plagiarism. It is about to study big amount of data.
Dont you think sometimes while you are studing huge documents..”oh, I saw something similar in other page, but where??”
I am sure that knowing all repeated ideas (differents ideas which each one are repeated several times), we’ll be able to conect all information much better.
If you like this idea, it would be awesome if you come up with more ideas to improve in this aspect (for example to find automaticaly all similars ideas within a document itself, thats would be just perfect)
P.D: sorry for my english Im no native
Alex
That would be great Lou, really. My objetive is not about plagiarism. It is about to study big amount of data.
Dont you think sometimes while you are studing huge documents..”oh, I saw something similar in other page, but where??”
I am sure that knowing all repeated ideas (differents ideas which each one are repeated several times), we’ll be able to conect all information much better.
If you like this idea, it would be awesome if you come up with more ideas to improve in this aspect (for example to find automaticaly all similars ideas within a document itself, thats would be just perfect)
P.D: sorry for my english Im no native
Alex
Any new about that?
Regards 😉