WCopyfind compares text or word processor documents with one another to determine if they share words in phrases.
WCopyfind reads .DOCX, .TXT, and .HTML files natively and it does a pretty good job of reading .PDF files, as long as they contain text content rather than pure image content. WCopyfind can extract test from .DOC files, but without much sophistication or finesse.
- Select the documents by browsing for them or dragging them from Windows Explorer into WCopyfind.
- Adjust the comparison parameters.
- Select or create a folder in which all the report files will be placed.
- Run the comparison process.
- Examine the results.
Step 1: Start WCopyfind.
- Download or otherwise locate WCopyfind.4.1.1.exe (or WCopyfind184.108.40.206.exe for 64-bit
computers) and click on its icon.
Step 2: Choose Documents to Compare
Two possible paths:
- Right-click on the old or new document list.
- Select “Browse for Documents” in the popup menu.
- Search for and select the documents you want to include in the comparison.
- Press “Open” to add these documents to the document list.
- Repeat 1 through 4 as needed [Note: Steps 1 & 2 can be completed by simply double-clicking on the list.]
- Start Windows Explorer.
- In Explorer, select one or more documents that you wish to compare. See Notes for discussion of Web-resident documents.
- Drag-and-drop those documents into WCopyfind’s document windows.
- Repeat 2 & 3 as needed.
[Additional Features: You can save or load a particular list of documents via the popup menu (right-click on the list). You can delete selected documents or all of the documents in a list via the popup menu, or selected documents by pressing "delete". Automatic sorting of the documents in a list as they are added can be turned off from the popup menu.]
Step 3: Adjust Comparison Rule Parameters
- Shortest Phrase to Match — Range: 1 to infinite This number is the minimum string length that WCopyfind will consider to be a match. For example, when this parameter is set to 6, WCopyfind will ignore matching phrases that are only 5 words long or less. I recommend leaving this parameter at 6 (words).
- Fewest Matches to Report — Range: 1 to infinite This number is the fewest matching words in a pair of documents that will cause WCopyfind to report a document match in its “Compare Documents” window and generate a pair of underlined comparison documents in the Report Files Folder. There is no recommended value for this parameter.
- Most Imperfections to Allow — Range: 0 to 9 This number is the maximum number of non-matches that WCopyfind will allow between perfectly matching portions of a phrase. For example, if this value is set to 2, then WCopyfind will bridge its way across two non-matching words to connect pieces of perfectly matching prose. A value of 0 will limit WCopyfind to finding only perfect matches, while a value of 1 to 9 will allow WCopyfind to find imperfectly matching phrases (matches that contain flaws). Increasing this value slows the program down. I recommend a value of 0 (if speed or absolute matching are your main requirements) or 2 (if you want to find matches despite minor editing).
- Minimum % of Matching Words — Range: 0 to 100 This number is the minimum percentage of perfect matches that a phrase can contain and be considered a match. Setting this value at 100 limits WCopyfind to finding only perfect matches. I recommend a value of 100 (if speed or absolute matching are your main requirements) or 80 (if you want to find matches despite minor editing).
- Ignore All Punctuation — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore all punctuation characters when it is performing its comparisons. While punctuation will continue to appear in the reports that WCopyfind generates, it will not affect the phrase matching. The matching will normally increase when punctuation is ignored. I recommend against checking this box unless you really want to ignore all punctuation.
- Ignore Outer Punctuation — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore any punctuation characters that appear to the left or right of a word when it is performing its comparisons. For example, the quoted sentence: “The box, which I found, is broken.” will be treated as though it were simply: The box which I found is broken (with no final period) . While this “outer punctuation” will continue to appear in the reports that WCopyfind generates, it will not affect the phrase matching. The matching will normally increase when outer punctuation is ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Ignore Numbers — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore any number characters when it is performing its comparisons. For example, the words 8-fold and 10-fold will match if this parameter is checked. While numbers will continue to appear in the reports that WCopyfind generates, they will not affect the phrase matching. The matching will normally increase when numbers are ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Ignore Letter Case — Checked: Yes or No When checked, this parameter causes WCopyfind to ignore capitalization of letters when it is performing its comparisons. For example, the words Whenever and whenever will match if this parameter is checked. While capital letters will continue to appear in the reports that WCopyfind generates, they will not affect the phrase matching. The matching will normally increase when capitalization is ignored. I recommend against checking this box if your want absolute matching, but for checking this box if you want to find matches despite minor editing.
- Skip Non-Words — Checked: Yes or No When checked, this parameter causes WCopyfind to completely skip words that contain any characters other than letters, except for internal hyphens and apostrophes. The non-words will neither be used in matching, nor will they appear in the reports that WCopyfind generates. If you check this box, I suggest also checking “Ignore Outer Punctuation,” so that words that begin or end with punctuation aren’t skipped over (including plural possessives). I recommend against checking this box if you want absolute matching, but for checking this box if the documents you are comparing contain many non-textual items, including filenames, URL, and other word-processor junk.
- Skip Words Longer than _____ Characters — Checked: Yes or No, with Range: 0 to 255 When checked, this parameter causes WCopyfind to completely skip words that are longer than the number of characters you select. The too-long-words will neither be used in matching, nor will they appear in the reports that WCopyfind generates. I recommend checking this box and setting the number of characters at 20, unless your documents really do contain words longer than that. This choice will allow WCopyfind to skip over many non-textual items, including filenames, URL, image data, and other word-processor junk.
- Basic Characters Only (in DOC Files) — Checked: Yes or No When checked, this parameter causes WCopyfind to limit the character set it recognizes when reading a .DOC file (old-style Microsoft Word format). WCopyfind then considers characters outside that basic collection to be non-printing characters and does not include them in the matching process. I recommend selecting this option if you are comparing .DOC files that have relatively few non-English characters in them.
- Language Selecting the most appropriate language helps WCopyfind determine which characters are letters, punctuation, or capitalizations.
Step 4: Choose Reporting Folder and Style
- Browse to locate or create the reporting folder. It must exist before you run the comparison.
- Check “Brief Report” box if you want the comparison files to contain only the matching phrases (see Note).
Step 5: Run Comparison and Examine Results
- Click “Run” — matches will be reported in the Comparison Window. (See Note for explanation of comparison line.)
- A small window will open while the comparison process is running, allowing you to abort the process before it finishes.
- When the process finishes, a browser window will open, allowing you to examine the pairs of matching files. You can click on the files individually for ease of printing or you can click on the “side-by-side” option to display the pair of file together in adjacent panels of new browser window.
- When you view the files side-by-side, all the matching phrases are actively linked between the two files. If you click on a matching phrase in the left file panel, the corresponding phrase in the right file will move to the top of the right panel, and vice versa.
- You can also double-click on a comparison line in WCopyfind’s internal report window to examine the two comparisons (which have been saved as .html files) in your internet browser.
[Additional Features: You can save the report list to a file via the popup menu (right-click on the report list). You can delete selected lines or all of the lines in the list via the popup menu, or selected lines by pressing "delete". You can also reopen the browser window described in part C.]
Notes about File Formats
- WCopyfind knows how to unpack and analyze .docx files. It uses zlib to decompress each open document format file and then reads and decodes the document.xml file it contains. It handles the full unicode character set and should work for most language that put white-space or punctuation between words.
- WCopyfind knows how to read .html (and .htm) files. It recognizes UTF-8 characters, when they are present, and thus should be able to handle many languages.
- WCopyfind knows how to open and analyze .pdf files, although it doesn’t handle complicated characters and it doesn’t always divide words correctly. However, if you place Foolabs‘ pdftotext executable in the same folder as the WCopyfind.3.0 executable, WCopyfind will make use of pdftotext whenever it opens a .pdf file. Pdftotext is much more sophisticated and it reads .pdf files amazingly well. Since pdftotext is an open source program covered by the Gnu Public License, I include a copy of it here: pdftotext.exe. I hope that the people at Foolabs don’t mind. Important Note: .pdf files sometimes contain images of text rather than actual text. You can view those images in various .pdf viewers, but there is no actual text for WCopyfind to read. If you want to work with such imaged text, you’ll need to use OCR (optical character recognition) on the .pdf document and then save it as text. Adobe Acrobat can do this sort of thing.
- WCopyfind can sift through .doc files, looking for text, but it will also find internal file information such as image names and formatting instructions. There are just too many different formats of .doc file and learning how to analyze them accurately is too difficult. WCopyfind will simply do its best to read what it can.
- WCopyfind can read .txt files well because they are very straightforward. If the .txt file begins with the BOM (Byte Order Mark), WCopyfind will assume the .txt file is using the UTF-8 character set and will handle many languages. If the BOM is absent, it will assume the standard 8-bit Windows character set.
- WCopyfind can read other file formats, but without sophistication. It will simply assume the standard 8-bit Windows character set and try to find text in the file.
- The comparison window and “matches.txt” both display two groupings of numbers:A. The first is the “Perfect Match” grouping and it shows (1) the number of perfectly matching words in phrases of at least “Shortest Phrase to Match” words, (2) the percentage of words in the Left document that are in those phrases, and (3) the percentage of words in the Right document that are in those phrases.B. The second is the “Overall Match” grouping and it shows (1) the number of perfectly and imperfectly matching words in phrases of at least “Shortest Phrase to Match” words in the Left document, (2) the percentage of words in the Left document that are in those phrases, (3) the number of perfectly and imperfectly matching words in phrases of at least “Shortest Phrase to Match” words in the Right document, (4) the percentage of words in the Right document that are in those phrases
- The “Make Vocab” button generates a long list of all of the words that appear in all of the document files. It produces an output file, which you chose after pressing the “Make Vocab” button, listing all the words and the numbers of times they appear in all of the documents. The final list is roughly in descending order of usage frequency, though it is not truly sorted. Generating this vocabulary may take a long time, particularly as the list of words encountered gets longer during the generation process. The ignore and skip parameters are active when making a vocabulary and should be selected carefully. I recommend checking “Ignore Outer Punctuation,” “Ignore Letter Case,” “Skip Non-Words,” and “Skip Words Longer than 20 Characters.”
- In the reports, perfect matches are indicated by red-underlined words and bridging, but non-matching words are indicated by green-italicized-underlined words.
- The matching phrases are links. If you click on a matching phrase, you will be taken to the equivalent phrase in the other document of the pair.
- WCopyfind can “surf the web” by following “internet shortcuts.” If you want it to load a document from the web during the comparison process, simply create an “internet shortcut” to that web-document and drag the “internet shortcut” into one of the two input windows of WCopyfind.For example, if you want to include http://www.nytimes.com/ in the documents to be searched, first create an “internet shortcut” to that site somewhere on your desktop or in a folder. You can create it by using an internet browser to open http://www.nytimes.com/ and then dragging the link icon onto your desktop or into a folder.You can also create “internet shortcuts” using a search engine such as Google. To do this, perform a search using that search engine and then drag and drop each interesting link into a folder on your desktop. The folder will then contain “internet shortcuts” to those interesting pages.
Once you have the “internet shortcuts,” you can drag them into WCopyfind’s document windows. You’ll see the “internet shortcut” name, followed by the extension “.url”. When you run the comparison, WCopyfind will load the document over the web and compare it as though it were a local file. You can save a folder full of “internet shortcuts” and drag them into WCopyfind whenever you want to compare local documents against them. Be aware that broken links will terminate the loading process (I should fix this problem). Also, WCopyfind reloads web pages when preparing reports that contain them, so if a page changes between the load for comparison and the load for reporting (an unlikely event, except with news services, etc.), the report may be scrambled.
- If you select “Brief Report,” WCopyfind will abbreviate its report files so that they contain only the matching phrases, in the order in which they appear in the document. Each of these phrases will be followed by a line break. In effect, the “Brief Report” option simply suppresses the inclusion of non-matching text in the reports and also inserts a line break (actually a new paragraph) at the end of each matching phrase.
- WCopyfind may report an error number if something goes wrong during processing. Those error numbers are:
1 CANNOT OPEN FILE
2 CANNOT ALLOCATE WORKING HASH ARRAY
3 CANNOT ALLOCATE HASH ARRAY
4 CANNOT ALLOCATE SORTED HASH ARRAY
5 CANNOT ALLOCATE SORTED NUMBER ARRAY
6 CANNOT OPEN LOG FILE
7 CANNOT OPEN COMPARISON REPORT TXT FILE
8 CANNOT OPEN COMPARISON REPORT HTML FILE
9 CANNOT ALLOCATE LEFT MATCH MARKERS
10 CANNOT ALLOCATE RIGHT MATCH MARKERS
11 CANNOT ALLOCATE LEFTA MATCH MARKERS
12 CANNOT ALLOCATE RIGHTA MATCH MARKERS
13 CANNOT ALLOCATE LEFTT MATCH MARKERS
15 CANNOT OPEN HTML FILE
16 CANNOT OPEN DOCUMENT FILE
17 CANNOT OPEN SIDE BY SIDE HTML FILE
101 CANNOT ACCESS URL
102 NO FILE OPEN
103 CANNOT FIND FILE
104 CANNOT FIND FILE EXTENSION
105 BAD DOCX FILE
106 BAD PDF FILE
107 CANNOT FIND URL LINK
108 CANNOT OPEN INPUT FILE