OR Section: "Check Type" Word-to-Word vs Re-Write

Started by Alexei B., September 17, 2015, 09:32:09 PM

Previous topic - Next topic

Alexei B.

Plagiarism Detector team is proud to announce the absolutely awesome feature added to Plagiarism Detector – presets for different kinds of documents are now available "out of the box"!

Some theory behind this:
It has been some time since we first observed cases, when our usual Plagiarism detection algorithms provided unsatisfactory results for certain documents. Additional research highlighted some common features for such cases. These are the specific features of the text in the Arts subjects and Exact Sciences subjects, as we started to call them. While the language for these documents is the same, they usually have some very interesting characteristics, which require different attitude from the Plagiarism detection algorithms.

So, starting with version 885, a user can select the preferable algorithm of check in the Step-by-Step Wizard! Please select "detect Text Rewrite (maximum detection)" if you documents are in the Arts subjects or other similar, and "detect Word-to-Word (maximum exactness)" if your documents are in the Exact Sciences' fields.

We really believe that this new feature (that we have never seen before elsewhere) will help our customers in the ever-ongoing struggle against copy-paste!


Alexei B.

#1
The Arts.

Texts in these subjects are very flexible in nature and allow much modification without actually changing the meaning. Any analysis of a piece of literature is a good example to it. To detect Plagiarism in the best possible way a software has to detect obfuscated "re-written" cases of Plagiarism – when sentences are modified (manually or automatically) to keep the meaning, but avoid detection. We have multiple cases of such modified documents, provided by our customers at different times, which shows some students' struggle to avoid Plagiarism detection. For example the sentence "It was a need for him to have the computer fixed" is better be detected as similar to "he must have had the PC fixed". Please note: these examples are hypothetical and very simplified, the algorithm is much more complex and this pair can be detected or not, depending on the context.

Such approach to Plagiarism detection is perceived to be better not only by us, but also by many competitors, and that is due to several advantages:
-   Obfuscated Plagiarism detection
-   More Plagiarism detected – users often compare software by the detection percent for the same document

That is why it has usually been a default setting for our software.
However, this approach was found to have a significant drawback:
-   False-positive results for certain documents (see below)

Alexei B.

The Exact Sciences.


Text in these fields of knowledge showed certain features of their own, making the above-mentioned obfuscated Plagiarism detection algorithms unacceptable on many cases. Texts in Physics, Maths, etc. usually are much less flexible and enjoy a massive use of domain-specific constructions and expressions that are similar to many texts from the same domain of knowledge. One of the best examples was a certain medical prescription, which was considered Plagiarized upon checking. However a manual check did not confirm it. It turned out that most (if not all) of prescriptions use the same structure of the text as well as the same words and expressions. It is just the components, that change.

Let us take this example from Wikipedia:
"Take of pentobarbitone sodium, three grammes
of sulphate of morphia, two grammes
of hydrate of chloral, fifteen grammes
of table sugar, enough to make fifty grammes."
And now let's toss the ingredients randomly:
"Take of hydrate of chloral, three grammes
of pentobarbitone sodium, two grammes
of sulphate of morphia, fifty grammes
of table sugar, enough to make fifteen grammes."

And now remember the example from the Arts section that was to be detected as Plagiarism. It is rather evident, that due to the same language used these parts will be considered the "same" text, that was obfuscated by changing the word order (one of the approached to obfuscation).

Sure, it is an error, one that we call false-positive. Errors of this kind are usual for all the Plagiarism detection algorithms that are aimed at detecting obfuscated Plagiarism.

Having it in mind, we modified the algorithm specifically for such texts, to detect only what we call "word-to-word" Plagiarism. This algorithm will correctly detect this "prescription" as two different texts, but will also detect those "Arts" example as different texts.

So this "word-to-word" Plagiarism Detection algorithm has the following features:
-   Detects only similar parts of texts
-   Prevents false-positive results
-   Usually shows less Plagiarism then a regular algorithm
-   Is bad at finding even slightly obfuscated Plagiarism

In the recent years we have had several versions of Plagiarism Detector, using this algorithm, and they were provided to customers, that required this kind of check. However having two very different versions is not what we see right, so our RnD spent much time on incorporating both algorithms into a single software!