The Exact Sciences.
Text in these fields of knowledge showed certain features of their own, making the above-mentioned obfuscated Plagiarism detection algorithms unacceptable on many cases. Texts in Physics, Maths, etc. usually are much less flexible and enjoy a massive use of domain-specific constructions and expressions that are similar to many texts from the same domain of knowledge. One of the best examples was a certain medical prescription, which was considered Plagiarized upon checking. However a manual check did not confirm it. It turned out that most (if not all) of prescriptions use the same structure of the text as well as the same words and expressions. It is just the components, that change.
Let us take this example from Wikipedia:
“Take of pentobarbitone sodium, three grammes
of sulphate of morphia, two grammes
of hydrate of chloral, fifteen grammes
of table sugar, enough to make fifty grammes.”
And now let’s toss the ingredients randomly:
“Take of hydrate of chloral, three grammes
of pentobarbitone sodium, two grammes
of sulphate of morphia, fifty grammes
of table sugar, enough to make fifteen grammes.”
And now remember the example from the Arts section that was to be detected as Plagiarism. It is rather evident, that due to the same language used these parts will be considered the “same” text, that was obfuscated by changing the word order (one of the approached to obfuscation).
Sure, it is an error, one that we call false-positive. Errors of this kind are usual for all the Plagiarism detection algorithms that are aimed at detecting obfuscated Plagiarism.
Having it in mind, we modified the algorithm specifically for such texts, to detect only what we call “word-to-word” Plagiarism. This algorithm will correctly detect this “prescription” as two different texts, but will also detect those “Arts” example as different texts.
So this “word-to-word” Plagiarism Detection algorithm has the following features:
- Detects only similar parts of texts
- Prevents false-positive results
- Usually shows less Plagiarism then a regular algorithm
- Is bad at finding even slightly obfuscated Plagiarism
In the recent years we have had several versions of Plagiarism Detector, using this algorithm, and they were provided to customers, that required this kind of check. However having two very different versions is not what we see right, so our RnD spent much time on incorporating both algorithms into a single software!