Plagiarism Detector Community
General Category => "Silver Bullets" => Topic started by: Mike Sanders on September 15, 2013, 04:47:16 PM
-
Plagiarism Detector has several methods of Extracting Plain text from each supported file type.
(The List of supported file formats (http://www.plagiarism-detector.com/smf_bb/index.php/topic,29.0.html))
For each file type has it's own Text Extraction Engines.
Text Extraction Engine (TEE) is a sub-program that extracts text from a specified file type.
By default, every time you start Document Manager - Plagiarism Detector automatically selects the most optimal TEE for each supported file type.
Still, there exist 2 cases, when you possibly need to change TEE to get better text extraction:
- Incomplete Text Extraction (some parts of the document are missing).
- Dirty Text Extraction (when html code is present in the document).
- Wrong encoding detection for complex file formats.
TEEs configuration made with Document Manager is global - it is used in all cases.
TEEs configuration made with Advanced Report Viewer (ARV) is local - it is used only within ARV session and then reseted to global setting.
To change a particular TEE for a specific file type, click to the right, combobox appears:
(The example below shows how to change TEE for DocX files)
(https://plagiarism-detector.com/smf_bb/_external_images/doc_selector_change_tee.png)