Text Extrcation Engines Configuration

Started by Mike Sanders, September 15, 2013, 01:47:16 PM

Previous topic - Next topic

Mike Sanders

Plagiarism Detector has several methods of Extracting Plain text from each supported file type.
(The List of supported file formats)

For each file type has it's own Text Extraction Engines.

Text Extraction Engine (TEE) is a sub-program that extracts text from a specified file type.

By default, every time you start Document Manager - Plagiarism Detector automatically selects the most optimal TEE for each supported file type.

Still, there exist 2 cases, when you possibly need to change TEE to get better text extraction:

  • Incomplete Text Extraction (some parts of the document are missing).
  • Dirty Text Extraction (when html code is present in the document).
  • Wrong encoding detection for complex file formats.
TEEs configuration made with Document Manager is global - it is used in all cases.
TEEs configuration made with Advanced Report Viewer (ARV) is local - it is used only within ARV session and then reseted to global setting.

To change a particular TEE for a specific file type, click to the right, combobox appears:
(The example below shows how to change TEE for DocX files)

Plagiarism Detector is a swiss army knife.