Author Topic: Text Extrcation Engines Configuration  (Read 31138 times)

Mike Sanders

  • Plagiarism Detector support
  • Administrator
  • Jr. Member
  • *****
  • Posts: 51
  • I will gladly help, just ask!
    • View Profile
    • Support Superman
Text Extrcation Engines Configuration
« on: September 15, 2013, 04:47:16 PM »
Plagiarism Detector has several methods of Extracting Plain text from each supported file type.
(The List of supported file formats)

For each file type has it's own Text Extraction Engines.

Text Extraction Engine (TEE) is a sub-program that extracts text from a specified file type.

By default, every time you start Document Manager - Plagiarism Detector automatically selects the most optimal TEE for each supported file type.

Still, there exist 2 cases, when you possibly need to change TEE to get better text extraction:
  • Incomplete Text Extraction (some parts of the document are missing).
  • Dirty Text Extraction (when html code is present in the document).
  • Wrong encoding detection for complex file formats.
TEEs configuration made with Document Manager is global - it is used in all cases.
TEEs configuration made with Advanced Report Viewer (ARV) is local - it is used only within ARV session and then reseted to global setting.

To change a particular TEE for a specific file type, click to the right, combobox appears:
(The example below shows how to change TEE for DocX files)

« Last Edit: November 18, 2017, 08:19:50 PM by Mike Sanders »
Plagiarism Detector is a swiss army knife.