API Reference

tesstrain.arguments

Argument handling utilities.

class tesstrain.arguments.TrainingArguments

Container for holding the training arguments.

tesstrain.arguments.get_argument_parser() ArgumentParser

Get the ArgumentParser for the CLI.

Returns:

The corresponding argument parser.

tesstrain.arguments.verify_parameters_and_handle_defaults(ctx: TrainingArguments) TrainingArguments

Verify the given parameters and handle defaults if value is unset.

Parameters:

ctx – The parameters to handle.

Returns:

The parameters.

tesstrain.generate

Utility for generating the various files.

For a detailed description of the phases, see https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html.

tesstrain.generate.check_file_readable(*filenames: str | Path) bool

Check if all the given files exist, or exit otherwise.

Used to check required input files and produced output files in each phase.

Parameters:

filenames – The filenames to check.

Returns:

Whether all files exist.

tesstrain.generate.cleanup(ctx: TrainingArguments) None

Move the log file to the output directory and remove the training directory.

Parameters:

ctx – The run configuration.

tesstrain.generate.err_exit(msg: str) NoReturn

Exit the application with exit code 1.

Parameters:

msg – The message to log before the exit.

tesstrain.generate.generate_font_image(ctx: TrainingArguments, font: str, exposure: int, char_spacing: float) str

Helper function for phaseI_generate_image.

Generates the image for a single language/font combination in a way that can be run in parallel.

Parameters:
  • ctx – The run configuration.

  • font – The name of the font to use.

  • exposure – The exposure value to use.

  • char_spacing – The character spacing to use.

Returns:

A corresponding identifier.

tesstrain.generate.initialize_fontconfig(ctx: TrainingArguments) None

Initialize the font configuration with a unique font cache directory.

Parameters:

ctx – The run configuration.

tesstrain.generate.make_fontname(font: str) str

Convert the font name to one without special characters.

Parameters:

font – The name to convert.

Returns:

The converted name.

tesstrain.generate.make_lstmdata(ctx: TrainingArguments) None

Construct LSTM training data.

tesstrain.generate.make_outbase(ctx: TrainingArguments, fontname: str, exposure: int) Path

Generate the base output path.

Parameters:
  • ctx – The run configuration.

  • fontname – The name of the font to train.

  • exposure – The current exposure value.

Returns:

The generated path.

tesstrain.generate.phase_E_extract_features(ctx: TrainingArguments, box_config: list[str], ext: str) None

Phase E: (E)xtract .tr feature files from .tif/.box files.

Parameters:
  • ctx – The run configuration.

  • box_config – The box configuration values.

tesstrain.generate.phase_I_generate_image(ctx: TrainingArguments, par_factor: int | None = None) None

Phase I: Generate (I)mages from training text for each font.

Parameters:
  • ctx – The run configuration.

  • par_factor – Maximum number of workers.

tesstrain.generate.phase_UP_generate_unicharset(ctx: TrainingArguments) None

Phase UP: Generate (U)nicharset and (P)roperties file.

Parameters:

ctx – The run configuration.

tesstrain.generate.run_command(cmd: str, *args: str | Path, env: dict[str, Any] | None = None) None

Helper function to run a command and append its output to a log. Aborts early if the program file is not found.

Parameters:
  • cmd – Binary to use.

  • args – Arguments to pass.

  • env – Environment variables to use.

tesstrain.language_specific

Set some language specific variables.

tesstrain.language_specific.set_lang_specific_parameters(ctx: TrainingArguments, lang: str) TrainingArguments

Set language-specific values for several global variables, including

  • text_corpus: Holds the text corpus file for the language. Used in phase F.

  • fonts: Holds a sequence of applicable fonts for the language. Used in phase F & I. Only set if not already set.

  • training_data_arguments: Character-code-specific filtering to distinguish between scripts (e.g. CJK) used by filter_forbidden_characters in phase F.

  • wordlist2dawg_arguments: Specify fixed length DAWG generation for non-space-delimited language.

Parameters:
  • ctx – The run configuration to update.

  • lang – The language code.

Returns:

THe updated run configuration.

tesstrain.wrapper

Actual execution logic.

tesstrain.wrapper.run(fonts: List[str], langdata_directory: str, maximum_pages: int, fonts_directory: str | None = None, temporary_directory: str | None = None, language_code: str | None = None, output_directory: str | None = None, overwrite: bool = False, save_box_tiff: bool = False, linedata_only: bool = False, training_text: str | None = None, wordlist_file: str | None = None, extract_font_properties: bool = True, distort_image: bool = False, tessdata_directory: str | None = None, exposures: List[int] | None = None, point_size: int = 12, vertical_fonts: List[str] | None = None) int

Run with the given parameters.

Parameters:
  • fonts – A list of font names to train on. These need to be recognizable by Pango using fontconfig. An easy way to list the canonical name of all fonts available on your system is to run text2image with --list_available_fonts and the appropriate --fonts_dir path.

  • fonts_directory – Path to font files.

  • temporary_directory – Path to temporary training directory.

  • language_code – ISO 639 language code. Defaults to English.

  • langdata_directory – Path to tesseract/training/langdata directory.

  • maximum_pages – The maximum number of pages to generate.

  • output_directory – Location of generated traineddata file.

  • overwrite – Safe to overwrite files in output directory.

  • save_box_tiff – Save box/tiff pairs along with lstmf files.

  • linedata_only – Only generate training data for lstmtraining.

  • training_text – File with the text to render and use for training. If unspecified, we will look for it in the langdata directory.

  • wordlist_file – File with the word list for the language ordered by decreasing frequency. If unspecified, we will look for it in the langdata directory.

  • extract_font_properties – Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in a .fontinfo file.

  • distort_image – Degrade rendered image with noise, blur, invert.

  • tessdata_directory – Specify location of existing traineddata files, required during feature extraction. If set, it should be the path to the tesseract/tessdata directory. If unspecified, the TESSDATA_PREFIX specified in the current environment will be used.

  • exposures – A list of exposure levels to use (e.g. [-1, 0, 1]). If unspecified, language-specific ones will be used.

  • point_size – Size of printed text.

  • vertical_fonts – A list of vertical font names to train on.

Returns:

The exit code. Always equals 0 at the moment.

tesstrain.wrapper.run_from_context(ctx: TrainingArguments) None

Run with the given configuration.

Parameters:

ctx – The configuration to run with.