API Reference¶

tesstrain.arguments¶

Argument handling utilities.

class tesstrain.arguments.TrainingArguments¶: Container for holding the training arguments.

tesstrain.arguments.get_argument_parser() → ArgumentParser¶

Get the ArgumentParser for the CLI.

Returns:: The corresponding argument parser.

tesstrain.arguments.verify_parameters_and_handle_defaults(ctx: TrainingArguments) → TrainingArguments¶

Verify the given parameters and handle defaults if value is unset.

Parameters:: ctx – The parameters to handle.
Returns:: The parameters.

tesstrain.generate¶

Utility for generating the various files.

For a detailed description of the phases, see https://tesseract-ocr.github.io/tessdoc/Training-Tesseract.html.

tesstrain.generate.check_file_readable(*filenames: str | Path) → bool¶

Check if all the given files exist, or exit otherwise.

Used to check required input files and produced output files in each phase.

Parameters:: filenames – The filenames to check.
Returns:: Whether all files exist.

tesstrain.generate.cleanup(ctx: TrainingArguments) → None¶

Move the log file to the output directory and remove the training directory.

Parameters:: ctx – The run configuration.

tesstrain.generate.err_exit(msg: str) → NoReturn¶

Exit the application with exit code 1.

Parameters:: msg – The message to log before the exit.

tesstrain.generate.generate_font_image(ctx: TrainingArguments, font: str, exposure: int, char_spacing: float) → str¶

Helper function for phaseI_generate_image.

Generates the image for a single language/font combination in a way that can be run in parallel.

Parameters:

ctx – The run configuration.
font – The name of the font to use.
exposure – The exposure value to use.
char_spacing – The character spacing to use.

Returns:

A corresponding identifier.

tesstrain.generate.initialize_fontconfig(ctx: TrainingArguments) → None¶

Initialize the font configuration with a unique font cache directory.

Parameters:: ctx – The run configuration.

tesstrain.generate.make_fontname(font: str) → str¶

Convert the font name to one without special characters.

Parameters:: font – The name to convert.
Returns:: The converted name.

tesstrain.generate.make_lstmdata(ctx: TrainingArguments) → None¶: Construct LSTM training data.

tesstrain.generate.make_outbase(ctx: TrainingArguments, fontname: str, exposure: int) → Path¶

Generate the base output path.

Parameters:

ctx – The run configuration.
fontname – The name of the font to train.
exposure – The current exposure value.

Returns:

The generated path.

tesstrain.generate.phase_E_extract_features(ctx: TrainingArguments, box_config: list[str], ext: str) → None¶

Phase E: (E)xtract .tr feature files from .tif/.box files.

Parameters:

ctx – The run configuration.
box_config – The box configuration values.

tesstrain.generate.phase_I_generate_image(ctx: TrainingArguments, par_factor: int | None = None) → None¶

Phase I: Generate (I)mages from training text for each font.

Parameters:

ctx – The run configuration.
par_factor – Maximum number of workers.

tesstrain.generate.phase_UP_generate_unicharset(ctx: TrainingArguments) → None¶

Phase UP: Generate (U)nicharset and (P)roperties file.

Parameters:: ctx – The run configuration.

tesstrain.generate.run_command(cmd: str, *args: str | Path, env: dict[str, Any] | None = None) → None¶

Helper function to run a command and append its output to a log. Aborts early if the program file is not found.

Parameters:

cmd – Binary to use.
args – Arguments to pass.
env – Environment variables to use.

tesstrain.language_specific¶

Set some language specific variables.

tesstrain.language_specific.set_lang_specific_parameters(ctx: TrainingArguments, lang: str) → TrainingArguments¶

Set language-specific values for several global variables, including

text_corpus: Holds the text corpus file for the language. Used in phase F.
fonts: Holds a sequence of applicable fonts for the language. Used in phase F & I. Only set if not already set.
training_data_arguments: Character-code-specific filtering to distinguish between scripts (e.g. CJK) used by filter_forbidden_characters in phase F.
wordlist2dawg_arguments: Specify fixed length DAWG generation for non-space-delimited language.

Parameters:

ctx – The run configuration to update.
lang – The language code.

Returns:

THe updated run configuration.

tesstrain.wrapper¶

Actual execution logic.

tesstrain.wrapper.run(fonts: List[str], langdata_directory: str, maximum_pages: int, fonts_directory: str | None = None, temporary_directory: str | None = None, language_code: str | None = None, output_directory: str | None = None, overwrite: bool = False, save_box_tiff: bool = False, linedata_only: bool = False, training_text: str | None = None, wordlist_file: str | None = None, extract_font_properties: bool = True, distort_image: bool = False, tessdata_directory: str | None = None, exposures: List[int] | None = None, point_size: int = 12, vertical_fonts: List[str] | None = None) → int¶

Run with the given parameters.

Parameters:

fonts – A list of font names to train on. These need to be recognizable by Pango using fontconfig. An easy way to list the canonical name of all fonts available on your system is to run text2image with --list_available_fonts and the appropriate --fonts_dir path.
fonts_directory – Path to font files.
temporary_directory – Path to temporary training directory.
language_code – ISO 639 language code. Defaults to English.
langdata_directory – Path to tesseract/training/langdata directory.
maximum_pages – The maximum number of pages to generate.
output_directory – Location of generated traineddata file.
overwrite – Safe to overwrite files in output directory.
save_box_tiff – Save box/tiff pairs along with lstmf files.
linedata_only – Only generate training data for lstmtraining.
training_text – File with the text to render and use for training. If unspecified, we will look for it in the langdata directory.
wordlist_file – File with the word list for the language ordered by decreasing frequency. If unspecified, we will look for it in the langdata directory.
extract_font_properties – Assumes that the input file contains a list of ngrams. Renders each ngram, extracts spacing properties and records them in a .fontinfo file.
distort_image – Degrade rendered image with noise, blur, invert.
tessdata_directory – Specify location of existing traineddata files, required during feature extraction. If set, it should be the path to the tesseract/tessdata directory. If unspecified, the TESSDATA_PREFIX specified in the current environment will be used.
exposures – A list of exposure levels to use (e.g. [-1, 0, 1]). If unspecified, language-specific ones will be used.
point_size – Size of printed text.
vertical_fonts – A list of vertical font names to train on.

Returns:

The exit code. Always equals 0 at the moment.

tesstrain.wrapper.run_from_context(ctx: TrainingArguments) → None¶

Run with the given configuration.

Parameters:: ctx – The configuration to run with.