textdirectory package

Submodules

textdirectory.cli module

Console script for textdirectory.

textdirectory.crudespellchecker module

Spellchecker module.

class textdirectory.crudespellchecker.CrudeSpellChecker(caching=True, language_model='crudesc_lm_en')[source]

Bases: object

A very simple and crude spellchecker based on Peter Norvig’s design. Simple Language Models: crudesc_lm_en.gz.lm English based on COCA (sample), OANC (written), BNC crudesc_lm_ame.lm American English based on COCA (sample) and OANC (written) crudesc_lm_amehistorical.lm American English based on COHA (sample)

candidates(word)[source]
Parameters:

word (str) – a word

Returns:

a list of candidates

correct_string(string, return_corrections=False)[source]
Parameters:
  • string (str) – the string to correct.

  • return_corrections (bool) – include the corrections in the result

Returns:

the corrected string

correction(word)[source]
Parameters:

word (str) – a word

Returns:

most probable spelling correction for word

edit_distance_1(word)[source]
Parameters:

word (str) – a word

Returns:

all edits one edit away from the word

edit_distance_2(word)[source]
Parameters:

word (str) – a word

Returns:

all edits two edits away from the word

known(words)[source]
Parameters:

word (str) – a word

Returns:

a subset of words in the dictionary of frequencies

p_word(word)[source]
Parameters:

word (str) – a word

textdirectory.crudespellchecker.generate_crudespellchecker_lm(corpus_directory, model_name, strip_xml=False)[source]
Parameters:
  • corpus_directory (str) – path the folder containing the files.

  • model_name (str) – th name of the model

  • strip_xml (bool) – stripping XML tags with bs4

textdirectory.helpers module

Helpers module.

textdirectory.helpers.chunk_text(string, chunk_size=50000)[source]
Parameters:
  • string (str) – a string

  • chunk_size (int) – the max characters of one chunk

Returns:

a list of chunks

textdirectory.helpers.count_non_alphanum(string)[source]
Parameters:

string (str) – a string

Returns:

the number of non-alphanumeric characters in the string

textdirectory.helpers.estimate_spacy_max_length(override=False, tokenizer_only=False)[source]

Returns a somewhat sensible suggestions for max_length.

textdirectory.helpers.get_available_filters(get_human_name=False)[source]
Parameters:

get_human_name (bool) – if True, also return the ‘human name’

Returns:

a list of functions; if get_human_name a list of tuples

textdirectory.helpers.get_available_transformations(get_human_name=False)[source]
Parameters:

get_human_name – if True, also return the ‘human name’

Returns:

a list of functions; if get_human_name a list of tuples

textdirectory.helpers.get_human_from_docstring(doc)[source]
Parameters:

doc (string) – if True, also return the ‘human name’

Returns:

a dictionary of name_* keys/values from the docstring.

textdirectory.helpers.tabulate_flat_list_of_dicts(list_of_dicts, max_length=25)[source]
Parameters:
  • list_of_dicts (list) – a list of dictionaries; each list is a row

  • max_length (int) – the maximum length of a cell

Returns:

a table

textdirectory.textdirectory module

Main module.

class textdirectory.textdirectory.TextDirectory(directory, encoding='utf8', autoload=False)[source]

Bases: object

aggregate_to_file(filename='aggregated.txt')[source]
Parameters:

filename (str) – the path/filename to write to

aggregate_to_memory()[source]
Returns:

a string containing the aggregated text files

Type:

str

clear_transformation()[source]

Destage all transformations and clear memory.

destage_transformation(transformation)[source]
Parameters:

transformation (list) – the transformation that should be de-staged and its parameters

filter()[source]

A wrapper for filters.

filter_by_chars_outliers(sigmas=2)[source]
Parameters:

sigmas (int) – The number of stds that qualifies an outlier.

Human_name:

Character outliers

filter_by_contains(contains)[source]
Parameters:

contains (str) – A string that needs to be present in the file

Human_name:

Contains string

filter_by_filename_contains(contains)[source]
Parameters:

contains (str) – A string that needs to be present in the filename

Human_name:

Filename contains string

filter_by_filename_not_contains(not_contains)[source]
Parameters:

not_contains (str) – A string that needs not to be present in the filename

Human_name:

Filename does not contain string

filter_by_filenames(filenames)[source]
Parameters:

filenames (list) – A list of filenames to include

filter_by_max_chars(max_chars=100)[source]
Parameters:

max_chars (int) – the maximum number of characters a file can have

Human_name:

Maximum characters

filter_by_max_filesize(max_kb=100)[source]
Parameters:

max_mb (int) – The maximum number of kB a file is allowed to have.

Human_name:

Maximum filesize

filter_by_max_tokens(max_tokens=100)[source]
Parameters:

max_tokens (int) – the maximum number of tokens a file can have

Human_name:

Maximum tokens

filter_by_min_chars(min_chars=100)[source]
Parameters:

min_chars (int) – the minimum number of characters a file can have

Human_name:

Minimum characters

filter_by_min_filesize(min_kb=10)[source]
Parameters:

max_mb (int) – The minimum number of kB a file is allowed to have.

Human_name:

Minimum Filesize

filter_by_min_tokens(min_tokens=1)[source]
Parameters:

min_tokens (int) – the minimum number of tokens a file can have

Human_name:

Minimum tokens

filter_by_not_contains(not_contains)[source]
Parameters:

not_contains (str) – A string that is not allowed to be present in the file

Human_name:

Does not contain string

filter_by_random_sampling(n, replace=False)[source]
Parameters:
  • n (int) – the number of documents in the sample

  • replace (bool) – Should valued be replaced

Human_name:

Random sampling

filter_by_similar_documents(reference_file, threshold=0.8)[source]
Parameters:
  • reference_file (str) – Path to the reference file

  • threshold (float) – A value between 0.0 and 1.0 indicating the max. difference between the file and the reference.

Human_name:

Similar documents

get_aggregation()[source]

A generator that provides the current aggregation.

get_file_length(path)[source]
Parameters:

path – path to a textfile

Returns:

the files length in characters

get_file_tokens(path)[source]
Parameters:

path – path to a textfile

Returns:

the files length in tokens

get_text(file_id)[source]
Parameters:

file_id – the file_id in files

Returns:

the (transformed) text of the given file

load_aggregation_state(state=0)[source]
Parameters:

state (int) – the state to go back to

load_files(recursive=True, sort=True, filetype='txt', fast=False, skip_checkpoint=False)[source]
Parameters:
  • recursive (bool) – recursive search

  • sort (bool) – sort the files by name

  • filetype (str) – filetype to look for (e.g. txt)

  • fast (bool) – load files faster without getting metadata

print_aggregation()[source]

Print the aggregated files as a table.

print_pipeline()[source]

Print the current pipeline.

print_saved_states()[source]

Print all saved states.

run_filters(filters)[source]
Parameters:

filters (list) – A list of tuples with filters and their arguments.

run_transformations(text)[source]
Parameters:

text (str) – the text to run staged transformations on

Returns:

the transformed text

save_aggregation_state()[source]

Saves the current self.aggregation state.

set_aggregation(aggregation)[source]

Set the aggregation.

stage_transformation(transformation)[source]
Parameters:

transformation (list) – the transformation that should be staged and its parameters

transform_to_files(output_directory)[source]

Runs all transformations and stores the transformed texts in individual files.

Parameters:

output_directory (str) – the path/filename to write to

transform_to_memory()[source]

Runs all transformations and stores the transformed texts in memory.

textdirectory.transformations module

Transformation module.

textdirectory.transformations.transformation_crude_spellchecker(text, language_model='crudesc_lm_en', *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_eebop4_to_plaintext(text)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_expand_english_contractions(text)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_ftfy(text)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_lemmatize(text, spacy_model='en_core_web_sm')[source]
Parameters:
  • text (str) – the text to run the transformation on

  • spacy_model (str) – the spaCy model we want to use

Returns:

the transformed text

Human_name:

Lemmatizer

textdirectory.transformations.transformation_lowercase(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_postag(text, spacy_model='en_core_web_sm', *args)[source]
Parameters:
  • text (str) – the text to run the transformation on

  • spacy_model (str) – the spaCy model we want to use

Returns:

the transformed text

Human_name:

Add pos-tags

textdirectory.transformations.transformation_remove_htmltags(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_remove_nl(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_remove_non_alphanumerical(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_remove_non_ascii(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)[source]
Parameters:
  • text (str) – the text to run the transformation on

  • stopwords_source (str) – [internal, file] where are stopwords loaded from

  • stopwords (str) – filename of a list containing stopwords

  • spacy_model (str) – the spaCy model we want to use

  • custom_stopwords (str) – a comma separated list of additional stopwords to consider:

Returns:

the transformed text

textdirectory.transformations.transformation_remove_weird_tokens(text, spacy_model='en_core_web_sm', remove_double_space=False, *args)[source]
Parameters:
  • text (str) – the text to run the transformation on

  • spacy_model (str) – the spaCy model we want to use

  • remove_double_space – remove duplicated spaces

Type:

remove_double_space: bool

Returns:

the transformed text

textdirectory.transformations.transformation_replace_digits(text, replacement_character='%')[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_to_leetspeak(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_uppercase(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

textdirectory.transformations.transformation_usas_en_semtag(text, *args)[source]
Parameters:

text (str) – the text to run the transformation on

Returns:

the transformed text

Module contents

Top-level package for textdirectory.