textdirectory package¶

Submodules¶

textdirectory.cli module¶

Console script for textdirectory.

textdirectory.crudespellchecker module¶

Spellchecker module.

class textdirectory.crudespellchecker.CrudeSpellChecker(caching=True, language_model='crudesc_lm_en')[source]¶

Bases: object

A very simple and crude spellchecker based on Peter Norvig’s design. Simple Language Models: crudesc_lm_en.gz.lm English based on COCA (sample), OANC (written), BNC crudesc_lm_ame.lm American English based on COCA (sample) and OANC (written) crudesc_lm_amehistorical.lm American English based on COHA (sample)

candidates(word)[source]¶

Parameters:: word (str) – a word
Returns:: a list of candidates

correct_string(string, return_corrections=False)[source]¶

Parameters:

string (str) – the string to correct.
return_corrections (bool) – include the corrections in the result

Returns:

the corrected string

correction(word)[source]¶

Parameters:: word (str) – a word
Returns:: most probable spelling correction for word

edit_distance_1(word)[source]¶

Parameters:: word (str) – a word
Returns:: all edits one edit away from the word

edit_distance_2(word)[source]¶

Parameters:: word (str) – a word
Returns:: all edits two edits away from the word

known(words)[source]¶

Parameters:: word (str) – a word
Returns:: a subset of words in the dictionary of frequencies

p_word(word)[source]¶

Parameters:: word (str) – a word

textdirectory.crudespellchecker.generate_crudespellchecker_lm(corpus_directory, model_name, strip_xml=False)[source]¶

Parameters:

corpus_directory (str) – path the folder containing the files.
model_name (str) – th name of the model
strip_xml (bool) – stripping XML tags with bs4

textdirectory.helpers module¶

Helpers module.

textdirectory.helpers.chunk_text(string, chunk_size=50000)[source]¶

Parameters:

string (str) – a string
chunk_size (int) – the max characters of one chunk

Returns:

a list of chunks

textdirectory.helpers.count_non_alphanum(string)[source]¶

Parameters:: string (str) – a string
Returns:: the number of non-alphanumeric characters in the string

textdirectory.helpers.estimate_spacy_max_length(override=False, tokenizer_only=False)[source]¶: Returns a somewhat sensible suggestions for max_length.

textdirectory.helpers.get_available_filters(get_human_name=False)[source]¶

Parameters:: get_human_name (bool) – if True, also return the ‘human name’
Returns:: a list of functions; if get_human_name a list of tuples

textdirectory.helpers.get_available_transformations(get_human_name=False)[source]¶

Parameters:: get_human_name – if True, also return the ‘human name’
Returns:: a list of functions; if get_human_name a list of tuples

textdirectory.helpers.get_human_from_docstring(doc)[source]¶

Parameters:: doc (string) – if True, also return the ‘human name’
Returns:: a dictionary of name_* keys/values from the docstring.

textdirectory.helpers.tabulate_flat_list_of_dicts(list_of_dicts, max_length=25)[source]¶

Parameters:

list_of_dicts (list) – a list of dictionaries; each list is a row
max_length (int) – the maximum length of a cell

Returns:

a table

textdirectory.textdirectory module¶

Main module.

class textdirectory.textdirectory.TextDirectory(directory, encoding='utf8', autoload=False)[source]¶

Bases: object

aggregate_to_file(filename='aggregated.txt')[source]¶

Parameters:: filename (str) – the path/filename to write to

aggregate_to_memory()[source]¶

Returns:: a string containing the aggregated text files
Type:: str

clear_transformation()[source]¶: Destage all transformations and clear memory.

destage_transformation(transformation)[source]¶

Parameters:: transformation (list) – the transformation that should be de-staged and its parameters

filter()[source]¶: A wrapper for filters.

filter_by_chars_outliers(sigmas=2)[source]¶

Parameters:: sigmas (int) – The number of stds that qualifies an outlier.
Human_name:: Character outliers

filter_by_contains(contains)[source]¶

Parameters:: contains (str) – A string that needs to be present in the file
Human_name:: Contains string

filter_by_filename_contains(contains)[source]¶

Parameters:: contains (str) – A string that needs to be present in the filename
Human_name:: Filename contains string

filter_by_filename_not_contains(not_contains)[source]¶

Parameters:: not_contains (str) – A string that needs not to be present in the filename
Human_name:: Filename does not contain string

filter_by_filenames(filenames)[source]¶

Parameters:: filenames (list) – A list of filenames to include

filter_by_max_chars(max_chars=100)[source]¶

Parameters:: max_chars (int) – the maximum number of characters a file can have
Human_name:: Maximum characters

filter_by_max_filesize(max_kb=100)[source]¶

Parameters:: max_mb (int) – The maximum number of kB a file is allowed to have.
Human_name:: Maximum filesize

filter_by_max_tokens(max_tokens=100)[source]¶

Parameters:: max_tokens (int) – the maximum number of tokens a file can have
Human_name:: Maximum tokens

filter_by_min_chars(min_chars=100)[source]¶

Parameters:: min_chars (int) – the minimum number of characters a file can have
Human_name:: Minimum characters

filter_by_min_filesize(min_kb=10)[source]¶

Parameters:: max_mb (int) – The minimum number of kB a file is allowed to have.
Human_name:: Minimum Filesize

filter_by_min_tokens(min_tokens=1)[source]¶

Parameters:: min_tokens (int) – the minimum number of tokens a file can have
Human_name:: Minimum tokens

filter_by_not_contains(not_contains)[source]¶

Parameters:: not_contains (str) – A string that is not allowed to be present in the file
Human_name:: Does not contain string

filter_by_random_sampling(n, replace=False)[source]¶

Parameters:

n (int) – the number of documents in the sample
replace (bool) – Should valued be replaced

Human_name:

Random sampling

filter_by_similar_documents(reference_file, threshold=0.8)[source]¶

Parameters:

reference_file (str) – Path to the reference file
threshold (float) – A value between 0.0 and 1.0 indicating the max. difference between the file and the reference.

Human_name:

textdirectory.transformations module¶

Transformation module.

textdirectory.transformations.transformation_crude_spellchecker(text, language_model='crudesc_lm_en', *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_eebop4_to_plaintext(text)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_expand_english_contractions(text)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_ftfy(text)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_lemmatize(text, spacy_model='en_core_web_sm')[source]¶

Parameters:

text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use

Returns:

the transformed text

Human_name:

Lemmatizer

textdirectory.transformations.transformation_lowercase(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_postag(text, spacy_model='en_core_web_sm', *args)[source]¶

Parameters:

text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use

Returns:

the transformed text

Human_name:

Add pos-tags

textdirectory.transformations.transformation_remove_htmltags(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_remove_nl(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_remove_non_alphanumerical(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_remove_non_ascii(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)[source]¶

Parameters:

text (str) – the text to run the transformation on
stopwords_source (str) – [internal, file] where are stopwords loaded from
stopwords (str) – filename of a list containing stopwords
spacy_model (str) – the spaCy model we want to use
custom_stopwords (str) – a comma separated list of additional stopwords to consider:

Returns:

the transformed text

textdirectory.transformations.transformation_remove_weird_tokens(text, spacy_model='en_core_web_sm', remove_double_space=False, *args)[source]¶

Parameters:

text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
remove_double_space – remove duplicated spaces

Type:

remove_double_space: bool

Returns:

the transformed text

textdirectory.transformations.transformation_replace_digits(text, replacement_character='%')[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_to_leetspeak(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_uppercase(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

textdirectory.transformations.transformation_usas_en_semtag(text, *args)[source]¶

Parameters:: text (str) – the text to run the transformation on
Returns:: the transformed text

Module contents¶

Top-level package for textdirectory.

textdirectory package¶

Submodules¶

textdirectory.cli module¶

textdirectory.crudespellchecker module¶

textdirectory.helpers module¶

textdirectory.textdirectory module¶

textdirectory.transformations module¶

Module contents¶

textdirectory

Navigation

Related Topics