textdirectory package¶
Submodules¶
textdirectory.cli module¶
Console script for textdirectory.
textdirectory.crudespellchecker module¶
Spellchecker module.
- class textdirectory.crudespellchecker.CrudeSpellChecker(caching=True, language_model='crudesc_lm_en')[source]¶
Bases:
object
A very simple and crude spellchecker based on Peter Norvig’s design. Simple Language Models: crudesc_lm_en.gz.lm English based on COCA (sample), OANC (written), BNC crudesc_lm_ame.lm American English based on COCA (sample) and OANC (written) crudesc_lm_amehistorical.lm American English based on COHA (sample)
- correct_string(string, return_corrections=False)[source]¶
- Parameters:
string (str) – the string to correct.
return_corrections (bool) – include the corrections in the result
- Returns:
the corrected string
- correction(word)[source]¶
- Parameters:
word (str) – a word
- Returns:
most probable spelling correction for word
- edit_distance_1(word)[source]¶
- Parameters:
word (str) – a word
- Returns:
all edits one edit away from the word
- edit_distance_2(word)[source]¶
- Parameters:
word (str) – a word
- Returns:
all edits two edits away from the word
textdirectory.helpers module¶
Helpers module.
- textdirectory.helpers.chunk_text(string, chunk_size=50000)[source]¶
- Parameters:
string (str) – a string
chunk_size (int) – the max characters of one chunk
- Returns:
a list of chunks
- textdirectory.helpers.count_non_alphanum(string)[source]¶
- Parameters:
string (str) – a string
- Returns:
the number of non-alphanumeric characters in the string
- textdirectory.helpers.estimate_spacy_max_length(override=False, tokenizer_only=False)[source]¶
Returns a somewhat sensible suggestions for max_length.
- textdirectory.helpers.get_available_filters(get_human_name=False)[source]¶
- Parameters:
get_human_name (bool) – if True, also return the ‘human name’
- Returns:
a list of functions; if get_human_name a list of tuples
- textdirectory.helpers.get_available_transformations(get_human_name=False)[source]¶
- Parameters:
get_human_name – if True, also return the ‘human name’
- Returns:
a list of functions; if get_human_name a list of tuples
textdirectory.textdirectory module¶
Main module.
- class textdirectory.textdirectory.TextDirectory(directory, encoding='utf8', autoload=False)[source]¶
Bases:
object
- aggregate_to_file(filename='aggregated.txt')[source]¶
- Parameters:
filename (str) – the path/filename to write to
- destage_transformation(transformation)[source]¶
- Parameters:
transformation (list) – the transformation that should be de-staged and its parameters
- filter_by_chars_outliers(sigmas=2)[source]¶
- Parameters:
sigmas (int) – The number of stds that qualifies an outlier.
- Human_name:
Character outliers
- filter_by_contains(contains)[source]¶
- Parameters:
contains (str) – A string that needs to be present in the file
- Human_name:
Contains string
- filter_by_filename_contains(contains)[source]¶
- Parameters:
contains (str) – A string that needs to be present in the filename
- Human_name:
Filename contains string
- filter_by_filename_not_contains(not_contains)[source]¶
- Parameters:
not_contains (str) – A string that needs not to be present in the filename
- Human_name:
Filename does not contain string
- filter_by_filenames(filenames)[source]¶
- Parameters:
filenames (list) – A list of filenames to include
- filter_by_max_chars(max_chars=100)[source]¶
- Parameters:
max_chars (int) – the maximum number of characters a file can have
- Human_name:
Maximum characters
- filter_by_max_filesize(max_kb=100)[source]¶
- Parameters:
max_mb (int) – The maximum number of kB a file is allowed to have.
- Human_name:
Maximum filesize
- filter_by_max_tokens(max_tokens=100)[source]¶
- Parameters:
max_tokens (int) – the maximum number of tokens a file can have
- Human_name:
Maximum tokens
- filter_by_min_chars(min_chars=100)[source]¶
- Parameters:
min_chars (int) – the minimum number of characters a file can have
- Human_name:
Minimum characters
- filter_by_min_filesize(min_kb=10)[source]¶
- Parameters:
max_mb (int) – The minimum number of kB a file is allowed to have.
- Human_name:
Minimum Filesize
- filter_by_min_tokens(min_tokens=1)[source]¶
- Parameters:
min_tokens (int) – the minimum number of tokens a file can have
- Human_name:
Minimum tokens
- filter_by_not_contains(not_contains)[source]¶
- Parameters:
not_contains (str) – A string that is not allowed to be present in the file
- Human_name:
Does not contain string
- filter_by_random_sampling(n, replace=False)[source]¶
- Parameters:
n (int) – the number of documents in the sample
replace (bool) – Should valued be replaced
- Human_name:
Random sampling
- filter_by_similar_documents(reference_file, threshold=0.8)[source]¶
- Parameters:
reference_file (str) – Path to the reference file
threshold (float) – A value between 0.0 and 1.0 indicating the max. difference between the file and the reference.
- Human_name:
Similar documents
- get_file_length(path)[source]¶
- Parameters:
path – path to a textfile
- Returns:
the files length in characters
- get_file_tokens(path)[source]¶
- Parameters:
path – path to a textfile
- Returns:
the files length in tokens
- get_text(file_id)[source]¶
- Parameters:
file_id – the file_id in files
- Returns:
the (transformed) text of the given file
- load_files(recursive=True, sort=True, filetype='txt', fast=False, skip_checkpoint=False)[source]¶
- Parameters:
recursive (bool) – recursive search
sort (bool) – sort the files by name
filetype (str) – filetype to look for (e.g. txt)
fast (bool) – load files faster without getting metadata
- run_filters(filters)[source]¶
- Parameters:
filters (list) – A list of tuples with filters and their arguments.
- run_transformations(text)[source]¶
- Parameters:
text (str) – the text to run staged transformations on
- Returns:
the transformed text
- stage_transformation(transformation)[source]¶
- Parameters:
transformation (list) – the transformation that should be staged and its parameters
textdirectory.transformations module¶
Transformation module.
- textdirectory.transformations.transformation_crude_spellchecker(text, language_model='crudesc_lm_en', *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_eebop4_to_plaintext(text)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_expand_english_contractions(text)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_ftfy(text)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_lemmatize(text, spacy_model='en_core_web_sm')[source]¶
- Parameters:
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
- Returns:
the transformed text
- Human_name:
Lemmatizer
- textdirectory.transformations.transformation_lowercase(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_postag(text, spacy_model='en_core_web_sm', *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
- Returns:
the transformed text
- Human_name:
Add pos-tags
- textdirectory.transformations.transformation_remove_htmltags(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_remove_nl(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_remove_non_alphanumerical(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_remove_non_ascii(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
stopwords_source (str) – [internal, file] where are stopwords loaded from
stopwords (str) – filename of a list containing stopwords
spacy_model (str) – the spaCy model we want to use
custom_stopwords (str) – a comma separated list of additional stopwords to consider:
- Returns:
the transformed text
- textdirectory.transformations.transformation_remove_weird_tokens(text, spacy_model='en_core_web_sm', remove_double_space=False, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
remove_double_space – remove duplicated spaces
- Type:
remove_double_space: bool
- Returns:
the transformed text
- textdirectory.transformations.transformation_replace_digits(text, replacement_character='%')[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
- textdirectory.transformations.transformation_to_leetspeak(text, *args)[source]¶
- Parameters:
text (str) – the text to run the transformation on
- Returns:
the transformed text
Module contents¶
Top-level package for textdirectory.