textdirectory package¶
Submodules¶
textdirectory.cli module¶
Console script for textdirectory.
textdirectory.crudespellchecker module¶
-
class
textdirectory.crudespellchecker.CrudeSpellChecker(caching=True, language_model='crudesc_lm_en')[source]¶ Bases:
objectA very simple and crude spellchecker based on Peter Norvig’s design. Simple Language Models: crudesc_lm_en.gz.lm English based on COCA (sample), OANC (written), BNC crudesc_lm_ame.lm American English based on COCA (sample) and OANC (written) crudesc_lm_amehistorical.lm American English based on COHA (sample)
-
correct_string(string, return_corrections=False)[source]¶ - Parameters
string (str) – the string to correct.
return_corrections (bool) – include the corrections in the result
- Returns
the corrected string
-
correction(word)[source]¶ - Parameters
word (str) – a word
- Returns
most probable spelling correction for word
-
edit_distance_1(word)[source]¶ - Parameters
word (str) – a word
- Returns
all edits one edit away from the word
-
edit_distance_2(word)[source]¶ - Parameters
word (str) – a word
- Returns
all edits two edits away from the word
-
textdirectory.helpers module¶
Helpers module.
-
textdirectory.helpers.chunk_text(string, chunk_size=50000)[source]¶ - Parameters
string (str) – a string
chunk_size (int) – the max characters of one chunk
- Returns
a list of chunks
-
textdirectory.helpers.count_non_alphanum(string)[source]¶ - Parameters
string (str) – a string
- Returns
the number of non-alphanumeric characters in the string
-
textdirectory.helpers.estimate_spacy_max_length(override=False, tokenizer_only=False)[source]¶ Returns a somewhat sensible suggestions for max_length.
-
textdirectory.helpers.get_available_filters(get_human_name=False)[source]¶ - Parameters
get_human_name (bool) – if True, also return the ‘human name’
- Returns
a list of functions; if get_human_name a list of tuples
-
textdirectory.helpers.get_available_transformations(get_human_name=False)[source]¶ - Parameters
get_human_name – if True, also return the ‘human name’
- Returns
a list of functions; if get_human_name a list of tuples
textdirectory.textdirectory module¶
Main module.
-
class
textdirectory.textdirectory.TextDirectory(directory, encoding='utf8', autoload=False)[source]¶ Bases:
object-
aggregate_to_file(filename='aggregated.txt')[source]¶ - Parameters
filename (str) – the path/filename to write to
-
destage_transformation(transformation)[source]¶ - Parameters
transformation (list) – the transformation that should be de-staged and its parameters
-
filter_by_chars_outliers(sigmas=2)[source]¶ - Parameters
sigmas (int) – The number of stds that qualifies an outlier.
- Human_name
Character outliers
-
filter_by_contains(contains)[source]¶ - Parameters
contains (str) – A string that needs to be present in the file
- Human_name
Contains string
-
filter_by_filename_contains(contains)[source]¶ - Parameters
contains (str) – A string that needs to be present in the filename
- Human_name
Filename contains string
-
filter_by_max_chars(max_chars=100)[source]¶ - Parameters
max_chars (int) – the maximum number of characters a file can have
- Human_name
Maximum characters
-
filter_by_max_filesize(max_kb=100)[source]¶ - Parameters
max_mb (int) – The maximum number of kB a file is allowed to have.
- Human_name
Maximum filesize
-
filter_by_max_tokens(max_tokens=100)[source]¶ - Parameters
max_tokens (int) – the maximum number of tokens a file can have
- Human_name
Maximum tokens
-
filter_by_min_chars(min_chars=100)[source]¶ - Parameters
min_chars (int) – the minimum number of characters a file can have
- Human_name
Minimum characters
-
filter_by_min_filesize(min_kb=10)[source]¶ - Parameters
max_mb (int) – The minimum number of kB a file is allowed to have.
- Human_name
Minimum Filesize
-
filter_by_min_tokens(min_tokens=1)[source]¶ - Parameters
min_tokens (int) – the minimum number of tokens a file can have
- Human_name
Minimum tokens
-
filter_by_not_contains(not_contains)[source]¶ - Parameters
not_contains (str) – A string that is not allowed to be present in the file
- Human_name
Does not contain string
-
filter_by_random_sampling(n, replace=False)[source]¶ - Parameters
n (int) – the number of documents in the sample
replace (bool) – Should valued be replaced
- Human_name
Random sampling
-
filter_by_similar_documents(reference_file, threshold=0.8)[source]¶ - Parameters
reference_file (str) – Path to the reference file
threshold (float) – A value between 0.0 and 1.0 indicating the max. difference between the file and the reference.
- Human_name
Similar documents
-
get_file_length(path)[source]¶ - Parameters
path – path to a textfile
- Returns
the files length in characters
-
get_file_tokens(path)[source]¶ - Parameters
path – path to a textfile
- Returns
the files length in tokens
-
get_text(file_id)[source]¶ - Parameters
file_id – the file_id in files
- Returns
the (transformed) text of the given file
-
load_aggregation_state(state=0)[source]¶ - Parameters
back (int) – how many filter operations to go back
-
load_files(recursive=True, sort=True, filetype='txt')[source]¶ - Parameters
recursive (bool) – recursive search
sort (bool) – sort the files by name
filetype (str) – filetype to look for (e.g. txt)
-
run_filters(filters)[source]¶ - Parameters
filters (list) – A list of tuples with filters and their arguments.
-
run_transformations(text)[source]¶ - Parameters
text (str) – the text to run staged transformations on
- Returns
the transformed text
-
textdirectory.transformations module¶
Transformation module.
-
textdirectory.transformations.transformation_crude_spellchecker(text, language_model='crudesc_lm_en', *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_expand_english_contractions(text)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_lemmatize(text, spacy_model='en_core_web_sm')[source]¶ - Parameters
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
- Returns
the transformed text
- Human_name
Lemmatizer
-
textdirectory.transformations.transformation_lowercase(text, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_postag(text, spacy_model='en_core_web_sm', *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
- Returns
the transformed text
- Human_name
Add pos-tags
- Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_remove_nl(text, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_remove_non_alphanumerical(text, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_remove_non_ascii(text, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
-
textdirectory.transformations.transformation_remove_stopwords(text, stopwords_source='internal', stopwords='en', spacy_model='en_core_web_sm', custom_stopwords=None, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
stopwords_source (str) – [internal, file] where are stopwords loaded from
stopwords (str) – filename of a list containing stopwords
spacy_model (str) – the spaCy model we want to use
custom_stopwords (str) – a list of additional stopwords to consider:
- Returns
the transformed text
-
textdirectory.transformations.transformation_remove_weird_tokens(text, spacy_model='en_core_web_sm', remove_double_space=False, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
spacy_model (str) – the spaCy model we want to use
remove_double_space – remove duplicated spaces
- Type
remove_double_space: bool
- Returns
the transformed text
-
textdirectory.transformations.transformation_to_leetspeak(text, *args)[source]¶ - Parameters
text (str) – the text to run the transformation on
- Returns
the transformed text
Module contents¶
Top-level package for textdirectory.