History¶
0.1.0 (2018-04-26)¶
Initial release
First release on PyPI.
0.1.1 (2018-04-27)¶
added filter_by_chars_outliers
added transformation_remove_nl
0.1.2 (2018-04-29)¶
added transformation_postag
added transformation_usas_en_semtag
added transformation_uppercase
added filter_by_filename_contains
added parameter support for transformations
0.1.3 (2018-04-30)¶
filter_by_random_sampling now has a “replacement” option
changed from tabulate to an embedded function
added transformation_remove_non_ascii
added transformation_remove_non_alphanumerical
added filter_by_similar_documents
0.1.4 (2018-04-02)¶
fixed an object mutation problem in the tabulate function
0.2.0 (2018-05-13)¶
added transform_to_memory() function
added transformation_to_leetspeak() function
added transformation_crude_spellchecker
added filter_by_max_filesize
added filter_by_min_filesize
fixed a bug where load_files() would fail if there were no files
0.2.1 (2019-06-13)¶
added transformation_remove_stopwords
added transformation_remove_htmltags
fixed some minor bugs
0.2.2 (2019-06-13)¶
changed the data packaging
0.3.0 (2020-01-19)¶
added transformation_remove_weird_tokens
added transformation_lemmatizer
fixed some minor bugs
added a function to revert applied filters
added a function that prints the current pipeline
added a function that clears all transformations
added helper functions to list available filters and transformations
fixed a bug in which
tabulate_flat_list_of_dicts
would fail if the dictionary was emptyself.aggregation
does not hold a copy of the files anymore but references toself.files
transformations relying on spaCy are now estimating a max_length based on available memory
TextDirectory objects are now iterable
0.3.1 (2020-01-20)¶
added long_description_content_type to setup.py
0.3.2 (2021-01-10)¶
added transformation_expand_english_contractions
fixed some minor bugs
added __str__ method to TextDirectory
added filename to __str__ output
added get_text method
0.3.3 (2022-09-25)¶
added filter_by_filenames
added filter_by_filename_not_contains
added transform_to_files
added transformation_eebop4_to_plaintext
added transformation_replace_digits
added transformation_ftfy
added fast and skip_checkpoint options to load_files
added __repr__ method to TextDirectory
added examples
upgraded to spaCy 3
improved the test suite
fixed some minor bugs