Download Trained Models¶
Allows models to be downloaded from the command line with
python -m adeft.download
Use
python -m adeft.download --update
to update existing models if models have changed on S3
-
adeft.download.download.
download_models
(models=None)[source]¶ Download models from S3
Models are downloaded and placed into a models directory in the users home directory. Each model contains a serialized AdeftClassifier, a dictionary mapping shortforms to dictionaries mapping longform texts to groundings, and a list of canonical names for each grounding. Within the models directory, models are stored in subdirectories named after the shortform they disambiguate with escape characters used to handle characters that cannot be used in filenames and to distinguish upper and lower case for compatibility with case insensitive file systems.
Parameters: models (Optional[iterable of str]) – List of models to be downloaded. Allows user to select specific models to download. If this option is set, update will be treated as True regardless of how it was set. These should be considered as mutually exclusive parameters.
-
adeft.download.download.
download_test_resources
()[source]¶ Download files necessary to run tests
Downloads a test disambiguator and a set of example training data and places them in the test_resources folder of the .adeft directory. This function will error if the necessary directories do not exist. If they do not already exist they will be created when running python -m adeft.download
-
adeft.download.download.
get_available_models
(path='/home/docs/.local/share/adeft/0.11.2/models')[source]¶ Returns set of all models currently in models folder
-
adeft.download.download.
get_s3_models
()[source]¶ Returns set of all models currently available on s3
-
adeft.download.download.
setup_models_folder
()[source]¶ Create models folder if it does not exist and download models
Disambiguate Shortforms¶
Implements classes to disambiguate shortforms given text context.
-
class
adeft.disambiguate.
AdeftDisambiguator
(classifier, grounding_dict, names)[source]¶ Bases:
object
Disambiguates a particular shortform in a list of texts
Parameters: - classifier (py:class:adeft.modeling.classify.AdeftClassifier) – Machine learning model for disambiguating shortforms based upon context
- grounding_dict (dict) – Dictionary mapping shortforms to grounding_map dictionaries mapping longforms to groundings
- names (dict) – Dictionary mapping groundings to canonical names
-
shortforms
¶ Shortforms to disambiguate
Type: list of str
-
recognizers
¶ A list of recognizers, one for each shortform, to disambiguate by searching for a defining pattern.
Type: list of py:class:adeft.recognize.AdeftRecognizer
-
pos_labels
¶ List of labels of interest. Only these are considered when calculating the micro averaged f1 score for a classifier.
Type: list of str
-
disambiguate
(texts)[source]¶ Return disambiguations for a list of texts
First checks for defining patterns (DP) within a text. If there is an unambiguous match to a longform with a defining pattern, considers this to be the correct disambiguation with confidence 1.0. If no defining pattern is found, uses a logistic regression model to predict the correct disambiguation. If there were multiple longforms with different groundings found with a defining pattern, disambiguates to the grounding among these with highest predicted probability. If no defining pattern was found, disambiguates to the grounding with highest predicted probability.
Parameters: texts (str or list of str) – fulltext or list of fulltexts in which to disambiguate shortform Returns: result – Disambiguations for text. For each text the corresponding disambiguation is a tuple of three elements. A grounding, a canonical name associated with the grounding, and a dictionary containing predicted probabilities for each possible grounding Return type: tuple or list of tuple
-
info
()[source]¶ Get information about disambiguator and its performance.
Displays disambiguations model is able to produce. Shows class balance of disambiguation labels in the models training data and crossvalidated F1 score, precision, and recall on training data. Classification metrics for multi-label data are calculated by taking the micro-average over the positive labels. This means the metrics are calculated globally by counting the total true positives, false negatives, and false positives. Positive labels are starred in in the displayed output. F1, Precision, and Recall are also shown for for each label separately. Classification metrics may not be available depending upon how the model was trained.
Returns: A string representing the information about the disambigutor. Return type: str
-
modify_groundings
(new_groundings=None, new_names=None)[source]¶ Update groundings and standardized names
Modify groundings and standard names for the disambiguator without retraining. Cannot map two existing groundings to a single new grounding, as this leads to a nontrivial change in the model rather than just a relabeling.
Parameters: - new_groundings (Optional[dict]) – Dictionary mapping a subset of previous groundings to updated groundings. If None, no groundings are modified. Default: None
- new_names (Optional[dict]) – Dictionary mapping a subset of previous groundings to updated names. If None, no names are modified. Default: None
-
update_pos_labels
(pos_labels)[source]¶ Update which labels are considered pos_labels
Micro-averaged precision, recall, and f1 scores are also updated.
Warning: If this method is called on a disambiguator trained with a a version prior to 0.10.0, global precision, recall, and f1 will be set to NaN. Older disambiguators must be retrained to update positive labels and recompute model statistics.
Parameters: pos_labels (list) – list of strs. Should be a subset of the labels produced by the underlying classifier. Check the labels attribute of the AdeftDisambiguator to see which labels are produced.
-
adeft.disambiguate.
load_disambiguator
(shortform, path='/home/docs/.local/share/adeft/0.11.2/models')[source]¶ Returns adeft disambiguator loaded from models directory
Searches folder specified by path for a disambiguation model that can disambiguate the given shortform and returns this model
Parameters: Returns: A disambiguator that was loaded from a file. Returns None if there are no disambiguation models in the supplied folder that can disambiguate the given shortform
Return type: py:class:adeft.disambiguate.AdeftDisambiguator
-
adeft.disambiguate.
load_disambiguator_directly
(path)[source]¶ Returns disambiguator located at path
Parameters: path (str) – Path to a disambiguation model. Must be a path to a directory <model_name> containing the files <model_name>_model.gz, <model_name>_grounding_dict.json, <model_name>_names.json Returns: A disambiguation model loaded from folder specified by path Return type: py:class:adeft.disambiguate.AdeftDisambiguator
Discover Longforms¶
Recognize Defining Patterns¶
Implements the disambiguation of shortforms based on recognizing an explicit defining pattern in text.
-
class
adeft.recognize.
AdeftRecognizer
(shortform, grounding_map, window=100)[source]¶ Bases:
adeft.recognize.BaseRecognizer
Class for recognizing longforms by searching for defining patterns (DP)
Searches text for the pattern “<longform> (<shortform>)” for a collection of grounded longforms supplied by the user.
Parameters: - shortform (str) – shortform to be recognized
- grounding_map (dict[str, str]) – Dictionary mapping longform texts to their groundings
- window (Optional[int]) – Specifies range of characters before a defining pattern (DP) to consider when finding longforms. Should be set to the same value that was used in the AdeftMiner that was used to find longforms. Default: 100
-
_trie
¶ Trie used to search for longforms. Edges correspond to stemmed tokens from longforms. They appear in reverse order to the bottom of the trie with terminal nodes containing the associated longform in their data.
Type: adeft.recognize._TrieNode
-
class
adeft.recognize.
BaseRecognizer
(shortform, window=100)[source]¶ Bases:
object
Base class for recognizers
Recognizers are built to identify longform expansions for a shortform by searching for defining patterns (DPs).
Parameters: -
recognize
(text)[source]¶ Find longforms in text by searching for defining patterns (DPs)
Parameters: text (str) – Sentence where we seek to disambiguate shortform Returns: expansions – Set of longforms corresponding to shortform in sentence if a defining pattern is matched. Returns None if no defining patterns are found Return type: set of str
-
strip_defining_patterns
(text)[source]¶ Return text with defining patterns stripped
This is useful for training machine learning models where training labels are generated by finding defining patterns (DP)s. Models must be trained to disambiguate texts that do not contain a defining pattern.
The output on the first sentence of the previous paragraph is “This is useful for training machine learning models where training labels are generated by finding DPs.”
Parameters: text (str) – Text to remove defining patterns from Returns: stripped_text – Text with defining patterns replaced with shortform Return type: str
-
-
class
adeft.recognize.
OneShotRecognizer
(shortform, window=100, **params)[source]¶ Bases:
adeft.recognize.BaseRecognizer
Identify longform expansions using subsequence matching
Uses a string matching algorithm to determine longform boundaries for a defining pattern for only a single text.
-
window
¶ Specifies range of characters before a defining pattern (DP) to consider when finding longforms. Should be set to the same value that was used in the AdeftMiner that was used to find longforms. Default: 100
Type: Optional[int]
-
**params
Parameters for :py:class`adeft.score.AdeftLongformScorer`
-
Learn Models¶
Implements classes needed to label text corpora, and learn a logistic regression model to disambiguate across a set of longforms for a given shortform.
Label Corpora¶
-
class
adeft.modeling.label.
AdeftLabeler
(grounding_dict)[source]¶ Bases:
object
Class for labeling corpora
Parameters: grounding_dict (dict of dict of str) – Dictionary mapping shortforms to grounding_map dictionaries mapping longforms to groundings -
recognizers
¶ List of recognizers for each shortform to be considered. Each recognizer identifies longforms for a shortform by finding defining matches to a defining pattern (DP)
Type: list of py:class`adeft.recognize.AdeftRecognizer`
-
build_from_texts
(text_tuples)[source]¶ Build labeled corpus from a list of texts
Labels texts based on defining patterns (DPs)
Parameters: text_tuples (list of tuple) – List of two element tuples whose first elements are texts from which we seek to build a corpus and whose second elements are identifiers associated with the texts. Each text should have a unique identifier associated to it. Returns: corpus – Contains a tuple for each text in the input list which contains a defining pattern. Multiple tuples correspond to texts with multiple defining patterns for longforms with different groundings. The first element of each tuple contains a training text with all defining patterns replaced with only the shortform. The second element contains a grounding label for the desired shortform within the training text that was identified through a defining pattern. The third element contains the identifier for the given training text. Return type: list
-
Classify¶
-
class
adeft.modeling.classify.
AdeftClassifier
(shortforms, pos_labels, random_state=None)[source]¶ Bases:
object
Trains classifiers to disambiguate shortforms based on context
Fits logistic regression models with tfidf vectorized ngram features. Uses sklearns LogisticRegression and TfidfVectorizer classes. Models can be serialized and loaded for later use.
Parameters: - shortforms (str or list of str) – Shortform to disambiguate or list of shortforms to build models for multiple synomous shortforms.
- pos_labels (list of str) – Labels for positive classes. These correspond to the longforms of interest in an application. For adeft pretrained models these are typically genes and other relevant biological terms.
- random_state (Optional[int]) – Optional specification of seed used when calculating crossvalidation folds and fitting the logistic regression model. Default: None
-
estimator
¶ An sklearn pipeline that transforms text data with a TfidfVectorizer and fits a logistic regression.
Type: py:class:sklearn.pipeline.Pipeline
-
stats
¶ Statistics describing model performance. Only available after model is fit with crossvalidation
Type: dict
-
stop
¶ List of stopwords to exclude when performing tfidf vectorization. These consist of the set of stopwords in adeft.nlp.english_stopwords along with the shortform(s) for which the model is being built
Type: list of str
-
params
¶ Dictionary mapping parameters to their values. If fit with cv, this contains the parameters with best micro averaged f1 score over crossvalidation runs.
Type: dict
-
best_score
¶ Best micro averaged f1 score for positive labels over crossvalidation runs. This information can also be found in the stats dict and is not included when models are serialized. Only available if model is fit with the cv method.
Type: float
-
grid_search
¶ sklearn gridsearch object if model was fit with cv. This is not included when model is serialized.
Type: py:class:sklearn.model_selection.GridSearchCV
-
confusion_info
¶ Contains the confusion matrix for each pair of labels per crossvalidation split. Only available if the model has been fit with crossvalidation. Nested dictionary, confusion_info[label1][label2][i] gives the number of test examples where the true label is label1 and the classifier has made prediction label2 in split i.
Type: dict
-
other_metadata
¶ Data set here by the user will be included when the model is serialized and remain available when the classifier is loaded again.
Type: dict
-
training_set_digest
¶ Digest of training set calculated using md5 hash. Can be used at a glance to determine if two models used the same training set.
Type: str
-
_std
¶ Array of standard deviations of feature values over training set. This is used to calculate feature importance
Type: py:class:numpy.ndarray
-
cv
(texts, y, param_grid, n_jobs=1, cv=5)[source]¶ Performs grid search to select and fit a disambiguation model
Parameters: - texts (iterable of str) – Training texts
- y (iterable of str) – True labels for the training texts
- param_grid (Optional[dict]) – Grid search parameters. Can contain all parameters from the train method.
- n_jobs (Optional[int]) – Number of jobs to use when performing grid_search Default: 1
- cv (Optional[int]) – Number of folds to use in crossvalidation. Default: 5
Example
>>> params = {'C': [1.0, 10.0, 100.0], ... 'max_features': [3000, 6000, 9000], ... 'ngram_range': [(1, 1), (1, 2), (1, 3)]} >>> classifier = LongformClassifier('IR', ['insulin receptor']) >>> classifier.train(texts, labels, param_grid=params, n_jobs=4)
-
dump_model
(filepath)[source]¶ Serialize model to gzipped json
Parameters: filepath (str) – Path to output file
-
feature_importances
()[source]¶ Return feature importance scores for each label
The feature importance scores are given by multiplying the coefficients of the logistic regression model by the standard deviations of the tf-idf scores for the associated features over all texts. Note that there is a coefficient associated to each label feature pair.
One can interpret the feature importance score as the change in the linear predictor for a given label associated to a one standard deviation change in a feature’s value. The predicted probability being given by the composition of the logit link function and the linear predictor.
Returns: Dictionary with class labels as keys. The associated values are lists of two element tuples each with first element an ngram feature and second element a feature importance score Return type: dict
-
get_model_info
()[source]¶ Return a JSON object representing a model for portability.
Returns: A JSON object representing the attributes of the classifier needed to make it portable/serializable and enabling its reload. Return type: dict
-
train
(texts, y, C=1.0, ngram_range=(1, 2), max_features=1000, class_weight=None)[source]¶ Fits a disambiguation model
Parameters: - texts (iterable of str) – Training texts
- y (iterable of str) – True labels for training texts
- C (Optional[float]) – L1 regularization parameter logistic regression model. Follows convention of support vector machines with smaller values corresponding to stronger regularization. Default: 1.0
- ngram_range (Optional[tuple of int]) – Range of ngram features to use. Must be a tuple of ints of the form (a, b) with a <= b. When ngram_range is (1, 2), unigrams and bigrams will be used as features. Default: (1, 2)
- max_features (int) – Maximum number of tfidf-vectorized ngrams to use as features in model. Selects top_features by term frequency Default: 1000
- class_weight (Optional[dict or 'balanced']) –
Weights associated with classes in the form {class_label: weight}. If not given, all classes are supposed to have weight one.
The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y)).
Note that these weights will be multiplied with sample_weight (passed through the fit method) if sample_weight is specified.
-
adeft.modeling.classify.
load_model
(filepath)[source]¶ Load previously serialized model
Parameters: filepath (str) – path to model file Returns: longform_model – The classifier that was loaded from the given path. Return type: py:class:adeft.classify.AdeftClassifier
-
adeft.modeling.classify.
load_model_info
(model_info)[source]¶ Return a longform model from a model info JSON object.
Parameters: model_info (dict) – The JSON object containing the attributes of a model. Returns: longform_model – The classifier that was loaded from the given JSON object. Return type: py:class:adeft.classify.AdeftClassifier
NLP¶
Implements a set of natural language processing tools used to pre-process text used for finding candidate longforms, recognizing defining patterns, and learning classification models.
stopwords_min contains a small collection of stopwords for use in the
alignment based AdeftLongformScorer
. english_stopwords contains
a larger collection of stopwords for use in classification and anomaly
detection models.
Util¶
Utility functions used by Adeft internally.
-
adeft.util.
get_candidate
(fragment)[source]¶ Return tokens in candidate fragment up until last excluded word
Parameters:
-
adeft.util.
get_candidate_fragments
(text, shortform, window=100)[source]¶ Return candidate longform fragments from text
Gets fragments of text preceding defining patterns (DPs) to search for candidate longforms. Each fragment contains either a specified range of characters before a DP, or characters up until either the start of the sentence or the end of a previous DP.
Parameters: - text (str) – Text to search for defining patterns (DP)
- shortform (str) – Shortform to disambiguate
- window (Optional[int]) – Specifies range of characters before a defining pattern (DP) to consider when finding longforms. If set to 30, candidate longforms would be taken from the string “ters before a defining pattern”. Default: 100