Metrics

accuracy

“Class for Metric Accuracy

class mindnlp.engine.metrics.accuracy.Accuracy(name='Accuracy')[source]

Bases: Metric

Calculates accuracy. The function is shown as follows:

\[\text{ACC} =\frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}\]

where ACC is accuracy, TP is the number of true posistive cases, TN is the number of true negative cases, FP is the number of false posistive cases, FN is the number of false negative cases.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import nn, Tensor
>>> from mindnlp.common.metrics import Accuracy
>>> preds = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]), mindspore.float32)
>>> labels = Tensor(np.array([1, 0, 1]), mindspore.int32)
>>> metric = Accuracy()
>>> metric.update(preds, labels)
>>> acc = metric.eval()
>>> print(acc)
0.6666666666666666

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the accuracy.

Returns

acc (float) - The computed result.

Raises

RuntimeError – If the number of samples is 0.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, numpy.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, numpy.ndarray]): Ground truth value. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
ValueError – class numbers of last input predicted data and current predicted data not match.

bleu

“Class for Metric BleuScore

class mindnlp.engine.metrics.bleu.BleuScore(n_size=4, weights=None, name='BleuScore')[source]

Bases: Metric

Calculates the BLEU score. BLEU (bilingual evaluation understudy) is a metric for evaluating the quality of text translated by machine. It uses a modified form of precision to compare a candidate translation against multiple reference translations. The function is shown as follows:

\[ \begin{align}\begin{aligned}\begin{split}BP & = \begin{cases} 1, & \text{if }c>r \\ e_{1-r/c}, & \text{if }c\leq r \end{cases}\end{split}\\BLEU & = BP\exp(\sum_{n=1}^N w_{n} \log{p_{n}})\end{aligned}\end{align} \]

where c is the length of candidate sentence, and r is the length of reference sentence.

Parameters

n_size (int) – N_gram value ranges from 1 to 4. Default: 4.
weights (Union[list, None]) – Weights of precision of each gram. Defaults to None.
name (str) – Name of the metric.

Raises

ValueError – If the value range of n_size is not from 1 to 4.
ValueError – If the lengths of weights is not equal to n_size.

Example

>>> from mindnlp.common.metrics import BleuScore
>>> cand = [["The", "cat", "The", "cat", "on", "the", "mat"]]
>>> ref_list = [[["The", "cat", "is", "on", "the", "mat"],
                ["There", "is", "a", "cat", "on", "the", "mat"]]]
>>> metric = BleuScore()
>>> metric.update(cand, ref_list)
>>> bleu_score = metric.eval()
>>> print(bleu_score)
0.46713797772820015

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the BLEU score.

Returns

bleu_score (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input cand and ref_list.

cand (list): A list of tokenized candidate sentences.
ref_list (list): A list of lists of tokenized ground truth sentences.

Raises

ValueError – If the number of inputs is not 2.
ValueError – If the lengths of cand and ref_list are not equal.

confusion_matrix

“Class for Metric ConfusionMatrix

class mindnlp.engine.metrics.confusion_matrix.ConfusionMatrix(class_num=2, name='ConfusionMatrix')[source]

Bases: Metric

Calculates the confusion matrix. Confusion matrix is commonly used to evaluate the performance of classification models, including binary classification and multiple classification.

Parameters

class_num (int) – Number of classes in the dataset. Default: 2.
name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import ConfusionMatrix
>>> preds = Tensor(np.array([1, 0, 1, 0]))
>>> labels = Tensor(np.array([1, 0, 0, 1]))
>>> metric = ConfusionMatrix()
>>> metric.update(preds, labels)
>>> conf_mat = metric.eval()
>>> print(conf_mat)
[[1. 1.]
 [1. 1.]]

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the Confusion Matrix.

Returns

conf_mat (np.ndarray) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers and the shape of preds is \((N, C)\) or \((N,)\).
labels (Union[Tensor, list, np.ndarray]): Ground truth. The shape of labels is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
ValueError – If preds and labels do not have valid dimensions.

distinct

“Class for Metric Distinct

class mindnlp.engine.metrics.distinct.Distinct(n_size=2, name='Distinct')[source]

Bases: Metric

Calculates the Distinct-N. Distinct-N is a metric that measures the diversity of a sentence. It focuses on the number of distinct n-gram of a sentence. The larger the number of distinct n-grams, the higher the diversity of the text. The function is shown as follows:

Parameters

n_size (int) – N_gram value. Defaults: 2.
name (str) – Name of the metric.

Example

>>> from mindnlp.common.metrics import Distinct
>>> cand_list = ["The", "cat", "The", "cat", "on", "the", "mat"]
>>> metric = Distinct()
>>> metric.update(cand_list)
>>> distinct_score = metric.eval()
>>> print(distinct_score)
0.8333333333333334

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the Distinct-N.

Returns

distinct_score (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input cand_list.

cand_list (list): A list of tokenized candidate sentence.

Raises

ValueError – If the number of inputs is not 1.

em_score

“Class for Metric EmScore

class mindnlp.engine.metrics.em_score.EmScore(name='EmScore')[source]

Bases: Metric

Calculates the exact match (EM) score. This metric measures the percentage of predictions that match any one of the ground truth answers exactly.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import EmScore
>>> preds = "this is the best span"
>>> examples = ["this is a good span", "something irrelevant"]
>>> metric = EmScore()
>>> metric.update(preds, examples)
>>> em_score = metric.eval()
>>> print(em_score)
0.0

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the EM score.

Returns: - exact_match (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and examples.

preds (Union[str, list]): Predicted value.
examples (list): Ground truth.

Raises

ValueError – If the number of inputs is not 2.
RuntimeError – If preds and examples have different lengths.

f1

“Class for Metric F1Score

class mindnlp.engine.metrics.f1.F1Score(name='F1Score')[source]

Bases: Metric

Calculates the F1 score. Fbeta score is a weighted mean of precision and recall, and F1 score is a special case of Fbeta when beta is 1. The function is shown as follows:

\[F_1=\frac{2\cdot TP}{2\cdot TP + FN + FP}\]

where TP is the number of true posistive cases, FN is the number of false negative cases, FP is the number of false positive cases.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import F1Score
>>> preds = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
>>> labels = Tensor(np.array([1, 0, 1]))
>>> metric = F1Score()
>>> metric.update(preds, labels)
>>> f1_s = metric.eval()
>>> print(f1_s)
[0.6666666666666666 0.6666666666666666]

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the F1 score.

Returns

f1_s (numpy.ndarray) - The computed result.

Raises

RuntimeError – If the number of samples is 0.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, np.ndarray]): Ground truth. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
ValueError – class numbers of last input predicted data and current predicted data not match.
ValueError – If preds doesn’t have the same classes number as labels.

matthews

“Class for Metric MatthewsCorrelation

class mindnlp.engine.metrics.matthews.MatthewsCorrelation(name='MatthewsCorrelation')[source]

Bases: Metric

Calculates the Matthews correlation coefficient (MCC). MCC is in essence a correlation coefficient between the observed and predicted binary classifications; it returns a value between −1 and +1. A coefficient of +1 represents a perfect prediction, 0 no better than random prediction and −1 indicates total disagreement between prediction and observation. The function is shown as follows:

\[MCC=\frac{TP \times TN-FP \times FN}{\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}}\]

where TP is the number of true posistive cases, TN is the number of true negative cases, FN is the number of false negative cases, FP is the number of false positive cases.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import MatthewsCorrelation
>>> preds = [[0.8, 0.2], [-0.5, 0.5], [0.1, 0.4], [0.6, 0.3], [0.6, 0.3]]
>>> labels = [0, 1, 0, 1, 0]
>>> metric = MatthewsCorrelation()
>>> metric.update(preds, labels)
>>> m_c_c = metric.eval()
>>> print(m_c_c)
0.16666666666666666

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the MCC.

Returns

m_c_c (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, numpy.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, numpy.ndarray]): Ground truth value. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.

pearson

“Class for Metric PearsonCorrelation

class mindnlp.engine.metrics.pearson.PearsonCorrelation(name='PearsonCorrelation')[source]

Bases: Metric

Calculates the Pearson correlation coefficient (PCC). PCC is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus, it is essentially a normalized measurement of the covariance, such that the result always has a value between −1 and 1.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import PearsonCorrelation
>>> preds = Tensor(np.array([[0.1], [1.0], [2.4], [0.9]]), mindspore.float32)
>>> labels = Tensor(np.array([[0.0], [1.0], [2.9], [1.0]]), mindspore.float32)
>>> metric = PearsonCorrelation()
>>> metric.update(preds, labels)
>>> p_c_c = metric.eval()
>>> print(p_c_c)
0.9985229081857804

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the PCC.

Returns

p_c_c (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers and the shape of preds is \((N, 1)\).
labels (Union[Tensor, list, np.ndarray]): Ground truth. labels is a list of floating numbers and the shape of preds is \((N, 1)\).

Raises

ValueError – If the number of inputs is not 2.
RuntimeError – If preds and labels have different lengths.

perplexity

“Class for Metric Perplexity

class mindnlp.engine.metrics.perplexity.Perplexity(ignore_label=None, name='Perplexity')[source]

Bases: Metric

Calculates the perplexity. Perplexity is a measure of how well a probabilibity model predicts a sample. A low perplexity indicates the model is good at predicting the sample. The function is shown as follows:

\[PP(W)=P(w_{1}w_{2}...w_{N})^{-\frac{1}{N}}=\sqrt[N]{\frac{1}{P(w_{1}w_{2}...w_{N})}}\]

Where \(w\) represents words in corpus.

Parameters

ignore_label (Union[int, None]) – Index of an invalid label to be ignored when counting. If set to None, it means there’s no invalid label. Default: None.
name (str) – Name of the metric.

Examples

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.common.metrics import Perplexity
>>> preds = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]))
>>> labels = Tensor(np.array([1, 0, 1]))
>>> metric = Perplexity()
>>> metric.update(preds, labels)
>>> ppl = metric.eval()
>>> print(ppl)
2.231443166940565

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the perplexity.

Returns

ppl (float) - The computed result.

Raises

RuntimeError – If the sample size is 0.

get_metric_name()[source]: Return the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, np.ndarray]): Ground truth. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
RuntimeError – If preds and labels have different lengths.
RuntimeError – If pred and label have different shapes.

precision

“Class for Metric Precision

class mindnlp.engine.metrics.precision.Precision(name='Precision')[source]

Bases: Metric

Calculates precision. Precision (also known as positive predictive value) is the actual positive proportion in the predicted positive sample. It can only be used to evaluate the precision score of binary tasks. The function is shown as follows:

\[\text{Precision} =\frac{\text{TP}} {\text{TP} + \text{FP}}\]

where TP is the number of true posistive cases, FP is the number of false posistive cases.

Parameters: name (str) – Name of the metric.

Example

>>> from mindnlp.common.metrics import Precision
>>> preds = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]), mindspore.float32)
>>> labels = Tensor(np.array([1, 0, 1]), mindspore.int32)
>>> metric = Precision()
>>> metric.update(preds, labels)
>>> prec = metric.eval()
>>> print(prec)
[0.5 1. ]

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the precision.

Returns

prec (numpy.ndarray) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables. If the index of the maximum of the predicted value matches the label, the predicted result is correct.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, numpy.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, numpy.ndarray]): Ground truth value. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
ValueError – If preds doesn’t have the same classes number as labels.

recall

“Class for Metric Recall

class mindnlp.engine.metrics.recall.Recall(name='Recall')[source]

Bases: Metric

Calculates the recall. Recall is also referred to as the true positive rate or sensitivity. The function is shown as follows:

\[\text{Recall} =\frac{\text{TP}} {\text{TP} + \text{FN}}\]

where TP is the number of true posistive cases, FN is the number of false negative cases.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.common.metrics import Recall
>>> preds = Tensor(np.array([[0.2, 0.5], [0.3, 0.1], [0.9, 0.6]]), mindspore.float32)
>>> labels = Tensor(np.array([1, 0, 1]), mindspore.int32)
>>> metric = Recall()
>>> metric.update(preds, labels)
>>> rec = metric.eval()
>>> print(rec)
[1. 0.5]

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the recall.

Returns

rec (numpy.ndarray) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers in range \([0, 1]\) and the shape of preds is \((N, C)\) in most cases (not strictly), where \(N\) is the number of cases and \(C\) is the number of categories.
labels (Union[Tensor, list, np.ndarray]): Ground truth. labels must be in one-hot format that shape is \((N, C)\), or can be transformed to one-hot format that shape is \((N,)\).

Raises

ValueError – If the number of inputs is not 2.
ValueError – If preds doesn’t have the same classes number as labels.

rouge

“Classes for Metrics RougeN and RougeL

class mindnlp.engine.metrics.rouge.RougeL(beta=1.2, name='RougeL')[source]

Bases: Metric

Calculates the ROUGE-L score. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation models. ROUGE-L is calculated based on Longest Common Subsequence (LCS). The function is shown as follows:

\[ \begin{align}\begin{aligned}R_{l c s}=\frac{L C S(X, Y)}{m}\\p_{l c s}=\frac{L C S(X, Y)}{n}\\F_{l c s}=\frac{\left(1+\beta^{2}\right) R_{l c s} P_{l c s}}{R_{l c s}+\beta^{2} P_{l c s}}\end{aligned}\end{align} \]

where X is the candidate sentence, Y is the reference sentence. m and n represent the length of X and Y respectively. LCS means the longest common subsequence.

Parameters

beta (float) – A hyperparameter to decide the weight of recall. Defaults: 1.2.
name (str) – Name of the metric.

Example

>>> from mindnlp.common.metrics import RougeL
>>> cand_list = ["The","cat","The","cat","on","the","mat"]
>>> ref_list = [["The","cat","is","on","the","mat"],
                ["There","is","a","cat","on","the","mat"]]
>>> metric = RougeL()
>>> metric.update(cand_list, ref_list)
>>> rougel_score = metric.eval()
>>> print(rougel_score)
0.7800511508951408

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the Rouge-L score.

Returns

rougel_score (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters: inputs – Input cand_list and ref_list. cand_list (list): A list of tokenized candidate sentence. ref_list (list): A list of lists of tokenized ground truth sentences.
Raises: ValueError – If the number of inputs is not 2.

class mindnlp.engine.metrics.rouge.RougeN(n_size=1, name='RougeN')[source]

Bases: Metric

Calculates the ROUGE-N. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics used for evaluating automatic summarization and machine translation models. ROUGE-N refers to the overlap of n-grams between candidates and reference summaries.

Parameters

n_size (int) – N_gram value. Default: 1.
name (str) – Name of the metric.

Example

>>> from mindnlp.common.metrics import RougeN
>>> cand_list = ["the", "cat", "was", "found", "under", "the", "bed"]
>>> ref_list = [["the", "cat", "was", "under", "the", "bed"]]
>>> metric = RougeN(2)
>>> metric.update(cand_list, ref_list)
>>> rougen_score = metric.eval()
>>> print(rougen_score)
0.8

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the Rouge-N score.

Returns

rougen_score (float) - The computed result.

Raises

RuntimeError – If the reference size is 0.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input cand_list and ref_list.

cand_list (list): A list of tokenized candidate sentence.
ref_list (list): A list of lists of tokenized ground truth sentences.

Raises

ValueError – If the number of inputs is not 2.

spearman

“Class for Metric Spearman

class mindnlp.engine.metrics.spearman.SpearmanCorrelation(name='SpearmanCorrelation')[source]

Bases: Metric

Calculates the Spearman’s rank correlation coefficient (SRCC). It is a nonparametric measure of rank correlation (statistical dependence between the rankings of two variables). It assesses how well the relationship between two variables can be described using a monotonic function. If there are no repeated data values, a perfect Spearman correlation of +1 or −1 occurs when each of the variables is a perfect monotone function of the other.

Parameters: name (str) – Name of the metric.

Example

>>> import numpy as np
>>> import mindspore
>>> from mindspore import Tensor
>>> from mindnlp.engine.metrics import SpearmanCorrelation
>>> preds = Tensor(np.array([[0.1], [1.0], [2.4], [0.9]]), mindspore.float32)
>>> labels = Tensor(np.array([[0.0], [1.0], [2.9], [1.0]]), mindspore.float32)
>>> metric = SpearmanCorrelation()
>>> metric.update(preds, labels)
>>> s_r_c_c = metric.eval()
>>> print(s_r_c_c)
1.0

clear()[source]: Clears the internal evaluation results.

eval()[source]

Computes and returns the SRCC.

Returns

s_r_c_c (float) - The computed result.

get_metric_name()[source]: Returns the name of the metric.

update(*inputs)[source]

Updates local variables.

Parameters

inputs –

Input preds and labels.

preds (Union[Tensor, list, np.ndarray]): Predicted value. preds is a list of floating numbers and the shape of preds is \((N, 1)\).
labels (Union[Tensor, list, np.ndarray]): Ground truth. labels is a list of floating numbers and the shape of preds is \((N, 1)\).

Raises

ValueError – If the number of inputs is not 2.
RuntimeError – If preds and labels have different lengths.

Callbacks.