Text Classification
agnews
AG_NEWS load function
- mindnlp.dataset.text_classification.agnews.AG_NEWS(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None, shuffle=False)[source]
Load the AG_NEWS dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = AG_NEWS(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value= '3'), Tensor(shape=[], dtype=String,\ value= "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - \ Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")]
- mindnlp.dataset.text_classification.agnews.AG_NEWS_Process(dataset, vocab=None, tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, bucket_boundaries=None, batch_size=512, max_len=500, column='text', drop_remainder=False)[source]
the process of the AG_News dataset
- Parameters
dataset (GeneratorDataset) – AG_News dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. Default: None.
batch_size (int) – The number of rows each batch is created with. Default: 512.
max_len (int) – The max length of the sentence. Default: 500.
column (str) – the column needed to be transpormed of the agnews dataset.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import AG_NEWS, AG_NEWS_Process >>> train_dataset, test_dataset = AG_NEWS() >>> column = "text" >>> tokenizer = BasicTokenizer() >>> agnews_dataset, vocab = AG_NEWS_Process(train_dataset, column, tokenizer) >>> agnews_dataset = agnews_dataset.create_tuple_iterator() >>> print(next(agnews_dataset)) {'label': Tensor(shape=[], dtype=String, value= '3'), 'text': Tensor(shape=[35], dtype=Int32, value= [ 462, 503, 2, 2102, 47615, 1228, 1766, 3, 1388, 17, 34, 18, 34, 5, 4076, 5, 10244, 4, 462, 434, 19, 13, 14141, 21, 3547, 8, 8356, 5, 38127, 4, 55, 4770, 2987, 390, 2])}
amazonreviewfull
AmazonReviewFull load function
- mindnlp.dataset.text_classification.amazonreviewfull.AmazonReviewFull(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the AmazonReviewFull dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = AmazonReviewFull(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.amazonreviewfull.AmazonReviewFull_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the AmazonReviewFull dataset
- Parameters
dataset (GeneratorDataset) – AmazonReviewFull dataset.
column (str) – the column needed to be transpormed of the AmazonReviewFull dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset.amazonreviewfull import Amazonreviewfull >>> train_dataset, test_dataset = Amazonreviewfull() >>> column = "title_text" >>> tokenizer = BasicTokenizer() >>> amazonreviewfull_dataset, vocab = AmazonReviewFull_Process(train_dataset, column, tokenizer) >>> amazonreviewfull_dataset = amazonreviewfull_dataset.create_tuple_iterator() >>> print(next(amazonreviewfull_dataset)) [Tensor(shape=[], dtype=Int64, value= '3'), Tensor(shape=[27], dtype=Int32, value= [ 53, 37, 912165, 6822, 11, 6, 31, 2589, 13, 5, 8221, 509, 114, 5478, 16, 126088, 2, 16, 82, 141, 5, 30284, 2633, 50, 8, 9, 15])]
amazonreviewpolarity
AmazonReviewPolarity load function
- mindnlp.dataset.text_classification.amazonreviewpolarity.AmazonReviewPolarity(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the AmazonReviewPolarity datase
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = AmazonReviewPolarity(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.amazonreviewpolarity.AmazonReviewPolarity_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the AmazonReviewPolarity dataset
- Parameters
dataset (GeneratorDataset) – AmazonReviewPolarity dataset.
column (str) – the column needed to be transpormed of the AmazonReviewPolarity dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import AmazonReviewPolarity, AmazonReviewPolarity_Process >>> train_dataset, test_dataset = AmazonReviewPolarity() >>> column = "title_text" >>> tokenizer = BasicTokenizer() >>> amazonreviewpolarity_dataset, vocab = AmazonReviewPolarity_Process(train_dataset, column, tokenizer) >>> amazonreviewpolarity_dataset = amazonreviewpolarity_dataset.create_tuple_iterator() >>> print(next(amazonreviewpolarity_dataset)) [Tensor(shape=[], dtype=Int64, value= 2), Tensor(shape=[90], dtype=Int32, value= [277246, 89, 14, 1, 680, 16, 7506, 32, 203, 543, 18, 460, 12, 33, 6923, 1, 146277, 13, 67, 489, 38, 81, 3, 48, 2004, 9, 89, 5, 152, 78, 795, 22921, 0, 170, 137, 12, 3, 28, 567, 1, 170, 32075, 4790, 27, 50, 7, 36, 7, 1, 660, 3, 28, 158, 567, 9, 54, 1, 112, 137, 12, 33, 7683, 277, 41, 6067, 69373, 4, 471, 6, 20149, 991, 21, 10745, 3408, 4, 5257, 24128, 0, 33, 48, 5944, 241, 78, 3043, 5, 392, 12, 5075, 1118, 5075])]
cola
CoLA load function
- mindnlp.dataset.text_classification.cola.CoLA(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the CoLA dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'dev', 'test') >>> dataset_train,dataset_dev,dataset_test = CoLA(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value= 'gj04'), Tensor(shape=[], dtype=String, \ \value= '1'), \Tensor(shape=[], dtype=String, value= "Our friends won't buy \ this analysis, let alone the \next one we propose.")]
- mindnlp.dataset.text_classification.cola.CoLA_Process(dataset, column='sentence', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the CoLA dataset
- Parameters
dataset (GeneratorDataset) – CoLA dataset.
column (str) – the column needed to be transpormed of the CoLA dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import CoLA, CoLA_Process >>> train_dataset, dataset_dev, dataset_test = CoLA() >>> column = "sentence" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = CoLA_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset)) [Tensor(shape=[], dtype=String, value= 'gj04'), Tensor(shape=[], dtype=String, value= '1'), Tensor(shape=[17], dtype=Int32, value= [ 854, 290, 196, 10, 28, 182, 57, 738, 9, 816, 1372, 1, 768, 99, 71, 5316, 0])]
dbpedia
DBpedia load function
- mindnlp.dataset.text_classification.dbpedia.DBpedia(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the DBpedia dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = DBpedia(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.dbpedia.DBpedia_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the DBpedia dataset
- Parameters
dataset (GeneratorDataset) – DBpedia dataset.
column (str) – the column needed to be transpormed of the DBpedia dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import DBpedia, DBpedia_Process >>> train_dataset, test_dataset = DBpedia() >>> column = "title_text" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = DBpedia_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[51], dtype=Int32, value= [ 407, 0, 347, 0, 7760, 774, 7760, 3, 16106, 407, 347, 7760, 950, 10, 5, 99, 88888, 485, 69, 2, 16106, 3996, 3092, 156, 42, 73, 20, 1217, 0, 61, 504, 83, 3, 149, 8463, 10, 156, 2614, 9, 1604, 13, 3267, 1986, 4858, 0, 1730, 485, 1831, 2, 594, 0])]
imdb
IMDB load function
- mindnlp.dataset.text_classification.imdb.IMDB(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True, proxies=None)[source]
Load the IMDB dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = IMDB(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.imdb.IMDB_Process(dataset, tokenizer, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]
the process of the IMDB dataset
- Parameters
dataset (GeneratorDataset) – IMDB dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – size of the batch.
max_len (int) – max length of the sentence.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.
drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> imdb_train, imdb_test = load('imdb', shuffle=True) >>> embedding, vocab = Glove.from_pretrained('6B', 100, special_tokens=["<unk>", "<pad>"], dropout=drop) >>> tokenizer = BasicTokenizer(True) >>> imdb_train = process('imdb', imdb_train, tokenizer=tokenizer, vocab=vocab, bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
mnli
MNLI load function
- mindnlp.dataset.text_classification.mnli.MNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev_matched', 'dev_mismatched'), proxies=None)[source]
Load the MNLI dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(“train”, “dev_matched”, “dev_mismatched”).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev_matched", "dev_mismatched") >>> dataset_train, dataset_dev_matched, dataset_dev_mismatched = MNLI(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.mnli.MNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the MNLI dataset
- Parameters
dataset (GeneratorDataset) – MNLI dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the MNLI dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import MNLI, MNLI_Process >>> dataset_train, dataset_dev_matched, dataset_dev_mismatched = MNLI() >>> dataset_train, vocab = MNLI_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[12], dtype=Int32, value= [44002, 3578, 10420, 40, 117, 1363, 9631, 14, 790, 5, 10026, 0]), Tensor(shape=[10], dtype=Int32, value= [ 9387, 5, 10026, 20, 63, 133, 3578, 10420, 113, 0])]
mrpc
MRPC load function
- mindnlp.dataset.text_classification.mrpc.MRPC(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the MRPC dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "test") >>> dataset_train,dataset_test = MRPC(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.mrpc.MRPC_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the MRPC dataset
- Parameters
dataset (GeneratorDataset) – MRPC dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the MRPC dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import MRPC, MRPC_Process >>> dataset_train, dataset_test = MRPC() >>> dataset_train, vocab = MRPC_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[19], dtype=Int32, value= [1555, 527, 36, 1838, 1, 1547, 33, 226, 8, 2, 1156, 8, 1, 4, 4932, 9179, 36, 362, 0]), Tensor(shape=[20], dtype=Int32, value= [5820, 3, 151, 27, 119, 8, 2, 1156, 8, 1, 1555, 527, 36, 1838, 4, 4932, 9179, 36, 362, 0])]
qnli
QNLI load function
- mindnlp.dataset.text_classification.qnli.QNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the QNLI dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev, "test") >>> dataset_train,dataset_dev,dataset_test = QNLI(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.qnli.QNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('question', 'sentence'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the QNLI dataset
- Parameters
dataset (GeneratorDataset) – QNLI dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the QNLI dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import QNLI, QNLI_Process >>> dataset_train, dataset_dev, dataset_test = QNLI() >>> dataset_train, vocab = QNLI_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[8], dtype=Int32, value= [ 49, 25, 0, 382, 2323, 574, 380, 4]), Tensor(shape=[45], dtype=Int32, value= [ 3377, 0, 65, 1913, 180, 36, 5, 53, 2, 0, 1913, 19, 662, 1, 2323, 26903, 1857, 8, 8531, 5, 63, 9937, 1420, 7, 45, 1325, 3042, 2323, 58, 77, 44, 76653, 70, 46, 3140, 5, 63, 1164, 793, 272, 6, 0, 389, 486, 3])]
qqp
QQP load function
- mindnlp.dataset.text_classification.qqp.QQP(root: str = '/home/docs/.mindnlp', proxies=None)[source]
Load the QQP dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> dataset_train = QQP(root) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=Int64, value= 0), Tensor(shape=[], dtype=String, value= 'What is the step by step guide to invest in share market in india?'), Tensor(shape=[], dtype=String, value= 'What is the step by step guide to invest in share market?')]
- mindnlp.dataset.text_classification.qqp.QQP_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('question1', 'question2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the QQP dataset
- Parameters
dataset (GeneratorDataset) – QQP dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the QQP dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import QQP, QQP_Process >>> dataset_train = QQP() >>> dataset_train, vocab = QQP_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 0), Tensor(shape=[15], dtype=Int32, value= [ 2, 4, 1, 1280, 68, 1280, 3038, 6, 601, 8, 805, 407, 8, 633, 0]), Tensor(shape=[13], dtype=Int32, value= [ 2, 4, 1, 1280, 68, 1280, 3038, 6, 601, 8, 805, 407, 0])]
rte
RTE load function
- mindnlp.dataset.text_classification.rte.RTE(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the WNLI dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev, "test") >>> dataset_train,dataset_dev,dataset_test = RTE(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[], dtype=String, value= 'No Weapons of Mass Destruction Found in Iraq Yet.'), Tensor(shape=[], dtype=String, value= 'Weapons of Mass Destruction Found in Iraq.')]
- mindnlp.dataset.text_classification.rte.RTE_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the RTE dataset
- Parameters
dataset (GeneratorDataset) – RTE dataset
column (Tuple[str]|str) – the column or columns needed to be transpormed of the RTE dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import RTE, RTE_Process >>> dataset_train, dataset_dev, dataset_test = RTE() >>> dataset_train, vocab = RTE_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[10], dtype=Int32, value= [ 882, 3696, 3, 3599, 7046, 7175, 4, 79, 4518, 0]), Tensor(shape=[8], dtype=Int32, value= [3696, 3, 3599, 7046, 7175, 4, 79, 0])]
sogounews
SogouNews load function
- mindnlp.dataset.text_classification.sogounews.SogouNews(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the SogouNews dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "test") >>> dataset_train,dataset_test = SogouNews(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
sst2
SST2 load function
- mindnlp.dataset.text_classification.sst2.SST2(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the SST2 dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev, "test") >>> dataset_train,dataset_dev,dataset_test = SST2(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value= '0'), Tensor(shape=[], dtype=String, \ value= 'hide new secretions from the parental units ')]
- mindnlp.dataset.text_classification.sst2.SST2_Process(dataset, column='text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the SST2 dataset
- Parameters
dataset (GeneratorDataset) – SST2 dataset.
column (str) – the column needed to be transpormed of the sst2 dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import SST2, SST2_Process >>> train_dataset, dataset_dev, test_dataset = SST2() >>> column = "text" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = SST2_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset)) {'label': Tensor(shape=[], dtype=String, value= '0'), 'text': Tensor(shape=[7], dtype=Int32, value= [ 4699, 92, 12483, 36, 0, 7598, 9597])}
stsb
STSB load function
- mindnlp.dataset.text_classification.stsb.STSB(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the STSB dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev, "test") >>> dataset_train,dataset_dev,dataset_test = STSB(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[], dtype=Float64, value= 5), Tensor(shape=[], dtype=String, value= 'A plane is taking off.'), Tensor(shape=[], dtype=String, value= 'An air plane is taking off.')]
- mindnlp.dataset.text_classification.stsb.STSB_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the STSB dataset
- Parameters
dataset (GeneratorDataset) – STSB dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the STSB dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import STSB, STSB_Process >>> dataset_train, dataset_dev, dataset_test = STSB() >>> dataset_train, vocab = STSB_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[], dtype=Float64, value= 5), Tensor(shape=[6], dtype=Int32, value= [ 5, 263, 6, 448, 135, 0]), Tensor(shape=[7], dtype=Int32, value= [329, 242, 263, 6, 448, 135, 0])]
wnli
WNLI load function
- mindnlp.dataset.text_classification.wnli.WNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the WNLI dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "dev, "test") >>> dataset_train,dataset_dev,dataset_test = WNLI(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value= '1'), Tensor(shape=[], dtype=String, value= 'I stuck a pin through a carrot. When I pulled the pin out, it had a hole.'), Tensor(shape=[], dtype=String, value= 'The carrot had a hole.')]
- mindnlp.dataset.text_classification.wnli.WNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the WNLI dataset
- Parameters
dataset (GeneratorDataset) – WNLI dataset.
column (Tuple[str]|str) – the column or columns needed to be transpormed of the WNLI dataset
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset
vocab (Vocab) – vocabulary object, used to store the mapping of token and index
- Returns
dataset (MapDataset) - dataset after transforms
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If column is not a string or Tuple[str]
Examples
>>> from mindnlp.dataset import WNLI, WNLI_Process >>> dataset_train, dataset_dev, dataset_test= WNLI() >>> dataset_train, vocab = WNLI_Process(dataset_train) >>> dataset_train = dataset_train.create_tuple_iterator() >>> print(next(dataset_train)) [Tensor(shape=[], dtype=String, value= '1'), Tensor(shape=[20], dtype=Int32, value= [ 23, 1102, 6, 341, 109, 6, 607, 0, 105, 23, 468, 1, 341, 33, 2, 9, 14, 6, 182, 0]), Tensor(shape=[6], dtype=Int32, value= [ 7, 607, 14, 6, 182, 0]
yahooanswers
YahooAnswers load function
- mindnlp.dataset.text_classification.yahooanswers.YahooAnswers(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the YahooAnswers dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "test") >>> dataset_train,dataset_test = YahooAnswers(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.yahooanswers.YahooAnswers_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the YahooAnswers dataset
- Parameters
dataset (GeneratorDataset) – YahooAnswers dataset.
column (str) – the column needed to be transpormed of the YahooAnswers dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import YahooAnswers, YahooAnswers_Process >>> train_dataset, dataset_test = YahooAnswers() >>> column = "title_text" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = YahooAnswers_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset))
yelpreviewfull
YelpReviewFull load function
- mindnlp.dataset.text_classification.yelpreviewfull.YelpReviewFull(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the YelpReviewFull dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "test") >>> dataset_train,dataset_test = YelpReviewFull(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.yelpreviewfull.YelpReviewFull_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the YelpReviewFull dataset
- Parameters
dataset (GeneratorDataset) – YelpReviewFull dataset.
column (str) – the column needed to be transpormed of the YelpReviewFull dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import YelpReviewFull, YelpReviewFull_Process >>> train_dataset, dataset_test = YelpReviewFull() >>> column = "sentence" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = YelpReviewFull_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset)) {'label': Tensor(shape=[], dtype=Int64, value= 5), 'title_text': Tensor(shape=[117], dtype=Int32, value= [ 6338, 0, 258139, 1500, 265, 139, 295, 12, 15, 6, 1344, 17531, 0, 101, 8, 28, 106, 3, 702, 7, 842, 7, 364, 199, 11063, 277, 101, 8, 28, 152, 25, 57, 15, 1076, 225, 4021, 277, 101, 8, 28, 12202, 19, 6, 308, 20, 1638, 3077, 43, 287710, 38, 76, 23, 1802, 27, 1151, 7, 44, 14, 53, 1617, 15, 852, 185, 1865, 3, 21, 248, 3990, 277, 3, 21, 67, 52, 16374, 7, 169, 19483, 364, 390, 7, 169, 279, 138, 0, 75, 2, 79, 81, 103, 21, 248, 63, 139, 8, 99, 570, 51, 387, 7, 143, 10, 155, 1532, 139, 27, 64, 279, 2, 18, 139, 8, 99, 75, 9730, 6, 6598, 0])}
yelpreviewpolarity
YelpReviewPolarity load function
- mindnlp.dataset.text_classification.yelpreviewpolarity.YelpReviewPolarity(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the YelpReviewPolarity dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ("train", "test") >>> dataset_train,dataset_test = YelpReviewPolarity(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.text_classification.yelpreviewpolarity.YelpReviewPolarity_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
the process of the YelpReviewPolarity dataset
- Parameters
dataset (GeneratorDataset) – YelpReviewPolarity dataset.
column (str) – the column needed to be transpormed of the YelpReviewPolarity dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
- Returns
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import YelpReviewPolarity, YelpReviewPolarity_Process >>> train_dataset, dataset_test = YelpReviewPolarity() >>> column = "title_text" >>> tokenizer = BasicTokenizer() >>> train_dataset, vocab = YelpReviewPolarity_Process(train_dataset, column, tokenizer) >>> train_dataset = train_dataset.create_tuple_iterator() >>> print(next(train_dataset))
- class mindnlp.dataset.text_classification.yelpreviewpolarity.Yelpreviewpolarity(path)[source]
Bases:
objectYelpReviewPolarity dataset source
TextClassification dataset init