Text Classification

agnews

AG_NEWS load function

mindnlp.dataset.text_classification.agnews.AG_NEWS(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None, shuffle=False)[source]

Load the AG_NEWS dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = AG_NEWS(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value= '3'), Tensor(shape=[], dtype=String,\
     value= "Wall St. Bears Claw Back Into the Black (Reuters) Reuters - \
    Short-sellers, Wall Street's dwindling\\band of ultra-cynics, are seeing green again.")]
mindnlp.dataset.text_classification.agnews.AG_NEWS_Process(dataset, vocab=None, tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, bucket_boundaries=None, batch_size=512, max_len=500, column='text', drop_remainder=False)[source]

the process of the AG_News dataset

Parameters
  • dataset (GeneratorDataset) – AG_News dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. Default: None.

  • batch_size (int) – The number of rows each batch is created with. Default: 512.

  • max_len (int) – The max length of the sentence. Default: 500.

  • column (str) – the column needed to be transpormed of the agnews dataset.

  • drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import AG_NEWS, AG_NEWS_Process
>>> train_dataset, test_dataset = AG_NEWS()
>>> column = "text"
>>> tokenizer = BasicTokenizer()
>>> agnews_dataset, vocab = AG_NEWS_Process(train_dataset, column, tokenizer)
>>> agnews_dataset = agnews_dataset.create_tuple_iterator()
>>> print(next(agnews_dataset))
{'label': Tensor(shape=[], dtype=String, value= '3'), 'text': Tensor(shape=[35],
dtype=Int32, value= [  462,   503,     2,  2102, 47615,  1228,  1766,     3,  1388,
17,    34,    18,    34,     5,  4076,     5, 10244,     4,   462,   434,    19,    13,
14141,    21,  3547,     8,  8356,     5, 38127,     4,    55,  4770,  2987,   390,     2])}
class mindnlp.dataset.text_classification.agnews.Agnews(path)[source]

Bases: object

AG_NEWS dataset source

amazonreviewfull

AmazonReviewFull load function

mindnlp.dataset.text_classification.amazonreviewfull.AmazonReviewFull(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the AmazonReviewFull dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = AmazonReviewFull(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.amazonreviewfull.AmazonReviewFull_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the AmazonReviewFull dataset

Parameters
  • dataset (GeneratorDataset) – AmazonReviewFull dataset.

  • column (str) – the column needed to be transpormed of the AmazonReviewFull dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset.amazonreviewfull import Amazonreviewfull
>>> train_dataset, test_dataset = Amazonreviewfull()
>>> column = "title_text"
>>> tokenizer = BasicTokenizer()
>>> amazonreviewfull_dataset, vocab = AmazonReviewFull_Process(train_dataset, column, tokenizer)
>>> amazonreviewfull_dataset = amazonreviewfull_dataset.create_tuple_iterator()
>>> print(next(amazonreviewfull_dataset))
[Tensor(shape=[], dtype=Int64, value= '3'), Tensor(shape=[27], dtype=Int32, value=         [    53,     37, 912165,   6822,     11,      6,     31,   2589,     13,      5,           8221,    509,    114,   5478,     16, 126088,      2,     16,     82,    141,      5,          30284,   2633,     50,      8,      9,     15])]
class mindnlp.dataset.text_classification.amazonreviewfull.Amazonreviewfull(path)[source]

Bases: object

AmazonReviewFull dataset source

amazonreviewpolarity

AmazonReviewPolarity load function

mindnlp.dataset.text_classification.amazonreviewpolarity.AmazonReviewPolarity(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the AmazonReviewPolarity datase

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = AmazonReviewPolarity(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.amazonreviewpolarity.AmazonReviewPolarity_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the AmazonReviewPolarity dataset

Parameters
  • dataset (GeneratorDataset) – AmazonReviewPolarity dataset.

  • column (str) – the column needed to be transpormed of the AmazonReviewPolarity dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import AmazonReviewPolarity, AmazonReviewPolarity_Process
>>> train_dataset, test_dataset = AmazonReviewPolarity()
>>> column = "title_text"
>>> tokenizer = BasicTokenizer()
>>> amazonreviewpolarity_dataset, vocab = AmazonReviewPolarity_Process(train_dataset, column, tokenizer)
>>> amazonreviewpolarity_dataset = amazonreviewpolarity_dataset.create_tuple_iterator()
>>> print(next(amazonreviewpolarity_dataset))
[Tensor(shape=[], dtype=Int64, value= 2), Tensor(shape=[90], dtype=Int32, value= [277246,     89,
14,      1,    680,     16,   7506,     32,    203,    543,     18,    460,     12,     33,   6923,
1, 146277,     13,     67,    489,     38,     81,      3,     48,   2004,      9,     89,      5,
152,     78,    795,  22921,      0,    170,    137,     12,      3,     28,    567,      1,    170,
32075,   4790,     27,     50,      7,     36,      7,      1,    660,      3,     28,    158,    567,
9,     54,      1,    112,    137,     12,     33,   7683,    277,     41,   6067,  69373,      4,
471,      6,  20149,    991,     21,  10745,   3408,      4,   5257,  24128,      0,     33,     48,
5944,    241,     78,   3043,      5,    392,     12,   5075,   1118,   5075])]
class mindnlp.dataset.text_classification.amazonreviewpolarity.Amazonreviewpolarity(path)[source]

Bases: object

AmazonReviewPolarity dataset source

cola

CoLA load function

mindnlp.dataset.text_classification.cola.CoLA(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the CoLA dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'dev', 'test')
>>> dataset_train,dataset_dev,dataset_test = CoLA(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value= 'gj04'), Tensor(shape=[], dtype=String, \
\value= '1'), \Tensor(shape=[], dtype=String, value= "Our friends won't buy \
this analysis, let alone the \next one we propose.")]
mindnlp.dataset.text_classification.cola.CoLA_Process(dataset, column='sentence', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the CoLA dataset

Parameters
  • dataset (GeneratorDataset) – CoLA dataset.

  • column (str) – the column needed to be transpormed of the CoLA dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import CoLA, CoLA_Process
>>> train_dataset, dataset_dev, dataset_test  = CoLA()
>>> column = "sentence"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = CoLA_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
[Tensor(shape=[], dtype=String, value= 'gj04'), Tensor(shape=[], dtype=String, value= '1'),
Tensor(shape=[17], dtype=Int32, value= [ 854,  290,  196,   10,   28,  182,   57,  738,    9,
816, 1372,    1,  768,   99,   71, 5316,    0])]
class mindnlp.dataset.text_classification.cola.Cola(path)[source]

Bases: object

CoLA dataset source

dbpedia

DBpedia load function

mindnlp.dataset.text_classification.dbpedia.DBpedia(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the DBpedia dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = DBpedia(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.dbpedia.DBpedia_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the DBpedia dataset

Parameters
  • dataset (GeneratorDataset) – DBpedia dataset.

  • column (str) – the column needed to be transpormed of the DBpedia dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import DBpedia, DBpedia_Process
>>> train_dataset, test_dataset = DBpedia()
>>> column = "title_text"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = DBpedia_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[51], dtype=Int32, value= [  407,
 0,   347,     0,  7760,   774,  7760,     3, 16106,   407,   347,  7760,   950,
    10,     5,    99, 88888,   485,    69,     2, 16106,  3996,  3092,   156,
42,    73,    20,  1217,     0,    61,   504,    83,     3,   149,  8463,    10,   156,
  2614,     9,  1604,    13,  3267,  1986,  4858,     0,  1730,   485,  1831,
2,   594,     0])]
class mindnlp.dataset.text_classification.dbpedia.Dbpedia(path)[source]

Bases: object

DBpedia dataset source

imdb

IMDB load function

mindnlp.dataset.text_classification.imdb.IMDB(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True, proxies=None)[source]

Load the IMDB dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = IMDB(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.imdb.IMDB_Process(dataset, tokenizer, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]

the process of the IMDB dataset

Parameters
  • dataset (GeneratorDataset) – IMDB dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

  • batch_size (int) – size of the batch.

  • max_len (int) – max length of the sentence.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.

  • drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> imdb_train, imdb_test = load('imdb', shuffle=True)
>>> embedding, vocab = Glove.from_pretrained('6B', 100, special_tokens=["<unk>", "<pad>"], dropout=drop)
>>> tokenizer = BasicTokenizer(True)
>>> imdb_train = process('imdb', imdb_train, tokenizer=tokenizer, vocab=vocab,                         bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
class mindnlp.dataset.text_classification.imdb.Imdb(path, mode)[source]

Bases: object

IMDB dataset source

label_map = {'neg': 0, 'pos': 1}

mnli

MNLI load function

mindnlp.dataset.text_classification.mnli.MNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev_matched', 'dev_mismatched'), proxies=None)[source]

Load the MNLI dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(“train”, “dev_matched”, “dev_mismatched”).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev_matched", "dev_mismatched")
>>> dataset_train, dataset_dev_matched, dataset_dev_mismatched = MNLI(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.mnli.MNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the MNLI dataset

Parameters
  • dataset (GeneratorDataset) – MNLI dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the MNLI dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import MNLI, MNLI_Process
>>> dataset_train, dataset_dev_matched, dataset_dev_mismatched = MNLI()
>>> dataset_train, vocab = MNLI_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[12], dtype=Int32, value=
[44002,  3578, 10420,    40,   117,  1363,  9631,    14,   790,     5, 10026,
0]), Tensor(shape=[10], dtype=Int32, value= [ 9387,     5, 10026,    20,    63,
133,  3578, 10420,   113,     0])]
class mindnlp.dataset.text_classification.mnli.Mnli(path)[source]

Bases: object

MNLI dataset source

label_map = {'contradiction': 2, 'entailment': 0, 'neutral': 1}

mrpc

MRPC load function

mindnlp.dataset.text_classification.mrpc.MRPC(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the MRPC dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "test")
>>> dataset_train,dataset_test = MRPC(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.mrpc.MRPC_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the MRPC dataset

Parameters
  • dataset (GeneratorDataset) – MRPC dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the MRPC dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import MRPC, MRPC_Process
>>> dataset_train, dataset_test = MRPC()
>>> dataset_train, vocab = MRPC_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[19],
dtype=Int32, value= [1555,  527,   36, 1838,    1, 1547,   33,  226,    8,    2, 1156,
8,    1,    4, 4932, 9179,   36,  362,    0]), Tensor(shape=[20], dtype=Int32,
value= [5820,    3,  151,   27,  119,    8,    2, 1156,    8,    1, 1555,  527,   36, 1838,
4, 4932, 9179,   36,  362,    0])]
class mindnlp.dataset.text_classification.mrpc.Mrpc(path)[source]

Bases: object

MRPC dataset source

qnli

QNLI load function

mindnlp.dataset.text_classification.qnli.QNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the QNLI dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev, "test")
>>> dataset_train,dataset_dev,dataset_test = QNLI(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.qnli.QNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('question', 'sentence'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the QNLI dataset

Parameters
  • dataset (GeneratorDataset) – QNLI dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the QNLI dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import QNLI, QNLI_Process
>>> dataset_train, dataset_dev, dataset_test = QNLI()
>>> dataset_train, vocab = QNLI_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[8],
dtype=Int32, value= [  49,   25,    0,  382, 2323,  574,  380,    4]),
Tensor(shape=[45], dtype=Int32, value= [ 3377,     0,    65,  1913,   180,    36,
5,    53,     2,     0,  1913,    19,   662,     1,  2323, 26903,  1857,     8,  8531,
5,    63,  9937,  1420,     7,    45,  1325,  3042,  2323,    58,    77,    44,
76653,    70,    46,  3140,     5,    63,  1164,   793,   272,     6,     0,   389,   486,     3])]
class mindnlp.dataset.text_classification.qnli.Qnli(path)[source]

Bases: object

QNLI dataset source

label_map = {'entailment': 0, 'not_entailment': 1}

qqp

QQP load function

mindnlp.dataset.text_classification.qqp.QQP(root: str = '/home/docs/.mindnlp', proxies=None)[source]

Load the QQP dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> dataset_train = QQP(root)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=Int64, value= 0), Tensor(shape=[],
dtype=String, value= 'What is the step by step guide to invest
in share market in india?'), Tensor(shape=[], dtype=String, value=
'What is the step by step guide to invest in share market?')]
mindnlp.dataset.text_classification.qqp.QQP_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('question1', 'question2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the QQP dataset

Parameters
  • dataset (GeneratorDataset) – QQP dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the QQP dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import QQP, QQP_Process
>>> dataset_train = QQP()
>>> dataset_train, vocab = QQP_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 0), Tensor(shape=[15], dtype=Int32, value=
[   2,    4,    1, 1280,   68, 1280, 3038,    6,  601,    8,  805,  407,    8,
633,    0]), Tensor(shape=[13], dtype=Int32, value= [   2,    4,    1, 1280,   68,
1280, 3038,    6,  601,    8,  805,  407,    0])]
class mindnlp.dataset.text_classification.qqp.Qqp(path)[source]

Bases: object

QQP dataset source

rte

RTE load function

mindnlp.dataset.text_classification.rte.RTE(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the WNLI dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev, "test")
>>> dataset_train,dataset_dev,dataset_test = RTE(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[],
dtype=String, value= 'No Weapons of Mass Destruction Found in Iraq Yet.'),
Tensor(shape=[], dtype=String, value= 'Weapons of Mass Destruction Found in Iraq.')]
mindnlp.dataset.text_classification.rte.RTE_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the RTE dataset

Parameters
  • dataset (GeneratorDataset) – RTE dataset

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the RTE dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import RTE, RTE_Process
>>> dataset_train, dataset_dev, dataset_test = RTE()
>>> dataset_train, vocab = RTE_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[10],
dtype=Int32, value= [ 882, 3696,    3, 3599, 7046, 7175,    4,   79, 4518,    0]),
Tensor(shape=[8], dtype=Int32, value= [3696,    3, 3599, 7046, 7175,    4,   79,    0])]
class mindnlp.dataset.text_classification.rte.Rte(path)[source]

Bases: object

RTE dataset source

label_map = {'entailment': 0, 'not_entailment': 1}

sogounews

SogouNews load function

mindnlp.dataset.text_classification.sogounews.SogouNews(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the SogouNews dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "test")
>>> dataset_train,dataset_test = SogouNews(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
class mindnlp.dataset.text_classification.sogounews.Sogounews(path)[source]

Bases: object

SogouNews dataset source

sst2

SST2 load function

mindnlp.dataset.text_classification.sst2.SST2(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the SST2 dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev, "test")
>>> dataset_train,dataset_dev,dataset_test = SST2(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value= '0'), Tensor(shape=[], dtype=String, \
value= 'hide new secretions from the parental units ')]
mindnlp.dataset.text_classification.sst2.SST2_Process(dataset, column='text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the SST2 dataset

Parameters
  • dataset (GeneratorDataset) – SST2 dataset.

  • column (str) – the column needed to be transpormed of the sst2 dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import SST2, SST2_Process
>>> train_dataset, dataset_dev, test_dataset = SST2()
>>> column = "text"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = SST2_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
{'label': Tensor(shape=[], dtype=String, value= '0'), 'text': Tensor(shape=[7],
dtype=Int32, value= [ 4699,    92, 12483,    36,     0,  7598,  9597])}
class mindnlp.dataset.text_classification.sst2.Sst2(path)[source]

Bases: object

SST2 dataset source

stsb

STSB load function

mindnlp.dataset.text_classification.stsb.STSB(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the STSB dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev, "test")
>>> dataset_train,dataset_dev,dataset_test = STSB(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[], dtype=Float64,
value= 5), Tensor(shape=[], dtype=String, value= 'A plane is taking off.'),
Tensor(shape=[], dtype=String, value= 'An air plane is taking off.')]
mindnlp.dataset.text_classification.stsb.STSB_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the STSB dataset

Parameters
  • dataset (GeneratorDataset) – STSB dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the STSB dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import STSB, STSB_Process
>>> dataset_train, dataset_dev, dataset_test  = STSB()
>>> dataset_train, vocab = STSB_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=Int64, value= 1), Tensor(shape=[], dtype=Float64,
value= 5), Tensor(shape=[6], dtype=Int32, value= [  5, 263,   6, 448, 135,   0]),
Tensor(shape=[7], dtype=Int32, value= [329, 242, 263,   6, 448, 135,   0])]
class mindnlp.dataset.text_classification.stsb.Stsb(path)[source]

Bases: object

STSB dataset source

wnli

WNLI load function

mindnlp.dataset.text_classification.wnli.WNLI(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the WNLI dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "dev, "test")
>>> dataset_train,dataset_dev,dataset_test = WNLI(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value= '1'), Tensor(shape=[], dtype=String,
value= 'I stuck a pin through a carrot. When I pulled the pin out, it had a hole.'),
Tensor(shape=[], dtype=String, value= 'The carrot had a hole.')]
mindnlp.dataset.text_classification.wnli.WNLI_Process(dataset, column: ~typing.Union[~typing.Tuple[str], str] = ('sentence1', 'sentence2'), tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the WNLI dataset

Parameters
  • dataset (GeneratorDataset) – WNLI dataset.

  • column (Tuple[str]|str) – the column or columns needed to be transpormed of the WNLI dataset

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index

Returns

  • dataset (MapDataset) - dataset after transforms

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If column is not a string or Tuple[str]

Examples

>>> from mindnlp.dataset import WNLI, WNLI_Process
>>> dataset_train, dataset_dev, dataset_test= WNLI()
>>> dataset_train, vocab = WNLI_Process(dataset_train)
>>> dataset_train = dataset_train.create_tuple_iterator()
>>> print(next(dataset_train))
[Tensor(shape=[], dtype=String, value= '1'), Tensor(shape=[20],
dtype=Int32, value= [  23, 1102,    6,  341,  109,    6,  607,    0,  105,   23,  468,
1,  341,   33,    2,    9,   14,    6,  182,    0]), Tensor(shape=[6], dtype=Int32,
value= [  7, 607,  14,   6, 182,   0]
class mindnlp.dataset.text_classification.wnli.Wnli(path)[source]

Bases: object

WNLI dataset source

yahooanswers

YahooAnswers load function

mindnlp.dataset.text_classification.yahooanswers.YahooAnswers(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the YahooAnswers dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "test")
>>> dataset_train,dataset_test = YahooAnswers(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.yahooanswers.YahooAnswers_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the YahooAnswers dataset

Parameters
  • dataset (GeneratorDataset) – YahooAnswers dataset.

  • column (str) – the column needed to be transpormed of the YahooAnswers dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import YahooAnswers, YahooAnswers_Process
>>> train_dataset, dataset_test  = YahooAnswers()
>>> column = "title_text"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = YahooAnswers_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
class mindnlp.dataset.text_classification.yahooanswers.Yahooanswers(path)[source]

Bases: object

YahooAnswers dataset source

yelpreviewfull

YelpReviewFull load function

mindnlp.dataset.text_classification.yelpreviewfull.YelpReviewFull(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the YelpReviewFull dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "test")
>>> dataset_train,dataset_test = YelpReviewFull(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.yelpreviewfull.YelpReviewFull_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the YelpReviewFull dataset

Parameters
  • dataset (GeneratorDataset) – YelpReviewFull dataset.

  • column (str) – the column needed to be transpormed of the YelpReviewFull dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import YelpReviewFull, YelpReviewFull_Process
>>> train_dataset, dataset_test  = YelpReviewFull()
>>> column = "sentence"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = YelpReviewFull_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
{'label': Tensor(shape=[], dtype=Int64, value= 5), 'title_text': Tensor(shape=[117],
dtype=Int32, value= [  6338,      0, 258139,   1500,    265,    139,    295,     12,     15,
6,   1344,  17531,      0,    101,      8,     28,    106,      3,    702,     7,    842,      7,
364,    199,  11063,    277,    101,      8,     28,    152,     25,     57,     15,   1076,
225,   4021,    277,    101,      8,     28,  12202,     19,      6,    308,     20,   1638,   3077,
43, 287710,     38,     76,     23,   1802,     27,   1151,      7,     44,     14,     53,   1617,
15,    852,    185,   1865,      3,    21,    248,   3990,    277,      3,     21,     67,     52,
16374,      7,    169,  19483,    364,    390,      7,    169,    279,  138,      0,     75,      2,
79,     81,    103,     21,    248,     63,    139,      8,     99,    570,     51,    387,      7,
143,     10,    155,   1532,    139,     27,     64,    279,      2,     18,    139,      8,     99,
75,   9730,      6,   6598,      0])}
class mindnlp.dataset.text_classification.yelpreviewfull.Yelpreviewfull(path)[source]

Bases: object

YelpReviewFull dataset source

yelpreviewpolarity

YelpReviewPolarity load function

mindnlp.dataset.text_classification.yelpreviewpolarity.YelpReviewPolarity(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the YelpReviewPolarity dataset

Parameters
  • root (str) – Directory where the datasets are saved. Default:’~/.mindnlp’

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ("train", "test")
>>> dataset_train,dataset_test = YelpReviewPolarity(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.text_classification.yelpreviewpolarity.YelpReviewPolarity_Process(dataset, column='title_text', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

the process of the YelpReviewPolarity dataset

Parameters
  • dataset (GeneratorDataset) – YelpReviewPolarity dataset.

  • column (str) – the column needed to be transpormed of the YelpReviewPolarity dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

Returns

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import YelpReviewPolarity, YelpReviewPolarity_Process
>>> train_dataset, dataset_test  = YelpReviewPolarity()
>>> column = "title_text"
>>> tokenizer = BasicTokenizer()
>>> train_dataset, vocab = YelpReviewPolarity_Process(train_dataset, column, tokenizer)
>>> train_dataset = train_dataset.create_tuple_iterator()
>>> print(next(train_dataset))
class mindnlp.dataset.text_classification.yelpreviewpolarity.Yelpreviewpolarity(path)[source]

Bases: object

YelpReviewPolarity dataset source

TextClassification dataset init