Machine Translation

iwslt2016

IWSLT2016 load function

mindnlp.dataset.machine_translation.iwslt2016.IWSLT2016(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014', proxies=None)[source]

Load the IWSLT2016 dataset

The available datasets include following:

Language pairs:

	“en”	“fr”	“de”	“cs”	“ar”
“en”		x	x	x	x
“fr”	x
“de”	x
“cs”	x
“ar”	x

valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]

Parameters

root (str) – Directory where the datasets are saved. Default: “~/.mindnlp”
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
valid_set (str) – a string to identify validation set. Default: “tst2013”.
test_set (str) – a string to identify test set. Default: “tst2014”.
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Raises

TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is not in the range of supported datasets.
RuntimeError – If valid_set is not in the range of supported datasets.
RuntimeError – If test_set is not in the range of supported datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'valid', 'test')
>>> language_pair = ('de', 'en')
>>> valid_set="tst2013"
>>> test_set="tst2014"
>>> dataset_train, dataset_valid, dataset_test = IWSLT2016(root, split, \
    language_pair, valid_set, test_set)
>>> train_iter = dataset_train.create_dict_iterator()
>>> print(next(train_iter))
{'text': Tensor(shape=[], dtype=String, value= \
    'David Gallo: Das ist Bill Lange. Ich bin Dave Gallo.'),
'translation': Tensor(shape=[], dtype=String, value= \
    "David Gallo: This is Bill Lange. I'm Dave Gallo.")}

iwslt2017

IWSLT2017 load function

mindnlp.dataset.machine_translation.iwslt2017.IWSLT2017(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair=('de', 'en'), proxies=None)[source]

Load the IWSLT2017 dataset

The available datasets include following:

Language pairs:

	“en”	“nl”	“de”	“it”	“ro”
“en”		x	x	x	x
“nl”	x		x	x	x
“de”	x	x		x	x
“it”	x	x	x		x
“ro”	x	x	x	x

Parameters

root (str) – Directory where the datasets are saved. Default: “~/.mindnlp”
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Raises

TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is not in the range of supported datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'valid', 'test')
>>> language_pair = ('de', 'en')
>>> dataset_train, dataset_valid, dataset_test = IWSLT2017(root, split, language_pair)
>>> train_iter = dataset_train.create_dict_iterator()
>>> print(next(train_iter))
{'text': Tensor(shape=[], dtype=String, value= 'Vielen Dank, Chris.'),
'translation': Tensor(shape=[], dtype=String, value= 'Thank you so much, Chris.')}

mindnlp.dataset.machine_translation.iwslt2017.IWSLT2017_Process(dataset, column='translation', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]

a function transforms specific language column in IWSLT2017 dataset into tensors

Parameters

dataset (GeneratorDataset, ZipDataset) – IWSLT2017 dataset
column (str) – The language column name in IWSLT2017
tokenizer (TextTensorOperation) – Tokenizer you what to used
vocab (Vocab) – The vocab you use, defaults to None. If None, a new vocab will be created.

Returns

MapDataset, dataset after process.
Vocab, new vocab created from dataset if ‘vocab’ is None.

Raises

TypeError – If language is not string.

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.dataset import IWSLT2017, IWSLT2017_Process
>>> test_dataset = IWSLT2017(
>>>     root='./dataset',
>>>     split="test",
>>>     language_pair=("de", "en")
>>> )
>>> test_dataset, vocab = process('IWSLT2017', test_dataset, "translation",
>>>     text.BasicTokenizer())
>>> for i in test_dataset.create_tuple_iterator():
>>>     print(i)
>>>     break

multi30k

Multi30k load function

mindnlp.dataset.machine_translation.multi30k.Multi30k(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'), proxies=None)[source]

Load the Multi30k dataset

Parameters

root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns

datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Raises

TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is neither (‘de’, ‘en’) nor (‘en’, ‘de’).

Examples

>>> root = os.path.join(os.path.expanduser('~'), ".mindnlp")
>>> split = ('train', 'valid', 'test')
>>> language_pair = ('de', 'en')
>>> dataset_train, dataset_valid, dataset_test = Multi30k(root, split, language_pair)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value=\
    'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.'),
Tensor(shape=[], dtype=String, value= 'A man in an orange hat starring at something.')]

mindnlp.dataset.machine_translation.multi30k.Multi30k_Process(dataset, vocab, batch_size=64, max_len=500, drop_remainder=False)[source]

the process of the Multi30k dataset

Parameters

dataset (GeneratorDataset) – Multi30k dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – The number of rows each batch is created with. Default: 64.
max_len (int) – The max length of the sentence. Default: 500.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.

Returns

dataset (MapDataset) - dataset after transforms.

Raises

TypeError – If input_column is not a string.

Examples

>>> train_dataset = Multi30k(
>>>     root=self.root,
>>>     split="train",
>>>     language_pair=("de", "en")
>>> )
>>> tokenizer = BasicTokenizer(True)
>>> train_dataset = train_dataset.map([tokenizer], 'en')
>>> train_dataset = train_dataset.map([tokenizer], 'de')
>>> en_vocab = text.Vocab.from_dataset(train_dataset, 'en', special_tokens=
>>>   ['<pad>', '<unk>'], special_first= True)
>>> de_vocab = text.Vocab.from_dataset(train_dataset, 'de', special_tokens=
>>>   ['<pad>', '<unk>'], special_first= True)
>>> vocab = {'en':en_vocab, 'de':de_vocab}
>>> train_dataset = process('Multi30k', train_dataset, vocab = vocab)

MachineTranslation dataset init