Machine Translation
iwslt2016
IWSLT2016 load function
- mindnlp.dataset.machine_translation.iwslt2016.IWSLT2016(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair=('de', 'en'), valid_set='tst2013', test_set='tst2014', proxies=None)[source]
Load the IWSLT2016 dataset
The available datasets include following:
Language pairs:
“en”
“fr”
“de”
“cs”
“ar”
“en”
x
x
x
x
“fr”
x
“de”
x
“cs”
x
“ar”
x
valid/test sets: [“dev2010”, “tst2010”, “tst2011”, “tst2012”, “tst2013”, “tst2014”]
- Parameters
root (str) – Directory where the datasets are saved. Default: “~/.mindnlp”
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
valid_set (str) – a string to identify validation set. Default: “tst2013”.
test_set (str) – a string to identify test set. Default: “tst2014”.
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
- Raises
TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is not in the range of supported datasets.
RuntimeError – If valid_set is not in the range of supported datasets.
RuntimeError – If test_set is not in the range of supported datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'valid', 'test') >>> language_pair = ('de', 'en') >>> valid_set="tst2013" >>> test_set="tst2014" >>> dataset_train, dataset_valid, dataset_test = IWSLT2016(root, split, \ language_pair, valid_set, test_set) >>> train_iter = dataset_train.create_dict_iterator() >>> print(next(train_iter)) {'text': Tensor(shape=[], dtype=String, value= \ 'David Gallo: Das ist Bill Lange. Ich bin Dave Gallo.'), 'translation': Tensor(shape=[], dtype=String, value= \ "David Gallo: This is Bill Lange. I'm Dave Gallo.")}
iwslt2017
IWSLT2017 load function
- mindnlp.dataset.machine_translation.iwslt2017.IWSLT2017(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair=('de', 'en'), proxies=None)[source]
Load the IWSLT2017 dataset
The available datasets include following:
Language pairs:
“en”
“nl”
“de”
“it”
“ro”
“en”
x
x
x
x
“nl”
x
x
x
x
“de”
x
x
x
x
“it”
x
x
x
x
“ro”
x
x
x
x
- Parameters
root (str) – Directory where the datasets are saved. Default: “~/.mindnlp”
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
- Raises
TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is not in the range of supported datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'valid', 'test') >>> language_pair = ('de', 'en') >>> dataset_train, dataset_valid, dataset_test = IWSLT2017(root, split, language_pair) >>> train_iter = dataset_train.create_dict_iterator() >>> print(next(train_iter)) {'text': Tensor(shape=[], dtype=String, value= 'Vielen Dank, Chris.'), 'translation': Tensor(shape=[], dtype=String, value= 'Thank you so much, Chris.')}
- mindnlp.dataset.machine_translation.iwslt2017.IWSLT2017_Process(dataset, column='translation', tokenizer=<mindnlp.dataset.transforms.tokenizers.BasicTokenizer object>, vocab=None)[source]
a function transforms specific language column in IWSLT2017 dataset into tensors
- Parameters
dataset (GeneratorDataset, ZipDataset) – IWSLT2017 dataset
column (str) – The language column name in IWSLT2017
tokenizer (TextTensorOperation) – Tokenizer you what to used
vocab (Vocab) – The vocab you use, defaults to None. If None, a new vocab will be created.
- Returns
MapDataset, dataset after process.
Vocab, new vocab created from dataset if ‘vocab’ is None.
- Raises
TypeError – If language is not string.
Examples
>>> from mindspore.dataset import text >>> from mindnlp.dataset import IWSLT2017, IWSLT2017_Process >>> test_dataset = IWSLT2017( >>> root='./dataset', >>> split="test", >>> language_pair=("de", "en") >>> ) >>> test_dataset, vocab = process('IWSLT2017', test_dataset, "translation", >>> text.BasicTokenizer()) >>> for i in test_dataset.create_tuple_iterator(): >>> print(i) >>> break
multi30k
Multi30k load function
- mindnlp.dataset.machine_translation.multi30k.Multi30k(root: str = '/home/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'valid', 'test'), language_pair: Tuple[str] = ('de', 'en'), proxies=None)[source]
Load the Multi30k dataset
- Parameters
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘valid’, ‘test’).
language_pair (Tuple[str]) – Tuple containing src and tgt language. Default: (‘de’, ‘en’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
- Raises
TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
TypeError – If language_pair is not a Tuple[str].
RuntimeError – If the length of language_pair is not 2.
RuntimeError – If language_pair is neither (‘de’, ‘en’) nor (‘en’, ‘de’).
Examples
>>> root = os.path.join(os.path.expanduser('~'), ".mindnlp") >>> split = ('train', 'valid', 'test') >>> language_pair = ('de', 'en') >>> dataset_train, dataset_valid, dataset_test = Multi30k(root, split, language_pair) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value=\ 'Ein Mann mit einem orangefarbenen Hut, der etwas anstarrt.'), Tensor(shape=[], dtype=String, value= 'A man in an orange hat starring at something.')]
- mindnlp.dataset.machine_translation.multi30k.Multi30k_Process(dataset, vocab, batch_size=64, max_len=500, drop_remainder=False)[source]
the process of the Multi30k dataset
- Parameters
dataset (GeneratorDataset) – Multi30k dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – The number of rows each batch is created with. Default: 64.
max_len (int) – The max length of the sentence. Default: 500.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.
- Returns
dataset (MapDataset) - dataset after transforms.
- Raises
TypeError – If input_column is not a string.
Examples
>>> train_dataset = Multi30k( >>> root=self.root, >>> split="train", >>> language_pair=("de", "en") >>> ) >>> tokenizer = BasicTokenizer(True) >>> train_dataset = train_dataset.map([tokenizer], 'en') >>> train_dataset = train_dataset.map([tokenizer], 'de') >>> en_vocab = text.Vocab.from_dataset(train_dataset, 'en', special_tokens= >>> ['<pad>', '<unk>'], special_first= True) >>> de_vocab = text.Vocab.from_dataset(train_dataset, 'de', special_tokens= >>> ['<pad>', '<unk>'], special_first= True) >>> vocab = {'en':en_vocab, 'de':de_vocab} >>> train_dataset = process('Multi30k', train_dataset, vocab = vocab)
MachineTranslation dataset init