Using MidiTok with Pytorch¶

MidiTok features PyTorch Dataset objects to load music data during training, usually coupled with a PyTorch DataLoader. A Dataset is an object storing the information about a dataset: paths of files to load, or the data itself stored in memory (recommended for small datasets only). When indexed, the Dataset will output dictionaries with values corresponding to the inputs and labels.

Loading data¶

MidiTok provides two dataset classes: miditok.pytorch_data.DatasetMIDI and miditok.pytorch_data.DatasetJSON.

miditok.pytorch_data.DatasetMIDI loads MIDI files and can either tokenize them on the fly when the dataset is indexed, or pre-tokenize them when creating it and saving the token ids in memory. For most use cases, this Dataset should fulfill your needs and is recommended.

miditok.pytorch_data.DatasetJSON loads JSON files containing token ids. It requires to first tokenize a dataset to be used. This dataset is only compatible with JSON files saved as “one token stream” (tokenizer.one_token_stream). In order to use it for all the tracks of a multi-stream tokenizer, you will need to save each track token sequence as a separate JSON file.

Preparing data¶

When training a model, you will likely want to limit the possible token sequence length in order to not run out of memory. The dataset classes handle such case and can trim the token sequences. However, it is not uncommon for a single MIDI to be tokenized into sequences that can contain several thousands tokens, depending on its duration and number of notes. In such case, using only the first portion of the token sequence would considerably reduce the amount of data used to train and test a model.

To handle such case, MidiTok provides the miditok.pytorch_data.split_files_for_training() method to dynamically split MIDI files into chunks that should be tokenized in approximately the number of tokens you want. If you cannot fit most of your MIDIs into single usable token sequences, we recommend to split your dataset with this method.

Data loading example¶

MidiTok also provides an “all-in-one” data collator: miditok.pytorch_data.DataCollator to be used with PyTorch a DataLoader in order to pad batches and create attention masks. Here is a complete example showing how to use this module to train any model.

from miditok import REMI, TokenizerConfig
from miditok.pytorch_data import DatasetMIDI, DataCollator, split_files_for_training
from torch.utils.data import DataLoader
from pathlib import Path

# Creating a multitrack tokenizer configuration, read the doc to explore other parameters
config = TokenizerConfig(num_velocities=16, use_chords=True, use_programs=True)
tokenizer = REMI(config)

# Train the tokenizer with Byte Pair Encoding (BPE)
midi_paths = list(Path("path", "to", "midis").glob("**/*.mid"))
tokenizer.train(vocab_size=30000, files_paths=midi_paths)
tokenizer.save_params(Path("path", "to", "save", "tokenizer.json"))
# And pushing it to the Hugging Face hub (you can download it back with .from_pretrained)
tokenizer.push_to_hub("username/model-name", private=True, token="your_hf_token")

# Split MIDIs into smaller chunks for training
dataset_chunks_dir = Path("path", "to", "midi_chunks")
split_files_for_training(
    files_paths=midi_paths,
    tokenizer=tokenizer,
    save_dir=dataset_chunks_dir,
    max_seq_len=1024,
)

# Create a Dataset, a DataLoader and a collator to train a model
dataset = DatasetMIDI(
    files_paths=list(dataset_chunks_dir.glob("**/*.mid")),
    tokenizer=tokenizer,
    max_seq_len=1024,
    bos_token_id=tokenizer["BOS_None"],
    eos_token_id=tokenizer["EOS_None"],
)
collator = DataCollator(tokenizer.pad_token_id, copy_inputs_as_labels=True)
dataloader = DataLoader(dataset, batch_size=64, collate_fn=collator)

# Iterate over the dataloader to train a model
for batch in dataloader:
    print("Train your model on this batch...")

Note: This module is imported only if torch is installed in your Python environment.

Dataset classes and data collators to be used with PyTorch when training a model.

class miditok.pytorch_data.DataCollator(pad_token_id: int, pad_on_left: bool = False, copy_inputs_as_labels: bool = False, shift_labels: bool = False, labels_pad_idx: int = -100, inputs_kwarg_name: str = 'input_ids', labels_kwarg_name: str = 'labels', decoder_inputs_kwarg_name: str = 'decoder_input_ids')¶

All-in-one data collator for PyTorch DataLoader.

It allows to apply padding (right or left side of sequences), prepend or append BOS and EOS tokens. It will also add an "attention_mask" entry to the batch, following the padding applied.

Parameters:

pad_token_id – padding token id.
pad_on_left – if given True, it will pad the sequences on the left. This can be required when using some libraries expecting padding on left, for example when generating with Hugging Face Transformers. (default: False)
copy_inputs_as_labels – will add a labels entry (labels_kwarg_name) to the batch (or replace the existing one), which is a copy to the input entry: decoder_inputs_kwarg_name if present in the batch else labels_kwarg_name. (default: False)
shift_labels – will shift inputs and labels for autoregressive training/teacher forcing. (default: False)
labels_pad_idx – padding id for labels. (default: -100)
inputs_kwarg_name – name of dict / kwarg key for inputs. (default: "input_ids")
labels_kwarg_name – name of dict / kwarg key for inputs. (default: "labels")
decoder_inputs_kwarg_name – name of dict / kwarg key for decoder inputs. This key is intended to be used for encoder-decoder (seq2seq) models, for the decoder inputs while inputs_kwarg_name is for the encoder inputs. (default: "labels")

class miditok.pytorch_data.DatasetJSON(files_paths: Sequence[Path], max_seq_len: int, bos_token_id: int | None = None, eos_token_id: int | None = None)¶

Basic Dataset loading JSON files of tokenized music files.

When indexed (dataset[idx]), a DatasetJSON will load the files_paths[idx] JSON file and return the token ids, that can be used to train generative models.

This class is only compatible with tokens saved as a single stream of tokens ( tokenizer.one_token_stream ). If you plan to use it with token files containing multiple token streams, you should first split each track token sequence with the miditok.pytorch_data.split_dataset_to_subsequences() method.

If your dataset contains token sequences with lengths largely varying, you might want to first split it into subsequences with the miditok.pytorch_data.split_files_for_training() method before loading it to avoid losing data.

Parameters:

files_paths – list of paths to files to load.
max_seq_len – maximum sequence length (in num of tokens). (default: None)
bos_token_id – BOS token id. (default: None)
eos_token_id – EOS token id. (default: None)

class miditok.pytorch_data.DatasetMIDI(files_paths: Sequence[Path], tokenizer: MusicTokenizer, max_seq_len: int, bos_token_id: int | None = None, eos_token_id: int | None = None, pre_tokenize: bool = False, ac_tracks_random_ratio_range: tuple[float, float] | None = None, ac_bars_random_ratio_range: tuple[float, float] | None = None, func_to_get_labels: Callable[[Score, TokSequence | list[TokSequence], Path], int | list[int] | LongTensor] | None = None, sample_key_name: str = 'input_ids', labels_key_name: str = 'labels')¶

A Dataset loading and tokenizing music files (MIDI, abc) during training.

This class can be used for either tokenize music files on the fly when iterating it, or by pre-tokenizing all the files at its initialization and store the tokens in memory.

Important note: you should probably use this class in concert with the miditok.pytorch_data.split_files_for_training() method in order to train your model with chunks of music files having token sequence lengths close to the max_seq_len value. When using this class with file chunks, the BOS and EOS tokens will only be added to the first and last chunks respectively. This allows to not train the model with EOS tokens that would incorrectly inform the model the end of the data samples, and break the causality chain of consecutive chunks with incorrectly placed BOS tokens.

Additionally, you can use the func_to_get_labels argument to provide a method allowing to use labels (one label per file).

Handling of corrupted files: Some MIDI files may be corrupted, as for example containing unexpected values. In such cases, if the DatasetMIDI pre-tokenizes, it will simply ignore these files. Otherwise, the DatasetMIDI will return dictionaries with None values when iterated.

Parameters:

files_paths – paths to the music files to load.
tokenizer – tokenizer.
max_seq_len – maximum sequence length (in num of tokens)
bos_token_id – BOS token id. (default: None)
eos_token_id – EOS token id. (default: None)
pre_tokenize – whether to pre-tokenize the data files when creating the Dataset object. If this is enabled, the Dataset will tokenize all the files at its initialization and store the tokens in memory.
ac_tracks_random_ratio_range – range of ratios (between 0 and 1 included) of tracks to compute attribute controls on. If None is given, no track attribute control will be used. (default: None)
ac_bars_random_ratio_range – range of ratios (between 0 and 1 included) of bars to compute attribute controls on. If None is given, no bar attribute control will be used. (default: None)
func_to_get_labels – a function to retrieve the label of a file. The method must take two positional arguments: the first is either the miditok.TokSequence returned when tokenizing a file, the second is the path to the file just loaded. The method must return an integer which corresponds to the label id (and not the absolute value, e.g. if you are classifying 10 musicians, return the id from 0 to 9 included corresponding to the musician). (default: None)
sample_key_name – name of the dictionary key containing the sample data when iterating the dataset. (default: "input_ids")
labels_key_name – name of the dictionary key containing the labels data when iterating the dataset. (default: "labels")