=================
Code examples
=================

Create a tokenizer
------------------------

A basic example showing how to create a tokenizer, with a selection of custom parameters.

..  code-block:: python

    from miditok import REMI, TokenizerConfig  # here we choose to use REMI

    # Our parameters
    TOKENIZER_PARAMS = {
        "pitch_range": (21, 109),
        "beat_res": {(0, 4): 8, (4, 12): 4},
        "num_velocities": 32,
        "special_tokens": ["PAD", "BOS", "EOS", "MASK"],
        "use_chords": True,
        "use_rests": False,
        "use_tempos": True,
        "use_time_signatures": False,
        "use_programs": False,
        "num_tempos": 32,  # number of tempo bins
        "tempo_range": (40, 250),  # (min, max)
    }
    config = TokenizerConfig(**TOKENIZER_PARAMS)

    # Creates the tokenizer
    tokenizer = REMI(config)

MIDI - Tokens conversion
-------------------------------

Here we convert a MIDI to tokens, decode them back to a MIDI.

..  code-block:: python

    from pathlib import Path

    # Tokenize a MIDI file
    tokens = tokenizer(Path("to", "your_midi.mid"))  # automatically detects Score objects, paths, tokens

    # Convert to MIDI and save it
    generated_midi = tokenizer(tokens)  # MidiTok can handle PyTorch/Numpy/Tensorflow tensors
    generated_midi.dump_midi(Path("to", "decoded_midi.mid"))


Trains a tokenizer with BPE
-----------------------------

Here we train the tokenizer with :ref:`Byte Pair Encoding (BPE)`.
BPE allows to reduce the lengths of the sequences of tokens, in turn model efficiency, while improving the results quality/model performance.

..  code-block:: python

    from miditok import REMI
    from pathlib import Path

    # Creates the tokenizer and list the file paths
    tokenizer = REMI()  # using defaults parameters (constants.py)
    midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))

    # Builds the vocabulary with BPE
    tokenizer.train(vocab_size=30000, files_paths=midi_paths)


Prepare a dataset before training
-------------------------------------------

MidiTok provides useful methods to split music files into smaller chunks that make approximately a target number of tokens, allowing to use most of your data to train and evaluate models. It also provide data augmentation methods to increase the amount of data to train models.

..  code-block:: python

    from random import shuffle

    from miditok.data_augmentation import augment_dataset
    from miditok.utils import split_files_for_training

    # Split the dataset into train/valid/test subsets, with 15% of the data for each of the two latter
    midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))
    total_num_files = len(midi_paths)
    num_files_valid = round(total_num_files * 0.15)
    num_files_test = round(total_num_files * 0.15)
    shuffle(midi_paths)
    midi_paths_valid = midi_paths[:num_files_valid]
    midi_paths_test = midi_paths[num_files_valid:num_files_valid + num_files_test]
    midi_paths_train = midi_paths[num_files_valid + num_files_test:]

    # Chunk MIDIs and perform data augmentation on each subset independently
    for files_paths, subset_name in (
        (midi_paths_train, "train"), (midi_paths_valid, "valid"), (midi_paths_test, "test")
    ):

        # Split the MIDIs into chunks of sizes approximately about 1024 tokens
        subset_chunks_dir = Path(f"dataset_{subset_name}")
        split_files_for_training(
            files_paths=files_paths,
            tokenizer=tokenizer,
            save_dir=subset_chunks_dir,
            max_seq_len=1024,
            num_overlap_bars=2,
        )

        # Perform data augmentation
        augment_dataset(
            subset_chunks_dir,
            pitch_offsets=[-12, 12],
            velocity_offsets=[-4, 4],
            duration_offsets=[-0.5, 0.5],
        )

Creates a Dataset and collator for training
-------------------------------------------

Creates a Dataset and a collator to be used with a PyTorch DataLoader to train a model

..  code-block:: python

    from miditok import REMI
    from miditok.pytorch_data import DatasetMIDI, DataCollator
    from torch.utils.data import DataLoader

    tokenizer = REMI()  # using defaults parameters (constants.py)
    midi_paths = list(Path("path", "to", "dataset").glob("**/*.mid"))
    dataset = DatasetMIDI(
        files_paths=midi_paths,
        tokenizer=tokenizer,
        max_seq_len=1024,
        bos_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer["BOS_None"],
    )
    collator = DataCollator(tokenizer.pad_token_id)
    data_loader = DataLoader(dataset=dataset, collate_fn=collator)

    # Using the data loader in the training loop
    for batch in data_loader:
        print("Train your model on this batch...")


Tokenize a dataset
------------------------

Here we tokenize a whole dataset into JSON files storing the tokens ids.
We also perform data augmentation on the pitch, velocity and duration dimension.

..  code-block:: python

    from miditok import REMI
    from miditok.data_augmentation import augment_midi_dataset
    from pathlib import Path

    # Creates the tokenizer and list the file paths
    tokenizer = REMI()  # using defaults parameters (constants.py)
    data_path = Path("path", "to", "dataset")

    # A validation method to discard MIDIs we do not want
    # It can also be used for custom pre-processing, for instance if you want to merge
    # some tracks before tokenizing a MIDI file
    def midi_valid(midi) -> bool:
        if any(ts.numerator != 4 for ts in midi.time_signature_changes):
            return False  # time signature different from 4/*, 4 beats per bar
        return True

    # Performs data augmentation on one pitch octave (up and down), velocities and
    # durations
    midi_aug_path = Path("to", "new", "location", "augmented")
    augment_midi_dataset(
        data_path,
        pitch_offsets=[-12, 12],
        velocity_offsets=[-4, 5],
        duration_offsets=[-0.5, 1],
        out_path=midi_aug_path,
    )
    tokenizer.tokenize_dataset(        # 2 velocity and 1 duration values
        data_path,
        Path("path", "to", "tokens"),
        midi_valid,
    )