Bases of MidiTok

This page introduces the bases of MidiTok, how a tokenizer works and what are the basic elements of MidiTok.

MidiTok’s workflow

MidiTok uses a common workflow for all its tokenizers, which follows:

  1. Music file preprocessing: time is downsampled to match the tokenizer’s time resolution, tracks of the same programs are merged, notes with pitches outside the tokenizer’s pitch range are removed, note velocities and tempos are downsampled, finally notes, tempos and time signatures are deduplicated;

  2. Parsing of global events: tempos and time signature tokens are created;

  3. Parsing of the tracks events: notes, chords, controls (pedals…) and tokens specific to each tracks are parsed to create their associated tokens;

  4. Creating time tokens: the tokens representing the time are created in order to bind the previously created global and track tokens.

The resulting tokens are provided by the tokenizer as one or miditok.TokSequence depending on the tokenizer’s IO format (Tokens & TokSequence input / output format)

The first three steps are common for all tokenizers, while the fourth is handled independently by each tokenizer. The first step allows to format the music file so that its content fits the tokenizer’s vocabulary before being parsed.

Vocabulary

As introduced in Tokens and vocabulary, the vocabulary acts as a lookup table between the tokens (string) and their ids (integers). It can be accessed with tokenizer.vocab to get the string to id mapping.

For tokenizers with embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of dictionaries, and the tokenizer.is_multi_vocab property will be True.

With a trained tokenizer: tokenizer.vocab holds all the basic tokens describing the note and time attributes of music. By analogy with text, this vocabulary can be seen as the alphabet of unique characters. After Training a tokenizer, a new vocabulary is built with newly created tokens from pairs of basic tokens. This vocabulary can be accessed with tokenizer.vocab_model, and maps tokens as bytes (string) to their associated ids (int). This is the vocabulary of the 🤗tokenizers model.

TokSequence

The methods of MidiTok use miditok.TokSequence objects as input and outputs. A miditok.TokSequence holds tokens as strings, integers, miditok.Event and bytes (used internally to encode the token ids with trained tokenizers). TokSequences are subscriptable, can be sliced, concatenated and implement the __len__ magic method.

You can use the miditok.MusicTokenizer.complete_sequence() method to automatically fill the non-initialized attributes of a miditok.TokSequence.

class miditok.TokSequence(tokens: list[str | list[str]] = <factory>, ids: list[int | list[int]] = <factory>, bytes: str = <factory>, events: list[~miditok.classes.Event | list[~miditok.classes.Event]] = <factory>, are_ids_encoded: bool = False, _ticks_bars: list[int] = <factory>, _ticks_beats: list[int] = <factory>, _ids_decoded: list[int | list[int]] = <factory>)

Sequence of token.

A TokSequence can represent tokens by their several forms: * tokens (list of str): tokens as sequence of strings; * ids (list of int), these are the one to be fed to models; * events (list of Event): Event objects that can carry time or other information useful for debugging; * bytes (str): ids are converted into unique bytes, all joined together in a single string. This is used internally by MidiTok for the tokenizer’s model (BPE, Unigram, WordPiece).

Bytes are used internally by MidiTok for Byte Pair Encoding. The are_ids_encoded attribute tells if ids is encoded.

miditok.MusicTokenizer.complete_sequence() can be used to complete the non-initialized attributes.

split_per_bars() list[TokSequence]

Split the sequence into subsequences corresponding to each bar.

The method can only be called from sequences properly tokenized, otherwise it will throw an error.

Returns:

list of subsequences for each bar.

split_per_beats() list[TokSequence]

Split the sequence into subsequences corresponding to each beat.

The method can only be called from sequences properly tokenized, otherwise it will throw an error.

Returns:

list of subsequences for each beat.

The MusicTokenizer class

MidiTok features several MIDI tokenizations, all inheriting from the miditok.MusicTokenizer class. You can customize your tokenizer by creating it with a custom miditok.TokenizerConfig.

class miditok.MusicTokenizer(*args, **kwargs)

Base music tokenizer class, acting as a common framework.

This is the base class of all tokenizers, containing the common methods and attributes. It serves as a framework, and implement most of the tokenization global workflow. Child classes should only implement specific methods, for their specific behaviors, leaving most of the logic here.

Parameters:
  • tokenizer_config – the tokenizer’s configuration, as a miditok.TokenizerConfig object.

  • params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)

add_attribute_control(attribute_control: AttributeControl) None

Add a miditok.attribute_control.AttributeControl to the tokenizer.

The tokens of the attribute controls will also be added to the vocabulary.

Parameters:

attribute_controlmiditok.attribute_control.AttributeControl to add to the tokenizer.

add_to_vocab(token: str | Event, special_token: bool = False, vocab_idx: int | None = None, byte_: str | None = None) None

Add an event to the vocabulary. Its id will be the length of the vocab.

Parameters:
  • token – token to add, as a formatted string of the form “Type_Value”, e.g. Pitch_80, or an Event.

  • special_token – whether the token is special. (default: False)

  • vocab_idx – idx of the vocabulary (in case of embedding pooling). (default: None)

  • byte – unique byte associated to the token. The associated byte of a token is used to encode-decode ids with the tokenizer’s model (BPE, Unigram, WordPiece). If None is given, it will default to chr(id_ + CHR_ID_START) . (default: None)

complete_sequence(seq: TokSequence, complete_bytes: bool = False) None

Complete (inplace) a miditok.TokSequence.

The input sequence can have some of its attributes (ids, tokens) not initialized (i.e. None). This method will initialize them from the present ones. The events attribute will not be filled as it is only intended for debug purpose. The bytes attribute will be created if complete_bytes is provided as True and if the tokenizer has been trained.

Parameters:
  • seq – input miditok.TokSequence, must have at least one attribute defined.

  • complete_bytes – will complete the bytes form of each token. This is only applicable if the tokenizer has been trained.

decode(tokens: TokSequence | list[TokSequence] | list[int | list[int]] | ndarray, programs: list[tuple[int, bool]] | None = None, output_path: str | Path | None = None)

Detokenize one or several sequences of tokens into a symusic.Score.

You can give the tokens sequences either as miditok.TokSequence objects, lists of integers, numpy arrays or PyTorch/Jax/Tensorflow tensors. The Score’s time division will be the same as the tokenizer’s: tokenizer.time_division.

Parameters:
  • tokens – tokens to convert. Can be either a list of miditok.TokSequence, a Tensor (PyTorch and Tensorflow are supported), a numpy array or a Python list of ints. The first dimension represents tracks, unless the tokenizer handle tracks altogether as a single token sequence (tokenizer.one_token_stream == True).

  • programs – programs of the tracks. If none is given, will default to piano, program 0. (default: None)

  • output_path – path to save the file. (default: None)

Returns:

the symusic.Score object.

decode_token_ids(seq: TokSequence | list[TokSequence]) None

Decode the ids of a miditok.TokSequence with BPE, Unigram or WordPiece.

This method only modifies the .ids attribute of the input sequence(s) and does not complete it. This method can be used recursively on lists of miditok.TokSequence.

Parameters:

seq – token sequence to decompose.

encode(score: Score | Path, encode_ids: bool = True, no_preprocess_score: bool = False, attribute_controls_indexes: Mapping[int, Mapping[int, Sequence[int] | bool]] | None = None) TokSequence | list[TokSequence]

Tokenize a music file (MIDI/abc), given as a symusic.Score or a file path.

You can provide a Path to the file to tokenize, or a symusic.Score object. This method returns a (list of) miditok.TokSequence.

If you are implementing your own tokenization by subclassing this class, override the protected _score_to_tokens method.

Parameters:
  • score – the symusic.Score object to convert.

  • encode_ids – the backbone model (BPE, Unigram, WordPiece) will encode the tokens and compress the sequence. Can only be used if the tokenizer has been trained. (default: True)

  • no_preprocess_score – whether to preprocess the symusic.Score. If this argument is provided as True, make sure that the corresponding music file / symusic.Score has already been preprocessed by the tokenizer (miditok.MusicTokenizer.preprocess_score()) or that its content is aligned with the tokenizer’s vocabulary, otherwise the tokenization is likely to crash. This argument is useful in cases where you need to use the preprocessed symusic.Score along with the tokens to not have to preprocess it twice as this method preprocesses it inplace. (default: False)

  • attribute_controls_indexes – indices of the attribute controls to compute and associated tracks and bars. This argument has to be provided as a dictionary mapping track indices to dictionaries mapping attribute control indices (indexing tokenizer.attribute_controls) to a sequence of bar indexes if the AC is “bar-level” or anything if it is “track-level”. Its structure is as: {track_idx: {ac_idx: Any (track ac) | [bar_idx, ...] (bar ac)}} This argument is meant to be used when training a model in order to make it learn to generate tokens accordingly to the attribute controls. For maximum safety, it should be used with no_preprocess_score with an already preprocessed symusic.Score in order to make sure that the provided tracks indexes will remain correct as the preprocessing might delete or merge tracks depending on the tokenizer’s configuration.

Returns:

a miditok.TokSequence if tokenizer.one_token_stream is True, else a list of miditok.TokSequence objects.

encode_token_ids(seq: TokSequence | list[TokSequence]) None

Encode a miditok.TokSequence with BPE, Unigram or WordPiece.

The method works inplace and only alters the sequence’s .ids. The method also works with lists of miditok.TokSequence. If a list is given, the model will encode all sequences in one batch to speed up the operation.

Parameters:

seqmiditok.TokSequence to encode ids.

property io_format: tuple[str, ...]

Return the i/o format of the tokenizer.

The characters for each dimension returned are: * I: track or instrument; * T: token, or time step; * C: class of token, when using embedding pooling.

Returns:

i/o format of the tokenizer, as a tuple of strings which represent:

property is_multi_voc: bool

Indicate if the tokenizer uses embedding pooling / have multiple vocabularies.

Returns:

True is the tokenizer uses embedding pooling else False.

property is_trained: bool

Indicate if the tokenizer is trained (True).

Returns:

a boolean, equal to True if the tokenizer is trained, False otherwise.

property len: int | list[int]

Return the length of the vocabulary.

If the tokenizer uses embedding pooling/have multiple vocabularies, it will return the list of their lengths. Use the miditok.MusicTokenizer.__len__() magic method (len(tokenizer)) to get the sum of the lengths.

Returns:

length of the vocabulary.

load_tokens(path: str | Path, raw: bool = False) TokSequence | list[TokSequence] | dict

Load tokens saved as JSON files.

Parameters:
  • path – path of the file to load.

  • raw – if given True, will return the raw content of the json file. (default: False)

Returns:

the tokens, with the associated information saved with.

property pad_token_id: int

Return the id of the padding token (PAD_None). It is usually 0.

Returns:

id of the padding token (PAD_None).

preprocess_score(score: ScoreFactory())

Pre-process a symusic.Score object to resample its time and events values.

This method is called before parsing a Score’s contents for tokenization. Its notes attributes (times, pitches, velocities) will be downsampled and sorted, duplicated notes removed, as well as tempos. Empty tracks (with no note) will be removed from the symusic.Score object. Notes with pitches outside self.config.pitch_range will be deleted. Tracks with programs not supported by the tokenizer will be deleted.

This method is not inplace and does not alter the provided score object.

Parameters:

scoresymusic.Score object to preprocess.

Returns:

the preprocessed score.

save(out_path: str | Path, additional_attributes: dict | None = None, filename: str | None = 'tokenizer.json') None

Save tokenizer in a Json file.

This can be useful to keep track of how a dataset has been tokenized.

Parameters:
  • out_path – output path to save the file. This can be either a path to a file (with a name and extension), or a path to a directory in which case the filename argument will be used.

  • additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

  • filename – name of the file to save, to be used in case out_path leads to a directory. (default: "tokenizer.json")

save_params(*args, **kwargs) None

DEPRECIATED: save a tokenizer as a JSON file (calling tokenizer.save).

Parameters:
  • args – positional arguments.

  • kwargs – keyword arguments.

save_pretrained(save_directory: str | Path, *, repo_id: str | None = None, push_to_hub: bool = False, **push_to_hub_kwargs) str | None

Save the tokenizer in local a directory.

Overridden from huggingface_hub.ModelHubMixin. Since v0.21 this method will automatically save self.config on after calling self._save_pretrained, which is unnecessary in our case.

Parameters:
  • save_directory – Path to directory in which the model weights and configuration will be saved.

  • push_to_hub – Whether to push your model to the Huggingface Hub after saving it.

  • repo_id – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.

  • push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.

save_tokens(tokens: TokSequence | list[int] | ndarray, path: str | Path, programs: list[tuple[int, bool]] | None = None, **kwargs) None

Save tokens as a JSON file.

In order to reduce disk space usage, only the ids are saved. Use kwargs to save any additional information within the JSON file.

Parameters:
  • tokens – tokens, as list, numpy array, torch or tensorflow Tensor.

  • path – path of the file to save.

  • programs – (optional), programs of the associated tokens, should be given as a tuples (int, bool) for (program, is_drum).

  • kwargs – any additional information to save within the JSON file.

score_has_time_signatures_not_in_vocab(score: ScoreFactory()) bool

Check if a symusic.Score contains unsupported time signatures.

Parameters:

scoresymusic.Score object.

Returns:

boolean indicating whether the score can be processed by the tokenizer.

property special_tokens: list[str]

Return the special tokens in the vocabulary.

Returns:

special tokens of the tokenizer

property special_tokens_ids: list[int]

Return the ids of the special tokens in the vocabulary.

Returns:

ids of the special tokens of the tokenizer

to_dict() dict

Return the serializable dictionary form of the tokenizer.

token_id_type(id_: int, vocab_id: int | None = None) str

Return the type of the given token id.

Parameters:
  • id – token id to get the type.

  • vocab_id – index of the vocabulary associated to the token, if applicable. (default: None)

Returns:

the type of the token, as a string

token_ids_of_type(token_type: str, vocab_id: int | None = None) list[int]

Return the list of token ids of the given type.

Parameters:
  • token_type – token type to get the associated token ids.

  • vocab_id – index of the vocabulary associated to the token, if applicable. (default: None)

Returns:

list of token ids.

tokenize_dataset(files_paths: str | Path | Sequence[str | Path], out_dir: str | Path, overwrite_mode: bool = True, validation_fn: Callable[[ScoreFactory()], bool] | None = None, save_programs: bool | None = None, verbose: bool = True) None

Tokenize a dataset or list of music files and save them in Json files.

The resulting json files will have an ids entry containing the token ids. The format of the ids will correspond to the format of the tokenizer (tokenizer.io_format). Note that the file tree of the source files, up to the deepest common root directory if files_paths is given as a list of paths, will be reproducing in out_dir. The config of the tokenizer will be saved as a file named tokenizer_config_file_name (default: tokenizer.json) in the out_dir directory.

Parameters:
  • files_paths – paths of the music files (MIDI, abc). It can also be a path to a directory, in which case this method will recursively find the MIDI and abc files within (.mid, .midi and .abc extensions, case insensitive).

  • out_dir – output directory to save the converted files.

  • overwrite_mode – if True, will overwrite files if they already exist when trying to save the new ones created by the method. This is enabled by default, as it is good practice to use dedicated directories for each tokenized dataset. If False, if a file already exist, the new one will be saved in the same directory, with the same name with a number appended at the end. Both token files and tokenizer config are concerned. (default: True)

  • validation_fn – a function checking if a music file is valid validates your conditions (e.g. time signature, minimum/maximum length, instruments…). (default: None)

  • save_programs – will save the programs of the tracks of the files as an entry in the Json file. This option is probably unnecessary when using a multitrack tokenizer (config.use_programs), as the program information is present within the tokens, and that the tracks having the same programs are likely to have been merged. (default: False if config.use_programs, else True)

  • verbose – will throw warnings of errors when loading files, or if some files content is incorrect or need your attention. (default: True)

tokens_errors(tokens: TokSequence | list[TokSequence] | list[int | list[int]] | ndarray) float | list[float]

Return the ratio of errors of prediction in a sequence of tokens.

Check if a sequence of tokens is made of good token types successions and returns the error ratio (lower is better).

Parameters:

tokens – sequence of tokens to check.

Returns:

the error ratio (lower is better).

train(vocab_size: int, model: Literal['BPE', 'Unigram', 'WordPiece'] | Model | None = None, iterator: Iterable | None = None, files_paths: Sequence[Path] | None = None, **kwargs) None

Train the tokenizer to build its vocabulary with BPE, Unigram or WordPiece.

The data used for training can either be given through the iterator argument as an iterable object yielding strings, or by files_paths as a list of paths to music files that will be tokenized. You can read the Hugging Face 🤗tokenizers documentation, and 🤗tokenizers course for more details about the iterator and input type.

If splitting the token sequences per bar or beat, a “Metaspace” pre-tokenizer and decoder will be used. Each chunk of tokens will be prepended with a special “▁” (U+2581) character to mark its beginning, as would be a word.

A few considerations to note:

1. The WordPiece model has a max_input_chars_per_word attribute, which controls the maximum number of “base tokens” a sequence of ids can contain until it discards and replaces it with a predefined “unknown” token (unk_token model attribute). This means that, depending on the base sequence lengths of your files, the tokenizer will likely discard them. This can be addressed by either: 1) splitting the token sequence per bars or beats before encoding ids (highly recommended) into smaller subsequences whose lengths will likely be lower to the model’s max_input_chars_per_word attribute; 2) set the model’s max_input_chars_per_word attribute to a value higher than most of the sequences of ids encoded by the WordPiece model. A high max_input_chars_per_word value will however drastically increase the encoding and decoding times, reducing its interest. The default values set by MidiTok are 400 when splitting ids in bar subsequences and 100 when splitting ids in beat subsequences. The max_input_chars_per_word and unk_token model attributes can be set by referencing them in the keyword arguments of this method (kwargs). 2. The Hugging Face Unigram model training is not 100% deterministic. As such and if you are using Unigram, you should train your tokenizer only once before using it to save tokenized files or train a model. Otherwise, some token ids might be swapped, resulting in incoherent encodings-decodings.

The training progress bar will not appear with non-proper terminals. (cf GitHub issue )

Parameters:
  • vocab_size – size of the vocabulary to learn / build.

  • model – backbone model to use to train the tokenizer. MidiTok relies on the Hugging Face tokenizers library, and supports the BPE, Unigram and WordPiece models. This argument can be either a string indicating the model to use, an already initialized model, or None if you want to retrain a tokenizer already trained. (default: None, default to BPE if the tokenizer is not already trained, keeps the same model otherwise)

  • iterator – an iterable object yielding the training data, as lists of string. It can be a list or a Generator. This iterator will be passed to the model for training. It musts implement the __len__ method. If None is given, you must use the tokens_paths argument. (default: None)

  • files_paths – paths of the music files to load and use. (default: None)

  • kwargs – any additional argument to pass to the trainer or model. See the tokenizers docs for more details.

property vocab: dict[str, int] | list[dict[str, int]]

Get the base vocabulary, as a dictionary mapping tokens (str) to their ids.

The different (hidden / protected) vocabulary attributes of the class are:

  • ._vocab_baseDict[str: int] token -> id - Registers all known base

    tokens;

  • .__vocab_base_invDict[int: str] id -> token - Inverse of

    ._base_vocab , to go the other way;

  • ._vocab_base_id_to_byteDict[int: str] id -> byte - Link ids to their

    associated unique bytes;

  • ._vocab_base_byte_to_tokenDict[str: str] - similar as above but for

    tokens;

  • ._vocab_learned_bytes_to_tokensDict[str: List[str]] byte(s) -> token(s)

    used to decode BPE/Unigram/WordPiece token ids;

  • ._model.get_vocab()Dict[str: int] byte -> id - BPE/Unigram/WordPiece

    model vocabulary, based on unique bytes.

Before training the tokenizer, bytes are obtained by running chr(id) . After training, if we did start from an empty vocabulary, some base tokens might be removed from ._vocab_base , if they were never found in the training samples. The base vocabulary being changed, chr(id) would then bind to incorrect bytes (on which byte succession would not have been learned). We register the original id/token/byte association in ._vocab_base_id_to_byte and ._vocab_base_byte_to_token .

Returns:

the base vocabulary.

property vocab_model: None | dict[str, int]

Return the vocabulary learnt with BPE.

In case the tokenizer has not been trained with BPE, it returns None.

Returns:

the BPE model’s vocabulary.

property vocab_size: int

Return the size of the vocabulary, by calling len(tokenizer).

Returns:

size of the vocabulary.

Tokens & TokSequence input / output format

Depending on the tokenizer at use, the format of the tokens returned by the miditok.MusicTokenizer.encode() method may vary, as well as the expected format for the miditok.MusicTokenizer.decode() method. The format is given by the miditok.MusicTokenizer.io_format() property. For any tokenizer, the format is the same for both methods.

The format is deduced from the miditok.MusicTokenizer.is_multi_voc() and one_token_stream tokenizer attributes. one_token_stream determined wether the tokenizer outputs a unique miditok.TokSequence covering all the tracks of a music file or one miditok.TokSequence per track. It is equal to tokenizer.config.one_token_stream_for_programs, except for miditok.MMM for which it is enabled while one_token_stream_for_programs is False. miditok.MusicTokenizer.is_multi_voc() being True means that each “token” within a miditok.TokSequence is actually a list of C “sub-tokens”, C being the number of sub-token classes.

This results in four situations, where I (instrument) is the number of tracks, T (token) is the number of tokens and C (class) the number of subtokens per token step:

  • is_multi_voc and one_token_stream are both False: [I,(T)];

  • is_multi_voc is False and one_token_stream is True: (T);

  • is_multi_voc is True and one_token_stream is False: [I,(T,C)];

  • is_multi_voc and one_token_stream are both True: (T,C).

Note that if there is no I dimension in the format, the output of miditok.MusicTokenizer.encode() is a miditok.TokSequence object, otherwise it is a list of miditok.TokSequence objects (one per token stream / track).

Some tokenizer examples to illustrate:

  • TSD without config.use_programs will not have multiple vocabularies and will treat each track as a unique stream of tokens, hence it will convert music files to a list of miditok.TokSequence objects, (I,T) format.

  • TSD with config.use_programs being True will convert all tracks to a single stream of tokens, hence one miditok.TokSequence object, (T) format.

  • CPWord is a multi-voc tokenizer, without config.use_programs it will treat each track as a distinct stream of tokens, hence it will convert music files to a list of miditok.TokSequence objects with the (I,T,C) format.

  • Octuple is a multi-voc tokenizer and converts all track to a single stream of tokens, hence it will convert music files to a miditok.TokSequence object, (T,C) format.

Magic methods

Magic methods allows to intuitively access to a tokenizer’s attributes and methods. We list them here with some examples.

miditok.MusicTokenizer.__call__(self, obj: Score | TokSequence | list[TokSequence, int, list[int]] | np.ndarray, *args, **kwargs) TokSequence | list[TokSequence] | Score

Tokenize a music file (MIDI/abc), or decode tokens into a symusic.Score.

Calling a tokenizer allows to directly convert a music file (MIDI/abc) to tokens or vice-versa. The method automatically detects symusic.Score and miditok.TokSequence objects, as well as paths to music or json files. It will call the miditok.MusicTokenizer.encode() if you provide a symusic.Score object or path to a music file, or the miditok.MusicTokenizer.decode() method otherwise.

Parameters:

obj – a symusic.Score object, a miditok.TokSequence object, or a path to a music or tokens json file.

Returns:

the converted object.

tokens = tokenizer(score)
score2 = tokenizer(tokens)
miditok.MusicTokenizer.__getitem__(self, item: int | str | tuple[int, int | str]) str | int | list[int]

Convert a token (int) to an event (str), or vice-versa.

Parameters:

item – a token (int) or an event (str). For tokenizers with embedding pooling/multiple vocabularies ( tokenizer.is_multi_voc ), you must either provide a string (token) that is within all vocabularies (e.g. special tokens), or a tuple where the first element in the index of the vocabulary and the second the element to index.

Returns:

the converted object.

pad_token = tokenizer["PAD_None"]
miditok.MusicTokenizer.__len__(self) int

Return the length of the vocabulary.

If the tokenizer uses embedding pooling/have multiple vocabularies, it will return the sum of their lengths. If the tokenizer has been trained, this method returns the length of its model’s vocabulary, i.e. the proper number of possible token ids. Otherwise, it will return the length of the base vocabulary. Use the miditok.MusicTokenizer.len() property (tokenizer.len) to get the list of lengths.

Returns:

length of the vocabulary.

num_classes = len(tokenizer)
num_classes_per_vocab = tokenizer.len  # applicable to tokenizer with embedding pooling, e.g. CPWord or Octuple
miditok.MusicTokenizer.__eq__(self, other: MusicTokenizer) bool

Check that two tokenizers are identical.

This is done by comparing their vocabularies, and configuration.

Parameters:

other – tokenizer to compare.

Returns:

True if the vocabulary(ies) are identical, False otherwise.

if tokenizer1 == tokenizer2:
    print("The tokenizers have the same vocabulary and configurations!")

Save / Load a tokenizer

You can save and load a tokenizer, include its configuration and vocabulary. This is especially useful after Training a tokenizer.

miditok.MusicTokenizer.save(self, out_path: str | Path, additional_attributes: dict | None = None, filename: str | None = 'tokenizer.json') None

Save tokenizer in a Json file.

This can be useful to keep track of how a dataset has been tokenized.

Parameters:
  • out_path – output path to save the file. This can be either a path to a file (with a name and extension), or a path to a directory in which case the filename argument will be used.

  • additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default: None)

  • filename – name of the file to save, to be used in case out_path leads to a directory. (default: "tokenizer.json")

To load a tokenizer from saved parameters, just use the params argument when creating a it:

tokenizer = REMI(params=Path("to", "tokenizer.json"))