Bases of MidiTok¶
This page introduces the bases of MidiTok, how a tokenizer works and what are the basic elements of MidiTok.
MidiTok’s workflow¶
MidiTok uses a common workflow for all its tokenizers, which follows:
Music file preprocessing: time is downsampled to match the tokenizer’s time resolution, tracks of the same programs are merged, notes with pitches outside the tokenizer’s pitch range are removed, note velocities and tempos are downsampled, finally notes, tempos and time signatures are deduplicated;
Parsing of global events: tempos and time signature tokens are created;
Parsing of the tracks events: notes, chords, controls (pedals…) and tokens specific to each tracks are parsed to create their associated tokens;
Creating time tokens: the tokens representing the time are created in order to bind the previously created global and track tokens.
The resulting tokens are provided by the tokenizer as one or miditok.TokSequence depending on the tokenizer’s IO format (Tokens & TokSequence input / output format)
The first three steps are common for all tokenizers, while the fourth is handled independently by each tokenizer. The first step allows to format the music file so that its content fits the tokenizer’s vocabulary before being parsed.
Vocabulary¶
As introduced in Tokens and vocabulary, the vocabulary acts as a lookup table between the tokens (string) and their ids (integers).
It can be accessed with tokenizer.vocab to get the string to id mapping.
For tokenizers with embedding pooling (e.g. CPWord or Octuple), tokenizer.vocab will be a list of dictionaries, and the tokenizer.is_multi_vocab property will be True.
With a trained tokenizer:
tokenizer.vocab holds all the basic tokens describing the note and time attributes of music. By analogy with text, this vocabulary can be seen as the alphabet of unique characters.
After Training a tokenizer, a new vocabulary is built with newly created tokens from pairs of basic tokens. This vocabulary can be accessed with tokenizer.vocab_model, and maps tokens as bytes (string) to their associated ids (int). This is the vocabulary of the 🤗tokenizers model.
TokSequence¶
The methods of MidiTok use miditok.TokSequence objects as input and outputs. A miditok.TokSequence holds tokens as strings, integers, miditok.Event and bytes (used internally to encode the token ids with trained tokenizers). TokSequences are subscriptable, can be sliced, concatenated and implement the __len__ magic method.
You can use the miditok.MusicTokenizer.complete_sequence() method to automatically fill the non-initialized attributes of a miditok.TokSequence.
- class miditok.TokSequence(tokens: list[str | list[str]] = <factory>, ids: list[int | list[int]] = <factory>, bytes: str = <factory>, events: list[~miditok.classes.Event | list[~miditok.classes.Event]] = <factory>, are_ids_encoded: bool = False, _ticks_bars: list[int] = <factory>, _ticks_beats: list[int] = <factory>, _ids_decoded: list[int | list[int]] = <factory>)¶
Sequence of token.
A
TokSequencecan represent tokens by their several forms: * tokens (list of str): tokens as sequence of strings; * ids (list of int), these are the one to be fed to models; * events (list of Event): Event objects that can carry time or other information useful for debugging; * bytes (str): ids are converted into unique bytes, all joined together in a single string. This is used internally by MidiTok for the tokenizer’s model (BPE, Unigram, WordPiece).Bytes are used internally by MidiTok for Byte Pair Encoding. The
are_ids_encodedattribute tells ifidsis encoded.miditok.MusicTokenizer.complete_sequence()can be used to complete the non-initialized attributes.- split_per_bars() list[TokSequence]¶
Split the sequence into subsequences corresponding to each bar.
The method can only be called from sequences properly tokenized, otherwise it will throw an error.
- Returns:
list of subsequences for each bar.
- split_per_beats() list[TokSequence]¶
Split the sequence into subsequences corresponding to each beat.
The method can only be called from sequences properly tokenized, otherwise it will throw an error.
- Returns:
list of subsequences for each beat.
The MusicTokenizer class¶
MidiTok features several MIDI tokenizations, all inheriting from the miditok.MusicTokenizer class.
You can customize your tokenizer by creating it with a custom miditok.TokenizerConfig.
- class miditok.MusicTokenizer(*args, **kwargs)¶
Base music tokenizer class, acting as a common framework.
This is the base class of all tokenizers, containing the common methods and attributes. It serves as a framework, and implement most of the tokenization global workflow. Child classes should only implement specific methods, for their specific behaviors, leaving most of the logic here.
- Parameters:
tokenizer_config – the tokenizer’s configuration, as a
miditok.TokenizerConfigobject.params – path to a tokenizer config file. This will override other arguments and load the tokenizer based on the config file. This is particularly useful if the tokenizer learned Byte Pair Encoding. (default: None)
- add_attribute_control(attribute_control: AttributeControl) None¶
Add a
miditok.attribute_control.AttributeControlto the tokenizer.The tokens of the attribute controls will also be added to the vocabulary.
- Parameters:
attribute_control –
miditok.attribute_control.AttributeControlto add to the tokenizer.
- add_to_vocab(token: str | Event, special_token: bool = False, vocab_idx: int | None = None, byte_: str | None = None) None¶
Add an event to the vocabulary. Its id will be the length of the vocab.
- Parameters:
token – token to add, as a formatted string of the form “Type_Value”, e.g. Pitch_80, or an Event.
special_token – whether the token is special. (default:
False)vocab_idx – idx of the vocabulary (in case of embedding pooling). (default:
None)byte – unique byte associated to the token. The associated byte of a token is used to encode-decode ids with the tokenizer’s model (BPE, Unigram, WordPiece). If None is given, it will default to
chr(id_ + CHR_ID_START). (default:None)
- complete_sequence(seq: TokSequence, complete_bytes: bool = False) None¶
Complete (inplace) a
miditok.TokSequence.The input sequence can have some of its attributes (
ids,tokens) not initialized (i.e.None). This method will initialize them from the present ones. Theeventsattribute will not be filled as it is only intended for debug purpose. Thebytesattribute will be created ifcomplete_bytesis provided asTrueand if the tokenizer has been trained.- Parameters:
seq – input
miditok.TokSequence, must have at least one attribute defined.complete_bytes – will complete the bytes form of each token. This is only applicable if the tokenizer has been trained.
- decode(tokens: TokSequence | list[TokSequence] | list[int | list[int]] | ndarray, programs: list[tuple[int, bool]] | None = None, output_path: str | Path | None = None)¶
Detokenize one or several sequences of tokens into a
symusic.Score.You can give the tokens sequences either as
miditok.TokSequenceobjects, lists of integers, numpy arrays or PyTorch/Jax/Tensorflow tensors. The Score’s time division will be the same as the tokenizer’s:tokenizer.time_division.- Parameters:
tokens – tokens to convert. Can be either a list of
miditok.TokSequence, a Tensor (PyTorch and Tensorflow are supported), a numpy array or a Python list of ints. The first dimension represents tracks, unless the tokenizer handle tracks altogether as a single token sequence (tokenizer.one_token_stream == True).programs – programs of the tracks. If none is given, will default to piano, program 0. (default:
None)output_path – path to save the file. (default:
None)
- Returns:
the
symusic.Scoreobject.
- decode_token_ids(seq: TokSequence | list[TokSequence]) None¶
Decode the ids of a
miditok.TokSequencewith BPE, Unigram or WordPiece.This method only modifies the
.idsattribute of the input sequence(s) and does not complete it. This method can be used recursively on lists ofmiditok.TokSequence.- Parameters:
seq – token sequence to decompose.
- encode(score: Score | Path, encode_ids: bool = True, no_preprocess_score: bool = False, attribute_controls_indexes: Mapping[int, Mapping[int, Sequence[int] | bool]] | None = None) TokSequence | list[TokSequence]¶
Tokenize a music file (MIDI/abc), given as a
symusic.Scoreor a file path.You can provide a
Pathto the file to tokenize, or asymusic.Scoreobject. This method returns a (list of)miditok.TokSequence.If you are implementing your own tokenization by subclassing this class, override the protected
_score_to_tokensmethod.- Parameters:
score – the
symusic.Scoreobject to convert.encode_ids – the backbone model (BPE, Unigram, WordPiece) will encode the tokens and compress the sequence. Can only be used if the tokenizer has been trained. (default:
True)no_preprocess_score – whether to preprocess the
symusic.Score. If this argument is provided asTrue, make sure that the corresponding music file /symusic.Scorehas already been preprocessed by the tokenizer (miditok.MusicTokenizer.preprocess_score()) or that its content is aligned with the tokenizer’s vocabulary, otherwise the tokenization is likely to crash. This argument is useful in cases where you need to use the preprocessedsymusic.Scorealong with the tokens to not have to preprocess it twice as this method preprocesses it inplace. (default:False)attribute_controls_indexes – indices of the attribute controls to compute and associated tracks and bars. This argument has to be provided as a dictionary mapping track indices to dictionaries mapping attribute control indices (indexing
tokenizer.attribute_controls) to a sequence of bar indexes if the AC is “bar-level” or anything if it is “track-level”. Its structure is as:{track_idx: {ac_idx: Any (track ac) | [bar_idx, ...] (bar ac)}}This argument is meant to be used when training a model in order to make it learn to generate tokens accordingly to the attribute controls. For maximum safety, it should be used withno_preprocess_scorewith an already preprocessedsymusic.Scorein order to make sure that the provided tracks indexes will remain correct as the preprocessing might delete or merge tracks depending on the tokenizer’s configuration.
- Returns:
a
miditok.TokSequenceiftokenizer.one_token_streamisTrue, else a list ofmiditok.TokSequenceobjects.
- encode_token_ids(seq: TokSequence | list[TokSequence]) None¶
Encode a
miditok.TokSequencewith BPE, Unigram or WordPiece.The method works inplace and only alters the sequence’s
.ids. The method also works with lists ofmiditok.TokSequence. If a list is given, the model will encode all sequences in one batch to speed up the operation.- Parameters:
seq –
miditok.TokSequenceto encode ids.
- property io_format: tuple[str, ...]¶
Return the i/o format of the tokenizer.
The characters for each dimension returned are: *
I: track or instrument; *T: token, or time step; *C: class of token, when using embedding pooling.- Returns:
i/o format of the tokenizer, as a tuple of strings which represent:
- property is_multi_voc: bool¶
Indicate if the tokenizer uses embedding pooling / have multiple vocabularies.
- Returns:
Trueis the tokenizer uses embedding pooling elseFalse.
- property is_trained: bool¶
Indicate if the tokenizer is trained (
True).- Returns:
a boolean, equal to
Trueif the tokenizer is trained,Falseotherwise.
- property len: int | list[int]¶
Return the length of the vocabulary.
If the tokenizer uses embedding pooling/have multiple vocabularies, it will return the list of their lengths. Use the
miditok.MusicTokenizer.__len__()magic method (len(tokenizer)) to get the sum of the lengths.- Returns:
length of the vocabulary.
- load_tokens(path: str | Path, raw: bool = False) TokSequence | list[TokSequence] | dict¶
Load tokens saved as JSON files.
- Parameters:
path – path of the file to load.
raw – if given
True, will return the raw content of the json file. (default:False)
- Returns:
the tokens, with the associated information saved with.
- property pad_token_id: int¶
Return the id of the padding token (
PAD_None). It is usually 0.- Returns:
id of the padding token (
PAD_None).
- preprocess_score(score: ScoreFactory())¶
Pre-process a
symusic.Scoreobject to resample its time and events values.This method is called before parsing a Score’s contents for tokenization. Its notes attributes (times, pitches, velocities) will be downsampled and sorted, duplicated notes removed, as well as tempos. Empty tracks (with no note) will be removed from the
symusic.Scoreobject. Notes with pitches outsideself.config.pitch_rangewill be deleted. Tracks with programs not supported by the tokenizer will be deleted.This method is not inplace and does not alter the provided
scoreobject.- Parameters:
score –
symusic.Scoreobject to preprocess.- Returns:
the preprocessed
score.
- save(out_path: str | Path, additional_attributes: dict | None = None, filename: str | None = 'tokenizer.json') None¶
Save tokenizer in a Json file.
This can be useful to keep track of how a dataset has been tokenized.
- Parameters:
out_path – output path to save the file. This can be either a path to a file (with a name and extension), or a path to a directory in which case the
filenameargument will be used.additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default:
None)filename – name of the file to save, to be used in case
out_pathleads to a directory. (default:"tokenizer.json")
- save_params(*args, **kwargs) None¶
DEPRECIATED: save a tokenizer as a JSON file (calling
tokenizer.save).- Parameters:
args – positional arguments.
kwargs – keyword arguments.
- save_pretrained(save_directory: str | Path, *, repo_id: str | None = None, push_to_hub: bool = False, **push_to_hub_kwargs) str | None¶
Save the tokenizer in local a directory.
Overridden from
huggingface_hub.ModelHubMixin. Since v0.21 this method will automatically saveself.configon after callingself._save_pretrained, which is unnecessary in our case.- Parameters:
save_directory – Path to directory in which the model weights and configuration will be saved.
push_to_hub – Whether to push your model to the Huggingface Hub after saving it.
repo_id – ID of your repository on the Hub. Used only if push_to_hub=True. Will default to the folder name if not provided.
push_to_hub_kwargs – Additional key word arguments passed along to the [~ModelHubMixin.push_to_hub] method.
- save_tokens(tokens: TokSequence | list[int] | ndarray, path: str | Path, programs: list[tuple[int, bool]] | None = None, **kwargs) None¶
Save tokens as a JSON file.
In order to reduce disk space usage, only the ids are saved. Use
kwargsto save any additional information within the JSON file.- Parameters:
tokens – tokens, as list, numpy array, torch or tensorflow Tensor.
path – path of the file to save.
programs – (optional), programs of the associated tokens, should be given as a tuples (int, bool) for (program, is_drum).
kwargs – any additional information to save within the JSON file.
- score_has_time_signatures_not_in_vocab(score: ScoreFactory()) bool¶
Check if a
symusic.Scorecontains unsupported time signatures.- Parameters:
score –
symusic.Scoreobject.- Returns:
boolean indicating whether the score can be processed by the tokenizer.
- property special_tokens: list[str]¶
Return the special tokens in the vocabulary.
- Returns:
special tokens of the tokenizer
- property special_tokens_ids: list[int]¶
Return the ids of the special tokens in the vocabulary.
- Returns:
ids of the special tokens of the tokenizer
- to_dict() dict¶
Return the serializable dictionary form of the tokenizer.
- token_id_type(id_: int, vocab_id: int | None = None) str¶
Return the type of the given token id.
- Parameters:
id – token id to get the type.
vocab_id – index of the vocabulary associated to the token, if applicable. (default:
None)
- Returns:
the type of the token, as a string
- token_ids_of_type(token_type: str, vocab_id: int | None = None) list[int]¶
Return the list of token ids of the given type.
- Parameters:
token_type – token type to get the associated token ids.
vocab_id – index of the vocabulary associated to the token, if applicable. (default:
None)
- Returns:
list of token ids.
- tokenize_dataset(files_paths: str | Path | Sequence[str | Path], out_dir: str | Path, overwrite_mode: bool = True, validation_fn: Callable[[ScoreFactory()], bool] | None = None, save_programs: bool | None = None, verbose: bool = True) None¶
Tokenize a dataset or list of music files and save them in Json files.
The resulting json files will have an
idsentry containing the token ids. The format of the ids will correspond to the format of the tokenizer (tokenizer.io_format). Note that the file tree of the source files, up to the deepest common root directory iffiles_pathsis given as a list of paths, will be reproducing inout_dir. The config of the tokenizer will be saved as a file namedtokenizer_config_file_name(default:tokenizer.json) in theout_dirdirectory.- Parameters:
files_paths – paths of the music files (MIDI, abc). It can also be a path to a directory, in which case this method will recursively find the MIDI and abc files within (.mid, .midi and .abc extensions, case insensitive).
out_dir – output directory to save the converted files.
overwrite_mode – if True, will overwrite files if they already exist when trying to save the new ones created by the method. This is enabled by default, as it is good practice to use dedicated directories for each tokenized dataset. If False, if a file already exist, the new one will be saved in the same directory, with the same name with a number appended at the end. Both token files and tokenizer config are concerned. (default:
True)validation_fn – a function checking if a music file is valid validates your conditions (e.g. time signature, minimum/maximum length, instruments…). (default:
None)save_programs – will save the programs of the tracks of the files as an entry in the Json file. This option is probably unnecessary when using a multitrack tokenizer (config.use_programs), as the program information is present within the tokens, and that the tracks having the same programs are likely to have been merged. (default:
Falseifconfig.use_programs, elseTrue)verbose – will throw warnings of errors when loading files, or if some files content is incorrect or need your attention. (default:
True)
- tokens_errors(tokens: TokSequence | list[TokSequence] | list[int | list[int]] | ndarray) float | list[float]¶
Return the ratio of errors of prediction in a sequence of tokens.
Check if a sequence of tokens is made of good token types successions and returns the error ratio (lower is better).
- Parameters:
tokens – sequence of tokens to check.
- Returns:
the error ratio (lower is better).
- train(vocab_size: int, model: Literal['BPE', 'Unigram', 'WordPiece'] | Model | None = None, iterator: Iterable | None = None, files_paths: Sequence[Path] | None = None, **kwargs) None¶
Train the tokenizer to build its vocabulary with BPE, Unigram or WordPiece.
The data used for training can either be given through the
iteratorargument as an iterable object yielding strings, or byfiles_pathsas a list of paths to music files that will be tokenized. You can read the Hugging Face 🤗tokenizers documentation, and 🤗tokenizers course for more details about theiteratorand input type.If splitting the token sequences per bar or beat, a “Metaspace” pre-tokenizer and decoder will be used. Each chunk of tokens will be prepended with a special “▁” (U+2581) character to mark its beginning, as would be a word.
A few considerations to note:
1. The WordPiece model has a
max_input_chars_per_wordattribute, which controls the maximum number of “base tokens” a sequence of ids can contain until it discards and replaces it with a predefined “unknown” token (unk_tokenmodel attribute). This means that, depending on the base sequence lengths of your files, the tokenizer will likely discard them. This can be addressed by either: 1) splitting the token sequence per bars or beats before encoding ids (highly recommended) into smaller subsequences whose lengths will likely be lower to the model’smax_input_chars_per_wordattribute; 2) set the model’smax_input_chars_per_wordattribute to a value higher than most of the sequences of ids encoded by the WordPiece model. A highmax_input_chars_per_wordvalue will however drastically increase the encoding and decoding times, reducing its interest. The default values set by MidiTok are400when splitting ids in bar subsequences and100when splitting ids in beat subsequences. Themax_input_chars_per_wordandunk_tokenmodel attributes can be set by referencing them in the keyword arguments of this method (kwargs). 2. The Hugging Face Unigram model training is not 100% deterministic. As such and if you are using Unigram, you should train your tokenizer only once before using it to save tokenized files or train a model. Otherwise, some token ids might be swapped, resulting in incoherent encodings-decodings.The training progress bar will not appear with non-proper terminals. (cf GitHub issue )
- Parameters:
vocab_size – size of the vocabulary to learn / build.
model – backbone model to use to train the tokenizer. MidiTok relies on the Hugging Face tokenizers library, and supports the
BPE,UnigramandWordPiecemodels. This argument can be either a string indicating the model to use, an already initialized model, orNoneif you want to retrain a tokenizer already trained. (default:None, default toBPEif the tokenizer is not already trained, keeps the same model otherwise)iterator – an iterable object yielding the training data, as lists of string. It can be a list or a Generator. This iterator will be passed to the model for training. It musts implement the
__len__method. If None is given, you must use thetokens_pathsargument. (default: None)files_paths – paths of the music files to load and use. (default: None)
kwargs – any additional argument to pass to the trainer or model. See the tokenizers docs for more details.
- property vocab: dict[str, int] | list[dict[str, int]]¶
Get the base vocabulary, as a dictionary mapping tokens (str) to their ids.
The different (hidden / protected) vocabulary attributes of the class are:
._vocab_baseDict[str: int] token -> id - Registers all known basetokens;
.__vocab_base_invDict[int: str] id -> token - Inverse of._base_vocab, to go the other way;
._vocab_base_id_to_byteDict[int: str] id -> byte - Link ids to theirassociated unique bytes;
._vocab_base_byte_to_tokenDict[str: str] - similar as above but fortokens;
._vocab_learned_bytes_to_tokensDict[str: List[str]] byte(s) -> token(s)used to decode BPE/Unigram/WordPiece token ids;
._model.get_vocab()Dict[str: int] byte -> id - BPE/Unigram/WordPiecemodel vocabulary, based on unique bytes.
Before training the tokenizer, bytes are obtained by running
chr(id). After training, if we did start from an empty vocabulary, some base tokens might be removed from._vocab_base, if they were never found in the training samples. The base vocabulary being changed,chr(id)would then bind to incorrect bytes (on which byte succession would not have been learned). We register the original id/token/byte association in._vocab_base_id_to_byteand._vocab_base_byte_to_token.- Returns:
the base vocabulary.
- property vocab_model: None | dict[str, int]¶
Return the vocabulary learnt with BPE.
In case the tokenizer has not been trained with BPE, it returns None.
- Returns:
the BPE model’s vocabulary.
- property vocab_size: int¶
Return the size of the vocabulary, by calling
len(tokenizer).- Returns:
size of the vocabulary.
Tokens & TokSequence input / output format¶
Depending on the tokenizer at use, the format of the tokens returned by the miditok.MusicTokenizer.encode() method may vary, as well as the expected format for the miditok.MusicTokenizer.decode() method. The format is given by the miditok.MusicTokenizer.io_format() property. For any tokenizer, the format is the same for both methods.
The format is deduced from the miditok.MusicTokenizer.is_multi_voc() and one_token_stream tokenizer attributes.
one_token_stream determined wether the tokenizer outputs a unique miditok.TokSequence covering all the tracks of a music file or one miditok.TokSequence per track. It is equal to tokenizer.config.one_token_stream_for_programs, except for miditok.MMM for which it is enabled while one_token_stream_for_programs is False.
miditok.MusicTokenizer.is_multi_voc() being True means that each “token” within a miditok.TokSequence is actually a list of C “sub-tokens”, C being the number of sub-token classes.
This results in four situations, where I (instrument) is the number of tracks, T (token) is the number of tokens and C (class) the number of subtokens per token step:
is_multi_vocandone_token_streamare bothFalse:[I,(T)];is_multi_vocisFalseandone_token_streamisTrue:(T);is_multi_vocisTrueandone_token_streamisFalse:[I,(T,C)];is_multi_vocandone_token_streamare bothTrue:(T,C).
Note that if there is no I dimension in the format, the output of miditok.MusicTokenizer.encode() is a miditok.TokSequence object, otherwise it is a list of miditok.TokSequence objects (one per token stream / track).
Some tokenizer examples to illustrate:
TSD without
config.use_programswill not have multiple vocabularies and will treat each track as a unique stream of tokens, hence it will convert music files to a list ofmiditok.TokSequenceobjects,(I,T)format.TSD with
config.use_programsbeing True will convert all tracks to a single stream of tokens, hence onemiditok.TokSequenceobject,(T)format.CPWord is a multi-voc tokenizer, without
config.use_programsit will treat each track as a distinct stream of tokens, hence it will convert music files to a list ofmiditok.TokSequenceobjects with the(I,T,C)format.Octuple is a multi-voc tokenizer and converts all track to a single stream of tokens, hence it will convert music files to a
miditok.TokSequenceobject,(T,C)format.
Magic methods¶
Magic methods allows to intuitively access to a tokenizer’s attributes and methods. We list them here with some examples.
- miditok.MusicTokenizer.__call__(self, obj: Score | TokSequence | list[TokSequence, int, list[int]] | np.ndarray, *args, **kwargs) TokSequence | list[TokSequence] | Score
Tokenize a music file (MIDI/abc), or decode tokens into a
symusic.Score.Calling a tokenizer allows to directly convert a music file (MIDI/abc) to tokens or vice-versa. The method automatically detects
symusic.Scoreandmiditok.TokSequenceobjects, as well as paths to music or json files. It will call themiditok.MusicTokenizer.encode()if you provide asymusic.Scoreobject or path to a music file, or themiditok.MusicTokenizer.decode()method otherwise.- Parameters:
obj – a symusic.Score object, a
miditok.TokSequenceobject, or a path to a music or tokens json file.- Returns:
the converted object.
tokens = tokenizer(score)
score2 = tokenizer(tokens)
- miditok.MusicTokenizer.__getitem__(self, item: int | str | tuple[int, int | str]) str | int | list[int]
Convert a token (int) to an event (str), or vice-versa.
- Parameters:
item – a token (int) or an event (str). For tokenizers with embedding pooling/multiple vocabularies ( tokenizer.is_multi_voc ), you must either provide a string (token) that is within all vocabularies (e.g. special tokens), or a tuple where the first element in the index of the vocabulary and the second the element to index.
- Returns:
the converted object.
pad_token = tokenizer["PAD_None"]
- miditok.MusicTokenizer.__len__(self) int
Return the length of the vocabulary.
If the tokenizer uses embedding pooling/have multiple vocabularies, it will return the sum of their lengths. If the tokenizer has been trained, this method returns the length of its model’s vocabulary, i.e. the proper number of possible token ids. Otherwise, it will return the length of the base vocabulary. Use the
miditok.MusicTokenizer.len()property (tokenizer.len) to get the list of lengths.- Returns:
length of the vocabulary.
num_classes = len(tokenizer)
num_classes_per_vocab = tokenizer.len # applicable to tokenizer with embedding pooling, e.g. CPWord or Octuple
- miditok.MusicTokenizer.__eq__(self, other: MusicTokenizer) bool
Check that two tokenizers are identical.
This is done by comparing their vocabularies, and configuration.
- Parameters:
other – tokenizer to compare.
- Returns:
True if the vocabulary(ies) are identical, False otherwise.
if tokenizer1 == tokenizer2:
print("The tokenizers have the same vocabulary and configurations!")
Save / Load a tokenizer¶
You can save and load a tokenizer, include its configuration and vocabulary. This is especially useful after Training a tokenizer.
- miditok.MusicTokenizer.save(self, out_path: str | Path, additional_attributes: dict | None = None, filename: str | None = 'tokenizer.json') None
Save tokenizer in a Json file.
This can be useful to keep track of how a dataset has been tokenized.
- Parameters:
out_path – output path to save the file. This can be either a path to a file (with a name and extension), or a path to a directory in which case the
filenameargument will be used.additional_attributes – any additional information to store in the config file. It can be used to override the default attributes saved in the parent method. (default:
None)filename – name of the file to save, to be used in case
out_pathleads to a directory. (default:"tokenizer.json")
To load a tokenizer from saved parameters, just use the params argument when creating a it:
tokenizer = REMI(params=Path("to", "tokenizer.json"))