EasyTPP Preprocess Modules
- class preprocess.TPPDataLoader(data_config, backend, **kwargs)[source]
Bases:
object
- __init__(data_config, backend, **kwargs)[source]
Initialize the dataloader
- Parameters:
data_config (EasyTPP.DataConfig) – data config.
backend (str) – backend engine, e.g., tensorflow or torch.
- get_loader(split='train', **kwargs)[source]
Get the corresponding data loader.
- Parameters:
split (str, optional) – denote the train, valid and test set. Defaults to ‘train’.
num_event_types (int, optional) – num of event types in the data. Defaults to None.
- Raises:
NotImplementedError – the input of ‘num_event_types’ is inconsistent with the data.
- Returns:
the data loader for tpp data.
- Return type:
EasyTPP.DataLoader
- train_loader(**kwargs)[source]
Return the train loader
- Returns:
data loader for train set.
- Return type:
EasyTPP.DataLoader
- class preprocess.EventTokenizer(config)[source]
Bases:
object
Base class for tokenizer event sequences, vendored from huggingface/transformer
- padding_side: str = 'right'
- truncation_side: str = 'right'
- model_input_names: List[str] = ['time_seqs', 'time_delta_seqs', 'type_seqs', 'seq_non_pad_mask', 'attention_mask', 'type_mask']
- pad(encoded_inputs: Dict[str, Any] | Dict[str, List], padding: bool | str | PaddingStrategy = True, truncation: bool | str | TruncationStrategy = False, max_length: int | None = None, return_attention_mask: bool | None = None, return_tensors: str | TensorType | None = None, verbose: bool = False) BatchEncoding [source]
Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.
Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).
Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.
<Tip>
If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.
</Tip>
- Parameters:
([BatchEncoding] (encoded_inputs) –
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.
Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
[BatchEncoding] (list of) –
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.
Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –
- Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
return_attention_mask (bool, optional) – Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
’tf’: Return TensorFlow tf.constant objects.
’pt’: Return PyTorch torch.Tensor objects.
’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.
- static make_pad_sequence(seqs, pad_token_id, padding_side, max_len, dtype=<class 'numpy.float32'>, group_by_event_types=False)[source]
Pad the sequence batch-wise.
- Parameters:
seqs (list) – list of sequences with variational length
pad_token_id (int, float) – optional, a value that used to pad the sequences. If None, then the pad index
event_num_with_pad (is set to be the) –
max_len (int) – optional, the maximum length of the sequence after padding. If None, then the
sequences. (length is set to be the max length of all input) –
pad_at_end (bool) – optional, whether to pad the sequnce at the end. If False,
beginning (the sequence is pad at the) –
- Returns:
a numpy array of padded sequence
Example: ```python seqs = [[0, 1], [3, 4, 5]] pad_sequence(seqs, 100) >>> [[0, 1, 100], [3, 4, 5]]
pad_sequence(seqs, 100, max_len=5) >>> [[0, 1, 100, 100, 100], [3, 4, 5, 100, 100]] ```
- make_attn_mask_for_pad_sequence(pad_seqs, pad_token_id)[source]
Make the attention masks for the sequence.
- Parameters:
pad_seqs (tensor) – list of sequences that have been padded with fixed length
pad_token_id (int) – optional, a value that used to pad the sequences. If None, then the pad index
event_num_with_pad (is set to be the) –
- Returns:
a bool matrix of the same size of input, denoting the masks of the sequence (True: non mask, False: mask)
- Return type:
np.array
Example: ```python seqs = [[ 1, 6, 0, 7, 12, 12], [ 1, 0, 5, 1, 10, 9]] make_attn_mask_for_pad_sequence(seqs, pad_index=12) >>>
batch_non_pad_mask ([[ True, True, True, True, False, False], [ True, True, True, True, True, True]]) attention_mask [[[ True True True True True True]
[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False True True]]
- [[ True True True True True True]
[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False False True]]]
- class preprocess.TPPDataset(data: Dict)[source]
Bases:
Dataset
- to_tf_dataset(data_collator: TPPDataCollator, **kwargs)[source]
Generate a dataset to use in Tensorflow
- Parameters:
data_collator (TPPDataCollator) – collator to tokenize the event data.
- Raises:
ImportError – Tensorflow is not installed.
- Returns:
tf Dataset object for TPP data.
- Return type:
tf.keras.utils.Sequence
- preprocess.get_data_loader(dataset: TPPDataset, backend: str, tokenizer: EventTokenizer, **kwargs)[source]