EasyTPP Preprocess Modules

class preprocess.TPPDataLoader(data_config, backend, **kwargs)[source]

Bases: object

__init__(data_config, backend, **kwargs)[source]

Initialize the dataloader

Parameters:

data_config (EasyTPP.DataConfig) – data config.
backend (str) – backend engine, e.g., tensorflow or torch.

build_input_from_pkl(source_dir: str, split: str)[source]

get_loader(split='train', **kwargs)[source]

Get the corresponding data loader.

Parameters:

split (str, optional) – denote the train, valid and test set. Defaults to ‘train’.
num_event_types (int, optional) – num of event types in the data. Defaults to None.

Raises:

NotImplementedError – the input of ‘num_event_types’ is inconsistent with the data.

Returns:

the data loader for tpp data.

Return type:

EasyTPP.DataLoader

train_loader(**kwargs)[source]

Return the train loader

Returns:: data loader for train set.
Return type:: EasyTPP.DataLoader

valid_loader(**kwargs)[source]

Return the valid loader

Returns:: data loader for valid set.
Return type:: EasyTPP.DataLoader

test_loader(**kwargs)[source]

Return the test loader

Returns:: data loader for test set.
Return type:: EasyTPP.DataLoader

class preprocess.EventTokenizer(config)[source]

Bases: object

Base class for tokenizer event sequences, vendored from huggingface/transformer

__init__(config)[source]

padding_side: str = 'right'

truncation_side: str = 'right'

model_input_names: List[str] = ['time_seqs', 'time_delta_seqs', 'type_seqs', 'seq_non_pad_mask', 'attention_mask', 'type_mask']

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

<Tip>

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

</Tip>

Parameters:

([BatchEncoding] (encoded_inputs) –
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
[BatchEncoding] (list of) –
Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
return_attention_mask (bool, optional) – Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

static make_pad_sequence(seqs, pad_token_id, padding_side, max_len, dtype=<class 'numpy.float32'>, group_by_event_types=False)[source]

Pad the sequence batch-wise.

Parameters:

seqs (list) – list of sequences with variational length
pad_token_id (int, float) – optional, a value that used to pad the sequences. If None, then the pad index
event_num_with_pad (is set to be the) –
max_len (int) – optional, the maximum length of the sequence after padding. If None, then the
sequences. (length is set to be the max length of all input) –
pad_at_end (bool) – optional, whether to pad the sequnce at the end. If False,
beginning (the sequence is pad at the) –

Returns:

a numpy array of padded sequence

Example: ```python seqs = [[0, 1], [3, 4, 5]] pad_sequence(seqs, 100) >>> [[0, 1, 100], [3, 4, 5]]

pad_sequence(seqs, 100, max_len=5) >>> [[0, 1, 100, 100, 100], [3, 4, 5, 100, 100]] ```

make_attn_mask_for_pad_sequence(pad_seqs, pad_token_id)[source]

Make the attention masks for the sequence.

Parameters:

pad_seqs (tensor) – list of sequences that have been padded with fixed length
pad_token_id (int) – optional, a value that used to pad the sequences. If None, then the pad index
event_num_with_pad (is set to be the) –

Returns:

a bool matrix of the same size of input, denoting the masks of the sequence (True: non mask, False: mask)

Return type:

np.array

Example: ```python seqs = [[ 1, 6, 0, 7, 12, 12], [ 1, 0, 5, 1, 10, 9]] make_attn_mask_for_pad_sequence(seqs, pad_index=12) >>>

batch_non_pad_mask ([[ True, True, True, True, False, False], [ True, True, True, True, True, True]]) attention_mask [[[ True True True True True True]

[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False True True]]

[[ True True True True True True]
[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False False True]]]

```

make_type_mask_for_pad_sequence(pad_seqs)[source]

Make the type mask.

Parameters:: pad_seqs (tensor) – a list of sequence events with equal length (i.e., padded sequence)
Returns:: a 3-dim matrix, where the last dim (one-hot vector) indicates the type of event
Return type:: np.array

class preprocess.TPPDataset(data: Dict)[source]

Bases: Dataset

__init__(data: Dict)[source]

to_tf_dataset(data_collator: TPPDataCollator, **kwargs)[source]

Generate a dataset to use in Tensorflow

Parameters:: data_collator (TPPDataCollator) – collator to tokenize the event data.
Raises:: ImportError – Tensorflow is not installed.
Returns:: tf Dataset object for TPP data.
Return type:: tf.keras.utils.Sequence

preprocess.get_data_loader(dataset: TPPDataset, backend: str, tokenizer: EventTokenizer, **kwargs)[source]