EasyTPP Preprocess Modules

class preprocess.TPPDataLoader(data_config, backend, **kwargs)[source]

Bases: object

__init__(data_config, backend, **kwargs)[source]

Initialize the dataloader

Parameters:
  • data_config (EasyTPP.DataConfig) – data config.

  • backend (str) – backend engine, e.g., tensorflow or torch.

build_input_from_pkl(source_dir: str, split: str)[source]
get_loader(split='train', **kwargs)[source]

Get the corresponding data loader.

Parameters:
  • split (str, optional) – denote the train, valid and test set. Defaults to ‘train’.

  • num_event_types (int, optional) – num of event types in the data. Defaults to None.

Raises:

NotImplementedError – the input of ‘num_event_types’ is inconsistent with the data.

Returns:

the data loader for tpp data.

Return type:

EasyTPP.DataLoader

train_loader(**kwargs)[source]

Return the train loader

Returns:

data loader for train set.

Return type:

EasyTPP.DataLoader

valid_loader(**kwargs)[source]

Return the valid loader

Returns:

data loader for valid set.

Return type:

EasyTPP.DataLoader

test_loader(**kwargs)[source]

Return the test loader

Returns:

data loader for test set.

Return type:

EasyTPP.DataLoader

class preprocess.EventTokenizer(config)[source]

Bases: object

Base class for tokenizer event sequences, vendored from huggingface/transformer

__init__(config)[source]
padding_side: str = 'right'
truncation_side: str = 'right'
model_input_names: List[str] = ['time_seqs', 'time_delta_seqs', 'type_seqs', 'seq_non_pad_mask', 'attention_mask', 'type_mask']
pad(encoded_inputs: Dict[str, Any] | Dict[str, List], padding: bool | str | PaddingStrategy = True, truncation: bool | str | TruncationStrategy = False, max_length: int | None = None, return_attention_mask: bool | None = None, return_tensors: str | TensorType | None = None, verbose: bool = False) BatchEncoding[source]

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch.

Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id).

Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding.

<Tip>

If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

</Tip>

Parameters:
  • ([BatchEncoding] (encoded_inputs) –

    Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • [BatchEncoding] (list of) –

    Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function.

    Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.

  • padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –

    Select a strategy to pad the returned sequences (according to the model’s padding side and padding

    index) among:

    • True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).

    • ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.

    • False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).

  • max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).

  • return_attention_mask (bool, optional) – Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute.

  • return_tensors (str or [~utils.TensorType], optional) –

    If set, will return tensors instead of list of python integers. Acceptable values are:

    • ’tf’: Return TensorFlow tf.constant objects.

    • ’pt’: Return PyTorch torch.Tensor objects.

    • ’np’: Return Numpy np.ndarray objects.

  • verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

static make_pad_sequence(seqs, pad_token_id, padding_side, max_len, dtype=<class 'numpy.float32'>, group_by_event_types=False)[source]

Pad the sequence batch-wise.

Parameters:
  • seqs (list) – list of sequences with variational length

  • pad_token_id (int, float) – optional, a value that used to pad the sequences. If None, then the pad index

  • event_num_with_pad (is set to be the) –

  • max_len (int) – optional, the maximum length of the sequence after padding. If None, then the

  • sequences. (length is set to be the max length of all input) –

  • pad_at_end (bool) – optional, whether to pad the sequnce at the end. If False,

  • beginning (the sequence is pad at the) –

Returns:

a numpy array of padded sequence

Example: ```python seqs = [[0, 1], [3, 4, 5]] pad_sequence(seqs, 100) >>> [[0, 1, 100], [3, 4, 5]]

pad_sequence(seqs, 100, max_len=5) >>> [[0, 1, 100, 100, 100], [3, 4, 5, 100, 100]] ```

make_attn_mask_for_pad_sequence(pad_seqs, pad_token_id)[source]

Make the attention masks for the sequence.

Parameters:
  • pad_seqs (tensor) – list of sequences that have been padded with fixed length

  • pad_token_id (int) – optional, a value that used to pad the sequences. If None, then the pad index

  • event_num_with_pad (is set to be the) –

Returns:

a bool matrix of the same size of input, denoting the masks of the sequence (True: non mask, False: mask)

Return type:

np.array

Example: ```python seqs = [[ 1, 6, 0, 7, 12, 12], [ 1, 0, 5, 1, 10, 9]] make_attn_mask_for_pad_sequence(seqs, pad_index=12) >>>

batch_non_pad_mask ([[ True, True, True, True, False, False], [ True, True, True, True, True, True]]) attention_mask [[[ True True True True True True]

[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False True True]]

[[ True True True True True True]

[False True True True True True] [False False True True True True] [False False False True True True] [False False False False True True] [False False False False False True]]]

```

make_type_mask_for_pad_sequence(pad_seqs)[source]

Make the type mask.

Parameters:

pad_seqs (tensor) – a list of sequence events with equal length (i.e., padded sequence)

Returns:

a 3-dim matrix, where the last dim (one-hot vector) indicates the type of event

Return type:

np.array

class preprocess.TPPDataset(data: Dict)[source]

Bases: Dataset

__init__(data: Dict)[source]
to_tf_dataset(data_collator: TPPDataCollator, **kwargs)[source]

Generate a dataset to use in Tensorflow

Parameters:

data_collator (TPPDataCollator) – collator to tokenize the event data.

Raises:

ImportError – Tensorflow is not installed.

Returns:

tf Dataset object for TPP data.

Return type:

tf.keras.utils.Sequence

preprocess.get_data_loader(dataset: TPPDataset, backend: str, tokenizer: EventTokenizer, **kwargs)[source]