Supported Tokenizers¶

py_entitymatching.tok_qgram(input_string, q)[source]¶

This function splits the input string into a list of q-grams. Note that, by default the input strings are padded and then tokenized.

Parameters:	input_string (string) – Input string that should be tokenized. q (int) – q-val that should be used to tokenize the input string.
Returns:	A list of tokens, if the input string is not NaN, else returns NaN.

py_entitymatching.tok_delim(input_string, d)[source]¶

This function splits the input string into a list of tokens (based on the delimiter).

Parameters:	input_string (string) – Input string that should be tokenized. d (string) – Delimiter string.
Returns:	A list of tokens, if the input string is not NaN , else returns NaN.

py_entitymatching.tok_wspace(input_string)[source]¶

This function splits the input string into a list of tokens (based on the white space).

Parameters:	input_string (string) – Input string that should be tokenized.
Returns:	A list of tokens, if the input string is not NaN , else returns NaN.

py_entitymatching.tok_alphabetic(input_string)[source]¶

This function returns a list of tokens that are maximal sequences of consecutive alphabetical characters.

Parameters:	input_string (string) – Input string that should be tokenized.
Returns:	A list of tokens, if the input string is not NaN , else returns NaN.

py_entitymatching.tok_alphanumeric(input_string)[source]¶

This function returns a list of tokens that are maximal sequences of consecutive alphanumeric characters.

Parameters:	input_string (string) – Input string that should be tokenized.
Returns:	A list of tokens, if the input string is not NaN , else returns NaN.