Supported Tokenizers

py_entitymatching.tok_qgram(input_string, q)[source]

This function splits the input string into a list of q-grams. Note that, by default the input strings are padded and then tokenized.

Parameters:
  • input_string (string) – Input string that should be tokenized.
  • q (int) – q-val that should be used to tokenize the input string.
Returns:

A list of tokens, if the input string is not NaN, else returns NaN.

py_entitymatching.tok_delim(input_string, d)[source]

This function splits the input string into a list of tokens (based on the delimiter).

Parameters:
  • input_string (string) – Input string that should be tokenized.
  • d (string) – Delimiter string.
Returns:

A list of tokens, if the input string is not NaN , else returns NaN.

py_entitymatching.tok_wspace(input_string)[source]

This function splits the input string into a list of tokens (based on the white space).

Parameters:input_string (string) – Input string that should be tokenized.
Returns:A list of tokens, if the input string is not NaN , else returns NaN.
py_entitymatching.tok_alphabetic(input_string)[source]

This function returns a list of tokens that are maximal sequences of consecutive alphabetical characters.

Parameters:input_string (string) – Input string that should be tokenized.
Returns:A list of tokens, if the input string is not NaN , else returns NaN.
py_entitymatching.tok_alphanumeric(input_string)[source]

This function returns a list of tokens that are maximal sequences of consecutive alphanumeric characters.

Parameters:input_string (string) – Input string that should be tokenized.
Returns:A list of tokens, if the input string is not NaN , else returns NaN.