Supported Tokenizers¶
-
py_entitymatching.
tok_qgram
(input_string, q)[source]¶ This function splits the input string into a list of q-grams. Note that, by default the input strings are padded and then tokenized.
Parameters: - input_string (string) – Input string that should be tokenized.
- q (int) – q-val that should be used to tokenize the input string.
Returns: A list of tokens, if the input string is not NaN, else returns NaN.
-
py_entitymatching.
tok_delim
(input_string, d)[source]¶ This function splits the input string into a list of tokens (based on the delimiter).
Parameters: - input_string (string) – Input string that should be tokenized.
- d (string) – Delimiter string.
Returns: A list of tokens, if the input string is not NaN , else returns NaN.
-
py_entitymatching.
tok_wspace
(input_string)[source]¶ This function splits the input string into a list of tokens (based on the white space).
Parameters: input_string (string) – Input string that should be tokenized. Returns: A list of tokens, if the input string is not NaN , else returns NaN.
-
py_entitymatching.
tok_alphabetic
(input_string)[source]¶ This function returns a list of tokens that are maximal sequences of consecutive alphabetical characters.
Parameters: input_string (string) – Input string that should be tokenized. Returns: A list of tokens, if the input string is not NaN , else returns NaN.
-
py_entitymatching.
tok_alphanumeric
(input_string)[source]¶ This function returns a list of tokens that are maximal sequences of consecutive alphanumeric characters.
Parameters: input_string (string) – Input string that should be tokenized. Returns: A list of tokens, if the input string is not NaN , else returns NaN.