Creating the Features Manually¶
-
py_entitymatching.
get_feature_fn
(feature_string, tokenizers, similarity_functions)[source]¶ This function creates a feature in a declarative manner.
Specifically, this function uses the feature string, parses it and compiles it into a function using the given tokenizers and similarity functions. This compiled function will take in two tuples and return a feature value (typically a number).
Parameters: - feature_string (string) – A feature expression to be converted into a function.
- tokenizers (dictionary) – A Python dictionary containing tokenizers. Specifically, the dictionary contains tokenizer names as keys and tokenizer functions as values. The tokenizer function typically takes in a string and returns a list of tokens.
- similarity_functions (dictionary) – A Python dictionary containing similarity functions. Specifically, the dictionary contains similarity function names as keys and similarity functions as values. The similarity function typically takes in a string or two lists of tokens and returns a number.
Returns: This function returns a Python dictionary which contains sufficient information (such as attributes, tokenizers, function code) to be added to the feature table.
Specifically the Python dictionary contains the following keys: ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’.
For all the keys except the ‘function’ and ‘function_source’ the value will be either a valid string (if the input feature string is parsed correctly) or PARSE_EXP (if the parsing was not successful). The ‘function’ will have a valid Python function as value, and ‘function_source’ will have the Python function’s source in string format.
The created function is a self-contained function which means that the tokenizers and sim functions that it calls are bundled along with the returned function code.
Raises: AssertionError
– If feature_string is not of type string.AssertionError
– If the input tokenizers is not of type dictionary.AssertionError
– If the input similarity_functions is not of type dictionary.
-
py_entitymatching.
add_feature
(feature_table, feature_name, feature_dict)[source]¶ Adds a feature to the feature table.
Specifically, this function is used in combination with
get_feature_fn()
. First the user creates a dictionary usingget_feature_fn()
, then the user uses this function to add feature_dict to the feature table.Parameters: - feature_table (DataFrame) – A DataFrame containing features.
- feature_name (string) – The name that should be given to the feature.
- feature_dict (dictionary) – A Python dictionary, that is typically
returned by executing
get_feature_fn()
.
Returns: A Boolean value of True is returned if the addition was successful.
Raises: AssertionError
– If the input feature_table is not of type pandas DataFrame.AssertionError
– If feature_name is not of type string.AssertionError
– If feature_dict is not of type Python dictionary.AssertionError
– If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.AssertionError
– If the feature_name is already present in the feature table.
-
py_entitymatching.
add_blackbox_feature
(feature_table, feature_name, feature_function)[source]¶ Adds a black box feature to the feature table.
Parameters: - feature_table (DataFrame) – The input DataFrame (typically a feature table) to which the feature must be added.
- feature_name (string) – The name that should be given to the feature.
- feature_function (Python function) – A Python function for the black box feature.
Returns: A Boolean value of True is returned if the addition was successful.
Raises: AssertionError
– If the input feature_table is not of type DataFrame.AssertionError
– If the input feature_name is not of type string.AssertionError
– If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.AssertionError
– If the feature_name is already present in the feature table.