Creating the Features Manually

py_entitymatching.get_feature_fn(feature_string, tokenizers, similarity_functions)[source]

This function creates a feature in a declarative manner.

Specifically, this function uses the feature string, parses it and compiles it into a function using the given tokenizers and similarity functions. This compiled function will take in two tuples and return a feature value (typically a number).

Parameters:
  • feature_string (string) – A feature expression to be converted into a function.
  • tokenizers (dictionary) – A Python dictionary containing tokenizers. Specifically, the dictionary contains tokenizer names as keys and tokenizer functions as values. The tokenizer function typically takes in a string and returns a list of tokens.
  • similarity_functions (dictionary) – A Python dictionary containing similarity functions. Specifically, the dictionary contains similarity function names as keys and similarity functions as values. The similarity function typically takes in a string or two lists of tokens and returns a number.
Returns:

This function returns a Python dictionary which contains sufficient information (such as attributes, tokenizers, function code) to be added to the feature table.

Specifically the Python dictionary contains the following keys: ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’.

For all the keys except the ‘function’ and ‘function_source’ the value will be either a valid string (if the input feature string is parsed correctly) or PARSE_EXP (if the parsing was not successful). The ‘function’ will have a valid Python function as value, and ‘function_source’ will have the Python function’s source in string format.

The created function is a self-contained function which means that the tokenizers and sim functions that it calls are bundled along with the returned function code.

Raises:
  • AssertionError – If feature_string is not of type string.
  • AssertionError – If the input tokenizers is not of type dictionary.
  • AssertionError – If the input similarity_functions is not of type dictionary.
py_entitymatching.add_feature(feature_table, feature_name, feature_dict)[source]

Adds a feature to the feature table.

Specifically, this function is used in combination with get_feature_fn(). First the user creates a dictionary using get_feature_fn(), then the user uses this function to add feature_dict to the feature table.

Parameters:
  • feature_table (DataFrame) – A DataFrame containing features.
  • feature_name (string) – The name that should be given to the feature.
  • feature_dict (dictionary) – A Python dictionary, that is typically returned by executing get_feature_fn().
Returns:

A Boolean value of True is returned if the addition was successful.

Raises:
  • AssertionError – If the input feature_table is not of type pandas DataFrame.
  • AssertionError – If feature_name is not of type string.
  • AssertionError – If feature_dict is not of type Python dictionary.
  • AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
  • AssertionError – If the feature_name is already present in the feature table.
py_entitymatching.add_blackbox_feature(feature_table, feature_name, feature_function)[source]

Adds a black box feature to the feature table.

Parameters:
  • feature_table (DataFrame) – The input DataFrame (typically a feature table) to which the feature must be added.
  • feature_name (string) – The name that should be given to the feature.
  • feature_function (Python function) – A Python function for the black box feature.
Returns:

A Boolean value of True is returned if the addition was successful.

Raises:
  • AssertionError – If the input feature_table is not of type DataFrame.
  • AssertionError – If the input feature_name is not of type string.
  • AssertionError – If the feature_table does not have necessary columns such as ‘feature_name’, ‘left_attribute’, ‘right_attribute’, ‘left_attr_tokenizer’, ‘right_attr_tokenizer’, ‘simfunction’, ‘function’, and ‘function_source’ in the DataFrame.
  • AssertionError – If the feature_name is already present in the feature table.