Blocking¶
-
class
py_entitymatching.
AttrEquivalenceBlocker
[source]¶ Blocks based on the equivalence of attribute values.
-
block_candset
(candset, l_block_attr, r_block_attr, allow_missing=False, verbose=False, show_progress=True, n_jobs=1)[source]¶ Blocks an input candidate set of tuple pairs based on attribute equivalence.
Finds tuple pairs from an input candidate set of tuple pairs such that the value of attribute l_block_attr of the left tuple in a tuple pair exactly matches the value of attribute r_block_attr of the right tuple in the tuple pair.
Parameters: - candset (DataFrame) – The input candidate set of tuple pairs.
- l_block_attr (string) – The blocking attribute in left table.
- r_block_attr (string) – The blocking attribute in right table.
- allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If candset is not of type pandas DataFrame.AssertionError
– If l_block_attr is not of type string.AssertionError
– If r_block_attr is not of type string.AssertionError
– If verbose is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_block_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.
-
block_tables
(ltable, rtable, l_block_attr, r_block_attr, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, n_jobs=1)[source]¶ Blocks two tables based on attribute equivalence.
Finds tuple pairs from left and right tables such that the value of attribute l_block_attr of a tuple from the left table exactly matches the value of attribute r_block_attr of a tuple from the right table. This is similar to equi-join of two tables.
Parameters: - ltable (DataFrame) – The left input table.
- rtable (DataFrame) – The right input table.
- l_block_attr (string) – The blocking attribute in left table.
- r_block_attr (string) – The blocking attribute in right table.
- l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
- r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
- l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
- r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
- allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine). Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If ltable is not of type pandas DataFrame.AssertionError
– If rtable is not of type pandas DataFrame.AssertionError
– If l_block_attr is not of type string.AssertionError
– If r_block_attr is not of type string.AssertionError
– If l_output_attrs is not of type of list.AssertionError
– If r_output_attrs is not of type of list.AssertionError
– If the values in l_output_attrs is not of type string.AssertionError
– If the values in r_output_attrs is not of type string.AssertionError
– If l_output_prefix is not of type string.AssertionError
– If r_output_prefix is not of type string.AssertionError
– If verbose is not of type boolean.AssertionError
– If allow_missing is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_block_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.AssertionError
– If l_out_attrs are not in the ltable.AssertionError
– If r_out_attrs are not in the rtable.
-
block_tuples
(ltuple, rtuple, l_block_attr, r_block_attr, allow_missing=False)[source]¶ Blocks a tuple pair based on attribute equivalence.
Parameters: - ltuple (Series) – The input left tuple.
- rtuple (Series) – The input right tuple.
- l_block_attr (string) – The blocking attribute in left tuple.
- r_block_attr (string) – The blocking attribute in right tuple.
- allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.
Returns: A status indicating if the tuple pair is blocked, i.e., the values of l_block_attr in ltuple and r_block_attr in rtuple are different (boolean).
-
-
class
py_entitymatching.
OverlapBlocker
[source]¶ Blocks based on the overlap of token sets of attribute values.
-
block_candset
(candset, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False, verbose=False, show_progress=True, n_jobs=1)[source]¶ - Blocks an input candidate set of tuple pairs based on the overlap
- of token sets of attribute values.
Finds tuple pairs from an input candidate set of tuple pairs such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of the left tuple in a tuple pair, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of the right tuple in the tuple pair, is above a certain threshold.
Parameters: - candset (DataFrame) – The input candidate set of tuple pairs.
- l_overlap_attr (string) – The overlap attribute in left table.
- r_overlap_attr (string) – The overlap attribute in right table.
- rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
- q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
- word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
- overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
- allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple pair with missing value in either blocking attribute will be retained in the output candidate set.
- verbose (boolean) –
A flag to indicate whether the debug information
should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If candset is not of type pandas DataFrame.AssertionError
– If l_overlap_attr is not of type string.AssertionError
– If r_overlap_attr is not of type string.AssertionError
– If q_val is not of type int.AssertionError
– If word_level is not of type boolean.AssertionError
– If overlap_size is not of type int.AssertionError
– If verbose is not of type boolean.AssertionError
– If allow_missing is not of type boolean.AssertionError
– If show_progress is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_overlap_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.SyntaxError
– If q_val is set to a valid value and word_level is set to True.SyntaxError
– If q_val is set to None and word_level is set to False.
-
block_tables
(ltable, rtable, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', allow_missing=False, verbose=False, show_progress=True, n_jobs=1)[source]¶ - Blocks two tables based on the overlap of token sets of attribute
- values.
Finds tuple pairs from left and right tables such that the overlap between (a) the set of tokens obtained by tokenizing the value of attribute l_overlap_attr of a tuple from the left table, and (b) the set of tokens obtained by tokenizing the value of attribute r_overlap_attr of a tuple from the right table, is above a certain threshold.
Parameters: - ltable (DataFrame) – The left input table.
- rtable (DataFrame) – The right input table.
- l_overlap_attr (string) – The overlap attribute in left table.
- r_overlap_attr (string) – The overlap attribute in right table.
- rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
- q_val (int) – The value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
- word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
- overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
- l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
- r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
- l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
- r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
- allow_missing (boolean) – A flag to indicate whether tuple pairs with missing value in at least one of the blocking attributes should be included in the output candidate set (defaults to False). If this flag is set to True, a tuple in ltable with missing value in the blocking attribute will be matched with every tuple in rtable and vice versa.
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If ltable is not of type pandas DataFrame.AssertionError
– If rtable is not of type pandas DataFrame.AssertionError
– If l_overlap_attr is not of type string.AssertionError
– If r_overlap_attr is not of type string.AssertionError
– If l_output_attrs is not of type of list.AssertionError
– If r_output_attrs is not of type of list.AssertionError
– If the values in l_output_attrs is not of type string.AssertionError
– If the values in r_output_attrs is not of type string.AssertionError
– If l_output_prefix is not of type string.AssertionError
– If r_output_prefix is not of type string.AssertionError
– If q_val is not of type int.AssertionError
– If word_level is not of type boolean.AssertionError
– If overlap_size is not of type int.AssertionError
– If verbose is not of type boolean.AssertionError
– If allow_missing is not of type boolean.AssertionError
– If show_progress is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_overlap_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.AssertionError
– If l_output_attrs are not in the ltable.AssertionError
– If r_output_attrs are not in the rtable.SyntaxError
– If q_val is set to a valid value and word_level is set to True.SyntaxError
– If q_val is set to None and word_level is set to False.
-
block_tuples
(ltuple, rtuple, l_overlap_attr, r_overlap_attr, rem_stop_words=False, q_val=None, word_level=True, overlap_size=1, allow_missing=False)[source]¶ - Blocks a tuple pair based on the overlap of token sets of attribute
- values.
Parameters: - ltuple (Series) – The input left tuple.
- rtuple (Series) – The input right tuple.
- l_overlap_attr (string) – The overlap attribute in left tuple.
- r_overlap_attr (string) – The overlap attribute in right tuple.
- rem_stop_words (boolean) – A flag to indicate whether stop words (e.g., a, an, the) should be removed from the token sets of the overlap attribute values (defaults to False).
- q_val (int) – A value of q to use if the overlap attributes values are to be tokenized as qgrams (defaults to None).
- word_level (boolean) – A flag to indicate whether the overlap attributes should be tokenized as words (i.e, using whitespace as delimiter) (defaults to True).
- overlap_size (int) – The minimum number of tokens that must overlap (defaults to 1).
- allow_missing (boolean) – A flag to indicate whether a tuple pair with missing value in at least one of the blocking attributes should be blocked (defaults to False). If this flag is set to True, the pair will be kept if either ltuple has missing value in l_block_attr or rtuple has missing value in r_block_attr or both.
Returns: A status indicating if the tuple pair is blocked (boolean).
-
-
class
py_entitymatching.
RuleBasedBlocker
(*args, **kwargs)[source]¶ Blocks based on a sequence of blocking rules supplied by the user.
-
add_rule
(conjunct_list, feature_table=None, rule_name=None)[source]¶ Adds a rule to the rule-based blocker.
Parameters: - conjunct_list (list) – A list of conjuncts specifying the rule.
- feature_table (DataFrame) – A DataFrame containing all the features that are being referenced by the rule (defaults to None). If the feature_table is not supplied here, then it must have been specified during the creation of the rule-based blocker or using set_feature_table function. Otherwise an AssertionError will be raised and the rule will not be added to the rule-based blocker.
- rule_name – A string specifying the name of the rule to be added (defaults to None). If the rule_name is not specified then a name will be automatically chosen. If there is already a rule with the specified rule_name, then an AssertionError will be raised and the rule will not be added to the rule-based blocker.
-
block_candset
(candset, verbose=False, show_progress=True, n_jobs=1)[source]¶ Blocks an input candidate set of tuple pairs based on a sequence of blocking rules supplied by the user.
Finds tuple pairs from an input candidate set of tuple pairs that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked (dropped).
Parameters: - candset (DataFrame) – The input candidate set of tuple pairs.
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If candset is not of type pandas DataFrame.AssertionError
– If verbose is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If show_progress is not of type boolean.AssertionError
– If l_block_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.AssertionError
– If there are no rules to apply.
-
block_tables
(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_jobs=1)[source]¶ Blocks two tables based on the sequence of rules supplied by the user.
Finds tuple pairs from left and right tables that survive the sequence of blocking rules. A tuple pair survives the sequence of blocking rules if none of the rules in the sequence returns True for that pair. If any of the rules returns True, then the pair is blocked.
Parameters: - ltable (DataFrame) – The left input table.
- rtable (DataFrame) – The right input table.
- l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
- r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
- l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
- r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived the sequence of blocking rules (DataFrame).
Raises: AssertionError
– If ltable is not of type pandas DataFrame.AssertionError
– If rtable is not of type pandas DataFrame.AssertionError
– If l_output_attrs is not of type of list.AssertionError
– If r_output_attrs is not of type of list.AssertionError
– If the values in l_output_attrs is not of type string.AssertionError
– If the values in r_output_attrs is not of type string.AssertionError
– If the input l_output_prefix is not of type string.AssertionError
– If the input r_output_prefix is not of type string.AssertionError
– If verbose is not of type boolean.AssertionError
– If show_progress is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_out_attrs are not in the ltable.AssertionError
– If r_out_attrs are not in the rtable.AssertionError
– If there are no rules to apply.
-
block_tuples
(ltuple, rtuple)[source]¶ Blocks a tuple pair based on a sequence of blocking rules supplied by the user.
Parameters: - ltuple (Series) – The input left tuple.
- rtuple (Series) – The input right tuple.
Returns: A status indicating if the tuple pair is blocked by applying the sequence of blocking rules (boolean).
-
delete_rule
(rule_name)[source]¶ Deletes a rule from the rule-based blocker.
Parameters: rule_name (string) – Name of the rule to be deleted.
-
get_rule
(rule_name)[source]¶ Returns the function corresponding to a rule.
Parameters: rule_name (string) – Name of the rule. Returns: A function object corresponding to the specified rule.
-
get_rule_names
()[source]¶ Returns the names of all the rules in the rule-based blocker.
Returns: A list of names of all the rules in the rule-based blocker (list).
-
-
class
py_entitymatching.
BlackBoxBlocker
(*args, **kwargs)[source]¶ Blocks based on a black box function specified by the user.
-
block_candset
(candset, verbose=True, show_progress=True, n_jobs=1)[source]¶ Blocks an input candidate set of tuple pairs based on a black box blocking function specified by the user.
Finds tuple pairs from an input candidate set of tuple pairs that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.
Parameters: - candset (DataFrame) – The input candidate set of tuple pairs.
- verbose (boolean) – A flag to indicate whether logging should be done (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus is the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If candset is not of type pandas DataFrame.AssertionError
– If verbose is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If show_progress is not of type boolean.AssertionError
– If l_block_attr is not in the ltable columns.AssertionError
– If r_block_attr is not in the rtable columns.
-
block_tables
(ltable, rtable, l_output_attrs=None, r_output_attrs=None, l_output_prefix='ltable_', r_output_prefix='rtable_', verbose=False, show_progress=True, n_jobs=1)[source]¶ Blocks two tables based on a black box blocking function specified by the user.
Finds tuple pairs from left and right tables that survive the black box function. A tuple pair survives the black box blocking function if the function returns False for that pair, otherwise the tuple pair is dropped.
Parameters: - ltable (DataFrame) – The left input table.
- rtable (DataFrame) – The right input table.
- l_output_attrs (list) – A list of attribute names from the left table to be included in the output candidate set (defaults to None).
- r_output_attrs (list) – A list of attribute names from the right table to be included in the output candidate set (defaults to None).
- l_output_prefix (string) – The prefix to be used for the attribute names coming from the left table in the output candidate set (defaults to ‘ltable_’).
- r_output_prefix (string) – The prefix to be used for the attribute names coming from the right table in the output candidate set (defaults to ‘rtable_’).
- verbose (boolean) – A flag to indicate whether the debug information should be logged (defaults to False).
- show_progress (boolean) – A flag to indicate whether progress should be displayed to the user (defaults to True).
- n_jobs (int) – The number of parallel jobs to be used for computation (defaults to 1). If -1 all CPUs are used. If 0 or 1, no parallel computation is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used (where n_cpus are the total number of CPUs in the machine).Thus, for n_jobs = -2, all CPUs but one are used. If (n_cpus + 1 + n_jobs) is less than 1, then no parallel computation is used (i.e., equivalent to the default).
Returns: A candidate set of tuple pairs that survived blocking (DataFrame).
Raises: AssertionError
– If ltable is not of type pandas DataFrame.AssertionError
– If rtable is not of type pandas DataFrame.AssertionError
– If l_output_attrs is not of type of list.AssertionError
– If r_output_attrs is not of type of list.AssertionError
– If values in l_output_attrs is not of type string.AssertionError
– If values in r_output_attrs is not of type string.AssertionError
– If l_output_prefix is not of type string.AssertionError
– If r_output_prefix is not of type string.AssertionError
– If verbose is not of type boolean.AssertionError
– If show_progress is not of type boolean.AssertionError
– If n_jobs is not of type int.AssertionError
– If l_out_attrs are not in the ltable.AssertionError
– If r_out_attrs are not in the rtable.
-
block_tuples
(ltuple, rtuple)[source]¶ Blocks a tuple pair based on a black box blocking function specified by the user.
Takes a tuple pair as input, applies the black box blocking function to it, and returns True (if the intention is to drop the pair) or False (if the intention is to keep the tuple pair).
Parameters: - ltuple (Series) – input left tuple.
- rtuple (Series) – input right tuple.
Returns: A status indicating if the tuple pair should be dropped or kept, based on the black box blocking function (boolean).
-