Downsampling¶
-
py_entitymatching.
down_sample
(table_a, table_b, size, y_param, show_progress=True, verbose=False)[source]¶ This function down samples two tables A and B into smaller tables A’ and B’ respectively.
Specifically, first it randomly selects size tuples from the table B to be table B’. Next, it builds an inverted index I ( token, tuple_id) on table A. For each tuple x ∈ B’, the algorithm finds a set P of k/2 tuples from I that match x, and a set Q of k/2 tuples randomly selected from A - P. The idea is for A’ and B’ to share some matches yet be as representative of A and B as possible.
Parameters: - table_a,table_b (DataFrame) – The input tables A and B.
- size (int) – The size that table B should be down sampled to.
- y_param (int) – The parameter to control the down sample size of table A. Specifically, the down sampled size of table A should be close to size * y_param.
- show_progress (boolean) – A flag to indicate whether a progress bar should be displayed.
- verbose (boolean) – A flag to indicate whether the debug information should be displayed.
Returns: Down sampled tables A and B as pandas DataFrames.
Raises: AssertionError
– If any of the input tables (table_a, table_b) are empty or not a DataFrame.AssertionError
– If size or y_param is empty or 0 or not a valid integer value.