Splitting Data into Train and Test¶
- 
py_entitymatching.split_train_test(labeled_data, train_proportion=0.5, random_state=None, verbose=True)[source]¶
- This function splits the input data into train and test. - Specifically, this function is just a wrapper of scikit-learn’s train_test_split function. - This function also takes care of copying the metadata from the input table to train and test splits. - Parameters: - labeled_data (DataFrame) – The input pandas DataFrame that needs to be split into train and test.
- train_proportion (float) – A number between 0 and 1, indicating the proportion of tuples that should be included in the train split ( defaults to 0.5).
- random_state (object) – A number of random number object (as in scikit-learn).
- verbose (boolean) – A flag to indicate whether the debug information should be displayed.
 - Returns: - A Python dictionary containing two keys - train and test. - The value for the key ‘train’ is a pandas DataFrame containing tuples allocated from the input table based on train_proportion. - Similarly, the value for the key ‘test’ is a pandas DataFrame containing tuples for evaluation. - This function sets the output DataFrames (train, test) properties same as the input DataFrame.