Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal SN Computer Science. Check our GitHub repository for the code and instructions to reproduce the experiments. The data were obtained on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was 3.8. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (syn_pipeline.py) took several hours. The commit hash for the last run of this pipeline is acc34cf5d2. The commit hash for the last run of the corresponding evaluation (syn_evaluation.py) is c1a7e7e99e. Running the experimental pipeline for the case study in materials science (ms_pipeline.py) took less than one hour. The commit hash for the last run of this pipeline is ba30bf9f11. The commit hash for the last run of the corresponding evaluation (ms_evaluation.py) is c1a7e7e99e. All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with pandas.read_csv().

ms/

The input data for the case study in materials science (ms_pipeline.py). Output of the script prepare_ms_dataset.py. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to prepare_ms_dataset.py). In this pre-processed version, the feature and target parts of the data are already separated into two files: voxel_data_predict_glissile_X.csv and voxel_data_predict_glissile_y.csv. In voxel_data_predict_glissile_X.csv, each column is a numeric feature. voxel_data_predict_glissile_y.csv only contains one column, the numeric prediction target (reaction density of glissile reactions).

ms-results/

Only contains one result file (results.csv) for the case study in materials science. Output of the script ms_pipeline.py, input to the script ms_evaluation.py. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter.

objective_value (float): Objective Q(s, X, y), the sum of the qualities of the selected features.

num_selected (int): n_{se}, the number of selected features.

selected (string, but actually a list of strings): Names of the selected features.

num_variables (int): n, the total number of features in the dataset.

num_constrained_variables (int): n_{cf}, the number of features involved in constraints.

num_unique_constrained_variables (int): n_{ucf}, the number of unique features involved in constraints.

num_constraints (int): n_{co}, the number of constraints.

frac_solutions (float): n_{so}^{norm}, the number of valid (regarding constraints) feature sets relative to the total number of feature sets.

linear-regression_train_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set.

linear-regression_test_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set.

regression-tree_train_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set.

regression-tree_test_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set.

xgb-linear_train_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set.

xgb-linear_test_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set.

xgb-tree_train_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set.

xgb-tree_test_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set.

evaluation_time (float): Runtime (in s) for evaluating one set of constraints.

split_idx (int): Index of the cross-validation fold.

quality_name (string): Measure for feature quality (absolute correlation or mutual information).

constraint_name (string): Name of the constraint type (see paper).

dataset_name (string): Name of the dataset.

openml/

The input data for the study with synthetic constraints (syn_pipeline.py). Output of the script prepare_openml_datasets.py. We downloaded 35 datasets from OpenML and removed non-numeric columns. Also, we separated the feature part (*_X.csv) and the target part (*_y.csv) of each dataset. _data_overview.csv contains meta-data for the datasets, including dataset id, dataset version, and uploader. Licensing Please consult each dataset's website on OpenML for licensing information and citation requests. According to OpenML's terms, OpenML datasets fall under the CC-BY license. The datasets used in our study were uploaded by:

Jan van Rijn (user id: 1)

Joaquin Vanschoren (user id: 2)

Rafael Gomes Mantovani (user id: 64)

Tobias Kuehn (user id: 94)

Richard Ooms (user id: 8684)

R P (user id: 15317) See _data_overview.csv to match each dataset to its uploader.

openml-results/

Result files for the study with synthetic constraints. Output of the script syn_pipeline.py, input to the script syn_evaluation.py. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, results.csv. The columns of the result files are the those of ms-results/results.csv, minus selected and evaluation_time; see above for detailed descriptions.

RADAR4KIT ist ein über das Internet nutzbarer Dienst für die Archivierung und Publikation von Forschungsdaten aus abgeschlossenen wissenschaftlichen Studien und Projekten für Forschende des KIT. Betreiber ist das Karlsruher Institut für Technologie (KIT). RADAR4KIT setzt auf dem von FIZ Karlsruhe angebotenen Dienst RADAR auf. Die Speicherung der Daten findet ausschließlich auf IT-Infrastruktur des KIT am Steinbuch Centre for Computing (SCC) statt.

Eine inhaltliche Bewertung und Qualitätsprüfung findet ausschließlich durch die Datengeberinnen und Datengeber statt.

Das Nutzungsverhältnis zwischen Ihnen („Datennutzerin“ bzw. „Datennutzer“) und dem KIT erschöpft sich im Download von Datenpaketen oder Metadaten. Das KIT behält sich vor, die Nutzung von RADAR4KIT einzuschränken oder den Dienst ganz einzustellen.
Sofern Sie sich als Datennutzerin oder als Datennutzer registrieren lassen bzw. über Shibboleth legitimieren, kann Ihnen seitens der Datengeberin oder des Datengebers Zugriff auch auf unveröffentlichte Dokumente gewährt werden.
Den Schutz Ihrer persönlichen Daten erklären die Datenschutzbestimmungen.
Das KIT übernimmt für Richtigkeit, Aktualität und Zuverlässigkeit der bereitgestellten Inhalte keine Gewährleistung und Haftung, außer im Fall einer zwingenden gesetzlichen Haftung.
Das KIT stellt Ihnen als Datennutzerin oder als Datennutzer für das Recherchieren in RADAR4KIT und für das Herunterladen von Datenpaketen keine Kosten in Rechnung.
Sie müssen die mit dem Datenpaket verbundenen Lizenzregelungen einhalten.

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

`ms/`

`ms-results/`

`openml/`

`openml-results/`

Dataset: Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

ms/

ms-results/

openml/

openml-results/

`ms/`

`ms-results/`

`openml/`

`openml-results/`