Skip to main content

This website only uses technically necessary cookies. They will be deleted at the latest when you close your browser. To learn more, please read our Privacy Policy.

DE EN
Login
Logo, to home
  1. You are here:
  2. Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"
...

    Dataset: Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"

    • RADAR Metadata
    • Content
    • Statistics
    • Technical Metadata
    Alternate identifier:
    (KITopen-DOI) 10.5445/IR/1000148891
    Related identifier:
    (Is Identical To) https://publikationen.bibliothek.kit.edu/1000148891 - URL
    Creator/Author:
    Bach, Jakob https://orcid.org/0000-0003-0301-2798 [Institut für Programmstrukturen und Datenorganisation (IPD), Karlsruher Institut für Technologie (KIT)]

    Zoller, Kolja [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]

    Schulz, Katrin [Computational Materials Science (IAM-CMS), Karlsruher Institut für Technologie (KIT)]
    Contributors:
    -
    Title:
    Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"
    Additional titles:
    -
    Description:
    (Abstract) These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the co... These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" published at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). You can find the paper [here](https://doi.org/10.1007/s42979-022-01338-z) and the code [here](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection). See the `README` for details. Some of the datasets used in our study (which we also provide here) originate from [OpenML](https://www.openml.org) and are CC-BY-licensed. Please see the paragraph `Licensing` in the `README` for details, e.g., on the authors of these datasets.

    These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" published at the journal SN Computer Science. You can find the paper here and the code here. See the README for details. Some of the datasets used in our study (which we also provide here) originate from OpenML and are CC-BY-licensed. Please see the paragraph Licensing in the README for details, e.g., on the authors of these datasets.

    Show all Show markdown

    (Technical Remarks) # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/... # Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection" These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal [*SN Computer Science*](https://www.springer.com/journal/42979). Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection) for the code and instructions to reproduce the experiments. The data were obtained on a server with an `AMD EPYC 7551` [CPU](https://www.amd.com/en/products/cpu/amd-epyc-7551) (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was `3.8`. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (`syn_pipeline.py`) took several hours. The commit hash for the last run of this pipeline is [`acc34cf5d2`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/acc34cf5d22b0a8427852a01288bb8b34f5d8c98). The commit hash for the last run of the corresponding evaluation (`syn_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). Running the experimental pipeline for the case study in materials science (`ms_pipeline.py`) took less than one hour. The commit hash for the last run of this pipeline is [`ba30bf9f11`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/ba30bf9f11703e2a8a942425e2cd4b9f36ead513). The commit hash for the last run of the corresponding evaluation (`ms_evaluation.py`) is [`c1a7e7e99e`](https://github.com/Jakob-Bach/Constrained-Filter-Feature-Selection/tree/c1a7e7e99e56c1a178a602596c13641d7771df0a). All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with `pandas.read_csv()`. ## `ms/` The input data for the case study in materials science (`ms_pipeline.py`). Output of the script `prepare_ms_dataset.py`. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to `prepare_ms_dataset.py`). In this pre-processed version, the feature and target parts of the data are already separated into two files: `voxel_data_predict_glissile_X.csv` and `voxel_data_predict_glissile_y.csv`. In `voxel_data_predict_glissile_X.csv`, each column is a numeric feature. `voxel_data_predict_glissile_y.csv` only contains one column, the numeric prediction target (reaction density of glissile reactions). ## `ms-results/` Only contains one result file (`results.csv`) for the case study in materials science. Output of the script `ms_pipeline.py`, input to the script `ms_evaluation.py`. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter. - `objective_value` (float): Objective `Q(s, X, y)`, the sum of the qualities of the selected features. - `num_selected` (int): `n_{se}`, the number of selected features. - `selected` (string, but actually a list of strings): Names of the selected features. - `num_variables` (int): `n`, the total number of features in the dataset. - `num_constrained_variables` (int): `n_{cf}`, the number of features involved in constraints. - `num_unique_constrained_variables` (int): `n_{ucf}`, the number of unique features involved in constraints. - `num_constraints` (int): `n_{co}`, the number of constraints. - `frac_solutions` (float): `n_{so}^{norm}`, the number of valid (regarding constraints) feature sets relative to the total number of feature sets. - `linear-regression_train_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set. - `linear-regression_test_r2` (float): `R^2` (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set. - `regression-tree_train_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set. - `regression-tree_test_r2` (float): `R^2` (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set. - `xgb-linear_train_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set. - `xgb-linear_test_r2` (float): `R^2` (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set. - `xgb-tree_train_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set. - `xgb-tree_test_r2` (float): `R^2` (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set. - `evaluation_time` (float): Runtime (in s) for evaluating one set of constraints. - `split_idx` (int): Index of the cross-validation fold. - `quality_name` (string): Measure for feature quality (absolute correlation or mutual information). - `constraint_name` (string): Name of the constraint type (see paper). - `dataset_name` (string): Name of the dataset. ## `openml/` The input data for the study with synthetic constraints (`syn_pipeline.py`). Output of the script `prepare_openml_datasets.py`. We downloaded 35 datasets from [OpenML](https://www.openml.org) and removed non-numeric columns. Also, we separated the feature part (`*_X.csv`) and the target part (`*_y.csv`) of each dataset. `_data_overview.csv` contains meta-data for the datasets, including dataset id, dataset version, and uploader. **Licensing** Please consult each dataset's website on [OpenML](https://www.openml.org) for licensing information and citation requests. According to OpenML's [terms](https://www.openml.org/terms), OpenML datasets fall under the [CC-BY](https://creativecommons.org/licenses/by/4.0/) license. The datasets used in our study were uploaded by: - Jan van Rijn (user id: 1) - Joaquin Vanschoren (user id: 2) - Rafael Gomes Mantovani (user id: 64) - Tobias Kuehn (user id: 94) - Richard Ooms (user id: 8684) - R P (user id: 15317) See `_data_overview.csv` to match each dataset to its uploader. ## `openml-results/` Result files for the study with synthetic constraints. Output of the script `syn_pipeline.py`, input to the script `syn_evaluation.py`. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, `results.csv`. The columns of the result files are the those of `ms-results/results.csv`, minus `selected` and `evaluation_time`; see above for detailed descriptions.

    Experimental Data for the Paper "An Empirical Evaluation of Constrained Feature Selection"

    These are the experimental data for the paper> Bach, Jakob, et al. "An Empirical Evaluation of Constrained Feature Selection" accepted at the journal SN Computer Science. Check our GitHub repository for the code and instructions to reproduce the experiments. The data were obtained on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 128 GB RAM. The Python version was 3.8. Our paper contains two studies, and we provide data for both of them. Running the experimental pipeline for the study with synthetic constraints (syn_pipeline.py) took several hours. The commit hash for the last run of this pipeline is acc34cf5d2. The commit hash for the last run of the corresponding evaluation (syn_evaluation.py) is c1a7e7e99e. Running the experimental pipeline for the case study in materials science (ms_pipeline.py) took less than one hour. The commit hash for the last run of this pipeline is ba30bf9f11. The commit hash for the last run of the corresponding evaluation (ms_evaluation.py) is c1a7e7e99e. All these commits are also tagged. In the following, we describe the structure/content of each data file. All files are plain CSVs, so you can read them with pandas.read_csv().

    ms/

    The input data for the case study in materials science (ms_pipeline.py). Output of the script prepare_ms_dataset.py. As the raw simulation dataset is quite large, we only provide a pre-processed version of it (we do not provide the input to prepare_ms_dataset.py). In this pre-processed version, the feature and target parts of the data are already separated into two files: voxel_data_predict_glissile_X.csv and voxel_data_predict_glissile_y.csv. In voxel_data_predict_glissile_X.csv, each column is a numeric feature. voxel_data_predict_glissile_y.csv only contains one column, the numeric prediction target (reaction density of glissile reactions).

    ms-results/

    Only contains one result file (results.csv) for the case study in materials science. Output of the script ms_pipeline.py, input to the script ms_evaluation.py. The columns of the file mostly correspond to evaluation metrics used in the paper; see Appendix A.1 for definitions of the latter.

    • objective_value (float): Objective Q(s, X, y), the sum of the qualities of the selected features.
    • num_selected (int): n_{se}, the number of selected features.
    • selected (string, but actually a list of strings): Names of the selected features.
    • num_variables (int): n, the total number of features in the dataset.
    • num_constrained_variables (int): n_{cf}, the number of features involved in constraints.
    • num_unique_constrained_variables (int): n_{ucf}, the number of unique features involved in constraints.
    • num_constraints (int): n_{co}, the number of constraints.
    • frac_solutions (float): n_{so}^{norm}, the number of valid (regarding constraints) feature sets relative to the total number of feature sets.
    • linear-regression_train_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the training set.
    • linear-regression_test_r2 (float): R^2 (coefficient of determination) for linear-regression models, trained with the selected features, predicting on the test set.
    • regression-tree_train_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the training set.
    • regression-tree_test_r2 (float): R^2 (coefficient of determination) for regression-tree models, trained with the selected features, predicting on the test set.
    • xgb-linear_train_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the training set.
    • xgb-linear_test_r2 (float): R^2 (coefficient of determination) for linear XGBoost models, trained with the selected features, predicting on the test set.
    • xgb-tree_train_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the training set.
    • xgb-tree_test_r2 (float): R^2 (coefficient of determination) for tree-based XGBoost models, trained with the selected features, predicting on the test set.
    • evaluation_time (float): Runtime (in s) for evaluating one set of constraints.
    • split_idx (int): Index of the cross-validation fold.
    • quality_name (string): Measure for feature quality (absolute correlation or mutual information).
    • constraint_name (string): Name of the constraint type (see paper).
    • dataset_name (string): Name of the dataset.

    openml/

    The input data for the study with synthetic constraints (syn_pipeline.py). Output of the script prepare_openml_datasets.py. We downloaded 35 datasets from OpenML and removed non-numeric columns. Also, we separated the feature part (*_X.csv) and the target part (*_y.csv) of each dataset. _data_overview.csv contains meta-data for the datasets, including dataset id, dataset version, and uploader. Licensing Please consult each dataset's website on OpenML for licensing information and citation requests. According to OpenML's terms, OpenML datasets fall under the CC-BY license. The datasets used in our study were uploaded by:

    • Jan van Rijn (user id: 1)
    • Joaquin Vanschoren (user id: 2)
    • Rafael Gomes Mantovani (user id: 64)
    • Tobias Kuehn (user id: 94)
    • Richard Ooms (user id: 8684)
    • R P (user id: 15317) See _data_overview.csv to match each dataset to its uploader.

    openml-results/

    Result files for the study with synthetic constraints. Output of the script syn_pipeline.py, input to the script syn_evaluation.py. One result file for each combination of the 10 constraint generators and the 35 datasets, plus one overall (merged) file, results.csv. The columns of the result files are the those of ms-results/results.csv, minus selected and evaluation_time; see above for detailed descriptions.

    Show all Show markdown
    Keywords:
    Feature selection
    Constraints
    Domain knowledge
    Theory-guided data science
    Related information:
    -
    Language:
    -
    Publishers:
    Karlsruhe Institute of Technology
    Production year:
    2021
    Subject areas:
    Computer Science
    Resource type:
    Dataset
    Data source:
    -
    Software used:
    -
    Data processing:
    -
    Publication year:
    2023
    Rights holders:
    Bach, Jakob https://orcid.org/0000-0003-0301-2798

    Zoller, Kolja

    Schulz, Katrin
    Funding:
    -
    Show all Show less
    Name Storage Metadata Upload Action
    Status:
    Published
    Uploaded by:
    kitopen
    Created on:
    2023-04-20
    Archiving date:
    2023-06-21
    Archive size:
    266.9 MB
    Archive creator:
    kitopen
    Archive checksum:
    213185fcdd4b34111aa2319a3848f4eb (MD5)
    Embargo period:
    -
    The metadata was corrected retroactively. The original metadata will be available after download of the dataset.
    dataset/Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection"
    DOI: 10.35097/1345
    Publication date: 2023-06-21
    Download Dataset
    Download (266.9 MB)

    Download Metadata
    Statistics
    0
    Views
    0
    Downloads
    Rights statement for the dataset
    This work is licensed under
    CC BY 4.0
    CC icon
    Cite Dataset
    Bach, Jakob; Zoller, Kolja; Schulz, Katrin (2023): Experimental Data for the Paper ''An Empirical Evaluation of Constrained Feature Selection". Karlsruhe Institute of Technology. DOI: 10.35097/1345
    • About the Repository
    • Privacy Policy
    • Terms and Conditions
    • Legal Notices
    • Accessibility Declaration
    powered by RADAR
    1.22.10 (f) / 1.16.6 (b) / 1.22.4 (i)

    RADAR4KIT ist ein über das Internet nutzbarer Dienst für die Archivierung und Publikation von Forschungsdaten aus abgeschlossenen wissenschaftlichen Studien und Projekten für Forschende des KIT. Betreiber ist das Karlsruher Institut für Technologie (KIT). RADAR4KIT setzt auf dem von FIZ Karlsruhe angebotenen Dienst RADAR auf. Die Speicherung der Daten findet ausschließlich auf IT-Infrastruktur des KIT am Steinbuch Centre for Computing (SCC) statt.

    Eine inhaltliche Bewertung und Qualitätsprüfung findet ausschließlich durch die Datengeberinnen und Datengeber statt.

    1. Das Nutzungsverhältnis zwischen Ihnen („Datennutzerin“ bzw. „Datennutzer“) und dem KIT erschöpft sich im Download von Datenpaketen oder Metadaten. Das KIT behält sich vor, die Nutzung von RADAR4KIT einzuschränken oder den Dienst ganz einzustellen.
    2. Sofern Sie sich als Datennutzerin oder als Datennutzer registrieren lassen bzw. über Shibboleth legitimieren, kann Ihnen seitens der Datengeberin oder des Datengebers Zugriff auch auf unveröffentlichte Dokumente gewährt werden.
    3. Den Schutz Ihrer persönlichen Daten erklären die Datenschutzbestimmungen.
    4. Das KIT übernimmt für Richtigkeit, Aktualität und Zuverlässigkeit der bereitgestellten Inhalte keine Gewährleistung und Haftung, außer im Fall einer zwingenden gesetzlichen Haftung.
    5. Das KIT stellt Ihnen als Datennutzerin oder als Datennutzer für das Recherchieren in RADAR4KIT und für das Herunterladen von Datenpaketen keine Kosten in Rechnung.
    6. Sie müssen die mit dem Datenpaket verbundenen Lizenzregelungen einhalten.