Skip to main content

This website only uses technically necessary cookies. They will be deleted at the latest when you close your browser. To learn more, please read our Privacy Policy.

DE EN
Login
Logo, to home
  1. You are here:
  2. Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"
...

    Dataset: Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"

    • RADAR Metadata
    • Content
    • Statistics
    • Technical Metadata
    Alternate identifier:
    -
    Related identifier:
    (Is Identical To) https://publikationen.bibliothek.kit.edu/1000171166 - URL
    Creator/Author:
    Bach, Jakob https://orcid.org/0000-0003-0301-2798 [Bach, Jakob]
    Contributors:
    -
    Title:
    Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"
    Additional titles:
    -
    Description:
    (Abstract) These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published on [arXiv](https://arxiv.org/) in 2024. You can find the paper [here](https://doi.org/10.48550/arXiv.2406.01411) and the code [here](https://github.com/J... These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published on [arXiv](https://arxiv.org/) in 2024. You can find the paper [here](https://doi.org/10.48550/arXiv.2406.01411) and the code [here](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery). See the `README` for details. The datasets used in our study (which we also provide here) originate from [PMLB](https://epistasislab.github.io/pmlb/). The corresponding [GitHub repository](https://github.com/EpistasisLab/pmlb) is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Please see the file `LICENSE` in the folder `datasets/` for the license text.

    These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published on arXiv in 2024. You can find the paper here and the code here. See the README for details. The datasets used in our study (which we also provide here) originate from PMLB. The corresponding GitHub repository is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Please see the file LICENSE in the folder datasets/ for the license text.

    Show all Show markdown

    (Technical Remarks) # Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published at [arXiv](https://arxiv.org/) in 2024. If... # Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published at [arXiv](https://arxiv.org/) in 2024. If we create further versions of this paper in the future, these experimental data may cover them as well. Check our [GitHub repository](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery) for the code and instructions to reproduce the experiments. We obtained the experimental results on a server with an `AMD EPYC 7551` CPU (32 physical cores, base clock of 2.0 GHz) and 160 GB RAM. The operating system was `Ubuntu 20.04.6 LTS`. The Python version was `3.8`. With this configuration, running the experimental pipeline (`run_experiments.py`) took about 34 hours. The commit hash for the last run of the experimental pipeline (`run_experiments.py`) is [0a57bcd529](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/0a57bcd52938dce6285e8113d777360c2b17f30f). The commit hash for the last run of the evaluation pipeline (`run_evaluation_arxiv.py`) is [48f2465b4c](https://github.com/Jakob-Bach/Constrained-Subgroup-Discovery/tree/48f2465b4cabf2d6657e4ff4b73c28c240d0b883). We also tagged both commits (`run-2024-05-13` and `evaluation-2024-05-15`). The experimental data are stored in two folders, `datasets/` and `results/`. Further, the console output of `run_evaluation_arxiv.py` is stored in `Evaluation_console_output.txt` (manually copied from the console to a file). In the following, we describe the structure and content of each data file. ## `datasets/` These are the input data for the experimental pipeline `run_experiments.py`, i.e., the prediction datasets. The folder contains one overview file, one license file, and two files for each of the 27 datasets. The original datasets were downloaded from [PMLB](https://epistasislab.github.io/pmlb/) with the script `prepare_datasets.py`. Note that we do not own the copyright for these datasets. However, the [GitHub repository of PMLB](https://github.com/EpistasisLab/pmlb), which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Thus, we include the file `LICENSE` from that repository. After downloading from `PMLB`, we split each dataset into the feature part (`_X.csv`) and the target part (`_y.csv`), which we save separately. Both file types are CSVs that only contain numeric values (categorical features are ordinally encoded in `PMLB`) except for the column names. There are no missing values. Each row corresponds to a data object (= instance, sample), and each column either corresponds to a feature (in `_X`) or the target (in `_y`). The first line in each `_X` file contains the names of the features as strings; for `_y` files, there is only one column, always named `target`. For the prediction target, we ensured that the minority (i.e., less frequent) class is the positive class (i.e., has the class label `1`), so the labeling may differ from PMLB. `_dataset_overview.csv` contains meta-data for the datasets, like the number of instances and features. ## `results/` These are the output data of the experimental pipeline in the form of CSVs, produced by the script `run_experiments.py`. `_results.csv` contains all results merged into one file and acts as input for the script `run_evaluation_arxiv.py`. The remaining files are subsets of the results, as the experimental pipeline parallelizes over 27 datasets, 5 cross-validation folds, and 6 subgroup-discovery methods. Thus, there are `27 * 5 * 6 = 810` files containing subsets of the results. Each row in a result file corresponds to one subgroup. One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.: - dataset `dataset_name` - cross-validation fold `split_idx` - subgroup-discovery method `sd_name` - feature-cardinality threshold `param.k` (missing value if no feature-cardinality constraint) - solver timeout `param.timeout` (missing value if not solver-based search) - number of alternatives `param.a` (missing value if only original subgroup searched) - dissimilarity threshold `param.tau_abs` (missing value if only original subgroup searched) For each value combination of these seven columns, there is either one subgroup (search for original subgroups) or six subgroups (search for alternative subgroup descriptions, in which case the column `alt.number` identifies individual subgroups within a search run). Further, note that the last four mentioned columns contain missing values, which should be treated as a category on their own. In particular, if you use `groupby()` from `pandas` for analyzing the results and you want to include any of the last four mentioned columns in the grouping, you should either fill in the missing values with an (arbitrary) placeholder value or use `dropna=False`, because the grouping (by default) ignores the rows with missing values in the group columns otherwise. The remaining columns represent results and evaluation metrics. In detail, all result files contain the following columns: - `objective_value` (float in `[-0.25, 1]` + missing values): Objective value of the subgroup-discovery method on the training set. WRAcc when searching original subgroups and normalized Hamming similarity when searching alternative subgroup descriptions. Missing value for *MORS* as the subgroup-discovery method, since *MORS* does not explicitly compute an objective when searching for subgroups. - `optimization_status` (string, 2 different values + missing values): For *SMT*, `sat` if optimal solution found and `unknown` if timeout. Missing value for all other subgroup-discovery methods (which do not use solver timeouts). - `optimization_time` (non-negative float): The runtime of optimization in the subgroup-discovery method, i.e., without pre- and post-processing steps. - `fitting_time` (non-negative float): The complete runtime of the subgroup-discovery method (as reported in the paper), i.e., including pre- and post-processing steps. Very similar to `optimization_time` except for *SMT* as the subgroup-discovery method, which may spend a considerable amount of time formulating the optimization problem. - `train_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the training set. - `test_wracc` (float in `[-0.25, 0.25]`): The weighted relative accuracy (WRAcc) of the subgroup description on the test set. - `train_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the training set. - `test_nwracc` (float in `[-1, 1]`): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the test set. - `box_lbs` (list of floats, e.g., `[-inf, 0, -inf, -2, 8]`): The lower bounds for each feature in the subgroup description. Negative infinity if a feature's lower bound did not exclude any data objects from the subgroup. - `box_ubs` (list of floats, e.g., `[inf, 10, inf, 5, 9]`): The upper bounds for each feature in the subgroup description. Positive infinity if a feature's upper bound did not exclude any data objects from the subgroup. - `selected_feature_idxs` (list of non-negative ints, e.g., `[0, 4, 5]`): The indices (starting from 0) of the features selected (= restricted) in the subgroup description. Is an empty list, i.e., `[]`, if no feature was restricted (thus, the subgroup contains all data objects). - `dataset_name` (string, 27 different values): The name of the `PMLB` dataset used for subgroup discovery. - `split_idx` (int in `[0, 4]`): The index of the cross-validation fold of the dataset used for subgroup discovery. - `sd_name` (string, 6 different values): The name of the subgroup-discovery method (`Beam`, `BI`, `MORS`, `PRIM`, `Random`, or `SMT`). - `param.k` (int in `[1, 5]` + missing values): The feature-cardinality threshold for subgroup descriptions. Missing value if unconstrained subgroup discovery. Always `3` if alternative subgroup descriptions searched. - `param.timeout` (int in `[1, 2048]` + missing values): For *SMT*, solver timeout (in seconds) for optimization (not including formulation of the optimization problem). Missing value for all other subgroup-discovery methods. - `alt.hamming` (float in `[0, 1]` + missing values): Normalized Hamming similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched. Missing value if only original subgroup searched. - `alt.jaccard` (float in `[0, 1]` + missing values): Jaccard similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched. Missing value if only original subgroup searched. - `alt.number` (int in `[0, 5]` + missing values): The number of the current alternative if alternative subgroup descriptions searched. Missing value if only original subgroup searched. Thus, original subgroups either have `0` or a missing value in this column (i.e., for experimental settings where alternative subgroup descriptions searched, there is no separate search for an original subgroup, only a joint sequential search for original and alternatives). - `param.a` (int with value `5` + missing values): The number of desired alternative subgroup descriptions, not counting the original (zeroth) subgroup description. Missing value if only original subgroup searched. - `param.tau_abs` (int in `[1, 3]` + missing values) The dissimilarity threshold for alternatives, corresponding to the absolute number of features that have to be deselected from the original subgroup description and each prior alternative. Missing value if only original subgroup searched. You can easily read in any of the result files with `pandas`: ```python import pandas as pd results = pd.read_csv('results/_results.csv') ``` All result files are comma-separated and contain plain numbers and unquoted strings, apart from the columns `box_lbs`, `box_ubs`, and `selected_feature_idxs` (which represents lists and whose values are quoted except for empty lists). The first line in each result file contains the column names. You can use the following code to make sure that the lists of feature indices are treated as such (rather than strings): ```python import ast results['selected_feature_idxs'] = results['selected_feature_idxs'].apply(ast.literal_eval) ``` Note that this conversion does not work for `box_lbs` and `box_ubs`, where the lists not only contain ordinary numbers but also `-inf`, and `inf`; see [this *Stack Overflow* post](https://stackoverflow.com/questions/64773836/error-converting-string-list-to-list-when-it-contains-inf) for potential alternatives.

    Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"

    These are the experimental data for the paper> Bach, Jakob. "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions" published at arXiv in 2024. If we create further versions of this paper in the future, these experimental data may cover them as well. Check our GitHub repository for the code and instructions to reproduce the experiments. We obtained the experimental results on a server with an AMD EPYC 7551 CPU (32 physical cores, base clock of 2.0 GHz) and 160 GB RAM. The operating system was Ubuntu 20.04.6 LTS. The Python version was 3.8. With this configuration, running the experimental pipeline (run_experiments.py) took about 34 hours. The commit hash for the last run of the experimental pipeline (run_experiments.py) is 0a57bcd529. The commit hash for the last run of the evaluation pipeline (run_evaluation_arxiv.py) is 48f2465b4c. We also tagged both commits (run-2024-05-13 and evaluation-2024-05-15). The experimental data are stored in two folders, datasets/ and results/. Further, the console output of run_evaluation_arxiv.py is stored in Evaluation_console_output.txt (manually copied from the console to a file). In the following, we describe the structure and content of each data file.

    datasets/

    These are the input data for the experimental pipeline run_experiments.py, i.e., the prediction datasets. The folder contains one overview file, one license file, and two files for each of the 27 datasets. The original datasets were downloaded from PMLB with the script prepare_datasets.py. Note that we do not own the copyright for these datasets. However, the GitHub repository of PMLB, which stores the original datasets, is MIT-licensed ((c) 2016 Epistasis Lab at UPenn). Thus, we include the file LICENSE from that repository. After downloading from PMLB, we split each dataset into the feature part (_X.csv) and the target part (_y.csv), which we save separately. Both file types are CSVs that only contain numeric values (categorical features are ordinally encoded in PMLB) except for the column names. There are no missing values. Each row corresponds to a data object (= instance, sample), and each column either corresponds to a feature (in _X) or the target (in _y). The first line in each _X file contains the names of the features as strings; for _y files, there is only one column, always named target. For the prediction target, we ensured that the minority (i.e., less frequent) class is the positive class (i.e., has the class label 1), so the labeling may differ from PMLB. _dataset_overview.csv contains meta-data for the datasets, like the number of instances and features.

    results/

    These are the output data of the experimental pipeline in the form of CSVs, produced by the script run_experiments.py. _results.csv contains all results merged into one file and acts as input for the script run_evaluation_arxiv.py. The remaining files are subsets of the results, as the experimental pipeline parallelizes over 27 datasets, 5 cross-validation folds, and 6 subgroup-discovery methods. Thus, there are 27 * 5 * 6 = 810 files containing subsets of the results. Each row in a result file corresponds to one subgroup. One can identify individual subgroup-discovery runs with a combination of multiple columns, i.e.:

    • dataset dataset_name
    • cross-validation fold split_idx
    • subgroup-discovery method sd_name
    • feature-cardinality threshold param.k (missing value if no feature-cardinality constraint)
    • solver timeout param.timeout (missing value if not solver-based search)
    • number of alternatives param.a (missing value if only original subgroup searched)
    • dissimilarity threshold param.tau_abs (missing value if only original subgroup searched) For each value combination of these seven columns, there is either one subgroup (search for original subgroups) or six subgroups (search for alternative subgroup descriptions, in which case the column alt.number identifies individual subgroups within a search run). Further, note that the last four mentioned columns contain missing values, which should be treated as a category on their own. In particular, if you use groupby() from pandas for analyzing the results and you want to include any of the last four mentioned columns in the grouping, you should either fill in the missing values with an (arbitrary) placeholder value or use dropna=False, because the grouping (by default) ignores the rows with missing values in the group columns otherwise. The remaining columns represent results and evaluation metrics. In detail, all result files contain the following columns:
    • objective_value (float in [-0.25, 1] + missing values): Objective value of the subgroup-discovery method on the training set. WRAcc when searching original subgroups and normalized Hamming similarity when searching alternative subgroup descriptions. Missing value for MORS as the subgroup-discovery method, since MORS does not explicitly compute an objective when searching for subgroups.
    • optimization_status (string, 2 different values + missing values): For SMT, sat if optimal solution found and unknown if timeout. Missing value for all other subgroup-discovery methods (which do not use solver timeouts).
    • optimization_time (non-negative float): The runtime of optimization in the subgroup-discovery method, i.e., without pre- and post-processing steps.
    • fitting_time (non-negative float): The complete runtime of the subgroup-discovery method (as reported in the paper), i.e., including pre- and post-processing steps. Very similar to optimization_time except for SMT as the subgroup-discovery method, which may spend a considerable amount of time formulating the optimization problem.
    • train_wracc (float in [-0.25, 0.25]): The weighted relative accuracy (WRAcc) of the subgroup description on the training set.
    • test_wracc (float in [-0.25, 0.25]): The weighted relative accuracy (WRAcc) of the subgroup description on the test set.
    • train_nwracc (float in [-1, 1]): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the training set.
    • test_nwracc (float in [-1, 1]): The normalized weighted relative accuracy (WRAcc divided by its dataset-dependent maximum) of the subgroup description on the test set.
    • box_lbs (list of floats, e.g., [-inf, 0, -inf, -2, 8]): The lower bounds for each feature in the subgroup description. Negative infinity if a feature's lower bound did not exclude any data objects from the subgroup.
    • box_ubs (list of floats, e.g., [inf, 10, inf, 5, 9]): The upper bounds for each feature in the subgroup description. Positive infinity if a feature's upper bound did not exclude any data objects from the subgroup.
    • selected_feature_idxs (list of non-negative ints, e.g., [0, 4, 5]): The indices (starting from 0) of the features selected (= restricted) in the subgroup description. Is an empty list, i.e., [], if no feature was restricted (thus, the subgroup contains all data objects).
    • dataset_name (string, 27 different values): The name of the PMLB dataset used for subgroup discovery.
    • split_idx (int in [0, 4]): The index of the cross-validation fold of the dataset used for subgroup discovery.
    • sd_name (string, 6 different values): The name of the subgroup-discovery method (Beam, BI, MORS, PRIM, Random, or SMT).
    • param.k (int in [1, 5] + missing values): The feature-cardinality threshold for subgroup descriptions. Missing value if unconstrained subgroup discovery. Always 3 if alternative subgroup descriptions searched.
    • param.timeout (int in [1, 2048] + missing values): For SMT, solver timeout (in seconds) for optimization (not including formulation of the optimization problem). Missing value for all other subgroup-discovery methods.
    • alt.hamming (float in [0, 1] + missing values): Normalized Hamming similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched. Missing value if only original subgroup searched.
    • alt.jaccard (float in [0, 1] + missing values): Jaccard similarity between the current subgroup (original or alternative) and the original subgroup if alternative subgroup descriptions searched. Missing value if only original subgroup searched.
    • alt.number (int in [0, 5] + missing values): The number of the current alternative if alternative subgroup descriptions searched. Missing value if only original subgroup searched. Thus, original subgroups either have 0 or a missing value in this column (i.e., for experimental settings where alternative subgroup descriptions searched, there is no separate search for an original subgroup, only a joint sequential search for original and alternatives).
    • param.a (int with value 5 + missing values): The number of desired alternative subgroup descriptions, not counting the original (zeroth) subgroup description. Missing value if only original subgroup searched.
    • param.tau_abs (int in [1, 3] + missing values) The dissimilarity threshold for alternatives, corresponding to the absolute number of features that have to be deselected from the original subgroup description and each prior alternative. Missing value if only original subgroup searched. You can easily read in any of the result files with pandas:
    import pandas as pd
    results = pd.read_csv('results/_results.csv')
    

    All result files are comma-separated and contain plain numbers and unquoted strings, apart from the columns box_lbs, box_ubs, and selected_feature_idxs (which represents lists and whose values are quoted except for empty lists). The first line in each result file contains the column names. You can use the following code to make sure that the lists of feature indices are treated as such (rather than strings):

    import ast
    results['selected_feature_idxs'] = results['selected_feature_idxs'].apply(ast.literal_eval)
    

    Note that this conversion does not work for box_lbs and box_ubs, where the lists not only contain ordinary numbers but also -inf, and inf; see this Stack Overflow post for potential alternatives.

    Show all Show markdown
    Keywords:
    subgroup discovery
    alternatives
    constraints
    satisfiability modulo theories
    explainability
    interpretability
    XAI
    Related information:
    -
    Language:
    -
    Publishers:
    Karlsruhe Institute of Technology
    Production year:
    2024
    Subject areas:
    Computer Science
    Resource type:
    Dataset
    Data source:
    -
    Software used:
    -
    Data processing:
    -
    Publication year:
    2024
    Rights holders:
    Bach, Jakob https://orcid.org/0000-0003-0301-2798
    Funding:
    -
    Show all Show less
    Name Storage Metadata Upload Action
    Status:
    Published
    Uploaded by:
    kitopen
    Created on:
    2024-05-30
    Archiving date:
    2024-06-03
    Archive size:
    5.4 MB
    Archive creator:
    kitopen
    Archive checksum:
    17861381b7cd00ff3879809403149944 (MD5)
    Embargo period:
    -
    The metadata was corrected retroactively. The original metadata will be available after download of the dataset.
    dataset/Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions"
    DOI: 10.35097/caKKJCtoKqgxyvqG
    Publication date: 2024-06-03
    Download Dataset
    Download (5.4 MB)

    Download Metadata
    Statistics
    0
    Views
    0
    Downloads
    Rights statement for the dataset
    This work is licensed under
    CC BY 4.0
    CC icon
    Cite Dataset
    Bach, Jakob (2024): Experimental Data for the Paper "Using Constraints to Discover Sparse and Alternative Subgroup Descriptions". Karlsruhe Institute of Technology. DOI: 10.35097/caKKJCtoKqgxyvqG
    • About the Repository
    • Privacy Policy
    • Terms and Conditions
    • Legal Notices
    • Accessibility Declaration
    powered by RADAR
    1.22.10 (f) / 1.16.3 (b) / 1.22.4 (i)

    RADAR4KIT ist ein über das Internet nutzbarer Dienst für die Archivierung und Publikation von Forschungsdaten aus abgeschlossenen wissenschaftlichen Studien und Projekten für Forschende des KIT. Betreiber ist das Karlsruher Institut für Technologie (KIT). RADAR4KIT setzt auf dem von FIZ Karlsruhe angebotenen Dienst RADAR auf. Die Speicherung der Daten findet ausschließlich auf IT-Infrastruktur des KIT am Steinbuch Centre for Computing (SCC) statt.

    Eine inhaltliche Bewertung und Qualitätsprüfung findet ausschließlich durch die Datengeberinnen und Datengeber statt.

    1. Das Nutzungsverhältnis zwischen Ihnen („Datennutzerin“ bzw. „Datennutzer“) und dem KIT erschöpft sich im Download von Datenpaketen oder Metadaten. Das KIT behält sich vor, die Nutzung von RADAR4KIT einzuschränken oder den Dienst ganz einzustellen.
    2. Sofern Sie sich als Datennutzerin oder als Datennutzer registrieren lassen bzw. über Shibboleth legitimieren, kann Ihnen seitens der Datengeberin oder des Datengebers Zugriff auch auf unveröffentlichte Dokumente gewährt werden.
    3. Den Schutz Ihrer persönlichen Daten erklären die Datenschutzbestimmungen.
    4. Das KIT übernimmt für Richtigkeit, Aktualität und Zuverlässigkeit der bereitgestellten Inhalte keine Gewährleistung und Haftung, außer im Fall einer zwingenden gesetzlichen Haftung.
    5. Das KIT stellt Ihnen als Datennutzerin oder als Datennutzer für das Recherchieren in RADAR4KIT und für das Herunterladen von Datenpaketen keine Kosten in Rechnung.
    6. Sie müssen die mit dem Datenpaket verbundenen Lizenzregelungen einhalten.