898 lines
36 KiB
Plaintext
898 lines
36 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8bc02404-8cd1-46d9-8237-2d035ebb3e79",
|
|
"metadata": {},
|
|
"source": [
|
|
"# **[Project] Cancer Subtype Classification**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0c5076f4",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Introduction"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8a599748",
|
|
"metadata": {},
|
|
"source": [
|
|
"The [TCGA Kidney Cancers Dataset](https://archive.ics.uci.edu/dataset/892/tcga+kidney+cancers) is a bulk RNA-seq dataset that contains transcriptome profiles (i.e., gene expression quantification data) of patients diagnosed with three different subtypes of kidney cancers.\n",
|
|
"This dataset can be used to make predictions about the specific subtype of kidney cancers given the normalized transcriptome profile data.\n",
|
|
"\n",
|
|
"The normalized transcriptome profile data is given as **TPM** and **FPKM** for each gene.\n",
|
|
"\n",
|
|
"> TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) are two common methods for quantifying gene expression in RNA sequencing data.\n",
|
|
"> They both aim to account for the differences in sequencing depth and transcript length when estimating gene expression levels.\n",
|
|
">\n",
|
|
"> **TPM** (Transcripts Per Million):\n",
|
|
"> - TPM is a measure of gene expression that normalizes for both library size (sequencing depth) and transcript length.\n",
|
|
"> - The main idea behind TPM is to express the abundance of a transcript relative to the total number of transcripts in a sample, scaled to one million.\n",
|
|
">\n",
|
|
"> **FPKM** (Fragments Per Kilobase Million):\n",
|
|
"> - FPKM is another method for quantifying gene expression, which is commonly used in older RNA-seq analysis pipelines. It's similar in concept to TPM but differs in the way it's calculated.\n",
|
|
"> - FPKM also normalizes for library size and transcript length, but it measures gene expression as the number of fragments (i.e., reads) per kilobase of exon model per million reads.\n",
|
|
">\n",
|
|
"> TPM is generally considered more robust to variations in library size, making it a preferred choice in many modern RNA-seq analysis workflows.\n",
|
|
"\n",
|
|
"We provide one dataset for each kidney cancer subtype:\n",
|
|
"\n",
|
|
"- [TCGA-KICH](https://portal.gdc.cancer.gov/projects/TCGA-KICH): kidney chromophobe (renal clear cell carcinoma)\n",
|
|
"- [TCGA-KIRC](https://portal.gdc.cancer.gov/projects/TCGA-KIRC): kidney renal clear cell carcinoma\n",
|
|
"- [TCGA-KIRP](https://portal.gdc.cancer.gov/projects/TCGA-KIRP): kidney renal papillary cell carcinoma\n",
|
|
"\n",
|
|
"> This and _much_ more data is openly available on the [NCI Genomic Data Commons (GDC) Data Portal](https://portal.gdc.cancer.gov/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "16712787",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data access"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6421ef6c",
|
|
"metadata": {},
|
|
"source": [
|
|
"There are two ways to access the data: via the TNT homepage or the GDC Data Portal."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b977e8b8",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Download from the TNT homepage (_recommended_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "800fa7bd",
|
|
"metadata": {},
|
|
"source": [
|
|
"The download from the TNT homepage is straightforward:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "dda97b16",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "SyntaxError",
|
|
"evalue": "invalid syntax (2666948873.py, line 6)",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;36m Cell \u001b[0;32mIn[1], line 6\u001b[0;36m\u001b[0m\n\u001b[0;31m from IPython.display import clear_output(wait=True)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-cancer-classification.tar.gz\n",
|
|
"! tar -xzvf project-cancer-classification.tar.gz\n",
|
|
"! mv -v project-cancer-classification/ data/\n",
|
|
"! rm -v project-cancer-classification.tar.gz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "bc2db880",
|
|
"metadata": {},
|
|
"source": [
|
|
"In the `data/` folder you will now find many files in the [TSV format](https://en.wikipedia.org/wiki/Tab-separated_values) ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)-like with tabs as delimiter) containing the normalized transcriptome profile data.\n",
|
|
"\n",
|
|
"To start, you can read a TSV file into a [pandas](https://pandas.pydata.org) [`DataFrame`](pandas dataframe to dict) using the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function with the `sep` parameter set to `\\t`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ed50d396-fe33-47a7-ad19-8eb975ef0fa5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Lesen der DNA-Sequenz Dateien und speichern in einer Datei"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "2adae4ff",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Es wurden 1034 Dateien eingelesen.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import pickle\n",
|
|
"\n",
|
|
"\n",
|
|
"import os\n",
|
|
"#'./data/tcga-kirp-geq'\n",
|
|
"\n",
|
|
"labels = [\"kirp\", \"kirc\", \"kich\"] # Setzen Sie hier Ihren Ordnerpfad ein\n",
|
|
"n_files = 0\n",
|
|
"y = list()\n",
|
|
"x = list()\n",
|
|
"\n",
|
|
"rick = list()\n",
|
|
"data = []\n",
|
|
"\n",
|
|
"for l in labels:\n",
|
|
" root_folder = f\"./data/tcga-{l}-geq\"\n",
|
|
" for root, dirs, files in os.walk(root_folder):\n",
|
|
" for file in files:\n",
|
|
" if file.endswith('.tsv'):\n",
|
|
" n_files += 1\n",
|
|
" # Vollständiger Pfad zur Datei\n",
|
|
" file_path = os.path.join(root, file)\n",
|
|
" # Hier können Sie etwas mit der Datei machen, z.B. einlesen\n",
|
|
" df = pd.read_csv(filepath_or_buffer=file_path, sep=\"\\t\", header=1)\n",
|
|
" df = df['tpm_unstranded']\n",
|
|
"\n",
|
|
" df = df[4:]\n",
|
|
" df = np.array(df)\n",
|
|
" rick.append(df)\n",
|
|
" \n",
|
|
" data.append([df, l])\n",
|
|
"\n",
|
|
"print(f\"Es wurden {n_files} Dateien eingelesen.\")\n",
|
|
"#tsv_file_path = \"data/tcga-kich-geq/0ba21ef5-0829-422e-a674-d3817498c333/4868e8fc-e045-475a-a81d-ef43eabb7066.rna_seq.augmented_star_gene_counts.tsv\"\n",
|
|
"\n",
|
|
"# Read the TSV file into a DataFrame\n",
|
|
"#df = pd.read_csv(filepath_or_buffer=tsv_file_path, sep=\"\\t\", header=1)\n",
|
|
"\n",
|
|
"# Display the first few rows of the DataFrame\n",
|
|
"#print(df.head(n=20))\n",
|
|
"#rick = np.array(rick)\n",
|
|
"\n",
|
|
"# Speichern der 'kirp' Liste in einer Pickle-Datei\n",
|
|
"#with open('rick.pickle', 'wb') as f:\n",
|
|
"# pickle.dump(rick, f)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "dfe4f964-6068-46da-8103-194525086f01",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>genome_frequencies</th>\n",
|
|
" <th>cancer_type</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>[20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>[37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>[45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>[15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>[35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" genome_frequencies cancer_type\n",
|
|
"0 [20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,... kirp\n",
|
|
"1 [37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29.... kirp\n",
|
|
"2 [45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8... kirp\n",
|
|
"3 [15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66... kirp\n",
|
|
"4 [35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18.... kirp"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_Frame = pd.DataFrame(data, columns=[\"genome_frequencies\", \"cancer_type\"])\n",
|
|
"data_Frame.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "0f5cc92a-4485-4184-845e-116ea9a9776d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Speichern der Daten in einer lokalen Datei\n",
|
|
"with open('rick.pickle', 'wb') as f:\n",
|
|
" pickle.dump(data_Frame, f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "b7b79958-baba-4630-9def-cf47afe43d9f",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pickle\n",
|
|
"\n",
|
|
"# Laden der 'kirp' Liste aus der Pickle-Datei\n",
|
|
"with open('rick.pickle', 'rb') as f:\n",
|
|
" data_Frame = pickle.load(f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "f6608b92-8ace-4a52-a3dc-70c578e56f0d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>genome_frequencies</th>\n",
|
|
" <th>cancer_type</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>[20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>[37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>[45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>[15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>[35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" genome_frequencies cancer_type\n",
|
|
"0 [20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,... kirp\n",
|
|
"1 [37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29.... kirp\n",
|
|
"2 [45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8... kirp\n",
|
|
"3 [15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66... kirp\n",
|
|
"4 [35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18.... kirp"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_Frame.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c60cbf60-d904-4ee0-8f70-588bb109368b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "583e39c8-13ba-422e-9c39-9cf1c8d63d5b",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Training set & validation set"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "38695a70-86e9-4dd0-b622-33e3762372eb",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"DataSet shape: (1034, 2)\n",
|
|
"Training set\n",
|
|
"------------\n",
|
|
"Dataframe shape: (827, 2)\n",
|
|
"Dataframe head:\n",
|
|
" genome_frequencies cancer_type\n",
|
|
"518 [25.0645, 0.1125, 56.3997, 3.3108, 1.6061, 12.... kirc\n",
|
|
"355 [32.6449, 2.1789, 63.4954, 6.3228, 2.109, 40.9... kirc\n",
|
|
"528 [46.024, 0.0, 85.8077, 7.2567, 2.1301, 9.6509,... kirc\n",
|
|
"445 [153.0064, 1.6403, 99.3267, 7.3736, 1.3668, 10... kirc\n",
|
|
"986 [65.5167, 18.2363, 77.2126, 5.0375, 2.4628, 21... kich\n",
|
|
"\n",
|
|
"Validation set\n",
|
|
"--------------\n",
|
|
"Dataframe shape: (207, 2)\n",
|
|
"Dataframe head:\n",
|
|
" genome_frequencies cancer_type\n",
|
|
"294 [50.8994, 0.4635, 131.5049, 5.7193, 3.103, 15.... kirp\n",
|
|
"453 [35.857, 0.1018, 94.5681, 5.2997, 1.9388, 17.6... kirc\n",
|
|
"638 [11.3865, 0.2313, 28.5961, 3.0169, 0.7851, 8.2... kirc\n",
|
|
"139 [41.6119, 0.2207, 55.4377, 4.4395, 0.884, 3.56... kirp\n",
|
|
"539 [63.1646, 18.8107, 63.2703, 4.6696, 0.9466, 5.... kirc\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import os\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"train_df, val_df = train_test_split(data_Frame, train_size=0.8, random_state=42)\n",
|
|
"\n",
|
|
"print(f\"DataSet shape: {data_Frame.shape}\")\n",
|
|
"print(f\"Training set{os.linesep}------------\")\n",
|
|
"print(f\"Dataframe shape: {train_df.shape}\")\n",
|
|
"print(f\"Dataframe head:{os.linesep}{train_df.head()}\")\n",
|
|
"print(\"\")\n",
|
|
"print(f\"Validation set{os.linesep}--------------\")\n",
|
|
"print(f\"Dataframe shape: {val_df.shape}\")\n",
|
|
"print(f\"Dataframe head:{os.linesep}{val_df.head()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4903244b-548f-4672-967d-1c62825b6fce",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Building a custom PyTorch dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7e333251-c4e7-41f0-a086-12a3d95b723f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Öffnen der Datei mit den Gesammelten Sequenzen"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "e2f78725-cda6-4e8d-9029-a4a31f6f9ab7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from torch.utils.data import Dataset\n",
|
|
"import torch\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn.preprocessing import LabelEncoder\n",
|
|
"\n",
|
|
"class GenomeDataset(Dataset):\n",
|
|
" def __init__(self, dataframe):\n",
|
|
" self.dataframe = dataframe\n",
|
|
"\n",
|
|
" # Umwandlung der Genome Frequenzen in Tensoren\n",
|
|
" self.genome_frequencies = torch.tensor(dataframe['genome_frequencies'].tolist(), dtype=torch.float32)\n",
|
|
"\n",
|
|
" # Umwandlung der Krebsarten in numerische Werte\n",
|
|
" self.label_encoder = LabelEncoder()\n",
|
|
" self.cancer_types = torch.tensor(self.label_encoder.fit_transform(dataframe['cancer_type']), dtype=torch.long)\n",
|
|
"\n",
|
|
" def __getitem__(self, index):\n",
|
|
" # Rückgabe eines Tupels aus Genome Frequenzen und dem entsprechenden Krebstyp\n",
|
|
" return self.genome_frequencies[index], self.cancer_types[index]\n",
|
|
"\n",
|
|
" def __len__(self):\n",
|
|
" return len(self.dataframe)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "aaa2c50c-c79e-4bca-812f-1a06c9f485d5",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/tmp/ipykernel_343/2483914749.py:11: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.)\n",
|
|
" self.genome_frequencies = torch.tensor(dataframe['genome_frequencies'].tolist(), dtype=torch.float32)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Beispielhafte Verwendung\n",
|
|
"# Angenommen, df_train und df_valid sind Ihre Trainings- und Validierungsdaten\n",
|
|
"train_dataset = GenomeDataset(train_df)\n",
|
|
"valid_dataset = GenomeDataset(val_df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "a7fb59af-bd06-42d4-acce-03266a85bf36",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Genome frequency from dataframe:\n",
|
|
"[2.50645e+01 1.12500e-01 5.63997e+01 ... 0.00000e+00 1.29000e-02\n",
|
|
" 2.47100e-01]\n",
|
|
"\n",
|
|
"Cancer type from dataframe: kirc\n",
|
|
"\n",
|
|
"Genome frequency from dataset:\n",
|
|
"tensor([2.5065e+01, 1.1250e-01, 5.6400e+01, ..., 0.0000e+00, 1.2900e-02,\n",
|
|
" 2.4710e-01])\n",
|
|
"\n",
|
|
"Cancer type from dataset: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Inspect the first item from the training dataframe\n",
|
|
"train_df_head = train_df.head(n=1)\n",
|
|
"train_df_genome_frequence =train_df_head.iloc[0][\"genome_frequencies\"]\n",
|
|
"train_df_cancer_type = train_df_head.iloc[0][\"cancer_type\"]\n",
|
|
"print(f\"Genome frequency from dataframe:{os.linesep}{train_df_genome_frequence}{os.linesep}\")\n",
|
|
"print(f\"Cancer type from dataframe: {train_df_cancer_type}{os.linesep}\")\n",
|
|
"\n",
|
|
"# Inspect the first item from the training dataset\n",
|
|
"datapoint_features, datapoint_label = train_dataset[0]\n",
|
|
"print(f\"Genome frequency from dataset:{os.linesep}{datapoint_features}{os.linesep}\")\n",
|
|
"print(f\"Cancer type from dataset: {datapoint_label}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9199fdeb-0d48-44c2-8bec-db2a7d7cbd4d",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Neuronales Netz Definition"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e53132b9-6222-4739-be49-7628e5a37709",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Simples Neuronales Netz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "76b8eec8-d24b-4696-82bf-ebb286e7d1e7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import torch\n",
|
|
"import torch.nn as nn\n",
|
|
"import torch.optim as optim\n",
|
|
"from torch.utils.data import DataLoader\n",
|
|
"\n",
|
|
"# Definition des Modells\n",
|
|
"class SimpleNN(nn.Module):\n",
|
|
" def __init__(self, input_size, hidden_size, num_classes):\n",
|
|
" super(SimpleNN, self).__init__()\n",
|
|
" self.fc1 = nn.Linear(input_size, hidden_size)\n",
|
|
" self.relu = nn.ReLU()\n",
|
|
" self.fc2 = nn.Linear(hidden_size, num_classes)\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" out = self.fc1(x)\n",
|
|
" out = self.relu(out)\n",
|
|
" out = self.fc2(out)\n",
|
|
" return out"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e2e9e0dd-3d4f-4999-9e65-704266d5e4a2",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Komplexes Neuronales Netz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 32,
|
|
"id": "944d463e-12ed-4447-8587-ee9c60ce3eb6",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import torch\n",
|
|
"import torch.nn as nn\n",
|
|
"import torch.optim as optim\n",
|
|
"from torch.utils.data import DataLoader\n",
|
|
"\n",
|
|
"class ComplexNN(nn.Module):\n",
|
|
" def __init__(self, input_size, hidden_size, num_classes):\n",
|
|
" super(ComplexNN, self).__init__()\n",
|
|
" # Definieren der Schichten\n",
|
|
" self.fc1 = nn.Linear(input_size, 1024) # Eingabeschicht\n",
|
|
" self.fc2 = nn.Linear(1024, 512) # Versteckte Schicht\n",
|
|
" self.fc3 = nn.Linear(512, 256) # Weitere versteckte Schicht\n",
|
|
" self.fc4 = nn.Linear(256, num_classes) # Ausgabeschicht\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" # Definieren des Vorwärtsdurchlaufs\n",
|
|
" x = nn.ReLU(self.fc1(x))\n",
|
|
" x = nn.Dropout(p=0.5, inplace=False)\n",
|
|
" x = nn.ReLU(self.fc2(x))\n",
|
|
" x = nn.Dropout(p=0.5, inplace=False)\n",
|
|
" x = nn.ReLU(self.fc3(x))\n",
|
|
" x = torch.Sigmoid(self.fc4(x)) # Oder F.log_softmax für Mehrklassenklassifikation\n",
|
|
" return x"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 33,
|
|
"id": "60789428-7d6e-4737-a83a-1138f6a650f7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Annahme: input_size ist die Länge Ihrer Genome-Frequenzen und num_classes ist die Anzahl der Krebsarten\n",
|
|
"#model = SimpleNN(input_size=60660, hidden_size=5000, num_classes=3)\n",
|
|
"model = ComplexNN(input_size=60660, hidden_size=5000, num_classes=3)\n",
|
|
"\n",
|
|
"# Daten-Loader\n",
|
|
"train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)\n",
|
|
"valid_loader = DataLoader(dataset=valid_dataset, batch_size=64, shuffle=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 34,
|
|
"id": "de6e81de-0096-443a-a0b6-90cddecf5f88",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Verlustfunktion und Optimierer\n",
|
|
"criterion = nn.CrossEntropyLoss()\n",
|
|
"optimizer = optim.Adam(model.parameters(), lr=0.001)\n",
|
|
"num_epochs = 70"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 35,
|
|
"id": "a5deb2ed-c685-4d80-bc98-d6dd27334d82",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "TypeError",
|
|
"evalue": "linear(): argument 'input' (position 1) must be Tensor, not Dropout",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mTypeError\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[35], line 10\u001b[0m\n\u001b[1;32m 8\u001b[0m \u001b[38;5;28;01mfor\u001b[39;00m i, (inputs, labels) \u001b[38;5;129;01min\u001b[39;00m \u001b[38;5;28menumerate\u001b[39m(train_loader):\n\u001b[1;32m 9\u001b[0m optimizer\u001b[38;5;241m.\u001b[39mzero_grad()\n\u001b[0;32m---> 10\u001b[0m outputs \u001b[38;5;241m=\u001b[39m \u001b[43mmodel\u001b[49m\u001b[43m(\u001b[49m\u001b[43minputs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 11\u001b[0m loss \u001b[38;5;241m=\u001b[39m criterion(outputs, labels)\n\u001b[1;32m 12\u001b[0m loss\u001b[38;5;241m.\u001b[39mbackward()\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1496\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1497\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1498\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1499\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1500\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1502\u001b[0m \u001b[38;5;66;03m# Do not call functions when jit is used\u001b[39;00m\n\u001b[1;32m 1503\u001b[0m full_backward_hooks, non_full_backward_hooks \u001b[38;5;241m=\u001b[39m [], []\n",
|
|
"Cell \u001b[0;32mIn[32], line 19\u001b[0m, in \u001b[0;36mComplexNN.forward\u001b[0;34m(self, x)\u001b[0m\n\u001b[1;32m 17\u001b[0m x \u001b[38;5;241m=\u001b[39m nn\u001b[38;5;241m.\u001b[39mReLU(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfc1(x))\n\u001b[1;32m 18\u001b[0m x \u001b[38;5;241m=\u001b[39m nn\u001b[38;5;241m.\u001b[39mDropout(p\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0.5\u001b[39m, inplace\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[0;32m---> 19\u001b[0m x \u001b[38;5;241m=\u001b[39m nn\u001b[38;5;241m.\u001b[39mReLU(\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mfc2\u001b[49m\u001b[43m(\u001b[49m\u001b[43mx\u001b[49m\u001b[43m)\u001b[49m)\n\u001b[1;32m 20\u001b[0m x \u001b[38;5;241m=\u001b[39m nn\u001b[38;5;241m.\u001b[39mDropout(p\u001b[38;5;241m=\u001b[39m\u001b[38;5;241m0.5\u001b[39m, inplace\u001b[38;5;241m=\u001b[39m\u001b[38;5;28;01mFalse\u001b[39;00m)\n\u001b[1;32m 21\u001b[0m x \u001b[38;5;241m=\u001b[39m nn\u001b[38;5;241m.\u001b[39mReLU(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mfc3(x))\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py:1501\u001b[0m, in \u001b[0;36mModule._call_impl\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 1496\u001b[0m \u001b[38;5;66;03m# If we don't have any hooks, we want to skip the rest of the logic in\u001b[39;00m\n\u001b[1;32m 1497\u001b[0m \u001b[38;5;66;03m# this function, and just call forward.\u001b[39;00m\n\u001b[1;32m 1498\u001b[0m \u001b[38;5;28;01mif\u001b[39;00m \u001b[38;5;129;01mnot\u001b[39;00m (\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_forward_pre_hooks\n\u001b[1;32m 1499\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_backward_pre_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_backward_hooks\n\u001b[1;32m 1500\u001b[0m \u001b[38;5;129;01mor\u001b[39;00m _global_forward_hooks \u001b[38;5;129;01mor\u001b[39;00m _global_forward_pre_hooks):\n\u001b[0;32m-> 1501\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mforward_call\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 1502\u001b[0m \u001b[38;5;66;03m# Do not call functions when jit is used\u001b[39;00m\n\u001b[1;32m 1503\u001b[0m full_backward_hooks, non_full_backward_hooks \u001b[38;5;241m=\u001b[39m [], []\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/nn/modules/linear.py:114\u001b[0m, in \u001b[0;36mLinear.forward\u001b[0;34m(self, input)\u001b[0m\n\u001b[1;32m 113\u001b[0m \u001b[38;5;28;01mdef\u001b[39;00m \u001b[38;5;21mforward\u001b[39m(\u001b[38;5;28mself\u001b[39m, \u001b[38;5;28minput\u001b[39m: Tensor) \u001b[38;5;241m-\u001b[39m\u001b[38;5;241m>\u001b[39m Tensor:\n\u001b[0;32m--> 114\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m \u001b[43mF\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mlinear\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43minput\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mweight\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mbias\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"\u001b[0;31mTypeError\u001b[0m: linear(): argument 'input' (position 1) must be Tensor, not Dropout"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Listen, um Verluste zu speichern\n",
|
|
"train_losses = []\n",
|
|
"valid_losses = []\n",
|
|
"\n",
|
|
"for epoch in range(num_epochs):\n",
|
|
" model.train()\n",
|
|
" train_loss = 0.0\n",
|
|
" for i, (inputs, labels) in enumerate(train_loader):\n",
|
|
" optimizer.zero_grad()\n",
|
|
" outputs = model(inputs)\n",
|
|
" loss = criterion(outputs, labels)\n",
|
|
" loss.backward()\n",
|
|
" optimizer.step()\n",
|
|
" train_loss += loss.item()\n",
|
|
"\n",
|
|
" # Durchschnittlicher Trainingsverlust\n",
|
|
" train_loss /= len(train_loader)\n",
|
|
" train_losses.append(train_loss)\n",
|
|
"\n",
|
|
" # Validierungsverlust\n",
|
|
" model.eval()\n",
|
|
" valid_loss = 0.0\n",
|
|
" with torch.no_grad():\n",
|
|
" for inputs, labels in valid_loader:\n",
|
|
" outputs = model(inputs)\n",
|
|
" loss = criterion(outputs, labels)\n",
|
|
" valid_loss += loss.item()\n",
|
|
"\n",
|
|
" # Durchschnittlicher Validierungsverlust\n",
|
|
" valid_loss /= len(valid_loader)\n",
|
|
" valid_losses.append(valid_loss)\n",
|
|
"\n",
|
|
" print(f'Epoch [{epoch+1}/{num_epochs}], Trainingsverlust: {train_loss:.4f}, Validierungsverlust: {valid_loss:.4f}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "baf1caa8-d3d9-48e8-9339-81194521528d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"plt.plot(train_losses, label='Trainingsverlust')\n",
|
|
"plt.plot(valid_losses, label='Validierungsverlust')\n",
|
|
"plt.xlabel('Epochen')\n",
|
|
"plt.ylabel('Verlust')\n",
|
|
"plt.title('Trainings- und Validierungsverlust über die Zeit')\n",
|
|
"plt.legend()\n",
|
|
"plt.show()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8e339354-a7cc-4e8a-9323-4be41ef62117",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Laden der 'kirp' Liste aus der Pickle-Datei\n",
|
|
"with open('rick.pickle', 'rb') as f:\n",
|
|
" rick = pickle.load(f)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "be10a487-728e-4953-a081-9103d485378c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Hauptkomponentenanalyse (PCA)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "088db0b3-8c33-41ff-a543-1b1e50c5e589",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Transformierte Daten: [[-6.02552113e+01 4.57642675e+01 1.11957079e+02 ... 2.58331825e+00\n",
|
|
" 9.99342571e-01 -2.77477317e-01]\n",
|
|
" [-1.64705386e+01 9.03712725e+00 1.04837673e+01 ... 4.06859167e+00\n",
|
|
" 2.01083350e+00 1.49404086e+00]\n",
|
|
" [ 7.52348753e+00 -1.55853934e+01 -4.76301782e+01 ... -7.87604764e+00\n",
|
|
" -7.56801224e-02 8.37028680e+00]\n",
|
|
" ...\n",
|
|
" [-2.72012678e+01 4.44526098e+00 2.60063820e+01 ... 3.08321694e-01\n",
|
|
" 2.28939485e+00 -7.14920382e+00]\n",
|
|
" [-3.48027066e+01 2.27021639e+01 5.51486742e+01 ... -1.77955416e+01\n",
|
|
" 6.24722406e+00 2.32101665e+01]\n",
|
|
" [-3.98223613e+01 1.88534866e+01 5.32794498e+01 ... -1.45806809e+00\n",
|
|
" 1.18270903e+01 -2.84291311e+00]]\n",
|
|
"Varianz erklärt durch jede Komponente: [0.15056597 0.0997506 0.06070173 0.03658789 0.03530275 0.0263503\n",
|
|
" 0.02322747 0.01705354 0.01534278 0.01281486 0.01116959 0.0107472\n",
|
|
" 0.00989894 0.00906208 0.00871621 0.00813403 0.0074718 0.00708769\n",
|
|
" 0.00667045 0.00633275 0.00579241 0.00556758 0.00532382 0.00519289\n",
|
|
" 0.00476404 0.00472014 0.00457837 0.00414668 0.00399478 0.00380604\n",
|
|
" 0.00362433 0.00349278 0.00336446 0.00323228 0.00310834 0.00300595\n",
|
|
" 0.00297408 0.00285178 0.00280688 0.00273987 0.00268256 0.00263102\n",
|
|
" 0.00250513 0.00248987 0.0024505 0.0023979 0.00235971 0.00218554\n",
|
|
" 0.00217143 0.00212775 0.00210793 0.00205678 0.00202224 0.00200579\n",
|
|
" 0.00194754 0.00189606 0.00187714 0.00184969 0.00180133 0.00178537\n",
|
|
" 0.00176576 0.00172542 0.00168211 0.00167483 0.00162565 0.00159444\n",
|
|
" 0.00158667 0.00155982 0.00155534 0.00151929 0.00149558 0.00147549\n",
|
|
" 0.00146982 0.00146262 0.00143338 0.00142085 0.00140628 0.00139744\n",
|
|
" 0.00136563 0.00136169 0.00134972 0.00132027 0.00129168 0.00127963\n",
|
|
" 0.00126629 0.0012562 0.00123608 0.00122899 0.0012035 0.0011899\n",
|
|
" 0.00118094 0.00117162 0.00116552 0.00114295 0.00112631 0.00111896\n",
|
|
" 0.00110193 0.00109004 0.00108523 0.00106574 0.00106381 0.001051\n",
|
|
" 0.00104179 0.00103669 0.00103248 0.00101669 0.00100527 0.00099315\n",
|
|
" 0.00097478 0.00096486 0.00096244 0.00094792 0.00094463 0.00093107\n",
|
|
" 0.00092485 0.00090851 0.00089848 0.00089134 0.00087855 0.00087068\n",
|
|
" 0.00086397 0.00085563 0.00084342 0.00083406 0.00083064 0.00081791\n",
|
|
" 0.00080368 0.00080183 0.00079167 0.00079072 0.00078868 0.00078028\n",
|
|
" 0.00077115 0.00076662 0.00076043 0.00075196 0.0007447 0.0007332\n",
|
|
" 0.0007252 0.00072345 0.00071902 0.00070594 0.00070125 0.00069603\n",
|
|
" 0.00069029 0.00068619 0.00068012 0.00067224 0.00066615 0.00066017]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"from sklearn.decomposition import PCA\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# Angenommen, X ist Ihr Datensatz\n",
|
|
"# X = ...\n",
|
|
"X = rick\n",
|
|
"\n",
|
|
"# Standardisieren der Daten\n",
|
|
"scaler = StandardScaler()\n",
|
|
"X_scaled = scaler.fit_transform(X)\n",
|
|
"\n",
|
|
"# Erstellen des PCA-Objekts\n",
|
|
"pca = PCA(n_components=150) # Angenommen, Sie möchten 150 Hauptkomponenten behalten\n",
|
|
"\n",
|
|
"# Durchführen der PCA\n",
|
|
"X_pca = pca.fit_transform(X_scaled)\n",
|
|
"\n",
|
|
"# Die resultierenden Hauptkomponenten\n",
|
|
"print(\"Transformierte Daten:\", X_pca)\n",
|
|
"\n",
|
|
"# Variance Ratio für jede Komponente\n",
|
|
"print(\"Varianz erklärt durch jede Komponente:\", pca.explained_variance_ratio_)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b11bbe20-0494-4e7a-83ff-3cb0bfa82f3b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|