964 lines
45 KiB
Plaintext
964 lines
45 KiB
Plaintext
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8bc02404-8cd1-46d9-8237-2d035ebb3e79",
|
|
"metadata": {},
|
|
"source": [
|
|
"# **[Project] Cancer Subtype Classification**"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "0c5076f4",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Introduction"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "8a599748",
|
|
"metadata": {},
|
|
"source": [
|
|
"The [TCGA Kidney Cancers Dataset](https://archive.ics.uci.edu/dataset/892/tcga+kidney+cancers) is a bulk RNA-seq dataset that contains transcriptome profiles (i.e., gene expression quantification data) of patients diagnosed with three different subtypes of kidney cancers.\n",
|
|
"This dataset can be used to make predictions about the specific subtype of kidney cancers given the normalized transcriptome profile data.\n",
|
|
"\n",
|
|
"The normalized transcriptome profile data is given as **TPM** and **FPKM** for each gene.\n",
|
|
"\n",
|
|
"> TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) are two common methods for quantifying gene expression in RNA sequencing data.\n",
|
|
"> They both aim to account for the differences in sequencing depth and transcript length when estimating gene expression levels.\n",
|
|
">\n",
|
|
"> **TPM** (Transcripts Per Million):\n",
|
|
"> - TPM is a measure of gene expression that normalizes for both library size (sequencing depth) and transcript length.\n",
|
|
"> - The main idea behind TPM is to express the abundance of a transcript relative to the total number of transcripts in a sample, scaled to one million.\n",
|
|
">\n",
|
|
"> **FPKM** (Fragments Per Kilobase Million):\n",
|
|
"> - FPKM is another method for quantifying gene expression, which is commonly used in older RNA-seq analysis pipelines. It's similar in concept to TPM but differs in the way it's calculated.\n",
|
|
"> - FPKM also normalizes for library size and transcript length, but it measures gene expression as the number of fragments (i.e., reads) per kilobase of exon model per million reads.\n",
|
|
">\n",
|
|
"> TPM is generally considered more robust to variations in library size, making it a preferred choice in many modern RNA-seq analysis workflows.\n",
|
|
"\n",
|
|
"We provide one dataset for each kidney cancer subtype:\n",
|
|
"\n",
|
|
"- [TCGA-KICH](https://portal.gdc.cancer.gov/projects/TCGA-KICH): kidney chromophobe (renal clear cell carcinoma)\n",
|
|
"- [TCGA-KIRC](https://portal.gdc.cancer.gov/projects/TCGA-KIRC): kidney renal clear cell carcinoma\n",
|
|
"- [TCGA-KIRP](https://portal.gdc.cancer.gov/projects/TCGA-KIRP): kidney renal papillary cell carcinoma\n",
|
|
"\n",
|
|
"> This and _much_ more data is openly available on the [NCI Genomic Data Commons (GDC) Data Portal](https://portal.gdc.cancer.gov/)."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "16712787",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data access"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "6421ef6c",
|
|
"metadata": {},
|
|
"source": [
|
|
"There are two ways to access the data: via the TNT homepage or the GDC Data Portal."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b977e8b8",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Download from the TNT homepage (_recommended_)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "800fa7bd",
|
|
"metadata": {},
|
|
"source": [
|
|
"The download from the TNT homepage is straightforward:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "dda97b16",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"ename": "SyntaxError",
|
|
"evalue": "invalid syntax (2666948873.py, line 6)",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;36m Cell \u001b[0;32mIn[1], line 6\u001b[0;36m\u001b[0m\n\u001b[0;31m from IPython.display import clear_output(wait=True)\u001b[0m\n\u001b[0m ^\u001b[0m\n\u001b[0;31mSyntaxError\u001b[0m\u001b[0;31m:\u001b[0m invalid syntax\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-cancer-classification.tar.gz\n",
|
|
"! tar -xzvf project-cancer-classification.tar.gz\n",
|
|
"! mv -v project-cancer-classification/ data/\n",
|
|
"! rm -v project-cancer-classification.tar.gz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "bc2db880",
|
|
"metadata": {},
|
|
"source": [
|
|
"In the `data/` folder you will now find many files in the [TSV format](https://en.wikipedia.org/wiki/Tab-separated_values) ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)-like with tabs as delimiter) containing the normalized transcriptome profile data.\n",
|
|
"\n",
|
|
"To start, you can read a TSV file into a [pandas](https://pandas.pydata.org) [`DataFrame`](pandas dataframe to dict) using the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function with the `sep` parameter set to `\\t`:"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ed50d396-fe33-47a7-ad19-8eb975ef0fa5",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Lesen der DNA-Sequenz Dateien und speichern in einer Datei"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "2adae4ff",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Es wurden 1034 Dateien eingelesen.\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"import pandas as pd\n",
|
|
"import pickle\n",
|
|
"\n",
|
|
"\n",
|
|
"import os\n",
|
|
"#'./data/tcga-kirp-geq'\n",
|
|
"\n",
|
|
"labels = [\"kirp\", \"kirc\", \"kich\"] # Setzen Sie hier Ihren Ordnerpfad ein\n",
|
|
"n_files = 0\n",
|
|
"y = list()\n",
|
|
"x = list()\n",
|
|
"\n",
|
|
"rick = list()\n",
|
|
"data = []\n",
|
|
"\n",
|
|
"for l in labels:\n",
|
|
" root_folder = f\"./data/tcga-{l}-geq\"\n",
|
|
" for root, dirs, files in os.walk(root_folder):\n",
|
|
" for file in files:\n",
|
|
" if file.endswith('.tsv'):\n",
|
|
" n_files += 1\n",
|
|
" # Vollständiger Pfad zur Datei\n",
|
|
" file_path = os.path.join(root, file)\n",
|
|
" # Hier können Sie etwas mit der Datei machen, z.B. einlesen\n",
|
|
" df = pd.read_csv(filepath_or_buffer=file_path, sep=\"\\t\", header=1)\n",
|
|
" df = df['tpm_unstranded']\n",
|
|
"\n",
|
|
" df = df[4:]\n",
|
|
" df = np.array(df)\n",
|
|
" rick.append(df)\n",
|
|
" \n",
|
|
" data.append([df, l])\n",
|
|
"\n",
|
|
"print(f\"Es wurden {n_files} Dateien eingelesen.\")\n",
|
|
"#tsv_file_path = \"data/tcga-kich-geq/0ba21ef5-0829-422e-a674-d3817498c333/4868e8fc-e045-475a-a81d-ef43eabb7066.rna_seq.augmented_star_gene_counts.tsv\"\n",
|
|
"\n",
|
|
"# Read the TSV file into a DataFrame\n",
|
|
"#df = pd.read_csv(filepath_or_buffer=tsv_file_path, sep=\"\\t\", header=1)\n",
|
|
"\n",
|
|
"# Display the first few rows of the DataFrame\n",
|
|
"#print(df.head(n=20))\n",
|
|
"#rick = np.array(rick)\n",
|
|
"\n",
|
|
"# Speichern der 'kirp' Liste in einer Pickle-Datei\n",
|
|
"#with open('rick.pickle', 'wb') as f:\n",
|
|
"# pickle.dump(rick, f)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "dfe4f964-6068-46da-8103-194525086f01",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>genome_frequencies</th>\n",
|
|
" <th>cancer_type</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>[20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>[37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>[45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>[15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>[35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" genome_frequencies cancer_type\n",
|
|
"0 [20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,... kirp\n",
|
|
"1 [37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29.... kirp\n",
|
|
"2 [45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8... kirp\n",
|
|
"3 [15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66... kirp\n",
|
|
"4 [35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18.... kirp"
|
|
]
|
|
},
|
|
"execution_count": 2,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_Frame = pd.DataFrame(data, columns=[\"genome_frequencies\", \"cancer_type\"])\n",
|
|
"data_Frame.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "0f5cc92a-4485-4184-845e-116ea9a9776d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Speichern der Daten in einer lokalen Datei\n",
|
|
"with open('rick.pickle', 'wb') as f:\n",
|
|
" pickle.dump(data_Frame, f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "b7b79958-baba-4630-9def-cf47afe43d9f",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import pickle\n",
|
|
"\n",
|
|
"# Laden der 'kirp' Liste aus der Pickle-Datei\n",
|
|
"with open('rick.pickle', 'rb') as f:\n",
|
|
" data_Frame = pickle.load(f)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "f6608b92-8ace-4a52-a3dc-70c578e56f0d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"data": {
|
|
"text/html": [
|
|
"<div>\n",
|
|
"<style scoped>\n",
|
|
" .dataframe tbody tr th:only-of-type {\n",
|
|
" vertical-align: middle;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe tbody tr th {\n",
|
|
" vertical-align: top;\n",
|
|
" }\n",
|
|
"\n",
|
|
" .dataframe thead th {\n",
|
|
" text-align: right;\n",
|
|
" }\n",
|
|
"</style>\n",
|
|
"<table border=\"1\" class=\"dataframe\">\n",
|
|
" <thead>\n",
|
|
" <tr style=\"text-align: right;\">\n",
|
|
" <th></th>\n",
|
|
" <th>genome_frequencies</th>\n",
|
|
" <th>cancer_type</th>\n",
|
|
" </tr>\n",
|
|
" </thead>\n",
|
|
" <tbody>\n",
|
|
" <tr>\n",
|
|
" <th>0</th>\n",
|
|
" <td>[20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>1</th>\n",
|
|
" <td>[37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>2</th>\n",
|
|
" <td>[45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>3</th>\n",
|
|
" <td>[15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66...</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" <tr>\n",
|
|
" <th>4</th>\n",
|
|
" <td>[35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18....</td>\n",
|
|
" <td>kirp</td>\n",
|
|
" </tr>\n",
|
|
" </tbody>\n",
|
|
"</table>\n",
|
|
"</div>"
|
|
],
|
|
"text/plain": [
|
|
" genome_frequencies cancer_type\n",
|
|
"0 [20.331, 0.0, 25.1806, 1.1301, 0.4836, 7.3269,... kirp\n",
|
|
"1 [37.0405, 0.5002, 77.4246, 4.2188, 1.0408, 29.... kirp\n",
|
|
"2 [45.4456, 0.0903, 74.9545, 4.843, 1.5188, 11.8... kirp\n",
|
|
"3 [15.2345, 0.3393, 62.0003, 2.4412, 0.932, 2.66... kirp\n",
|
|
"4 [35.0709, 0.2333, 62.8022, 2.8872, 1.0547, 18.... kirp"
|
|
]
|
|
},
|
|
"execution_count": 3,
|
|
"metadata": {},
|
|
"output_type": "execute_result"
|
|
}
|
|
],
|
|
"source": [
|
|
"data_Frame.head()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "c60cbf60-d904-4ee0-8f70-588bb109368b",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Data preprocessing"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "583e39c8-13ba-422e-9c39-9cf1c8d63d5b",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Training set & validation set"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "38695a70-86e9-4dd0-b622-33e3762372eb",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"DataSet shape: (1034, 2)\n",
|
|
"Training set\n",
|
|
"------------\n",
|
|
"Dataframe shape: (827, 2)\n",
|
|
"Dataframe head:\n",
|
|
" genome_frequencies cancer_type\n",
|
|
"518 [25.0645, 0.1125, 56.3997, 3.3108, 1.6061, 12.... kirc\n",
|
|
"355 [32.6449, 2.1789, 63.4954, 6.3228, 2.109, 40.9... kirc\n",
|
|
"528 [46.024, 0.0, 85.8077, 7.2567, 2.1301, 9.6509,... kirc\n",
|
|
"445 [153.0064, 1.6403, 99.3267, 7.3736, 1.3668, 10... kirc\n",
|
|
"986 [65.5167, 18.2363, 77.2126, 5.0375, 2.4628, 21... kich\n",
|
|
"\n",
|
|
"Validation set\n",
|
|
"--------------\n",
|
|
"Dataframe shape: (207, 2)\n",
|
|
"Dataframe head:\n",
|
|
" genome_frequencies cancer_type\n",
|
|
"294 [50.8994, 0.4635, 131.5049, 5.7193, 3.103, 15.... kirp\n",
|
|
"453 [35.857, 0.1018, 94.5681, 5.2997, 1.9388, 17.6... kirc\n",
|
|
"638 [11.3865, 0.2313, 28.5961, 3.0169, 0.7851, 8.2... kirc\n",
|
|
"139 [41.6119, 0.2207, 55.4377, 4.4395, 0.884, 3.56... kirp\n",
|
|
"539 [63.1646, 18.8107, 63.2703, 4.6696, 0.9466, 5.... kirc\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import os\n",
|
|
"from sklearn.model_selection import train_test_split\n",
|
|
"\n",
|
|
"train_df, val_df = train_test_split(data_Frame, train_size=0.8, random_state=42)\n",
|
|
"\n",
|
|
"print(f\"DataSet shape: {data_Frame.shape}\")\n",
|
|
"print(f\"Training set{os.linesep}------------\")\n",
|
|
"print(f\"Dataframe shape: {train_df.shape}\")\n",
|
|
"print(f\"Dataframe head:{os.linesep}{train_df.head()}\")\n",
|
|
"print(\"\")\n",
|
|
"print(f\"Validation set{os.linesep}--------------\")\n",
|
|
"print(f\"Dataframe shape: {val_df.shape}\")\n",
|
|
"print(f\"Dataframe head:{os.linesep}{val_df.head()}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4903244b-548f-4672-967d-1c62825b6fce",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Building a custom PyTorch dataset"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "7e333251-c4e7-41f0-a086-12a3d95b723f",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Öffnen der Datei mit den Gesammelten Sequenzen"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "e2f78725-cda6-4e8d-9029-a4a31f6f9ab7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"from torch.utils.data import Dataset\n",
|
|
"import torch\n",
|
|
"import pandas as pd\n",
|
|
"from sklearn.preprocessing import LabelEncoder\n",
|
|
"\n",
|
|
"class GenomeDataset(Dataset):\n",
|
|
" def __init__(self, dataframe):\n",
|
|
" self.dataframe = dataframe\n",
|
|
"\n",
|
|
" # Umwandlung der Genome Frequenzen in Tensoren\n",
|
|
" self.genome_frequencies = torch.tensor(dataframe['genome_frequencies'].tolist(), dtype=torch.float32)\n",
|
|
"\n",
|
|
" # Umwandlung der Krebsarten in numerische Werte\n",
|
|
" self.label_encoder = LabelEncoder()\n",
|
|
" self.cancer_types = torch.tensor(self.label_encoder.fit_transform(dataframe['cancer_type']), dtype=torch.long)\n",
|
|
"\n",
|
|
" def __getitem__(self, index):\n",
|
|
" # Rückgabe eines Tupels aus Genome Frequenzen und dem entsprechenden Krebstyp\n",
|
|
" return self.genome_frequencies[index], self.cancer_types[index]\n",
|
|
"\n",
|
|
" def __len__(self):\n",
|
|
" return len(self.dataframe)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "aaa2c50c-c79e-4bca-812f-1a06c9f485d5",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"/tmp/ipykernel_343/2483914749.py:11: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:245.)\n",
|
|
" self.genome_frequencies = torch.tensor(dataframe['genome_frequencies'].tolist(), dtype=torch.float32)\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Beispielhafte Verwendung\n",
|
|
"# Angenommen, df_train und df_valid sind Ihre Trainings- und Validierungsdaten\n",
|
|
"train_dataset = GenomeDataset(train_df)\n",
|
|
"valid_dataset = GenomeDataset(val_df)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "a7fb59af-bd06-42d4-acce-03266a85bf36",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Genome frequency from dataframe:\n",
|
|
"[2.50645e+01 1.12500e-01 5.63997e+01 ... 0.00000e+00 1.29000e-02\n",
|
|
" 2.47100e-01]\n",
|
|
"\n",
|
|
"Cancer type from dataframe: kirc\n",
|
|
"\n",
|
|
"Genome frequency from dataset:\n",
|
|
"tensor([2.5065e+01, 1.1250e-01, 5.6400e+01, ..., 0.0000e+00, 1.2900e-02,\n",
|
|
" 2.4710e-01])\n",
|
|
"\n",
|
|
"Cancer type from dataset: 1\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Inspect the first item from the training dataframe\n",
|
|
"train_df_head = train_df.head(n=1)\n",
|
|
"train_df_genome_frequence =train_df_head.iloc[0][\"genome_frequencies\"]\n",
|
|
"train_df_cancer_type = train_df_head.iloc[0][\"cancer_type\"]\n",
|
|
"print(f\"Genome frequency from dataframe:{os.linesep}{train_df_genome_frequence}{os.linesep}\")\n",
|
|
"print(f\"Cancer type from dataframe: {train_df_cancer_type}{os.linesep}\")\n",
|
|
"\n",
|
|
"# Inspect the first item from the training dataset\n",
|
|
"datapoint_features, datapoint_label = train_dataset[0]\n",
|
|
"print(f\"Genome frequency from dataset:{os.linesep}{datapoint_features}{os.linesep}\")\n",
|
|
"print(f\"Cancer type from dataset: {datapoint_label}\")"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "418bc6a0-2ddb-4596-87d1-3e670195297c",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"## Hauptkomponentenanalyse (PCA)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "e6672e50-47e6-48fc-9e1e-cac0f0a606f1",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"from sklearn.decomposition import PCA\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# Angenommen, X ist Ihr Datensatz\n",
|
|
"# X = ...\n",
|
|
"X = rick\n",
|
|
"\n",
|
|
"# Standardisieren der Daten\n",
|
|
"scaler = StandardScaler()\n",
|
|
"X_scaled = scaler.fit_transform(X)\n",
|
|
"\n",
|
|
"# Erstellen des PCA-Objekts\n",
|
|
"pca = PCA(n_components=150) # Angenommen, Sie möchten 150 Hauptkomponenten behalten\n",
|
|
"\n",
|
|
"# Durchführen der PCA\n",
|
|
"X_pca = pca.fit_transform(X_scaled)\n",
|
|
"\n",
|
|
"# Die resultierenden Hauptkomponenten\n",
|
|
"print(\"Transformierte Daten:\", X_pca)\n",
|
|
"\n",
|
|
"# Variance Ratio für jede Komponente\n",
|
|
"print(\"Varianz erklärt durch jede Komponente:\", pca.explained_variance_ratio_)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9199fdeb-0d48-44c2-8bec-db2a7d7cbd4d",
|
|
"metadata": {},
|
|
"source": [
|
|
"# Neuronales Netz Definition"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e53132b9-6222-4739-be49-7628e5a37709",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Simples Neuronales Netz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "76b8eec8-d24b-4696-82bf-ebb286e7d1e7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import torch\n",
|
|
"import torch.nn as nn\n",
|
|
"import torch.optim as optim\n",
|
|
"from torch.utils.data import DataLoader\n",
|
|
"\n",
|
|
"# Definition des Modells\n",
|
|
"class SimpleNN(nn.Module):\n",
|
|
" def __init__(self, input_size, hidden_size, num_classes):\n",
|
|
" super(SimpleNN, self).__init__()\n",
|
|
" self.fc1 = nn.Linear(input_size, hidden_size)\n",
|
|
" self.relu = nn.ReLU()\n",
|
|
" self.fc2 = nn.Linear(hidden_size, num_classes)\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" out = self.fc1(x)\n",
|
|
" out = self.relu(out)\n",
|
|
" out = self.fc2(out)\n",
|
|
" return out"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "e2e9e0dd-3d4f-4999-9e65-704266d5e4a2",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"source": [
|
|
"### Komplexes Neuronales Netz"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 36,
|
|
"id": "944d463e-12ed-4447-8587-ee9c60ce3eb6",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import torch\n",
|
|
"import torch.nn as nn\n",
|
|
"import torch.nn.functional as F\n",
|
|
"\n",
|
|
"class ComplexNN(nn.Module):\n",
|
|
" def __init__(self, input_size, hidden_size, num_classes):\n",
|
|
" super(ComplexNN, self).__init__()\n",
|
|
" # Definieren der Schichten\n",
|
|
" self.fc1 = nn.Linear(input_size, 1024) # Eingabeschicht\n",
|
|
" self.fc2 = nn.Linear(1024, 512) # Versteckte Schicht\n",
|
|
" self.fc3 = nn.Linear(512, 256) # Weitere versteckte Schicht\n",
|
|
" self.fc4 = nn.Linear(256, num_classes) # Ausgabeschicht\n",
|
|
" self.dropout = nn.Dropout(p=0.5) # Dropout\n",
|
|
"\n",
|
|
" def forward(self, x):\n",
|
|
" # Definieren des Vorwärtsdurchlaufs\n",
|
|
" x = F.relu(self.fc1(x))\n",
|
|
" x = self.dropout(x)\n",
|
|
" x = F.relu(self.fc2(x))\n",
|
|
" x = self.dropout(x)\n",
|
|
" x = F.relu(self.fc3(x))\n",
|
|
" x = torch.sigmoid(self.fc4(x)) # Oder F.log_softmax(x, dim=1) für Mehrklassenklassifikation\n",
|
|
" return x"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 37,
|
|
"id": "60789428-7d6e-4737-a83a-1138f6a650f7",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Annahme: input_size ist die Länge Ihrer Genome-Frequenzen und num_classes ist die Anzahl der Krebsarten\n",
|
|
"#model = SimpleNN(input_size=60660, hidden_size=5000, num_classes=3)\n",
|
|
"model = ComplexNN(input_size=60660, hidden_size=5000, num_classes=3)\n",
|
|
"\n",
|
|
"# Daten-Loader\n",
|
|
"train_loader = DataLoader(dataset=train_dataset, batch_size=64, shuffle=True)\n",
|
|
"valid_loader = DataLoader(dataset=valid_dataset, batch_size=64, shuffle=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 38,
|
|
"id": "de6e81de-0096-443a-a0b6-90cddecf5f88",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Verlustfunktion und Optimierer\n",
|
|
"criterion = nn.CrossEntropyLoss()\n",
|
|
"optimizer = optim.Adam(model.parameters(), lr=0.001)\n",
|
|
"num_epochs = 70"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 39,
|
|
"id": "a5deb2ed-c685-4d80-bc98-d6dd27334d82",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Epoch [1/70], Trainingsverlust: 1.1040, Validierungsverlust: 1.0986\n",
|
|
"Epoch [2/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [3/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [4/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [5/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [6/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [7/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [8/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [9/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [10/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [11/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [12/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [13/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [14/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [15/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [16/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n",
|
|
"Epoch [17/70], Trainingsverlust: 1.0986, Validierungsverlust: 1.0986\n"
|
|
]
|
|
},
|
|
{
|
|
"ename": "KeyboardInterrupt",
|
|
"evalue": "",
|
|
"output_type": "error",
|
|
"traceback": [
|
|
"\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
|
|
"\u001b[0;31mKeyboardInterrupt\u001b[0m Traceback (most recent call last)",
|
|
"Cell \u001b[0;32mIn[39], line 13\u001b[0m\n\u001b[1;32m 11\u001b[0m loss \u001b[38;5;241m=\u001b[39m criterion(outputs, labels)\n\u001b[1;32m 12\u001b[0m loss\u001b[38;5;241m.\u001b[39mbackward()\n\u001b[0;32m---> 13\u001b[0m \u001b[43moptimizer\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43mstep\u001b[49m\u001b[43m(\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 14\u001b[0m train_loss \u001b[38;5;241m+\u001b[39m\u001b[38;5;241m=\u001b[39m loss\u001b[38;5;241m.\u001b[39mitem()\n\u001b[1;32m 16\u001b[0m \u001b[38;5;66;03m# Durchschnittlicher Trainingsverlust\u001b[39;00m\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:280\u001b[0m, in \u001b[0;36mOptimizer.profile_hook_step.<locals>.wrapper\u001b[0;34m(*args, **kwargs)\u001b[0m\n\u001b[1;32m 276\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 277\u001b[0m \u001b[38;5;28;01mraise\u001b[39;00m \u001b[38;5;167;01mRuntimeError\u001b[39;00m(\u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;132;01m{\u001b[39;00mfunc\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m must return None or a tuple of (new_args, new_kwargs),\u001b[39m\u001b[38;5;124m\"\u001b[39m\n\u001b[1;32m 278\u001b[0m \u001b[38;5;124mf\u001b[39m\u001b[38;5;124m\"\u001b[39m\u001b[38;5;124mbut got \u001b[39m\u001b[38;5;132;01m{\u001b[39;00mresult\u001b[38;5;132;01m}\u001b[39;00m\u001b[38;5;124m.\u001b[39m\u001b[38;5;124m\"\u001b[39m)\n\u001b[0;32m--> 280\u001b[0m out \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 281\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_optimizer_step_code()\n\u001b[1;32m 283\u001b[0m \u001b[38;5;66;03m# call optimizer step post hooks\u001b[39;00m\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/optim/optimizer.py:33\u001b[0m, in \u001b[0;36m_use_grad_for_differentiable.<locals>._use_grad\u001b[0;34m(self, *args, **kwargs)\u001b[0m\n\u001b[1;32m 31\u001b[0m \u001b[38;5;28;01mtry\u001b[39;00m:\n\u001b[1;32m 32\u001b[0m torch\u001b[38;5;241m.\u001b[39mset_grad_enabled(\u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39mdefaults[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mdifferentiable\u001b[39m\u001b[38;5;124m'\u001b[39m])\n\u001b[0;32m---> 33\u001b[0m ret \u001b[38;5;241m=\u001b[39m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43margs\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[38;5;241;43m*\u001b[39;49m\u001b[43mkwargs\u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 34\u001b[0m \u001b[38;5;28;01mfinally\u001b[39;00m:\n\u001b[1;32m 35\u001b[0m torch\u001b[38;5;241m.\u001b[39mset_grad_enabled(prev_grad)\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/optim/adam.py:141\u001b[0m, in \u001b[0;36mAdam.step\u001b[0;34m(self, closure)\u001b[0m\n\u001b[1;32m 130\u001b[0m beta1, beta2 \u001b[38;5;241m=\u001b[39m group[\u001b[38;5;124m'\u001b[39m\u001b[38;5;124mbetas\u001b[39m\u001b[38;5;124m'\u001b[39m]\n\u001b[1;32m 132\u001b[0m \u001b[38;5;28mself\u001b[39m\u001b[38;5;241m.\u001b[39m_init_group(\n\u001b[1;32m 133\u001b[0m group,\n\u001b[1;32m 134\u001b[0m params_with_grad,\n\u001b[0;32m (...)\u001b[0m\n\u001b[1;32m 138\u001b[0m max_exp_avg_sqs,\n\u001b[1;32m 139\u001b[0m state_steps)\n\u001b[0;32m--> 141\u001b[0m \u001b[43madam\u001b[49m\u001b[43m(\u001b[49m\n\u001b[1;32m 142\u001b[0m \u001b[43m \u001b[49m\u001b[43mparams_with_grad\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 143\u001b[0m \u001b[43m \u001b[49m\u001b[43mgrads\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 144\u001b[0m \u001b[43m \u001b[49m\u001b[43mexp_avgs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 145\u001b[0m \u001b[43m \u001b[49m\u001b[43mexp_avg_sqs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 146\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_exp_avg_sqs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 147\u001b[0m \u001b[43m \u001b[49m\u001b[43mstate_steps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 148\u001b[0m \u001b[43m \u001b[49m\u001b[43mamsgrad\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mamsgrad\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 149\u001b[0m \u001b[43m \u001b[49m\u001b[43mbeta1\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbeta1\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 150\u001b[0m \u001b[43m \u001b[49m\u001b[43mbeta2\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbeta2\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 151\u001b[0m \u001b[43m \u001b[49m\u001b[43mlr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mlr\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 152\u001b[0m \u001b[43m \u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mweight_decay\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 153\u001b[0m \u001b[43m \u001b[49m\u001b[43meps\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43meps\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 154\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaximize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mmaximize\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 155\u001b[0m \u001b[43m \u001b[49m\u001b[43mforeach\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mforeach\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 156\u001b[0m \u001b[43m \u001b[49m\u001b[43mcapturable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mcapturable\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 157\u001b[0m \u001b[43m \u001b[49m\u001b[43mdifferentiable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mdifferentiable\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 158\u001b[0m \u001b[43m \u001b[49m\u001b[43mfused\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgroup\u001b[49m\u001b[43m[\u001b[49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[38;5;124;43mfused\u001b[39;49m\u001b[38;5;124;43m'\u001b[39;49m\u001b[43m]\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 159\u001b[0m \u001b[43m \u001b[49m\u001b[43mgrad_scale\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mgrad_scale\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 160\u001b[0m \u001b[43m \u001b[49m\u001b[43mfound_inf\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;28;43mgetattr\u001b[39;49m\u001b[43m(\u001b[49m\u001b[38;5;28;43mself\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[38;5;124;43mfound_inf\u001b[39;49m\u001b[38;5;124;43m\"\u001b[39;49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[38;5;28;43;01mNone\u001b[39;49;00m\u001b[43m)\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 161\u001b[0m \u001b[43m \u001b[49m\u001b[43m)\u001b[49m\n\u001b[1;32m 163\u001b[0m \u001b[38;5;28;01mreturn\u001b[39;00m loss\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/optim/adam.py:281\u001b[0m, in \u001b[0;36madam\u001b[0;34m(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, foreach, capturable, differentiable, fused, grad_scale, found_inf, amsgrad, beta1, beta2, lr, weight_decay, eps, maximize)\u001b[0m\n\u001b[1;32m 278\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 279\u001b[0m func \u001b[38;5;241m=\u001b[39m _single_tensor_adam\n\u001b[0;32m--> 281\u001b[0m \u001b[43mfunc\u001b[49m\u001b[43m(\u001b[49m\u001b[43mparams\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 282\u001b[0m \u001b[43m \u001b[49m\u001b[43mgrads\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 283\u001b[0m \u001b[43m \u001b[49m\u001b[43mexp_avgs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 284\u001b[0m \u001b[43m \u001b[49m\u001b[43mexp_avg_sqs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 285\u001b[0m \u001b[43m \u001b[49m\u001b[43mmax_exp_avg_sqs\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 286\u001b[0m \u001b[43m \u001b[49m\u001b[43mstate_steps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 287\u001b[0m \u001b[43m \u001b[49m\u001b[43mamsgrad\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mamsgrad\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 288\u001b[0m \u001b[43m \u001b[49m\u001b[43mbeta1\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbeta1\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 289\u001b[0m \u001b[43m \u001b[49m\u001b[43mbeta2\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mbeta2\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 290\u001b[0m \u001b[43m \u001b[49m\u001b[43mlr\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mlr\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 291\u001b[0m \u001b[43m \u001b[49m\u001b[43mweight_decay\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mweight_decay\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 292\u001b[0m \u001b[43m \u001b[49m\u001b[43meps\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43meps\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 293\u001b[0m \u001b[43m \u001b[49m\u001b[43mmaximize\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mmaximize\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 294\u001b[0m \u001b[43m \u001b[49m\u001b[43mcapturable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mcapturable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 295\u001b[0m \u001b[43m \u001b[49m\u001b[43mdifferentiable\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mdifferentiable\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 296\u001b[0m \u001b[43m \u001b[49m\u001b[43mgrad_scale\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mgrad_scale\u001b[49m\u001b[43m,\u001b[49m\n\u001b[1;32m 297\u001b[0m \u001b[43m \u001b[49m\u001b[43mfound_inf\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[43mfound_inf\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"File \u001b[0;32m/opt/conda/lib/python3.10/site-packages/torch/optim/adam.py:393\u001b[0m, in \u001b[0;36m_single_tensor_adam\u001b[0;34m(params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, grad_scale, found_inf, amsgrad, beta1, beta2, lr, weight_decay, eps, maximize, capturable, differentiable)\u001b[0m\n\u001b[1;32m 390\u001b[0m \u001b[38;5;28;01melse\u001b[39;00m:\n\u001b[1;32m 391\u001b[0m denom \u001b[38;5;241m=\u001b[39m (exp_avg_sq\u001b[38;5;241m.\u001b[39msqrt() \u001b[38;5;241m/\u001b[39m bias_correction2_sqrt)\u001b[38;5;241m.\u001b[39madd_(eps)\n\u001b[0;32m--> 393\u001b[0m \u001b[43mparam\u001b[49m\u001b[38;5;241;43m.\u001b[39;49m\u001b[43maddcdiv_\u001b[49m\u001b[43m(\u001b[49m\u001b[43mexp_avg\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mdenom\u001b[49m\u001b[43m,\u001b[49m\u001b[43m \u001b[49m\u001b[43mvalue\u001b[49m\u001b[38;5;241;43m=\u001b[39;49m\u001b[38;5;241;43m-\u001b[39;49m\u001b[43mstep_size\u001b[49m\u001b[43m)\u001b[49m\n",
|
|
"\u001b[0;31mKeyboardInterrupt\u001b[0m: "
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Listen, um Verluste zu speichern\n",
|
|
"train_losses = []\n",
|
|
"valid_losses = []\n",
|
|
"\n",
|
|
"for epoch in range(num_epochs):\n",
|
|
" model.train()\n",
|
|
" train_loss = 0.0\n",
|
|
" for i, (inputs, labels) in enumerate(train_loader):\n",
|
|
" optimizer.zero_grad()\n",
|
|
" outputs = model(inputs)\n",
|
|
" loss = criterion(outputs, labels)\n",
|
|
" loss.backward()\n",
|
|
" optimizer.step()\n",
|
|
" train_loss += loss.item()\n",
|
|
"\n",
|
|
" # Durchschnittlicher Trainingsverlust\n",
|
|
" train_loss /= len(train_loader)\n",
|
|
" train_losses.append(train_loss)\n",
|
|
"\n",
|
|
" # Validierungsverlust\n",
|
|
" model.eval()\n",
|
|
" valid_loss = 0.0\n",
|
|
" with torch.no_grad():\n",
|
|
" for inputs, labels in valid_loader:\n",
|
|
" outputs = model(inputs)\n",
|
|
" loss = criterion(outputs, labels)\n",
|
|
" valid_loss += loss.item()\n",
|
|
"\n",
|
|
" # Durchschnittlicher Validierungsverlust\n",
|
|
" valid_loss /= len(valid_loader)\n",
|
|
" valid_losses.append(valid_loss)\n",
|
|
"\n",
|
|
" print(f'Epoch [{epoch+1}/{num_epochs}], Trainingsverlust: {train_loss:.4f}, Validierungsverlust: {valid_loss:.4f}')"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "baf1caa8-d3d9-48e8-9339-81194521528d",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [],
|
|
"source": [
|
|
"import matplotlib.pyplot as plt\n",
|
|
"\n",
|
|
"plt.plot(train_losses, label='Trainingsverlust')\n",
|
|
"plt.plot(valid_losses, label='Validierungsverlust')\n",
|
|
"plt.xlabel('Epochen')\n",
|
|
"plt.ylabel('Verlust')\n",
|
|
"plt.title('Trainings- und Validierungsverlust über die Zeit')\n",
|
|
"plt.legend()\n",
|
|
"plt.show()\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "8e339354-a7cc-4e8a-9323-4be41ef62117",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Laden der 'kirp' Liste aus der Pickle-Datei\n",
|
|
"with open('rick.pickle', 'rb') as f:\n",
|
|
" rick = pickle.load(f)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "be10a487-728e-4953-a081-9103d485378c",
|
|
"metadata": {},
|
|
"source": [
|
|
"## Hauptkomponentenanalyse (PCA)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "088db0b3-8c33-41ff-a543-1b1e50c5e589",
|
|
"metadata": {
|
|
"tags": []
|
|
},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Transformierte Daten: [[-6.02552113e+01 4.57642675e+01 1.11957079e+02 ... 2.58331825e+00\n",
|
|
" 9.99342571e-01 -2.77477317e-01]\n",
|
|
" [-1.64705386e+01 9.03712725e+00 1.04837673e+01 ... 4.06859167e+00\n",
|
|
" 2.01083350e+00 1.49404086e+00]\n",
|
|
" [ 7.52348753e+00 -1.55853934e+01 -4.76301782e+01 ... -7.87604764e+00\n",
|
|
" -7.56801224e-02 8.37028680e+00]\n",
|
|
" ...\n",
|
|
" [-2.72012678e+01 4.44526098e+00 2.60063820e+01 ... 3.08321694e-01\n",
|
|
" 2.28939485e+00 -7.14920382e+00]\n",
|
|
" [-3.48027066e+01 2.27021639e+01 5.51486742e+01 ... -1.77955416e+01\n",
|
|
" 6.24722406e+00 2.32101665e+01]\n",
|
|
" [-3.98223613e+01 1.88534866e+01 5.32794498e+01 ... -1.45806809e+00\n",
|
|
" 1.18270903e+01 -2.84291311e+00]]\n",
|
|
"Varianz erklärt durch jede Komponente: [0.15056597 0.0997506 0.06070173 0.03658789 0.03530275 0.0263503\n",
|
|
" 0.02322747 0.01705354 0.01534278 0.01281486 0.01116959 0.0107472\n",
|
|
" 0.00989894 0.00906208 0.00871621 0.00813403 0.0074718 0.00708769\n",
|
|
" 0.00667045 0.00633275 0.00579241 0.00556758 0.00532382 0.00519289\n",
|
|
" 0.00476404 0.00472014 0.00457837 0.00414668 0.00399478 0.00380604\n",
|
|
" 0.00362433 0.00349278 0.00336446 0.00323228 0.00310834 0.00300595\n",
|
|
" 0.00297408 0.00285178 0.00280688 0.00273987 0.00268256 0.00263102\n",
|
|
" 0.00250513 0.00248987 0.0024505 0.0023979 0.00235971 0.00218554\n",
|
|
" 0.00217143 0.00212775 0.00210793 0.00205678 0.00202224 0.00200579\n",
|
|
" 0.00194754 0.00189606 0.00187714 0.00184969 0.00180133 0.00178537\n",
|
|
" 0.00176576 0.00172542 0.00168211 0.00167483 0.00162565 0.00159444\n",
|
|
" 0.00158667 0.00155982 0.00155534 0.00151929 0.00149558 0.00147549\n",
|
|
" 0.00146982 0.00146262 0.00143338 0.00142085 0.00140628 0.00139744\n",
|
|
" 0.00136563 0.00136169 0.00134972 0.00132027 0.00129168 0.00127963\n",
|
|
" 0.00126629 0.0012562 0.00123608 0.00122899 0.0012035 0.0011899\n",
|
|
" 0.00118094 0.00117162 0.00116552 0.00114295 0.00112631 0.00111896\n",
|
|
" 0.00110193 0.00109004 0.00108523 0.00106574 0.00106381 0.001051\n",
|
|
" 0.00104179 0.00103669 0.00103248 0.00101669 0.00100527 0.00099315\n",
|
|
" 0.00097478 0.00096486 0.00096244 0.00094792 0.00094463 0.00093107\n",
|
|
" 0.00092485 0.00090851 0.00089848 0.00089134 0.00087855 0.00087068\n",
|
|
" 0.00086397 0.00085563 0.00084342 0.00083406 0.00083064 0.00081791\n",
|
|
" 0.00080368 0.00080183 0.00079167 0.00079072 0.00078868 0.00078028\n",
|
|
" 0.00077115 0.00076662 0.00076043 0.00075196 0.0007447 0.0007332\n",
|
|
" 0.0007252 0.00072345 0.00071902 0.00070594 0.00070125 0.00069603\n",
|
|
" 0.00069029 0.00068619 0.00068012 0.00067224 0.00066615 0.00066017]\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import numpy as np\n",
|
|
"from sklearn.decomposition import PCA\n",
|
|
"from sklearn.preprocessing import StandardScaler\n",
|
|
"\n",
|
|
"# Angenommen, X ist Ihr Datensatz\n",
|
|
"# X = ...\n",
|
|
"X = rick\n",
|
|
"\n",
|
|
"# Standardisieren der Daten\n",
|
|
"scaler = StandardScaler()\n",
|
|
"X_scaled = scaler.fit_transform(X)\n",
|
|
"\n",
|
|
"# Erstellen des PCA-Objekts\n",
|
|
"pca = PCA(n_components=150) # Angenommen, Sie möchten 150 Hauptkomponenten behalten\n",
|
|
"\n",
|
|
"# Durchführen der PCA\n",
|
|
"X_pca = pca.fit_transform(X_scaled)\n",
|
|
"\n",
|
|
"# Die resultierenden Hauptkomponenten\n",
|
|
"print(\"Transformierte Daten:\", X_pca)\n",
|
|
"\n",
|
|
"# Variance Ratio für jede Komponente\n",
|
|
"print(\"Varianz erklärt durch jede Komponente:\", pca.explained_variance_ratio_)\n"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "b11bbe20-0494-4e7a-83ff-3cb0bfa82f3b",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.10.10"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|