# **[Project] Cancer Subtype Classification**

# Introduction

The [TCGA Kidney Cancers Dataset](https://archive.ics.uci.edu/dataset/892/tcga+kidney+cancers) is a bulk RNA-seq dataset that contains transcriptome profiles (i.e., gene expression quantification data) of patients diagnosed with three different subtypes of kidney cancers.
This dataset can be used to make predictions about the specific subtype of kidney cancers given the normalized transcriptome profile data.

The normalized transcriptome profile data is given as **TPM** and **FPKM** for each gene.

> TPM (Transcripts Per Million) and FPKM (Fragments Per Kilobase Million) are two common methods for quantifying gene expression in RNA sequencing data.
> They both aim to account for the differences in sequencing depth and transcript length when estimating gene expression levels.
>
> **TPM** (Transcripts Per Million):
> - TPM is a measure of gene expression that normalizes for both library size (sequencing depth) and transcript length.
> - The main idea behind TPM is to express the abundance of a transcript relative to the total number of transcripts in a sample, scaled to one million.
>
> **FPKM** (Fragments Per Kilobase Million):
> - FPKM is another method for quantifying gene expression, which is commonly used in older RNA-seq analysis pipelines. It's similar in concept to TPM but differs in the way it's calculated.
> - FPKM also normalizes for library size and transcript length, but it measures gene expression as the number of fragments (i.e., reads) per kilobase of exon model per million reads.
>
> TPM is generally considered more robust to variations in library size, making it a preferred choice in many modern RNA-seq analysis workflows.

We provide one dataset for each kidney cancer subtype:

- [TCGA-KICH](https://portal.gdc.cancer.gov/projects/TCGA-KICH): kidney chromophobe (renal clear cell carcinoma)
- [TCGA-KIRC](https://portal.gdc.cancer.gov/projects/TCGA-KIRC): kidney renal clear cell carcinoma
- [TCGA-KIRP](https://portal.gdc.cancer.gov/projects/TCGA-KIRP): kidney renal papillary cell carcinoma

> This and _much_ more data is openly available on the [NCI Genomic Data Commons (GDC) Data Portal](https://portal.gdc.cancer.gov/).

# Data access

There are two ways to access the data: via the TNT homepage or the GDC Data Portal.

## Download from the TNT homepage (_recommended_)

The download from the TNT homepage is straightforward:

In [1]:
! wget http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-cancer-classification.tar.gz
! tar -xzvf project-cancer-classification.tar.gz
! mv -v project-cancer-classification/ data/
! rm -v project-cancer-classification.tar.gz

--2023-12-14 13:25:57--  http://www.tnt.uni-hannover.de/edu/vorlesungen/AMLG/data/project-cancer-classification.tar.gz
Resolving www.tnt.uni-hannover.de (www.tnt.uni-hannover.de)... 130.75.31.71
Connecting to www.tnt.uni-hannover.de (www.tnt.uni-hannover.de)|130.75.31.71|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1113432357 (1.0G) [application/x-gzip]
Saving to: ‘project-cancer-classification.tar.gz’


2023-12-14 13:26:19 (52.5 MB/s) - ‘project-cancer-classification.tar.gz’ saved [1113432357/1113432357]

project-cancer-classification/
project-cancer-classification/gdc-client_v1.6.1_Ubuntu_x64.zip
project-cancer-classification/tcga-kich-geq/
project-cancer-classification/tcga-kich-geq/0e3f97a7-63b5-4d66-bc64-1cfba1a7c32e/
project-cancer-classification/tcga-kich-geq/0e3f97a7-63b5-4d66-bc64-1cfba1a7c32e/2b3f591a-b826-4a4e-999e-1cf50172e56d.rna_seq.augmented_star_gene_counts.tsv
project-cancer-classification/tcga-kich-geq/0e3f97a7-63b5-4d66-bc64-1cfba1a7c32e/l

In the `data/` folder you will now find many files in the [TSV format](https://en.wikipedia.org/wiki/Tab-separated_values) ([CSV](https://en.wikipedia.org/wiki/Comma-separated_values)-like with tabs as delimiter) containing the normalized transcriptome profile data.

To start, you can read a TSV file into a [pandas](https://pandas.pydata.org) [`DataFrame`](pandas dataframe to dict) using the [`pandas.read_csv()`](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html#pandas-read-csv) function with the `sep` parameter set to `\t`:

In [60]:
import pandas as pd
import pickle


import os
#'./data/tcga-kirp-geq'

labels = ["kirp", "kirc", "kich"]   # Setzen Sie hier Ihren Ordnerpfad ein

rick = list()
for l in labels:
    root_folder = f"./data/tcga-{l}-geq"
    for root, dirs, files in os.walk(root_folder):
        for file in files:
            if file.endswith('.tsv'):
                # Vollständiger Pfad zur Datei
                file_path = os.path.join(root, file)
                # Hier können Sie etwas mit der Datei machen, z.B. einlesen
                df = pd.read_csv(filepath_or_buffer=file_path, sep="\t", header=1)
                df = df['tpm_unstranded']

                df = df[4:]
                df = np.array(df)
                rick.append(df)


#tsv_file_path = "data/tcga-kich-geq/0ba21ef5-0829-422e-a674-d3817498c333/4868e8fc-e045-475a-a81d-ef43eabb7066.rna_seq.augmented_star_gene_counts.tsv"

# Read the TSV file into a DataFrame
#df = pd.read_csv(filepath_or_buffer=tsv_file_path, sep="\t", header=1)

# Display the first few rows of the DataFrame
#print(df.head(n=20))
rick = np.array(rick)
print(rick)

# Speichern der 'kirp' Liste in einer Pickle-Datei
with open('rick.pickle', 'wb') as f:
    pickle.dump(rick, f)


ValueError: all the input array dimensions for the concatenation axis must match exactly, but along dimension 1, the array at index 0 has size 1 and the array at index 1 has size 60660

In [None]:
# Laden der 'kirp' Liste aus der Pickle-Datei
with open('rick.pickle', 'rb') as f:
    rick = pickle.load(f)


In [53]:
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Angenommen, X ist Ihr Datensatz
# X = ...
X = kirp

print(X)

# Standardisieren der Daten
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Erstellen des PCA-Objekts
pca = PCA(n_components=150)  # Angenommen, Sie möchten 2 Hauptkomponenten behalten

# Durchführen der PCA
X_pca = pca.fit_transform(X_scaled)

# Die resultierenden Hauptkomponenten
print("Transformierte Daten:", X_pca)

# Variance Ratio für jede Komponente
print("Varianz erklärt durch jede Komponente:", pca.explained_variance_ratio_)


[[2.03310e+01 0.00000e+00 2.51806e+01 ... 0.00000e+00 0.00000e+00
  1.43200e-01]
 [3.70405e+01 5.00200e-01 7.74246e+01 ... 0.00000e+00 1.92000e-02
  9.15500e-01]
 [4.54456e+01 9.03000e-02 7.49545e+01 ... 0.00000e+00 4.85000e-02
  7.54000e-01]
 ...
 [4.00416e+01 4.67600e-01 5.27965e+01 ... 0.00000e+00 5.98000e-02
  1.71170e+00]
 [3.78835e+01 1.42560e+00 6.00608e+01 ... 0.00000e+00 1.56000e-02
  1.25250e+00]
 [4.08749e+01 0.00000e+00 6.17930e+01 ... 0.00000e+00 1.41000e-02
  1.21190e+00]]
Transformierte Daten: [[-1.19442877e+02  6.73161120e+01  7.25198401e+00 ...  4.14152129e+00
   4.59599544e+00  5.09706886e+00]
 [-1.20146010e+01  1.00991944e+00 -1.46544493e+01 ...  5.30832262e-01
  -8.17291005e+00 -4.58423143e+00]
 [ 5.36208095e+01 -4.94641674e+01 -6.31142039e+00 ... -9.77985446e+00
   5.76202141e-02 -6.98175867e+00]
 ...
 [ 7.68526642e+01 -1.67290906e+01 -3.56488589e+01 ...  3.92227734e+00
   1.61191707e+00 -2.11556078e+00]
 [ 2.08641623e+01 -6.02519312e+00 -2.06334035e+01 ... -3.2248