Friendly Python

Analysis of GEO(Gene Expression Omnibus) data using python

June 10, 2018, 11:10 a.m.

    Today I am going to show how to upload data from GEO database and perform statistical analysis on it. GEO is a public functional genomics data repository supporting MIAME-compliant data submissions. https://www.ncbi.nlm.nih.gov/geo/ For this example I will use data from publication. The oncogene EVI1 enhances transcriptional and biological responses of human myeloid cells to all-trans retinoic acid. GEO accession number: GDS5614. To upload data I use a GEOparse package. Which you can install using pip (pip install GEOparse). All documentation you can find in this site.

import GEOparse # Python package to upload a geo data
import pandas as pd
import numpy as np
from scipy import stats
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

# Uploading a dataset from GEO database

gse = GEOparse.get_GEO(geo="GDS5614", destdir="./")
# All metadata related to experiment

# Description about an experiment

gse.metadata['description'] 

# Title of an experiment

gse.metadata['title'] 

# Value type of an experiment.

gse.metadata['value_type'] 

# Experimental data

data = gse.table

# Labeling of columns

columns = gse.columns

# For this analysis we would only use Control and All trans retinoic acid(ATRA) treated samples data.

analysis_data.columns = ["ATRA", "ATRA","ATRA", "C","C","C"]
analysis_data.index = gse.table.iloc[:,1].values
analysis_data = gse.table.iloc[:,8:]

# We also perform a statistical analysis

# T-test

analysis_data["Ttest"] = stats.ttest_ind(analysis_data.T.iloc[0:3, :],analysis_data.T.iloc[3:6, :], equal_var=True, nan_policy="omit")[1]

# Oneway anova

analysis_data["ANOVA_one"] = [stats.f_oneway(analysis_data.T.iloc[0:3, x],analysis_data.T.iloc[3:6, x])[1] for x in range(analysis_data.shape[0])]

# Here we check for downregulation of some known genes.

print(analysis_data.T['MYC'])
print(analysis_data.T['TLR2'])

# Returns

analysis_data.T['MYC']
ATRA         12.210000
ATRA         12.120000
ATRA         12.050000
C            12.760000
C            12.740000
C            12.820000
Ttest         0.000244
ANOVA_one     0.000244
Name: MYC, dtype: float64


ATRA          9.72000
ATRA          9.76000
ATRA          9.70000
C            10.12000
C            10.01000
C            10.02000
Ttest         0.00119
ANOVA_one     0.00119
Name: TLR2, dtype: float64

# We confirm that MYC and TLR2 genes are downregulated.

# To observe weather some genes were upregulated or downregulated.

# We transform data.

analysis_data_d = pd.DataFrame()
analysis_data_d['C'] = analysis_data.T.iloc[3:6,:].mean()
analysis_data_d['ATRA'] = analysis_data.T.iloc[0:3,:].mean()
analysis_data_d['Fold'] = analysis_data_d['Diff']/analysis_data_d['C']
analysis_data_d['Diff'] = analysis_data_d['ATRA'] - analysis_data_d['C']

# Now we plot the differences.


# Plotting differences

diff_up = analysis_data_d[analysis_data_d['Fold'] >= 0.5]
diff_down = analysis_data_d[analysis_data_d['Fold'] <= -0.2]
diff = pd.concat([diff_up,diff_down])
diff_vals = diff['Fold'].sort_values()
counter = np.arange(len(diff_vals.values))

fig, ax = plt.subplots(figsize = (20,10))
ax.bar(counter,diff_vals.values, width=0.5)
ax.set_xticks(counter)
ax.set_xticklabels(diff_vals.index.values, rotation=90)
ax.set_title("Gene expression differences of Control vs ATRA treated HL-60 cells fold difference to Control")
ax.set_ylabel("fold difference")
plt.show()


# Plotting data in log 2 scale

analysis_data_d['log2_ATRA'] = analysis_data_d['ATRA'].apply(np.log2)
analysis_data_d['log2_C'] = analysis_data_d['C'].apply(np.log2)
analysis_data_d['log2_diff'] = analysis_data_d['log2_ATRA'] - analysis_data_d['log2_C']

diff_up = analysis_data_d[analysis_data_d['log2_diff'] >= 0.5]
diff_down = analysis_data_d[analysis_data_d['log2_diff'] <= -0.5]
diff = pd.concat([diff_up,diff_down])
diff_vals = diff['log2_diff'].sort_values()
counter = np.arange(len(diff_vals.values))

fig, ax = plt.subplots(figsize = (20,10))
ax.bar(counter,diff_vals.values, width=0.5)
ax.set_xticks(counter)
ax.set_xticklabels(diff_vals.index.values, rotation=90)
ax.set_title("Gene expression differences of Control vs ATRA treated HL-60 cells")
ax.set_ylabel("log2 difference")
plt.show()