Disease-Symptom Data Cleaner and Flattener

This repository provides a Python script for cleaning, normalizing, and flattening disease-symptom datasets Unified Medical Language System (UMLS) specific https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html. It is designed to process medical data from Excel files (especially those with composite codes separated by ^), remove unwanted junk characters, and output a clean, analysis-ready CSV.

Features

Directly loads medical data from a GitHub-hosted Excel file
Fills missing disease names using last valid entry (forward fill)
Removes unnecessary columns (e.g., occurrence counts)
Eliminates rows with missing symptoms
Handles composite disease codes: splits codes joined by ^ into separate records
Removes junk characters (e.g., Â) from disease and symptom columns
Exports a clean, flattened CSV ready for ML or analytics

Input

"raw_data.xlsx" The script loads raw data which is obtained in .xlsx format from https://people.dbmi.columbia.edu/~friedma/Projects/DiseaseSymptomKB/index.html from below github link:

https://raw.githubusercontent.com/anujdutt9/Disease-Prediction-from-Symptoms/master/notebook/dataset/raw_data.xlsx

Output

A cleaned and flattened CSV named:

flattened_url.csv

Each row contains a single disease code and its associated symptoms.

Usage

Clone this repository or copy the script into your project.
Ensure you have Python 3.9+ and the following packages installed:
```
pip install pandas
```
Run the script:
```
python flatten.py
```
Result: You’ll get a flattened_url.csv file in your working directory.

Example

Input disease string:

UMLS:C0376358_malignant neoplasm of prostate^UMLS:C0600139_carcinoma prostate^UMLS:C0600159_carcinoma digitailis

Becomes three rows:

UMLS:C0376358_malignant neoplasm of prostate, symptom_1, symptom_2, ...
UMLS:C0600139_carcinoma prostate, symptom_1, symptom_2, ...
UMLS:C0600159_carcinoma digitailis, symptom_1, symptom_2, ...

Full Script

import pandas as pd
import re

# Download Excel file directly from GitHub RAW link
file_url = 'https://raw.githubusercontent.com/anujdutt9/Disease-Prediction-from-Symptoms/master/notebook/dataset/raw_data.xlsx'
df = pd.read_excel(file_url)

# Fill down Disease names
df['Disease'] = df['Disease'].ffill()

# Drop 'Count of Disease Occurrence' if exists
if 'Count of Disease Occurrence' in df.columns:
    df = df.drop(columns=['Count of Disease Occurrence'])

# Remove rows where Symptom is NaN
df = df[df['Symptom'].notna()]

# Replace '^' with ',' in symptoms
df['Symptom'] = df['Symptom'].str.replace('^', ',', regex=False)

# --- Remove junk symbols like Â etc. from Disease and Symptom ---
def clean_text(text):
    # Remove any character that is not printable ASCII or common punctuation/space
    return re.sub(r'[^\x20-\x7E]', '', str(text))

df['Disease'] = df['Disease'].apply(clean_text)
df['Symptom'] = df['Symptom'].apply(clean_text)

# Prepare new rows for disease splits
expanded_rows = []

for disease, group in df.groupby('Disease', sort=False):
    symptoms = ','.join(group['Symptom'])
    disease_codes = [d.strip() for d in disease.split('^')]
    for code in disease_codes:
        expanded_rows.append({'Disease': code, 'Symptom': symptoms})

# Create result DataFrame preserving order
result = pd.DataFrame(expanded_rows)

# (Optional) Clean junk symbols from the final result again, just in case
result['Disease'] = result['Disease'].apply(clean_text)
result['Symptom'] = result['Symptom'].apply(clean_text)

# Save to CSV
output_csv = 'flattened_url.csv'
result.to_csv(output_csv, index=False)

result.head()

Raw data to flattended data

Before Raw data

After

Flattened data (Comma separated)

UMLS:C0008031_pain chest,UMLS:C0392680_shortness of breath,UMLS:C0012833_dizziness,UMLS:C0004093_asthenia,UMLS:C0085639_fall,UMLS:C0039070_syncope,UMLS:C0042571_vertigo,UMLS:C0038990_sweat,UMLS:C0700590_sweating increased,UMLS:C0030252_palpitation,UMLS:C0027497_nausea,UMLS:C0002962_angina pectoris,UMLS:C0438716_pressure chest

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
LICENSE		LICENSE
README.md		README.md
flatten.py		flatten.py
flattened_url.csv		flattened_url.csv
flattened_url.xlsx		flattened_url.xlsx
raw_data.xlsx		raw_data.xlsx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Disease-Symptom Data Cleaner and Flattener

Features

Input

Output

Usage

Example

Full Script

Raw data to flattended data

License

References

About

Uh oh!

Releases

Languages

License

ParthaPRay/Disease-Symptom-Knowledge-Database_Flattening

Folders and files

Latest commit

History

Repository files navigation

Disease-Symptom Data Cleaner and Flattener

Features

Input

Output

Usage

Example

Full Script

Raw data to flattended data

License

References

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages