π Hi, I'm a Data Analyst with experience in data manipulation, business intelligence, and process automation. I love turning complex data into insights that drive positive change.
Visit my LinkedIn
My fascination with healthcare dataβs complexities and its potential to yield insightful observations for patient care and hospital management inspired this project. The healthcare sector, with its intricate data structures and significant impacts, provides a unique analytical challenge. This project primarily aims to practice SQL skills by using advanced techniques like joins, CASE statements, GROUP BY and HAVING clauses, and subqueries and CTEs to extract and analyze data from a comprehensive healthcare dataset. While I aim to uncover patterns that demonstrate the power of data analysis in healthcare, it is important to note that this project is purely educational and not intended to influence actual healthcare decisions.
The dataset used in this project spans ten years (1999-2008) of clinical care from 130 U.S. hospitals and integrated delivery networks, focusing on hospital admissions for patients diagnosed with diabetes. It comprises over 100,000 instances and 50 features, covering an extensive array of data points including patient demographics, the medical specialty of the admitting physician, type of admission, length of hospital stay, laboratory tests performed, medications administered, and patient outcomes like readmission rates. This comprehensive dataset is available on Kaggle, a prominent platform hosting datasets and competitions for data science and machine learning. It is publicly accessible for analysis and educational uses, making it an excellent resource for exploring healthcare analytics.
The initial step in preparing the dataset for analysis involved loading the data into a MySQL database, which was structured into two main tables: demographics
and health
. This organization facilitated a more efficient querying process by separating patient demographic information from their health-related data.
health
table:
SELECT
column_name, DATA_TYPE
FROM
INFORMATION_SCHEMA.COLUMNS
WHERE
table_schema = 'patient'
AND table_name = 'health';
Below are the SQL queries used to examine time distributions, medical specialities, and potential disparities in treatment along with its respective output:
Analyzed the distribution of the length of hospital stays by simulating a histogram, identifying a predominance of short-term stays (1-4 days), which highlights the operational dynamics and potential areas for improving patient care efficiency.
SELECT
ROUND(time_in_hospital, 1) AS length_of_stay,
COUNT(*) AS count,
RPAD('', COUNT(*) / 100, '*') AS bar
FROM
health
GROUP BY length_of_stay
ORDER BY length_of_stay;
This pattern is important for understanding hospital workflows and patient turnover.
This query identifies medical specialties that typically involve a higher number of procedures per patient, indicating areas with intensive procedural demands.
SELECT
medical_specialty,
ROUND(AVG(num_procedures), 1) AS avg_procedures,
COUNT(*) AS count
FROM
health
GROUP BY medical_specialty
HAVING count > 50 AND avg_procedures > 2.5
ORDER BY avg_procedures DESC;
This analysis enhances understanding of procedural frequencies and aids in the hypothetical allocation of resources.
This query explores the differences in the number of laboratory procedures performed across different racial demographic groups.
SELECT
race,
ROUND(AVG(num_lab_procedures), 1) AS avg_num_lab_procedures
FROM
health AS h
JOIN
demographics AS d ON h.patient_nbr = d.patient_nbr
GROUP BY race
ORDER BY avg_num_lab_procedures DESC;
While the differences were nominal, ongoing monitoring and study of these trends are crucial to ensure healthcare equity and to address any emerging disparities.
This query investigates the correlation between the number of lab procedures and the length of hospital stays.
SELECT
ROUND(AVG(time_in_hospital), 1) AS avg_time,
CASE
WHEN num_lab_procedures >= 0 AND num_lab_procedures < 25 THEN 'few'
WHEN num_lab_procedures >= 25 AND num_lab_procedures < 55 THEN 'average'
ELSE 'many'
END AS procedure_frequency
FROM
health
GROUP BY procedure_frequency
ORDER BY avg_time DESC;
This relationship is analyzed to demonstrate how to detect and interpret healthcare data patterns, emphasizing the potential impact of procedural load on hospital stay durations.
This analysis examines the efficiency of emergency admissions by focusing on the duration of hospital stays compared to the average.
WITH avg_time AS (
SELECT AVG(time_in_hospital) AS avg_hospital_time
FROM health
)
-- Count patients staying less and more than the average duration and calculate percentages
SELECT
ROUND(100.0 * SUM(CASE
WHEN time_in_hospital < avg_hospital_time THEN 1
ELSE 0
END) / COUNT(*),
2) AS pct_below_average_stays,
ROUND(100.0 * SUM(CASE
WHEN time_in_hospital >= avg_hospital_time THEN 1
ELSE 0
END) / COUNT(*),
2) AS pct_above_or_equal_average_stays
FROM
health,
avg_time;
This helps to understand the distribution of hospital stays but also to quantify how often emergency admissions result in shorter-than-average stays, potentially indicating efficiency in patient management and resource utilization.
This project unveiled several insights into the healthcare management of diabetic patients:
Reflecting on this project, I used my SQL knowledge to navigate the complexities of a healthcare dataset focused on diabetes care. The process involved data cleaning and preparation, including correcting data types and managing missing values, which were necessary for effective analysis. Utilizing SQL techniques like subqueries, CTEs, and complex joins deepened my understanding of real-world data applications. Engaging with healthcare data underscored the potential of data analytics in improving operational efficiencies. Overall, this experience has not only bolstered my analytical capabilities but also piqued my interest in the intersection of data science and healthcare.