MBA 611 Data Analytics 10004
Section 1: Learning Outcomes
The MISSION of Concordia University Irvine
Concordia University, guided by the Great Commission of Christ Jesus and the Lutheran Confessions, empowers students through the liberal arts and professional studies for lives of learning, service and leadership.
Institutional Learning Outcomes for Graduate Students (GLOs
Scholarly Research (SR): Generate scholarly research on problems and issues in their field of study.
Integrated Learning (IL): Integrate knowledge and skills from current research, scholarship, and/or techniques in their field with other disciplines.
Ethical Leadership (EL): Apply standards of ethics informed by Christian teachings as they fulfill their vocations as leaders within society.
Effective Communication (EC): Elucidate disciplinary knowledge and findings in professional and academic contexts through written, oral, and digital media.
Reflective Practice (RP): Balance evidence-based decision making, logical thinking, and consideration of human potential to take appropriate actions that advance their field.
Community Engagement (CE): Assess and develop cogent positions on significant issues in their field to respond to diverse needs in their respective communities.
Program Learning Outcomes (PLOs)
MBA graduates will be able to demonstrate the following:
Course Learning Outcomes
Section 2: Course Management
Course Description
This is a blended course incorporating both online and in-class components. The in-class lecture and discussions will build upon the information learned in the readings and from the online modules.
In this course, we will examine how data analysis can be used to improve decision-making. We will study the fundamental principles and techniques of data mining, with real-world examples and cases to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.
Course Components, Evaluation & Grading
Required Texts/Material:
|
Assignment Map: See Online Modules for specific assignments and their due dates. Instructor reserves the right to change assignments and the grading schema during the semester – Student will be made aware of any changes made to the syllabus or the assignment map. |
Class Participation Actively join the class discussions on the case studies and discussion questions. There are no wrong opinions. |
Online Modules Prior to the start of each class:
There are a total of 6 module: |
|
Project On week 7 you final project is due. You will submit your work and provide a visual presentation describing your work and your findings. For more information on the project, see Attachment One. |
Your final grade is a weighted average of each component as shown below:
Component | % of Total Grade | |||
Class participation | 10% | |||
Project & Presentation | 25% | |||
Online Homework | 15% | |||
Case Study Assignments | 20% | |||
Misc. Assignments | 30% | |||
Total | 100% | |||
Letter Grade | Percentage | |||
A | 93-100 | |||
A- | 90-92 | |||
B+ | 87-89 | |||
B | 84-86 | |||
B- | 80-83 | |||
C+ | 77-79 | |||
C | 74-76 | |||
F | < 74 |
Attendance at all regularly scheduled classes is expected (or required). Failure to attend classes does not constitute withdrawal. Missing two (2) classes or eight hours of class time will result in failing the course.
Preparation Time & the Carnegie Hour Standard
This is an accelerated course. As a result, you are expected to do more work and preparation outside of the classroom than you may have been required to do in other courses. Accordingly, the average student should expect to spend approximately 46 hours (equates to 55 of Carnegie hours) of related course work outside of class.
Policy on Honesty and Plagiarism
This course seeks to empower students for independent learning, resourcefulness, clear thinking, and perception. All submitted work and activities should be genuine reflections of individual achievement from which the student should derive personal satisfaction and a sense of accomplishment. Plagiarism and cheating subvert these goals and will be treated according to the policy stated in the Concordia University Irvine Student Handbook.
The instructor reserves the right to utilize electronic means to help prevent plagiarism. Students agree that by taking this course all assignments are subject to submission for textual similarity review to SafeAssign via Blackboard. Assignments submitted to SafeAssign will be included as source documents in SafeAssign’s restricted access database solely for the purpose of detecting plagiarism in such documents.
“Reasonable Accommodation” Statement
Students desiring accommodations on the basis of physical, learning, or psychological disability for this class are to contact the Disability and Learning Resource Center (DLRC). The DLRC is located in Suite 114 on the 1st floor of the Administration Building. You can reach the DLRC by dialing extension 1586.
Passwords:
Concordia provides a 24 hour 7 days a week self-help password assistance program. To access this service, go to myaccount.cui.edu. If you need further assistance, please email ITS@cui.edu or call ITS at 949 214-3175.
CELT: Concordia University has developed a student resource page named the Center for Excellence in Learning and Teaching (CELT) found at celt.cui.edu/Student.htm. This page provides a wide variety of resources ranging from links to MyRecords (Banner) to help with Microsoft Office. There are also hundreds of video tutorials for a large number of software packages. This tutorial repository can be accessed directly at http://movies.atomiclearning.com/highed/highed. For your class, the username for the service is "cui" and the password is "eagles".
Library Resources
The library services at Concordia can be accessed online at cui.edu/Library. There are hundreds of thousands of journals, ebooks and other titles that can be found in the 30+ research databases.
Resources for Writers
DO NOT try to access Grammarly from the internet directly or the system will try to charge you a fee. Instead, access Grammarly through the CUI Writing Center website link.
Communication outside of the classroom with your instructor shall be via Blackboard and/or CUI emails. Your instructor shall reply to any email sent no later than 1 working day later. If you do not receive a response, assume your instructor did not receive the communication.
Research questions could originate from any of a multitude of study areas. In medicine/healthcare datasets could be obtained from sites such as HealthData.gov (http://www.healthdata.gov/dataset/search ) or the Centers for Disease Control website (http://www.cdc.gov/nchs/data_access/sets/available_data.htm ). Students with an interest in biology or genetics/genomics could work with datasets from organizations such as the National Cancer Institute (http://epi.grants.cancer.gov/dac/ ) and the Broad Institute (MIT) (http://www.broadinstitute.org/cgi- bin/cancer/datasets.cg i). Students passionate about a particular sport could search the data sets that are available at http://databasesports.com/ , a site which claims to have the largest sports statistics database online, while students who prefer focusing more on business issues could choose to work with data from sites such as USA.gov (http://www.usa.gov/Business/Business-Data.shtml ), the National Bureau of Economic Research (http://www.nber.org/data /), or The Data Page of the Stern School (http://pages.stern.nyu.edu/~adamodar/New_Home_Page/data.html ). For research involving social media, freely available data sets can be obtained via the Stanford Network Analysis Project (SNAP) (http://snap.stanford.edu/data/index.html ) or from Yelp (https://www.yelp.com/academic_dataset ). In short, once the student has come up with a question of interest, there should be sufficient options for finding appropriate data to examine.
Our preference would be for students to investigate questions of their own, however, students can research to find “ready-to-go” options similar to the ones described below:
Project Design
1. Perform statistical analyses of the data: The student will use the R/Excel or other programming language to perform statistical analyses of the data. At a minimum, basic hypothesis testing and regression modeling will be done. You are encourage to work on clustering your data or doing more extensive building and testing of models.
2. Interpret results of data analysis: The student will learn to examine research results critically to decide whether or not the results obtained are as relevant to the question under investigation as was
anticipated. The student will consider the potential differences between “statistical significance” and “practical significance.”
3. . Write up results for presentation: Since data visualization is a critical subfield of data science students will use the graphics capabilities of R, Python and/or other packages to create quality graphics to highlight their research results. Additionally, students will show the statistical output generated from the statistical package used. These data visualizations will be included in both the written and oral presentations prepared by the student(s).
HealthData.gov: http://www.healthdata.gov/dataset/search
Centers for Disease Control: http://www.cdc.gov/nchs/data_access/sets/available_data.htm National Cancer Institute: http://epi.grants.cancer.gov/dac/
Broad Institute (MIT): http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi databaseSports.com: http://databasesports.com/
USA.gov: http://www.usa.gov/Business/Business-Data.shtml National Bureau of Economic Research: http://www.nber.org/data/
The Data Page (NYU): http://pages.stern.nyu.edu/~adamodar/New_Home_Page/data.html Stanford Network Analysis Project (SNAP): http://snap.stanford.edu/data/index.html
Yelp: https://www.yelp.com/academic_dataset
UCI
The R Programming Language: http://www.r-project.org/
R Studio (Integrated Development Environment for R): http://www.rstudio.com/ Python and IPython: https:// www.python.org/, http://ipython.org/index.html GitHib and git: http://www.git-scm.com/,https://github.com/Online tutorials:
Topic: Heath care Data (Hospital)
Research Question: What are the factors impacting total cost to hospital for a patient.
All the explanation would be done through code.
```{r libraries, echo=TRUE, message=FALSE, warning=FALSE}
library(stats) #for regression
library(caret) #for data partition
library(car) #for VIF
library(sandwich) #for variance, covariance matrix
```
```{r readData, echo=TRUE,tidy=TRUE}
raw_df <- read.csv("D:/hospital.csv", header = TRUE,sep = ",",na.strings = c(""," ", "NA"))
raw_df <- raw_df[,-c(58:62)]
raw_df
```
```{r summarizeData, echo=TRUE,tidy=TRUE}
str(raw_df)
summary(raw_df)
raw_df$PAST.MEDICAL.HISTORY.CODE[raw_df$PAST.MEDICAL.HISTORY.CODE == "Hypertension1"] <- "hypertension1"
raw_df$PAST.MEDICAL.HISTORY.CODE <- as.character(raw_df$PAST.MEDICAL.HISTORY.CODE)
raw_df$PAST.MEDICAL.HISTORY.CODE[is.na(raw_df$PAST.MEDICAL.HISTORY.CODE)] <- "None"
raw_df$PAST.MEDICAL.HISTORY.CODE <- as.factor(raw_df$PAST.MEDICAL.HISTORY.CODE)
```
```{r createDataCopy, echo=TRUE,tidy=TRUE}
new_df <- raw_df[,c(-1,-4,-5,-7,-9:-21,-23,-25,-31:-36,-41,-42,-44,-46,-48,-56)]
new_df <- na.omit(new_df) # listwise deletion of missing
head(new_df)
```
From the numeric attribute in the data, it will of interest to analyze the variables which are corelated to each other. High correlation amongst variable may result in the issue of **multi-collinearity** in the model.
```{r corMatrix, echo=TRUE,tidy=TRUE}
correlationMatrix <- cor(new_df[,c(1,7:10,12:14,18:24,26)])
print(correlationMatrix)
# find attributes that are highly corrected (ideally >0.7)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.7, names = TRUE)
print(highlyCorrelated)
```
Deriving BMI to drop of Weight and Height as variables. Both of them where highly corelated to age. Dropping Creatinine as a variable as it is highly correlated to age.
```{r}
new_df$BMI <- new_df$BODY.WEIGHT/((new_df$BODY.HEIGHT/100) ^ 2)
new_df$I_COST.OF.IMPLANT <- model.matrix(~new_df$IMPLANT.USED..Y.N.)[,2]*new_df$COST.OF.IMPLANT
filter_df <- new_df[,c(-5:-6)]
```
By default, the base category/reference category selected is ordered alphabetically. In this code chunk we are just changing the base category for PAST.MEDICAL.HISTORY.CODE variable.
The base category can be releveled using the function **relevel()**.
```{r relevelCategory, echo=TRUE,tidy=TRUE}
filter_df$PAST.MEDICAL.HISTORY.CODE <- relevel(filter_df$PAST.MEDICAL.HISTORY.CODE, ref = "None")
```
```{r createDataPartition, echo=TRUE,tidy=TRUE}
set.seed(2341)
trainIndex <- createDataPartition(filter_df$TOTAL.COST.TO.HOSPITAL, p = 0.80, list = FALSE)
train_df <- filter_df[trainIndex,]
test_df <- filter_df[-trainIndex,]
```
```{r}
train_df$Log.Cost.Treatment <- log(train_df$TOTAL.COST.TO.HOSPITAL)
test_df$Log.Cost.Treatment <- log(test_df$TOTAL.COST.TO.HOSPITAL)
```
```{r variableUsedinTraining, echo=TRUE,tidy=TRUE}
reg_train_df <- as.data.frame(train_df[,c("AGE",
"HR.PULSE",
"BP..HIGH",
"RR",
"HB",
"UREA",
#"TOTAL.LENGTH.OF.STAY",
"BMI",
#"COST.OF.IMPLANT",
#"IMPLANT.USED..Y.N.",
"I_COST.OF.IMPLANT",
"GENDER",
"MARITAL.STATUS",
"KEY.COMPLAINTS..CODE",
"PAST.MEDICAL.HISTORY.CODE",
"MODE.OF.ARRIVAL",
#"STATE.AT.THE.TIME.OF.ARRIVAL",
"TYPE.OF.ADMSN",
"TOTAL.COST.TO.HOSPITAL"
#"Log.Cost.Treatment"
)])
```
```{r variableUsedinTesting, echo=TRUE, tidy=TRUE}
reg_test_df <- as.data.frame(test_df[,c("AGE",
"HR.PULSE",
"BP..HIGH",
"RR",
"HB",
"UREA",
#"TOTAL.LENGTH.OF.STAY",
"BMI",
"COST.OF.IMPLANT",
#"IMPLANT.USED..Y.N.",
#"I_COST.OF.IMPLANT",
"GENDER",
"MARITAL.STATUS",
"KEY.COMPLAINTS..CODE",
"PAST.MEDICAL.HISTORY.CODE",
"MODE.OF.ARRIVAL",
#"STATE.AT.THE.TIME.OF.ARRIVAL",
"TYPE.OF.ADMSN",
"TOTAL.COST.TO.HOSPITAL"
#"Log.Cost.Treatment"
)])
```
```{r buildModel, echo=TRUE, message=FALSE, warning=FALSE, tidy=TRUE}
#Null Model
no_model <- lm(TOTAL.COST.TO.HOSPITAL ~ 1,data = reg_train_df)
#Full Model
reg_full_model = lm(TOTAL.COST.TO.HOSPITAL ~., data = reg_train_df)
#Stepwise - Forward selection backward elimination
reg_step_model <- step(no_model, list(lower = formula(no_model),
upper = formula(reg_full_model)),
direction = "both",trace = 0)
reg_step_model1 <- step(reg_full_model, list(lower = formula(reg_full_model),
upper = formula(reg_full_model)),
direction = "both",trace = 0)
```
The model summary gives the equation of the model as well as helps test the assumption that beta coefficients are not statically zero.
```{r modelStats,tidy=TRUE}
summary(reg_step_model)
```The output of summary is given below:
Here we can see clearly that Type of Admission Emergency, I Cost of Implant (Which is a combination of Implant Used Y N and Cost of Implant) and Age of the Patient is highly significant which means Total Cost to Hospital is dependent on these 3 factors.
R Square means 56.35 % of data is explaining this outcome.
P Value of HR Pulse is more than 0.05 that is why it is not being considered.
This plot shows if the error terms are normally distributed. In case, of normal distribution, the dots should appear close to the straight line with not much of a deviation. This means all the 4 variables are normally distributed.
We generally see only the Normal Q-Q Plot.
```{r}
vif(reg_step_model)
```
```{r modelValidation, echo=TRUE,tidy=TRUE}
reg_test_df_predict = predict(reg_step_model, reg_test_df,
interval = "confidence",
level = 0.95,
type = "response")
data.frame(reg_test_df_predict, reg_test_df$TOTAL.COST.TO.HOSPITAL)
```
Conclusion:
By the above analysis we can say that Total Cost to Hospital by a patient is dependent on 3 variables mainly: