10004 MBA 611 Data Analytics: Heath Care Data Assessment Answer

pages Pages: 4word Words: 890

Question :

MBA 611 Data Analytics 10004

Section 1: Learning Outcomes

The MISSION of Concordia University Irvine

Concordia University, guided by the Great Commission of Christ Jesus and the Lutheran Confessions, empowers students through the liberal arts and professional studies for lives of learning, service and leadership.

Institutional Learning Outcomes for Graduate Students (GLOs

Scholarly Research (SR): Generate scholarly research on problems and issues in their field of study.

Integrated Learning (IL): Integrate knowledge and skills from current research, scholarship, and/or techniques in their field with other disciplines.

Ethical Leadership (EL): Apply standards of ethics informed by Christian teachings as they fulfill their vocations as leaders within society.

Effective Communication (EC): Elucidate disciplinary knowledge and findings in professional and academic contexts through written, oral, and digital media.

Reflective Practice (RP): Balance evidence-based decision making, logical thinking, and consideration of human potential to take appropriate actions that advance their field.

Community Engagement (CE): Assess and develop cogent positions on significant issues in their field to respond to diverse needs in their respective communities.

Program Learning Outcomes (PLOs)

MBA graduates will be able to demonstrate the following:

  1. Recognize problems (RF)
  2. Integrate theory and practice for the purpose of strategic analysis (IL)
  3. Employ and apply quantitative techniques and methods in the analysis of real-world business situations (SR)
  4. Communicate to relevant audiences (EC); graduates should be able to:
    • Compose clear, consistent, and effective written forms of communication
    • Compose and present effective oral business presentations
  5. Work effectively with a team of colleagues on diverse projects (CE)
  6. Identify and analyze the ethical obligations and responsibilities of business (EL)

Course Learning Outcomes

  1. Understand the value, challenges and opportunities posed by Data. Understand the changing nature of data (volume, velocity, and variety,) and what it means to your business. Determine how and where Big Data challenges arise in a number of domains, including social media, transportation, finance, and medicine. (1) (2)
  2. Recognize the ethical challenges posed in harnessing Big Data. (6)
  3. Explore the relational model, SQL, and capabilities of new relational systems in terms of scalability and performance. (2)
  4. Understand the benefit from proper visualization of Big Data. (6)
  5. Understand and be able to apply basic statistical techniques employed in analyzing data. (3)
  6. Comprehend the concepts of fast algorithms for data analytics (6) (3)

Section 2: Course Management

Course Description

This is a blended course incorporating both online and in-class components. The in-class lecture and discussions will build upon the information learned in the readings and from the online modules.

In this course, we will examine how data analysis can be used to improve decision-making. We will study the fundamental principles and techniques of data mining, with real-world examples and cases to place data-mining techniques in context, to develop data-analytic thinking, and to illustrate that proper application is as much an art as it is a science.

Course Components, Evaluation & Grading

Required Texts/Material:

Assignment Map:
See Online Modules for specific assignments and their due dates. Instructor reserves the right to change assignments and the grading schema during the semester – Student will be made aware of any changes made to the syllabus or the assignment map.


Class Participation Actively join the class discussions on the case studies and discussion questions. There are no wrong opinions.
Online Modules Prior to the start of each class:
  • View the Online Module
  • Read the e-text and any other articles found in the module (e-text can be found under the Readings and Resource section.)
  • Watch all videos attached to the modules
  • Complete the Online Homework
  • Complete the Turn-in Homework (due at the start of class.)
Prior to attending the 1st class in this course, complete both the “Welcome to your Wiley Course” and ALL of Module 1.
There are a total of 6 module:


  • Week 1: The Value of Data and Its Changing Nature
  • Week 2: Data Typologies and Governance
  • Week 3: Working with Data: Extraction, Curation, Classification, and Management
  • Week 4: Business Statistics
  • Week 5: Optimization and Forecasting
  • Week 6: Data Visualization

Project On week 7 you final project is due. You will submit your work and provide a visual presentation describing your work and your findings.
For more information on the project, see Attachment One.

Your final grade is a weighted average of each component as shown below:

Component
% of Total Grade
Class participation
10%
Project & Presentation
25%
Online Homework
15%
Case Study Assignments
20%
Misc. Assignments
30%
Total
100%



Letter Grade
Percentage

A
93-100
A-
90-92
B+
87-89
B
84-86
B-
80-83
C+
77-79
C
74-76
F
< 74

Section 3: Class Policies

Attendance and Participation

Attendance at all regularly scheduled classes is expected (or required). Failure to attend classes does not constitute withdrawal. Missing two (2) classes or eight hours of class time will result in failing the course.

Preparation Time & the Carnegie Hour Standard

This is an accelerated course. As a result, you are expected to do more work and preparation outside of the classroom than you may have been required to do in other courses. Accordingly, the average student should expect to spend approximately 46 hours (equates to 55 of Carnegie hours) of related course work outside of class.

Policy on Honesty and Plagiarism

This course seeks to empower students for independent learning, resourcefulness, clear thinking, and perception. All submitted work and activities should be genuine reflections of individual achievement from which the student should derive personal satisfaction and a sense of accomplishment. Plagiarism and cheating subvert these goals and will be treated according to the policy stated in the Concordia University Irvine Student Handbook.

The instructor reserves the right to utilize electronic means to help prevent plagiarism. Students agree that by taking this course all assignments are subject to submission for textual similarity review to SafeAssign via Blackboard. Assignments submitted to SafeAssign will be included as source documents in SafeAssign’s restricted access database solely for the purpose of detecting plagiarism in such documents.

“Reasonable Accommodation” Statement

Students desiring accommodations on the basis of physical, learning, or psychological disability for this class are to contact the Disability and Learning Resource Center (DLRC). The DLRC is located in Suite 114 on the 1st floor of the Administration Building. You can reach the DLRC by dialing extension 1586.

Passwords:

Concordia provides a 24 hour 7 days a week self-help password assistance program. To access this service, go to myaccount.cui.edu. If you need further assistance, please email ITS@cui.edu or call ITS at 949 214-3175.

CELT: Concordia University has developed a student resource page named the Center for Excellence in Learning and Teaching (CELT) found at celt.cui.edu/Student.htm. This page provides a wide variety of resources ranging from links to MyRecords (Banner) to help with Microsoft Office. There are also hundreds of video tutorials for a large number of software packages. This tutorial repository can be accessed directly at http://movies.atomiclearning.com/highed/highed. For your class, the username for the service is "cui" and the password is "eagles".

Library Resources

The library services at Concordia can be accessed online at cui.edu/Library. There are hundreds of thousands of journals, ebooks and other titles that can be found in the 30+ research databases.

Resources for Writers

  • The Writing Studio: Meet with a Writing Studio consultant to receive feedback on your writing and to develop a revision plan that will strengthen your paper. The Writing Studio is located on the main floor of the library. On-site and on-line appointments can be made by going to  www.cui.edu/studentlife/writing-studio.  The site also has links more than 60 helpful resources for writers, guidelines for documentation styles, sample papers showing all citation styles, information on plagiarism, and more.
  • The Online Writing Lab (OWL): If you are a graduate student, or are taking an online class, you are eligible to send your paper to The Online Writing Lab. A writing consultant will read your paper and return it to you with written feedback. Go to  www.cui.edu/studentlife/writing-studio  for submission guidelines. Always send the directions and criteria for your writing assignment with your paper.
  • Grammarly (Free for CUI students): Get help with grammar from this automated grammar tutorial and revision support tool. As a CUI student, you can upload segments—or drafts—of your paper to receive immediate feedback, assess the editorial suggestions, and make the necessary corrections. Simply log-in using your CUI email.

DO NOT try to access Grammarly from the internet directly or the system will try to charge you a fee. Instead, access Grammarly through the CUI Writing Center website link.

G) Student-Instructor Communication

Communication outside of the classroom with your instructor shall be via Blackboard and/or CUI emails. Your instructor shall reply to any email sent no later than 1 working day later.  If you do not receive a response, assume your instructor did not receive the communication.

ATTACHMENT ONE – Project Guidance

Research questions could originate from any of a multitude of study areas. In medicine/healthcare datasets could be obtained from sites such as HealthData.gov (http://www.healthdata.gov/dataset/search ) or the Centers for Disease Control website (http://www.cdc.gov/nchs/data_access/sets/available_data.htm ). Students with an interest in biology or genetics/genomics could work with datasets from organizations such as the National Cancer Institute (http://epi.grants.cancer.gov/dac/ ) and the Broad Institute (MIT) (http://www.broadinstitute.org/cgi- bin/cancer/datasets.cg i). Students passionate about a particular sport could search the data sets that are available at http://databasesports.com/ , a site which claims to have the largest sports statistics database online, while students who prefer focusing more on business issues could choose to work with data from sites such as USA.gov (http://www.usa.gov/Business/Business-Data.shtml ), the National Bureau of Economic Research (http://www.nber.org/data /), or The Data Page of the Stern School (http://pages.stern.nyu.edu/~adamodar/New_Home_Page/data.html ). For research involving social media, freely available data sets can be obtained via the Stanford Network Analysis Project (SNAP) (http://snap.stanford.edu/data/index.html ) or from Yelp (https://www.yelp.com/academic_dataset ). In short, once the student has come up with a question of interest, there should be sufficient options for finding appropriate data to examine.

Our preference would be for students to investigate questions of their own, however, students can research to find “ready-to-go” options similar to the ones described below:

  1. Medicine: a. Question: Can gene expression data be used to gain insights into inducing differentiation in leukemia cells? b. Data sets: Data on gene expression in chemically treated untreated leukemia cells from the Broad Institute website.
  2. Social Media: a. Question: Can sentiment analysis of Yelp reviews be used to predict star ratings? b. Data set: The Yelp academic dataset.
  3. Economics: a. Question: How do employment status and income level correlate with expectations of crime victimization? b. Data set: The Survey of Economic Expectations data available on the National Bureau of Economic Research website.

Project Design

  1. Identify a research problem:
  2. Identify available data sources:
  3. Examine the data sources and clean data as needed: The student will use a variety of tools (text editors, Excel, R, Python) to perform exploratory data analysis and data “munging” (parsing and formatting data) to determine the adequacy of the data for the research question being investigated and to get ensure that the data are in usable formats.
  4. Formulate hypotheses to test: Based on the initial overview of the data, the student will identify one or more hypotheses to test.

1. Perform statistical analyses of the data: The student will use the R/Excel or other programming language to perform statistical analyses of the data. At a minimum, basic hypothesis testing and regression modeling will be done. You are encourage to work on clustering your data or doing more extensive building and testing of models.

2. Interpret results of data analysis: The student will learn to examine research results critically to decide whether or not the results obtained are as relevant to the question under investigation as was

anticipated. The student will consider the potential differences between “statistical significance” and “practical significance.”

3. . Write up results for presentation: Since data visualization is a critical subfield of data science students will use the graphics capabilities of R, Python and/or other packages to create quality graphics to highlight their research results. Additionally, students will show the statistical output generated from the statistical package used. These data visualizations will be included in both the written and oral presentations prepared by the student(s).

  1. ATTACHMENT TWO – Data/Tool Sources

Online sources of publically available datasets:

 HealthData.gov:  http://www.healthdata.gov/dataset/search

Centers for Disease Control: http://www.cdc.gov/nchs/data_access/sets/available_data.htm National Cancer Institute: http://epi.grants.cancer.gov/dac/

Broad Institute (MIT): http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi databaseSports.com:  http://databasesports.com/

 USA.gov:  http://www.usa.gov/Business/Business-Data.shtml National Bureau of Economic Research: http://www.nber.org/data/

The Data Page (NYU): http://pages.stern.nyu.edu/~adamodar/New_Home_Page/data.html Stanford Network Analysis Project (SNAP): http://snap.stanford.edu/data/index.html

Yelp: https://www.yelp.com/academic_dataset

Other Data Sourses

Web Sites & Blogs

UCI

Software Tools:

The R Programming Language: http://www.r-project.org/

R Studio (Integrated Development Environment for R): http://www.rstudio.com/ Python and IPython: https:// www.python.org/, http://ipython.org/index.html GitHib and git: http://www.git-scm.com/,https://github.com/Online tutorials:

Python:

R:

GitHub/git:

Additional Online Stat courses:

Work Opportunities:

Show More

Answer :

Topic: Heath care Data (Hospital)

Research Question: What are the factors impacting total cost to hospital for a patient.

All the explanation would be done through code.

  • The below code is used to import libraries for various functions.

```{r libraries, echo=TRUE, message=FALSE, warning=FALSE}

library(stats)    #for regression

library(caret)    #for data partition

library(car)      #for VIF

library(sandwich) #for variance, covariance matrix

```

  • The below code is used to import data and replacing all the spaces with NA value. Also this code eliminates row 58 to 62 as it is of no use.

```{r readData, echo=TRUE,tidy=TRUE}

raw_df <- read.csv("D:/hospital.csv", header = TRUE,sep = ",",na.strings = c(""," ", "NA"))

raw_df <- raw_df[,-c(58:62)]

raw_df

```

  • The code below gives the summary of the above explained data and changes upper case character to lower chase character and then converts it into a character again. Also, it replaces all the blank spaces with “none” and factors it. This is done only in Past Medical History Code. There are 175 NA values in Past Medical History Code. However, rather than treating these as missing values, it represents that there is no past medical history for these patients. These NA may be marked as "None". But while doing so, the code will give an error as we are trying to add a new level to factor variable (**raw_df$Past.MEDICAL.HISTORY.CODE**). In order to add a new level, first we will need to typecast this variable as a character variable, add a new level and then re-typecast them as Factor variable.

```{r summarizeData, echo=TRUE,tidy=TRUE}

str(raw_df)

summary(raw_df)

raw_df$PAST.MEDICAL.HISTORY.CODE[raw_df$PAST.MEDICAL.HISTORY.CODE == "Hypertension1"] <- "hypertension1"

raw_df$PAST.MEDICAL.HISTORY.CODE <- as.character(raw_df$PAST.MEDICAL.HISTORY.CODE)

raw_df$PAST.MEDICAL.HISTORY.CODE[is.na(raw_df$PAST.MEDICAL.HISTORY.CODE)] <- "None"

raw_df$PAST.MEDICAL.HISTORY.CODE <- as.factor(raw_df$PAST.MEDICAL.HISTORY.CODE)

```

  • The code below creates a new data frame and store the raw data copy. This is being done to have a copy of the raw data intact for further manipulation if needed.

```{r createDataCopy, echo=TRUE,tidy=TRUE}

new_df <- raw_df[,c(-1,-4,-5,-7,-9:-21,-23,-25,-31:-36,-41,-42,-44,-46,-48,-56)]

new_df <- na.omit(new_df) # listwise deletion of missing

head(new_df)

```

  • The code below is used to get the correlation between all the variables of the data set. Higher the correlation, we would try to remove it from the data set. Here high correlation is between Age, Actual Receivable Amount, Total Cost To Hospital, Length Of Stay ICU, Total Length Of Stay & BP High. 

From the numeric attribute in the data, it will of interest to analyze the variables which are corelated to each other. High correlation amongst variable may result in the issue of **multi-collinearity** in the model.

```{r corMatrix, echo=TRUE,tidy=TRUE}

correlationMatrix <- cor(new_df[,c(1,7:10,12:14,18:24,26)])

print(correlationMatrix)

# find attributes that are highly corrected (ideally >0.7)

highlyCorrelated <- findCorrelation(correlationMatrix, cutoff = 0.7, names = TRUE)

print(highlyCorrelated)

```

  • There was high correlation between Body Weight and Body Height so the code below converts both the highly correlated fields into a new field so that correlation is decreased. Similarly, a new variable I Cost of Implant is being made because Implant Used and Cost of Implant is highly correlated.

Deriving BMI to drop of Weight and Height as variables. Both of them where highly corelated to age. Dropping Creatinine as a variable as it is highly correlated to age.

```{r}

new_df$BMI <- new_df$BODY.WEIGHT/((new_df$BODY.HEIGHT/100) ^ 2)

new_df$I_COST.OF.IMPLANT <- model.matrix(~new_df$IMPLANT.USED..Y.N.)[,2]*new_df$COST.OF.IMPLANT

filter_df <- new_df[,c(-5:-6)]

```

  • The code below relevels the Past Medical History Code column to none. So that the first value is none.

By default, the base category/reference category selected is ordered alphabetically. In this code chunk we are just changing the base category for PAST.MEDICAL.HISTORY.CODE variable.

The base category can be releveled using the function **relevel()**.

```{r relevelCategory, echo=TRUE,tidy=TRUE}

filter_df$PAST.MEDICAL.HISTORY.CODE <- relevel(filter_df$PAST.MEDICAL.HISTORY.CODE, ref = "None")

```

  • The code below is to convert the data set into test and train. Probability is taken as 0.80, which is general. You can take any value you wish to.

```{r createDataPartition, echo=TRUE,tidy=TRUE}

set.seed(2341)

trainIndex <- createDataPartition(filter_df$TOTAL.COST.TO.HOSPITAL, p = 0.80, list = FALSE)

train_df <- filter_df[trainIndex,]

test_df <- filter_df[-trainIndex,]

```

  • The code below means Transformation of variables may be needed to validate the model assumptions.

```{r}

train_df$Log.Cost.Treatment <- log(train_df$TOTAL.COST.TO.HOSPITAL)

test_df$Log.Cost.Treatment <- log(test_df$TOTAL.COST.TO.HOSPITAL)

```

  • We can pull the specific attribute needed to build the model in another data frame. This again is more of a hygiene practice to not touch the **train** and **test** data set directly.

```{r variableUsedinTraining, echo=TRUE,tidy=TRUE}

reg_train_df <- as.data.frame(train_df[,c("AGE",

                                             "HR.PULSE",

                                             "BP..HIGH",

                                             "RR",

                                             "HB",

                                             "UREA",

                                             #"TOTAL.LENGTH.OF.STAY",

                                             "BMI",

                                             #"COST.OF.IMPLANT",

                                             #"IMPLANT.USED..Y.N.",

                                             "I_COST.OF.IMPLANT",

                                             "GENDER",

                                             "MARITAL.STATUS",

                                             "KEY.COMPLAINTS..CODE",

                                             "PAST.MEDICAL.HISTORY.CODE",

                                             "MODE.OF.ARRIVAL",

                                             #"STATE.AT.THE.TIME.OF.ARRIVAL",

                                             "TYPE.OF.ADMSN",

                                             "TOTAL.COST.TO.HOSPITAL"

                                             #"Log.Cost.Treatment"

)])

```

```{r variableUsedinTesting, echo=TRUE, tidy=TRUE}

reg_test_df <- as.data.frame(test_df[,c("AGE",

                                             "HR.PULSE",

                                             "BP..HIGH",

                                             "RR",

                                             "HB",

                                             "UREA",

                                             #"TOTAL.LENGTH.OF.STAY",

                                             "BMI",

                                             "COST.OF.IMPLANT",

                                             #"IMPLANT.USED..Y.N.",

                                             #"I_COST.OF.IMPLANT",

                                             "GENDER",

                                             "MARITAL.STATUS",

                                             "KEY.COMPLAINTS..CODE",

                                             "PAST.MEDICAL.HISTORY.CODE",

                                             "MODE.OF.ARRIVAL",

                                             #"STATE.AT.THE.TIME.OF.ARRIVAL",

                                             "TYPE.OF.ADMSN",

                                             "TOTAL.COST.TO.HOSPITAL"

                                             #"Log.Cost.Treatment"

)])

```

  • The actual model building starts now. Note that we are demonstrating the strategy of building a step wise model (forward selection and backward elimination)  using the **lm()** function

```{r buildModel, echo=TRUE, message=FALSE, warning=FALSE, tidy=TRUE}

#Null Model

no_model <- lm(TOTAL.COST.TO.HOSPITAL ~ 1,data = reg_train_df)

#Full Model

reg_full_model = lm(TOTAL.COST.TO.HOSPITAL ~., data = reg_train_df)

#Stepwise - Forward selection backward elimination

reg_step_model <- step(no_model, list(lower = formula(no_model),

                                         upper = formula(reg_full_model)),

                           direction = "both",trace = 0)

reg_step_model1 <- step(reg_full_model, list(lower = formula(reg_full_model),

                                         upper = formula(reg_full_model)),

                           direction = "both",trace = 0)

```

  • Checking the if the model satisfies the assumptions of Linear Regression Model. Note that this evaluation is on training data.

The model summary gives the equation of the model as well as helps test the assumption that beta coefficients are not statically zero.

```{r modelStats,tidy=TRUE}

summary(reg_step_model)

```The output of summary is given below:

The output of summary

Here we can see clearly that Type of Admission Emergency, I Cost of Implant (Which is a combination of Implant Used Y N and Cost of Implant) and Age of the Patient is highly significant which means Total Cost to Hospital is dependent on these 3 factors.

R Square means 56.35 % of data is explaining this outcome.

P Value of HR Pulse is more than 0.05 that is why it is not being considered.

  • The code below plots the outcome on the graph

The code below plots

This plot shows if the error terms are normally distributed. In case, of normal distribution, the dots should appear close to the straight line with not much of a deviation. This means all the 4 variables are normally distributed.

We generally see only the Normal Q-Q Plot.

  • The below code is used to check Multi Collinearity. Since the values are very low so there is no multi collinearity.

```{r}

vif(reg_step_model)

```

  • The code below is used to train the test dataset to work like we trained the train dataset.

```{r modelValidation, echo=TRUE,tidy=TRUE}

reg_test_df_predict = predict(reg_step_model, reg_test_df,

                            interval = "confidence",

                            level = 0.95,

                            type = "response")

data.frame(reg_test_df_predict, reg_test_df$TOTAL.COST.TO.HOSPITAL)

```

Conclusion:

By the above analysis we can say that Total Cost to Hospital by a patient is dependent on 3 variables mainly:

  1. Type of Admission Emergency.
  2. I Cost of Implant (Which is a combination of Implant Used Y N and Cost of Implant).
  3. Age of the patient.