FIT1043 Assignment 1: Description
Aim
This assignment aims to explore and visualise data using Python as a data science tool. It will test your ability to:
Data
COVID-19 is a respiratory illness caused by a new virus which has changed our lives significantly. We aim to explore two datasets which contain relevant information about the virus and see whether different decisions and features such as applying lockdown, or the GDP of a country had any effect on the spread of the Coronavirus or not.
To achieve the goal of this analysis, we need some information about the new/total confirmed cases and deaths due to coronavirus as well as GDP and the lockdown date of each of the countries to do our analysis.
We use the following two datasets in this assignment:
Moreover, the data set has the following columns:
The lockdown dataset (CountryLockdowndates.csv) which contains information about the lockdown date and the name of the country which applied the lockdown.
Hand-in Requirements
Please hand in three files including a PDF file containing your answer, a CSV file containing the cleansed data set and a Jupyter notebook file (.ipynb) containing your Python code to all the questions respectively. Please consider the following cases for your submission:
PDF file should contain:
To generate a pdf report, you can use Word to write your report, but you need to convert it to PDF before your submission. Alternatively, an easier way is to generate a pdf version of your Juputer notebook by hitting Ctrl+P in the Jupyter notebook. This pdf file is a mandatory requirement to check the Turnitin by Monash University.
1. Ipynb file should contain:
Your Python codes for this assignment. Please use the provided template under Assignment 1 resources on Moodle (‘StudentID_FIT1043_Assignment1_Template.ipynb’).
2. CSV file should contain:
You will need to submit three separate files. “Zip”, “rar” or any other similar file compression format is not acceptable and will have a penalty of 10%.
You will be penalized by 5% of the assignment mark (5% out of 10 marks) if you submit after the due date for every day that you are late. If you could not submit your assignment before the due date, please make sure to submit your files at most 7 days after the assignment due date, we do not mark assignments which will be submitted after 11th of September 11:55 pm.
Assignment Tasks:
There are two tasks that you need to complete for this assignment. You need to use Python to complete the tasks.
Task 1-Data wrangling
First, you need to extract the required information from two data sources, namely “Covid-data.csv”, and “CountryLockdowndates.csv” based on our analysis requirements mentioned in the previous section, ‘Data’. Then, you need to clean the data and integrate the data sets. We call this process as data wrangling! Please pay attention that you should not delete any row from dataset Covid-data.csv during the data wrangling process.
Regarding the cleansing of data set Covid-data.csv, you need to check all the columns one by one and make sure their values are correct. For example, we do not expect to see any value higher than 100% in a column which shows the percentage. Moreover, if there are some missing values, you would be able to find the correct values based on the value of other columns. This is an important part of data science and you need to make sure you check all the columns one by one, detect their errors and fix them.
Please pay attention that in lockdown information, you would see different dates for different states/provinces of a country. Consider the earliest(minimum) date as the lockdown date for a country.
You need to export the cleansed dataframe which is the result of this task, as a CSV file at the end of the task and submit it in Moodle with the other two files as required. Please name the dataframe as follows:
<student_ID>_Task1DataSet.csv.
Following is a screenshot showing the columns of a dataframe which is the required output of this task (Order of columns is important).
Required column names and order for the CSV file which should be printed is as follows (location, date, total_cases, new_cases, total_deaths, new_deaths, gdp_per_capita, population, lockdown_date)
Task 2 Exploration
In this part, you need to explore the dataset which you generated in Task1. Please pay attention that exploration is not just a visualisation with a brief explanation. You can watch the assignment explanation recording provided in Moodle for further clarification about how a good exploration can be performed on a dataset.
1.Create a line chart to show the trend of the daily number of new cases for each country and
explore the result of visualisation (Create one line chart for each country).
2. Add a vertical line for the lockdown date to the line chart of each country which you created in the previous question and explore if the lockdown affected the trend which is shown in the plot? Is the effect similar for all countries? Why do you think so?
Following is an example of the expected plot for this question.
Figure 1. Example for the output of question 2 of Task 2
3. Explore whether there is a relation between daily new case/death rate and the GDP of a country. To this aim, you need to calculate:
Then, you need to create two line charts, one which shows the new case rate of groups "AboveGDP" and "BelowGDP"; and, another line chart to show the death rate of the two groups ("AboveGDP" and "BelowGDP").
For solution, connect with our online professionals.