Git Product home page Git Product logo

comp-1709-datascience-coursework's Introduction

Course: COMP1709 Information Visualisation Contribution

This is my attempt for this coursework, It took me around 10 hours to do this coursework.

Requirements

Anaconda Python Distribution for Best Results python v 3.7 & above

To install dependencies

SkLearn: pip install pandas matplotlib numpy hvplot notebook

The code can be run as a notebook in google colab as well as a script using python executable.

Objective of this Exercise: To analyze and model publically available data using RandomForest (Bagging) and XGBoost (Boosting) machine learning algorithms to predict Real Time Market correlations.

Methodology Followed: The methodology followed for this exercise can be summarized as follows:

  1. Collecting the data(Not Included)
  2. Compiling the data into a single CSV file for analysis (Not Included since dataset is publicy available and accessible through HTTPs)
  3. Importing, Inspecting and Cleaning the data(not included)
  4. Visual Inspection of Data
  5. Hyperparameter Tuning and Cross Validation in RandomForest and XGBoost
  6. Predicting customer visits correlations, outliers, standard deviations using Best Estimators and visualisations
  7. Visualizing and exporting the final results

Traditionally, Time series data is modelled using statistical approaches such as Exponential Smoothing, ARIMA or SARIMAX models etc. An important peculiarity of time series data is autocorrelation, i.e. the dependency of current value on the past values. Further, a time series data is time stamped which means that there is a chronological order maintained in the data that cannot be directly recognized by machine learning models. Thus, we will use a technique called reduction which is a process to transform the available features to account for the time dependency in the data.

**The notebook associated with this article is divided into 3 sections as following:

  1. Visual Inspection of data & Feature Engineering
  2. Implementing Predictions - todo
  3. Report of the coursework - To add later

Course requirements and information

100% of course Information Visualisation -Term 2 PDF file required -ZIP file also required Greenwich Course Leader: Dr Chris Walshaw Due date:22ndApril 2021 Learning Outcomes: 1.Identify and discuss fundamental concepts related to visualisation. 2.Demonstrate an understanding of different types of information visualisation and identify appropriate types of visualisation for various types of data. 3.Design, implement and evaluate interactive visualisation systems. 4.Apply visualisation tools and techniques to obtain insight from datasets.Plagiarismis presenting somebody else’s work as your own. It includes: copying information directly from the Web or books without referencing the material; submitting joint coursework as an individual effort; copying another student’s coursework; stealing or buying coursework from someone else and submitting it as your own work.Suspected plagiarism will be investigated and if found to have occurred will be dealt with according to the procedures set down by the University.All material copied or amended from any source (e.g. internet, books) must be referenced correctly according to the reference style you are using.Your work will be submitted for electronic plagiarism checking. Any attempt to bypass our plagiarism detection systems will be treated as a severe Assessment Offence.Coursework Submission Requirements An electronic copy of your work for this coursework should be fully uploaded by midnight (local time) on the Deadline Date. The last version you upload will be the one that is marked.For this coursework you must submit a single Acrobat PDF document. In general, any text in the document must not be an image (i.e. must not be scanned) and would normally be generated from other documents (e.g. MS Office using "Save As .. PDF"). For this coursework you must also upload a single ZIP file containing supporting evidence. There are limits on the file size. The current limits are displayed on the coursework submission page on the Intranet Make sure that any files you upload are virus-free and not protected by a password or corrupted otherwise they will be treated as null submissions. Comments on your work will be available from the Coursework page on the Intranet. The grade will be made available in the portal. You must NOT submit a paper copy of this coursework. All coursework must be submitted as aboveThe University website has details of the current Coursework Regulations, including details of penalties for late submission, procedures for Extenuating Circumstances, and penalties for Assessment Offences.See http://www2.gre.ac.uk/current-students/regsfor details.Detailed specificationYou are to carry out a data explorationfor ChrisCo, the fictional company whose sales and website data we have been analysing throughout the course, using a Python Notebook (in Colab or Jupyter) andproducing visualisations of store / customer data. The dataset concerns the company's 40 storesin the Northof the country, each identified by a unique 3 letter code (e.g. ABC, XYZ, etc). However,each student on the course has their own, randomised dataset to explore, and the codes are randomised sothata store code in one student’s datasetis veryunlikely to represent the same storein another student’s.DataYou will find your data in the following csv files, where BannerIDis your student ID number (e.g. 001234567): https://tinyurl.com/ChrisCoNorth/BannerID/DailyCustomers.csv listing the daily number of customer visits to the company's 40 stores https://tinyurl.com/ChrisCoNorth/BannerID/StoreMarketing.csvthe total annual spend on local marketing for each store https://tinyurl.com/ChrisCoNorth/BannerID/StoreOverheads.csvthe total annual cost of overheads for each store https://tinyurl.com/ChrisCoNorth/BannerID/StoreSize.csvthe store size (floor space) in metres squared for each store https://tinyurl.com/ChrisCoNorth/BannerID/StoreStaff.csvthe total number of full-time staff employed at each storePlease contact your tutorif you cannot find your data files.You should compile your data into two dataframes: one containing daily customer data (one row for each date); the other compiled from all of the .csv files into a dataframe of summary data (with a row for each store).

Report Your task is to investigate the data visually and present some conclusions about any characteristics you discover,including correlations, seasonal behaviour, outliers, etc., together with a suggestion about how the data might be best segmented. The company is most interested in the large and medium sized stores but would like a summary of the small stores plus any anomalies you identify in the data. You should also identify new stores that have been opened during the year or stores that the company has closed during the year. You should present your findings in the form of a pdf report for the company, i.e. based on the assumption that the reader knows nothing about data visualisation. The report should include: A brief introduction to data visualisation(no more than ½ a page).A discussionof your findings,including a total of8 visualisations(no more, no less). Each visualisation should be accompanied by a paragraph of text in which you shouldpresent:oa justificationfor including that particular visualisation:oadescriptionofwhat thevisualisation reveals about the data –do not assume that the reader will necessarily recognise and understand correlations, seasonality and anomalies. A critical review of your work, with a discussion of how best practices were demonstrated and applied(about ½ a page). A summary of theconclusions you have made about the data (you are notrequired to make any business recommendations).The summary may contain conclusions as bullet points (no more than ½ a page).For the 8 visualisations you include, you should choose your most illuminating charts / plots and paste in a screenshot. It is strongly recommended to use Insert > Screenshot in Word or the Windowssnipping tool (or similar) and to carefully cropeach screenshot so that it shows onlythe visualisation. Each visualisation should be carefully numberedand labelled, with a self-explanatory title and legend (if appropriate) and should be referred to inthe text (e.g. "Figure 1 shows that ..."). Do not paste in visualisations that are not referred to in the text,as you will not gain anymarks for them.The order of the visualisations should be carefully considered, leading the reader through the data exploration step by step and ideally with each visualisation leading on to the next one.NotebookYour Python Colab / Jupyter notebook should contain the details of your data exploration and support the report. The markdown should indicate the purpose of each preceding / following code section but you do not have to present your findings here.The code should be written efficiently, so that you do not repeat unnecessary code in each section.At least 2 of the visualisations in the notebook should be interactiveand provide functionality to explore the datain more detail.The markdown for these must include a clear description of availableuser interactions.DeliverablesYou must upload a single zip filecontaining: The pdf report containing your 8 chosen visualisations A supporting Python notebook (.ipynb) containing your data exploration Marking schemeThe report will be marked on the discussion and analysis, together with both the quality and impactof the visualisations. The notebook will be marked on its organisation, presentation and efficiency of coding. There are also marks for the interactive visualisations.Taska chieved well partially achievedpoorly/not achieved marks Reporttext (50%) Introduction to data visualisation/10 Discussion–justification of visualisations chosen/ 10Discussion –description of findings /10Critical review/10Data conclusions/10Report visualisations (20%) Presentation quality (labelling, legends, etc)/10Impact(as part of the exploration)/10 Notebook(30%) Organisation and presentation/10 Code efficiency (non-duplication)/5Interactive visualisations–functionality/10Interactive visualisations –description/ 5Grading criteria70-100% All requirements completed to an excellent standard 60-69%All requirements completed. However, there are a number of minor deficiencies in significant areas.50-59% All requirements completed. However, significant improvements could be made in many areas.40-49% All requirements completed. However, significant improvements could be made in all areas.30-39% All requirements attempted but the overall level of understanding and performance is poor.0-29% There are requirements missing or completed to a very inadequate standard which indicates a very poor or non-existent level of understanding.Thereport should be succinct and somust not containmore than 8 visualisations,although you may use the technique of facetting (i.e. a number of subplots in a single figure).Reportswith 9-10 visualisationswill be capped at 60% and those with 11or more visualisationswill be capped at 30%. However, your notebook may contain as many visualisations as you need to carry out the investigation.

comp-1709-datascience-coursework's People

Contributors

peterkuria avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

peterkuria1

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.