Git Product home page Git Product logo

rrtk2 / msb1015-assignment-3 Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 4.22 MB

This repository is the final product of assignment 3, requested by the course MSB1015 (Scientific Programming). It contains information on activating Windows subsystem for Linux, installing java, installing and running Nextflow.

License: GNU General Public License v3.0

Nextflow 100.00%
nextflow windows-subsystem-for-linux

msb1015-assignment-3's Introduction

MSB1015-Assignment-3

GitHub License GitHub Watches

Problem statement

Big data requires a lot of computing. Computer clusters and video cards can perform many calculations in parallel. But that requires that your computing task actually allows data to be processed in parallel. In this assignment you will use Nextflow to calculate LogP values for molecules encoded as SMILES that you retrieved from Wikidata.

What is this project about

This repository is the final product of assignment 3, requested by the course MSB1015 (Scientific Programming).

Project structure

The query asks information from Wikidata in a similar fashion the dedicated Wikidata database query works using the SPARQL language. Data on Wikidata is published under the Creative Commons Zero license, stating 'others may freely build upon, enhance and reuse the works for any purposes without restriction under copyright or database law'. This information is processed in Nextflow (released under the Apache 2.0 license license, see the paper). Data is parsed using the the Chemistry Development Kit (released under the GNU Lesser General Public License)

How is data shared, in what format, with what protocols?

Using the tool developed in this project, data is shared using the wikidata-sdk.

Workflow

The following workflow is applied:

  1. Data extraction from WikiData.

  2. Data parsing using the Chemistry Development Kit in Nextflow.

  3. Extract logP value for every SMILES (around 150,000).

  4. Re-run the steps 1, 2 and 3 with 1, 2 and 4 cores.

  5. Compare running times.

Installation

This script is ran in Nextflow, which is Linux based. Many different methods can be used to run Linux in Windows, such as a virtual machine. However, in this example the Windows Subsystem for Linux (WSL) is used. During installation restarting might be required, please do so and follow the instructions given in the interface. To install WSL, Java and Nextflow on Windows please follow these steps 1 to 12. If these are installed then start at 13.

Ubuntu Linux (windows 10 only)

Before Running the following steps, make sure the "Windows subsystem for Linux" is enabled under "Windows Features". Windows Features can be found by searching windows features in the windows search bar.

  1. Open Windows PowerShell as administrator
  2. Run this line in PowerShell: Enable-WindowsOptionalFeature -Online -FeatureName Microsoft-Windows-Subsystem-Linux
  3. Run this line in PowerShell: Invoke-WebRequest -Uri https://aka.ms/wsl-ubuntu-1804 -OutFile Ubuntu.appx -UseBasicParsin
  4. Run this line in PowerShell: Add-AppxPackage .\Ubuntu.appx
  5. When searching for "Ubuntu" in the search bar, an .exe file should be prompted. This program can be run to start Linux, on first run the OS will be installed.
Java
  1. Optional: Start Ubuntu Linux (step 5)
  2. Run this line in terminal: sudo apt-get update
  3. Run this line in terminal: sudo apt-get install default-jdk
Nextflow
  1. Optional: Start Ubuntu Linux (step 5)
  2. Create folder 'NxtFl' in /home/ folder by running this line in terminal mkdir /home/NxtFl
  3. go to folder by running this line in terminal cd /home/NxtFl
  4. Run this line in terminal: wget -qO- https://get.nextflow.io | bash
Clone GitHub repository
  1. Create a directory for the GitHub repository by running this line in terminal: mkdir /home/GitRepo
  2. Clone the GitHub repository into the created folder by running this line in terminal: git clone https://github.com/Rrtk2/MSB1015-Assignment-3 /home/GitRepo

Usage

When the installation is completed, Nextflow and the GitHub repository will be in a specific location. If deviated from this, change the code below accordingly.

If no changes are required, run the following lines. These will assess the running time of using 1, 2 and 4 cpus. These scripts will use the supplied long.tsv, which is a tsv file of the result of a query call using this specific call.

time /home/NxtFl/nextflow run /home/GitRepo/Linux_files/runtime_test_1cpu.nf

time /home/NxtFl/nextflow run /home/GitRepo/Linux_files/runtime_test_2cpu.nf

time /home/NxtFl/nextflow run /home/GitRepo/Linux_files/runtime_test_4cpu.nf

This will run the script and automatically indicate the time after every time command.

Results / expected output

When following the usage instructions, the expected output will be a summed runtime of 'user' and 'sys', representing the CPU runtime. The script will not prompt the logP values as printing to the terminal takes more time than the actual operation of extracting the logP value (which is performed in parallel). If this is required, these lines can be uncommented in the actual scripts.

The following results were obtained:

  • 1cpu: 197.672s
  • 2cpu: 192.406s
  • 4cpu: 208.829s

Contact

[email protected]

License and contributing guidelines

License

Contributing guidelines

Who is involved, and what are their roles.

RRtK2 (owner and contributor)

Status of project

Final. No edits expected, possible patches and bugfixes only.

Copyright and authors

All code and documents in the MSB1015-Assignment-3 folder was created by these author(s).

msb1015-assignment-3's People

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.