Git Product home page Git Product logo

custom_sort_large_file's Introduction

Custom_Sort_Large_File

Usage.

To compile, type make. To run the program, type "./sortme FileA FileB".

Introduction.

The command line program named “sortme.cpp” takes in two parameters: an input filename and an output filename.

i.e. Running “sortme FileA FileB” will result in reading in FileA and output to FileB.

The input file specified by the input filename should be a large text file having many lines of text. Each line of the input file is unique. The sortme program will output the sorted lines to the new file specified by the output filename.

The sort algorithm is based off the number of occurrences of the letter “s” on the line in ascending order. (That is, the line with the most number of “s” occurrences appears at the end of the output file.)

The input file is assumed to be 100x the amount of working memory available to your program. (That is, if the maximum amount of memory available to the process is 1GB, the file is 100GB in size. Simply put, assume the entire contents of the file cannot be stored in memory.)

Algorithm.

Dynamically allocate an array that has size 1% of the size of the original input file. For each line in the original text file, store in the array the number of "s" characters in that line as well as the position in the file where the line begins. When the array is full, sort it using merge sort according to the number of "s" characters. Next, write the sorted array to a scratch file. Repeat this process until each line of the original text file has been processed. There are 110 scratch files just in case.

Finally, use a min heap to perform a k-way merge, where k is the number of scratch files to be merged. Create a min heap with k heap nodes. Every heap node contains the first element of its corresponding scratch file. Get the root from the min heap, and read in the position of the corresponding line in the original text file. Fseek to that position, get the line from the original text file, and write it to the output file. Replace the root of the min heap with the next element and reheapify. Repeat until all scratch files reach EOF.

Testing.

You can test sortme on the file mylog.txt. This file has a line consisting of 80 "s" characters, followed by a line consisting of 79 "s" characters, all the way down to 1 "s" character. The pattern continues for the characters, "a"-"g".

The output of sortme should first contain the lines of mylog.txt with the characters "a" through "g" in any order. It should then have the line consisting of 1 "s" character, followed by the line consisting of 2 "s" characters, all the way up to the line containing 80 "s" characters.

There is also a handy filesize program that lets you check the size of your files. You will have to compile it separately with g++ -o filesize filesize.cpp. Typing "./filesize filename" outputs the size of the file in bytes.

Finally, there is also a handy bash script autoremove.sh that removes all your temporary scratch files. Remember to chmod u+x autoremove.sh and then type ./autoremove.sh to remove the temporary scratch files.

custom_sort_large_file's People

Contributors

ameera3 avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.