Git Product home page Git Product logo

citydataset's Introduction

Shortest Path Dataset

Shortest Path enumerates the shortest paths between US Cities and is intended to be used as training data for ML work.

Generating the dataset

First run the cityDatasetGeneratory.py script which should generate a pickled object called cityPaths.p

python3 cityDatasetGenerator.py

There are two controllable parameters here:

  • ConnectedRadius is a parameter that controls the radius below which we consider two cities to be connected. For instance a connectedRadius = 70 implies that we consider two cities to be connected if their gps coordinates are within 70 km of each other

  • Desired Number of points -- controls the number of shortest path solutions we'd like to generate for the dataset. The main reason this exists is because for the full dataset there are ~30,000 cities and enumerating all possible 900 million paths is computationaly costly, therefore this parameter bounds the number of paths the final output should have.

The pickled object is a dictionary that has 3 keys

  • paths: An array containing the generated solutions. For each row the first two entries in this array represent the start and end city and the remaining entries represent the solution
  • indexMapping : A dictionary containing mappings between the indicies used to represent the cities and the corresponding names
  • coordsDict : A dictionary containing mappings between the city indicies and their latititude and longitude coordinates expressed in radians.

Unpickling the data

#Quick helper
def getPickleData(path = 'cityPaths.p'):
    with open(path, 'rb') as f:
        return pickle.load(f)

unpickledData = getPickleData(path = 'cityPaths.p')

Example Usage

Here's some code that takes in the unpickled data and constructs a pytorch dataloader

class ShortestPathDataset(Dataset):
    def __init__(self, cityPathData, train=True, maxLength=45):
        #Index --> name and Index-->(lat, lng) coordinates mappinngs
        self.ix2NameDict = cityPathData['indexMapping']
        self.ix2CoordsDict = cityPathData['coordsDict'] #Coords in radians
        
        #Load Paths
        fullPaths = cityPathData['paths']
        self.EoSToken = cityPathData['maxCityIndx']
        self.ix2NameDict[self.EoSToken] = '<EOS>'

        # Fill data array with data
        self.data = [each + [self.EoSToken]
                     for each in fullPaths if len(each[2:]) <= maxLength]

        # 80/20 Training/Test Split
        splitIndx = int(0.8*len(self.data))
        self.dataLen = splitIndx if train else len(self.data) - splitIndx
        self.data = self.data[:splitIndx] if train else self.data[splitIndx:]

    def __len__(self):
        return self.dataLen

    def __getitem__(self, indx):
        return self.data[indx]

and some sample usage ...

#Load in dataset
shortestPathData = ShortestPathDataset(unpickledData)

#Randomly pick the 5th path stored in the data
chosenPath = shortestPathData[4]

#First two entries always represent start and end cities
startCityIndx, endCityIndx = chosenPath[0], chosenPath[1]

#Solution is always stored in the remaining entires
solution = chosenPath[2:]

#Helper for mapping indexes to human readable city names
toName  = lambda x : shortestPathData.ix2NameDict[x]
toCoordinates = lambda x: shortestPathData.ix2CoordsDict[x]

#Print out the names 
print(f'Start City: {toName(startCityIndx)}, End City:{toName(endCityIndx)}')

#print out solution
solutionText = '->'.join([toName(each) for each in solution])
print(f'Solution: {solutionText}')

Start City: Baltimore:MD, End City:Valparaiso:IN

Solution: Baltimore:MD->Westminster:MD->Shippensburg:PA->Huntingdon:PA->Altoona:PA->Indiana:PA->Butler:PA->Boardman:OH->Tallmadge:OH->Medina:OH->Norwalk:OH->Fremont:OH->Bowling Green:OH->Defiance:OH->Auburn:IN->Warsaw:IN->Plymouth:IN->Valparaiso:IN-><EOS>

Some potential helper methods

Below is a quick and dirty collate method (written as a closure for convienence, pytorch however expects just the function) that the dataloader will apply to each batch. Its purpose is two fold:

  1. Transform the sequence into effectively a teacher forced language modeling task where for each input sequence we also create target sequence by shifting the input sequence down by one.
  2. Pad each batch with the padding token to ensure that all entries in the batch are the same length
def makeCollateFn(paddingIndx):
    pIndx = paddingIndx
    def collateLanguageModelSequence(batch):
        endpoints, paths, lengths = [], [], []
        batchLen = len(batch)
        for each in batch:
            endpoints.append(each[0:2])
            paths.append(each[2:])
            lengths.append(len(each[2:]))
    
        seqLen = max(lengths)
        endpoints = torch.tensor(endpoints).long()
        xxPad  = torch.tensor([pIndx]).new_full((batchLen,seqLen), pIndx)
        yyPad  = torch.tensor([pIndx]).new_full((batchLen,seqLen), pIndx)


        for i, sequence in enumerate(paths):
            xxPad[i, :lengths[i]-1, ...] = torch.tensor(sequence[:-1]).long()
            yyPad[i, :lengths[i]-1, ...] = torch.tensor(sequence[1:]).long()

        return endpoints,xxPad, yyPad
    return collateLanguageModelSequence

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.