Git Product home page Git Product logo

find_my_plot's Introduction

Physics-Project

Group 17a's repository for the group project.

The following is the full process in which the data processing team converted the data provided in directories into a fully embeded vector database. Ready for vector searches to be run on it.

Paper Data Extraction

The first part of the code extracts the necessary information, from the directories of paper files. This is done using the following python script: get-mentions.py which is part of the data scraping algorthm of the github. This script takes in the directory of the papers of each experiment such as ATLAS or CMS. It then ouptuts the generated data in a file location of your choice. The format should be of the following format:

    {

    "name": "Figure 1",
    "mentions": [
        "Figure 1(a) shows the local \\(p\\)-value as a function of \\(m_{X}\\) for a narrow resonance that decays into a pair of SM Higgs bosons whose decay branching ratios are predicted under \\(m_{h}=125\\,\\mathrm{GeV}\\)",
        "Figure 1(b) shows the upper limits at the 95% confidence level (CL) on the resonant \\(hh\\) production cross section as a function of \\(m_{X}\\), assuming that \\(h\\) is the SM Higgs boson",
        "Figure 1: (a) Local \\(p\\)-value and (b) observed and expected upper limits at the 95% CL on the resonant Higgs boson pair production cross section as a function of the resonance mass \\(m_{X}\\)"
    ],
    "atlusUrl": "https://cds.cern.ch/record/2882365",
    "paper": "CDS_Record_2882365",
    "paperName": "Combination of searches for resonant Higgs boson pair production using $pp$ collisions at $\\sqrt{s}=13$ TeV with the ATLAS detector"
},

This should therefore extract the name of each figure, its caption and mentions under the heading mentions, the atlas url, which will indicate the paper from which this image came from. Then the CDS_Record name of the paper and the actual scientific paper name from which the image was sourced.

Image URL extraction

The next scripts are individually tailored to each experiments format in which the image meta data files were formated. Essentially the code works by looking at the paper directory, then opening each meta data file, finding the url. This leads to an html page on which there should normally be a link leading to the page particular which contains the locations of all the images within that paper. In some cases the image urls are embedded on the initial page without needing to go to another html link. The python scripts for this process are found in the image scraping folder on the github repository they are called: ATLAS CONFERENCE NOTES URL EXTRACTION.py, ATLAS IMAGE URL EXTRACTION.py and CMS URL Extraction.py. This should give an output of this :

{
    "name": ".thumb_fig_01.png",
    "url": "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SOFT-2010-01///.thumb_fig_01.png"
},
{
    "name": ".thumb_fig_02.png",
    "url": "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SOFT-2010-01///.thumb_fig_02.png"
},
{
    "name": ".thumb_fig_03.png",
    "url": "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SOFT-2010-01///.thumb_fig_03.png"
}

The code takes in the paper directory and returns a directory of JSON files each named after the paper from which the images were extracted.

Image URL merging with paper data

The following section takes the image url directory for each paper set as well as the output paper data for each section and merges them to the correct image entries. The code for this is once again available on the github repository under the folder merge: ATLAS CONFERENCE MERGE.py, ATLAS MERGE.py, CMS MERGE.py. This will output something of this format :

{

    "name": "Figure 1",
    "mentions": [
        "Figure 1(a) shows the local \\(p\\)-value as a function of \\(m_{X}\\) for a narrow resonance that decays into a pair of SM Higgs bosons whose decay branching ratios are predicted under \\(m_{h}=125\\,\\mathrm{GeV}\\)",
        "Figure 1(b) shows the upper limits at the 95% confidence level (CL) on the resonant \\(hh\\) production cross section as a function of \\(m_{X}\\), assuming that \\(h\\) is the SM Higgs boson",
        "Figure 1: (a) Local \\(p\\)-value and (b) observed and expected upper limits at the 95% CL on the resonant Higgs boson pair production cross section as a function of the resonance mass \\(m_{X}\\)"
    ],
    "atlusUrl": "https://cds.cern.ch/record/2882365",
    "paper": "CDS_Record_2882365",
    "paperName": "Combination of searches for resonant Higgs boson pair production using $pp$ collisions at $\\sqrt{s}=13$ TeV with the ATLAS detector",
    "imageUrls": [
        "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HDBS-2023-17///.thumb_fig_01a.png",
        "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HDBS-2023-17///.thumb_fig_01b.png"
    ]
}

Merging the JSON databases

The following code will merge however many JSON databases you have produced, currently it is set to 3 JSON files and merging them into 1 put it can be easily adapted. This code is found in the first section of the Combining and clearing code.py python script.

Embedding the database

The following code will embed the database with vectors so that the vector search can be run on it. The code for this is called Embedding.py. Having run this the database should have the following output format.

 {
    "name": "Figure 1",
    "mentions": [
        "Figure 1(a) shows the local \\(p\\)-value as a function of \\(m_{X}\\) for a narrow resonance that decays into a pair of SM Higgs bosons whose decay branching ratios are predicted under \\(m_{h}=125\\,\\mathrm{GeV}\\)",
        "Figure 1(b) shows the upper limits at the 95% confidence level (CL) on the resonant \\(hh\\) production cross section as a function of \\(m_{X}\\), assuming that \\(h\\) is the SM Higgs boson",
        "Figure 1: (a) Local \\(p\\)-value and (b) observed and expected upper limits at the 95% CL on the resonant Higgs boson pair production cross section as a function of the resonance mass \\(m_{X}\\)"
    ],
    "atlusUrl": "https://cds.cern.ch/record/2882365",
    "paper": "CDS_Record_2882365",
    "paperName": "Combination of searches for resonant Higgs boson pair production using $pp$ collisions at $\\sqrt{s}=13$ TeV with the ATLAS detector",
    "imageUrls": [
        "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HDBS-2023-17///.thumb_fig_01a.png",
        "https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/HDBS-2023-17///.thumb_fig_01b.png"
    ],
    "embedded vector": [
        -0.09764965623617172,
        0.052059356123209,
        0.0385596826672554,
        ....
    ]
}

Cleaning up the database

This code cleans up the database for any sections in which the url entries were found to be empty due to missing images not being available on the corresponding websites. This code is the second part of the Combining and clearing code.py python script.

Having run all of this code in this order you should now have succesfully made a vector database.

find_my_plot's People

Contributors

williame33445 avatar georgedoumenisramos avatar edoardousai avatar pasawishk avatar t8ema avatar oran223 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.