The research-experience from sushmaakoju

Research experience of Sushma Akoju

University of Arizona: Ongoing, PhD, since Fall 2022.
the University of Colorado Boulder: Masters in Data Science, 2021 - 2022 .
Carnegie Mellon University : Research Programmer, 2016 - 2021 .
University of Pittsburgh : Masters in Information Science, 2015.
University of Hyderabad: Post Graduate Diploma in Forensic Science and Criminal Justice. Credential verification from World Education Services (WES).
Jawaharlal Nehru Technological University, Hyderabad: Bachelor of Technology, Computer Science and Engineering. Credential verification: from World Education Services (WES).

🎐 🎐 Research interests, skills and works-in-progress:

Research Interests:

I am interested towards developing proofs of correctness for logical reasoning over Language models and LLMs. I have required skillset and have experience to work and explore more about logical reasoning aspects of LLMs.

Looking for potential advisors and research alignment:

I am looking for continuing my study such that it puts my past research experience (see my posts on twitter or aforementioned section on Research Experience) and acquired skills to good use as well as providing me chances to learn and explore. Nothing is set in stone but I am interested in finding a research alignment so my acquired skills as well as my interests help me to progress.

https://github.com/sushmaakoju/research-experience

I am looking for potential collaborators and if you know any Universities still accepting PhD admissions and any potential advisors looking for students, for Fall 2024, please let me know.

I recently completed reading a few notes on LLMs, Automated Reasoning, Neurosymbolic AI and Generative AI.

🔆 Acquired skills in NLP and AI:

Named Entity Extraction & Recognition
Binary and Multi-label classification
Prompt design
Generative AI
Language Models and Large Language Models
Text extraction from OCR scanned Handwritten documents
Zero-shot, few-shot learning
Teachable systems
Machine Unlearning and Large Language Models
Semantic Parsing
Natural Language Inference
Tableau based proof generation
Statistical Data Analysis
Domain-specific Machine Learning algorithms implementation
Machine Consciousness
Mathematical Consciousness

📚 Studied following topics out of interest in the subject:

First Order Logic
Mathematical Logic
Combinatory Categorial Grammar (CCG)
Abstract Meaning Representation (AMR)
Logical Neural Networks
Markov Logic Networks
Lambda Calculus
Tableau based proofs
Automated Reasoning
Natural Logic
Monotonicity Calculus
Moral reasoning in AI systems
Traditional Algorithms for detecting AI generated images and AI generated texts
Automata, Grammars, Languages
Coelgebra
Research "Handbooks" on various topics

Education:

Post Graduate Diploma in Forensic Science and Criminal Justice, University of Hyderabad, India.
Masters in Information Science, University of Pittsburgh, USA.
Masters in Data Science, University of Colorado Boulder, USA.
Diploma in Journalism, Hyderabad, India.

Hobbies:

Interested in Political data analysis, statistical data analysis, news-related analysis, voting systems, election data analysis and skiing data analysis.

Works-in-progress

Presently I am working on Survey of methods involving First Order Logic-aligned Natural Language data on

Transformer Language Models and
Large Language Models (LLMs).

But these works are theory involved with an intersection of Natural Language tasks over Logical reasoning. So I am looking for collaborators and advisors.

📃 🏆 Achievements list & Self-learning on logical reasoning:

🎇 NLP Course Project: My submission Ranked #1 on Leaderboard during first round of SemEval task "Patronizing and Condescending Language Detection" task 4 for 2022:

My submission for binary classification task at SemEval "Patronizing and Condescending Language Detection" task 4 for 2022 was ranked 1st on the Leaderboard during the first round in Dec 2021. I used pre-trained RoBERTa to finetune and conduct binary PCL classification. During this submission I suffered an unfortunate fracture to my finger which made it challenging for me to type the code for the programming part and report. Still I secured #1 rank for this course project on the competition's leaderboard for first round as well as in entire class (with ~90 students) during NLP course at University of Colorado Boulder, under supervision of Prof. James H Martin. I used a pre-trained SpanBERT, KeyBERT to finetune and conduct multi-label PCL classification https://competitions.codalab.org/competitions/34344#results My work for this competition from 2021: https://github.com/sushmaakoju/nlp-final-2021-pcl-semeval2022-task Competition results: emEval task "Patronizing and Condescending Language Detection" task 4 for 2022

The submission for this project was in Dec 2021. There were also wildfires in Boulder, Colorado soon after, so we did not proceed to the second round. My name is SUshma Akoju - so I used a username "sua" and this work was done while I was a student at University of Colorado Boulder. I had a team member Waad Alharthi, who worked on both binary as well as multi-label classification using BERT. My team member also secured within top 5 rank for this competition "waad".

We faced in natural disaster situation due to wildfires, like several others at Boulder, Colorado. This happened soon after the first round of SemEval competition at Boulder, Colorado, which restricted us from participating into the second round.

Prof. James H Martin's fun lectures and the book he co-authored i.e. Speech and Language Processing, were great and helpful. We are thankful to the Professor and TAs.

🎐 Course: Independent study at CU Boulder on Study and design of Text extraction from OCR scanned Handwritten Slave Trade Volumes, Named Entities extraction and Recognition tasks:

During my Independent study at CU Boulder, I worked on the study and design of process for Text extraction from OCR scanned Handwritten Slave Trade Volumes. Extracted text then required Named Entities extraction and Recognition tasks. : https://github.com/sushmaakoju/named-entity-text-extraction-ocr-slave-trade-volumes It was a novel subject since I did not have prior experience in Digital Humanities or Historical corpora. Although it was very complex task and I connected well with the goal pertaining to NLP and Named Entity extractions from the scratch. I had a goal and an idea in mind. This was also only possible since I had help from my supervisor Prof. Henry Lovejoy, Digital Slavery Research Lab, University of Colorado Boulder who explained what he wanted as an end goal. I also received guidelines from Kartikay Chadha, Doctoral Candidate, School of Information Studies, McGill University & CEO, Walk With Web Inc about the data and priority of the goals. Before coming up with the NLP pipeline, I had to explore PDF to text, Image2Text tools as well as use of Transkribus labelling tasks used for segment-level labelling, Regions of Interest (ROIs) in pdfs that were handwritten. Combining the two I came up with subtasks:

"Extracting text from OCR scanned Slave trade volumes from British Parliamentary papers". (I used Google Document AI api, in 2021 it was no-cost:)
"Text clean up and correction"
"Extract BIO tags from the ground truth"
"Mark all named entities for each sentence in extracted sentences - with BIO tags and POS tags"
Finetuning the model.

Until subtask 4 it was easy but BERT failed on this data. This text data was from 17th - 19th century English. For example, locations such as city names, person names were different back in the 17th century-19th century. BERT also would not correct tag parts-of-speech for "whence", "cometh" etc if sentence had a long tail dependency. So I used the MacBERTh model that was specifically pre-trained on historical English (1450-1950). https://macberth.netlify.app It was during this project I learnt that we do not need to completely extract text to conduct next task of Named Entity extraction in the pipeline - there were methods that combine Image to text and text to POS & BIO tags in a single pipeline, but required more other intermediate steps. For my challenge taking interests, and my interest in the novel goal with a purpose, this project was one of the best experiences for expanding my skills for NLP work and learning chances.

I am grateful for the chance to learn and thanks to Prof. Henry Lovejoy (African Diaspora / Digital History),
Dr Jane Wall (director, MSDS), Kartikay Chadha (WalkWithWeb Inc) and Kaitlyn Rye (MSDS coordinator + Online Program Manager).

#another smaller model examples I worked on: BERT & MacBERTh!!

🎐 Course Project: Survey of Visualizing learning process in Language models:

During Information Visualization project, University of Colorado Boulder, I worked on finding existing methods in visualizing underlying processes over Natural Language processing data and tasks. Prof. Abram Handler supported and encouraged me on this goal. I am truly grateful for the chances to learn.

During this course project, I compared with that of how Image processing data such as MNIST dataset and how the layers learn across various epochs during training. I found out it was not easy to visualize Natural Language processing tasks even over pre-trained models such as BERT, LSTM, RNNs. Not only is the goal of visual explanation a complex task, it is also vastly different across various language models. https://github.com/sushmaakoju/research-experience/blob/main/university-of-colorado-boulder/sushma-akoju-Analyzing_tools_for_Visualizing_Deep_learning_Models_for_Natural_Language_text___info_viz_project_report.pdf Skiing data visualization to understand the concepts of visualization https://github.com/sushmaakoju/skiing-data-visualization

On another note, most of applications built on Language models for downstream tasks used in for-profit companies, may not necessarily emphasize understanding the level of detail required to actually implement these methods.

🎯 Course Project, STATS 5000 & STATS 5010: Comparison between traditional Statistical Regression analysis with nearly black-box-type Neural Network models over XOR dataset:

During Stats 5010 Statistical methods and Applications II course, University of Colorado Boulder, I worked on a course project to compare traditional Statistical Regression analysis with nearly black-box-type Neural Network models over XOR dataset in R programming. This course project was under supervision of Prof. Brian Zaharatos. It was an interesting project and I learnt a lot during this project.

During Stats 5000 course project, I worked with Prof. James Bird to understand Markov Logic Networks. I chose this topic as I wanted to learn about this methods. https://github.com/sushmaakoju/markov-logic-networks It was another interesting project and I am grateful to Prof. James Bird for the chances to learn.

I wanted to learn about Universal Approximation Theorem and verify that in practice. I explored the interpretability aspects over linear regression as well as XOR data over well-known regression methods and Neural Networks. https://github.com/sushmaakoju/stat-5010-final-project

I am grateful to Prof. Brian Zaharatos for the chances to learn and understand the statistical methods.

🏯 🎐 🎐 Course Project for Data Mining: Knowledge Representation & Reasoning:

During Data Mining course taught by Prof. Di Wu, University of Colorado Boulder, I (Sushma Akoju) wanted to work on Knowledge representation. So at first I started working on this by myself. Later Waad, who is my other team member joined, we worked on literature survey over Knowledge Representation and Natural Language Processing. I studied StrategyQA, EntailmentBank papers and datasets from AllenAI's Aristo Project. I worked on writing some scripts and understanding the RoBERTa and BERT models over StrategyQA and BoolQA datasets. It was still one of the best experiences. I also studied SemanticWeb and NELL papers.

During this course project I reached out to

Prof. Di Wu (Semantic Web),
Prof. Tom Mitchell (Logical Neural Networks (LNNs) & LSTMs),
Prof. Martha Palmer (on Abstract Meaning Representation (AMR)),
Dr Peter Clark (StrategyQA and EntailmentBank) from Allen AI Institute.

I received encouraging support, guidelines from them. I am very grateful for the guidance and chances to learn and work with the professors and mentors.

I worked previously on Yahoo InMind project as Research Programmer at Carnegie Mellon University. The InMind project was supervised by Prof. Tom Mitchell and Prof. Justine Cassell. During this project, I had attended a seminar course on Never Ending Language Learning as an employee along with the rest of the team members. I learnt about Knowledge Representation, Q Learning, First Order Logic from Professors, team members, TAs Abulhair Saparov. I am grateful for the chances to learn.

It was my inquisitive nature and discussing so many aspects with Prof. Jim Martin, University of Colorado Boulder, that suggested me to apply for PhD after NLP course and SemEval task competition top rank on leaderboard. I am truly grateful.

🎐 Course project for Big Data Architecture: Microsoft Z3 Theorem Provers and Entscheidung problem:

uring Big Data Architecture course at University of Colorado Boulder, I wanted to learn about Theorem Provers and Entscheidung problem. So I worked on understanding Entscheidung problem for proofs over simple Lambda calculus style of proof statements. By using the similar proof system, over a popular natural language example "Smoking causes cancer" (from Prof. Pedro Domingos, Markov Logic Networks), I learnt about theory and implemented an application that proves Microsoft's Z3 provers and solvers. I learnt about theoretical aspects of Satisfiability and SMT problems over TPTP (Thousands of Problems for Theorem Provers). I also trained an RNN model on mathematical proofs based on First Order Logic & Tableau proof methods, however, since the researcher who provided dataset did not want that to be made public. So I never added RNNs model results to this project.

I am grateful to Gregory Greenstreet and Brian Newsom for allowing me to work on this project for myself. My project repository: https://github.com/sushmaakoju/demo-ATLS5214

I would like to share the fun example and demo of these applications that I created, designed and hosted on this website on Google Cloud so far CU Boulder allows this: https://theorem-prover-4182022.uc.r.appspot.com

This example shows basic proof sequents from Z3 prover and pseudo-Entscheidung implementation. #logic #firstorderlogic #theoremprovers #z3 #sat #smt

🏆 My First first Author publication, at University of Arizona:

University of Arizona: I worked with Prof. Mihai Surdeanu on Natural logic and had a chance to learn and study the "Natural logic for textual inference" work from Bill MacCartney and Johan Van Bentham's texts on Monotonicity Calculus.

In my work https://arxiv.org/abs/2307.05034 under supervision of Prof. Mihai Surdeanu, we introduced Sentences Involving Complex Compositional Knowledge (SICCK) and a novel analysis that investigates the performance of Natural Language Inference (NLI) models to understand compositionality in logic. We selected modifier phrases based on - universal quantifiers, existential quantifiers, negation, and other concept modifiers in Natural Logic (NL) (MacCartney, 2009). We used these phrases to modify the subject, verb, and object parts of the premise and hypothesis. Lastly, we annotate these modified texts with the corresponding entailment labels following NL rules. We conduct a preliminary verification of how well the change in the structural and semantic composition is captured by neural NLI models, in both zero-shot and fine-tuned scenarios. After fine-tuning this dataset, we observe that models continue to perform poorly over negation, existential and universal modifiers. https://github.com/sushmaakoju/clulab-releases/tree/master/acl2023-nlrse-sicck…… 🧵

This paper stands as proof of work under certain extreme situations like a test for resilience for me. Thanks to Prof. Mihai Surdeanu for the chances to learn and for the guidance.

🎐 Work as Research Assistant, for a DARPA project, at Clulab, University of Arizona. Aug 19th, 2022 - July 22nd, 2023:

This works are evidences of great collaboration with Prof. Mihai Surdeanu and Dr. Enrique Noriega Atala. The main goals were to learn scala, Odin rules system. We cannot reveal many details of the "what" and "how".

I worked on developing on an small module existing software written in scala. I learnt to implement modules in scala.
I worked on a tasks involving text extraction from Covid papers using a well-established software pipeline implemented by experts.
I annotated datasets for this task in 2.
It was nice project all good stuff must come to an end.

My contributions based on my findings for the RA work, not implemented, but seemed fun to me

Here although most of the tasks were well directed, I had some interesting findings of my own, following are my ideas but remain unexplored, that I contributed and I wanted to share, for the love of language, historical newspapers and Abstract Meaning Representations:

Jan 9th 2023

a) I found out that lot of Out-of-vocabulary words existed. This was expected since Covid introduced several new terms and words. So I proposed to extract new vocabulary by appealing to neologisms. Linguistic analysis of neologism related to coronavirus (COVID-19) https://www.sciencedirect.com/science/article/pii/S2590291121000978 Oxford as well as Cambridge maintained neologisms for Covid. The apis gave both new words and word senses related to Covid. This one for the love of language.

Feb 6th 2023

b) I also identified that in Covid papers, which came from different types of document structures, based on conferences/paper formats. The context and text organized was heavily dependent on structure and headings, which was even very different from AI/NLP conference papers and structure (sections, etc). So I referred to my works from my Independent study from University of Colorado Boulder at Digital Slavery Research Lab. I studied about Impresso project which is Historical news paper texts and Named entity extractions project, which is available at: Historical newspapers data studied under Digital Humanities: HIPE (Identifying Historical People, Places and other Entities). This one for the historical newspaper and their lasting impressions.

Mar 18th 2023

c) I found some interesting works on Abstract Meaning Representation extended to document-level structures. The idea is similar to other known approaches in Historical or OCR scanned documents: label the segments and related Regions of Interests (ROI) like a sentence-span, paragraph spans to keywords/events. This one for the "Abstract Meaning Representation" of Martha Palmer.

Summer 2023

d) We did not implement c) exactly, but after guided review, we had a different strategy as per supervision from Prof. Mihai Surdeanu. My aforementioned three ideas remain unexplored.

sushmaakoju / research-experience Goto Github PK

research-experience's Introduction