Git Product home page Git Product logo

Comments (14)

zh217 avatar zh217 commented on June 14, 2024

That's a very good idea! However, a casual glance at the generated doc shows that it still hallucinates quite a lot, and does not have a grasp on the grammar of the query language (it gets it right only in the simplest cases). The structure of the documentation is good though, and probably using retrieval-augmented generation with a stronger model (like GPT-4) will make it better.

from cozo.

dustyatx avatar dustyatx commented on June 14, 2024

Thank you for taking a look. I was testing Claude 2 100k to see if the larger context would give it enough information to fill in the gaps. I fed it most of the documentation and then started trying to build off of that as a foundation. So a bit of a failed experiment but maybe I didn't find the right use for that model.

I have another technique that I can try which is a bit more than just document retrieval, it takes a while for me to get it worked out for specific use cases.. I have stuff but it's all prototypes and needs a lot of interventions right now.. but that's the reason why I'm trying learn Cozo..

I'll give it another try, I think this other approach does a better job with controlling hallucinations.

from cozo.

dustyatx avatar dustyatx commented on June 14, 2024

It would really help if I had more examples of queries.. if you have any you can share, I have an idea on how I can use it..

from cozo.

infogulch avatar infogulch commented on June 14, 2024

Ideally you'd be able to randomly generate syntactically valid queries. Then you could train it with as much data as it wants. :)

from cozo.

ansarizafar avatar ansarizafar commented on June 14, 2024

@dustyatx That's a very good idea! I am also interested in a simple and easy to understand tutorial as Cozodb documentation is incomplete and not beginner friendly.

from cozo.

dustyatx avatar dustyatx commented on June 14, 2024

I just tried another attempt at this but the documentation seems to be triggering a lot more hallucinations than I normally get. No matter the parameters I use, GPT 4, Bison, Claude 2.

I use this process all the time at work and it usually provides a far better result than a person can accomplish. It starts with extracting the rules & core concepts. Then I inject those into a prompt that's uses that information to rewrite and expand the other sections. I've had good success with this in the past, but unfortunately it's not getting the rules & core concepts extracted properly.

Here is my best attempt to troubleshoot why the LLMs are hallucinating so much. I'd say the feedback tracks with my own experience trying to understand the documentation.

Troubleshooting prompt, after the prompt, I passed int the queries.rst doc.

Please analyze the provided Cozo DB software documentation with a focus on identifying issues with the documentation that could be triggering hallucinations.
We need feedback on how to improve the documentation, not a explanation of the limitations of a large language model.
The model has a foundational understanding of Datalog.
Is there any key or critical information missing?
Is this document complete?
We are trying to troubleshoot why this document triggers hallucinations that compound quickly.
Evaluate the progression and structure of the examples provided. Are there any issues that would confuse a large language model model.
Assess the explanations of key concepts. Are these concepts explained in a way that would be clear to a large language model model.
Examine the use of syntax in the documentation. Is the syntax used consistently and explained clearly?
Examine the examples provided, do they give a clear understanding of the concepts they are describing?
Review the explanations of specific concepts. Are these concepts introduced and explained in a way that is clear and understandable?
Provide SPECIFIC examples from the text to illustrate your points and explain why these aspects of the documentation may be causing issues.
DO NOT FOCUS ON THE CAPABILITIES OR THE ABILITIES OF THE MODEL, IT'S TRAINING DATA OR SPECULATE ON WHY IT MIGHT GET CONFUSED.
ONLY EXPLAIN ISSUES WITH THE DOCUMENTATION, DO NOT SPECULATE OR GUESS ABOUT THINGS THAT ARE NOT CONTAINED IN THE TEXT.
DO NOT COMMENT ON URLS, EXTERNAL DOCUMENTS OR VISUALIZATIONS! THE LARGE LANGUAGE MODEL DOES NOT HAVE THE ABILITY TO USE THOSE.


In general, the documentation provided is clear and explains key concepts of CozoScript. However, there are some potential areas of improvement that might be causing confusion for the AI. Here are some possible issues:

  1. Explanation of atoms and syntax: The documentation attempts to cover a lot of ground in explaining the complex topic of atoms. However, the explanations are a bit dense and may be overwhelming to someone (or an AI) unfamiliar with the subject. This could potentially cause confusion, leading to the reported hallucinations. For example:

    Atoms come in various flavours. In the example above:
    rule_a['constant_string', b]
    is an atom representing a *rule application*: a rule named rule_a must exist in the same query and have the correct arity (2 here). Each row in the named rule is then *unified* with the bindings given as parameters in the square bracket...

    A more structured breakdown and categorization of the different types of atoms and their syntax could be helpful.

  2. Lack of definitions for critical terms: Terms such as "atom", "rule", "relation", "arity", and "unification" are used throughout the documentation. However, their definitions are not clearly introduced. This lack of clear definitions could cause a model to be uncertain or make incorrect assumptions about their meanings.

  3. Use of complex language and jargon: The documentation uses a lot of complex language and specific jargon, which might be difficult to parse without a detailed understanding of the underlying principles. This could lead to difficulties in comprehending and thus misinterpretations. For instance:

    Recursion cannot occur in negated positions (*safety rule*): r[a] := not r[a] is not allowed.

    It would be beneficial to explain what "negated positions" means before stating the rule about recursion.

  4. Lack of explicit examples: While there are examples provided, they often involve complex concepts and syntaxes introduced without much context or breakdown. The examples would be clearer if they were tied to simpler concepts first, with explanations of how the syntax represents those concepts. For example:

    ?[department, count(employee)] := *personnel{department, employee}

    In this example, explaining how this line represents a query for counting employees in each department could help with understanding.

  5. Insufficient distinction between types of rules: Inline and fixed rules are described but the differentiation between them is not entirely clear. The unique traits of each type should be more clearly defined and explained, possibly with examples demonstrating the different characteristics of each.

  6. Lack of contextual linking between topics: The document jumps from topic to topic without much linkage. For example, the section on recursion directly follows the section on negation, but there's no transition to help a reader (or an AI) understand how these concepts are related.

  7. Inconsistent use of punctuation in syntax: The documentation uses both square brackets [] and curly braces {} in the syntax without clearly explaining when to use one over the other. For example:

    *stored_relation[bind1, bind2]

    *stored_relation{col1: bind1, col2: bind2}

  8. Potentially confusing formatting: Use of :=, <~, <- and [] can potentially be confusing. It would be beneficial to consistently use a certain format to express certain actions in the language.

  9. One of the noticeable issues with the provided documentation is its heavy reliance on a reader's prior understanding of Datalog, and to a certain extent, the specific nuances of the Cozo DB system.

Summary:
Datalog is a logic programming language, and its concepts form the core of CozoScript, the query language described in the documentation. Knowledge of Datalog is assumed implicitly, for example, in the sections describing the relationship between the rule head and rule body, the concept of 'relations', the use of conjunctions, the binding and unifying of variables, and the use of rule applications.

Similarly, the document uses Cozo DB specific terms such as 'inline rules', 'fixed rules', 'constant rules' and Cozo-specific operators like :=, <~, and <- without providing an introduction or context to these terms. For someone familiar with Datalog and Cozo DB, these terms and operators would be clear. However, a reader who lacks prior knowledge may struggle to understand their precise meaning and function.

The document could be greatly improved with a brief introduction to Datalog and an explanation of how CozoScript builds upon it. Key concepts of Datalog, such as rules, relations, and variables, should be explicitly explained to ensure that the reader has a firm grasp of the underlying concepts before moving on to CozoScript's specific implementation and syntax. Similarly, introducing and explaining the unique terms, rules, and operators of CozoScript would be helpful to ensure readers are not left confused or misinterpreting the subsequent sections.

from cozo.

dustyatx avatar dustyatx commented on June 14, 2024

Not to criticize the author(s) of the doc but the current state of the documentation does make it a challenge to get started with Cozo.. Given that it's best to get people using it as much as possible and providing feedback at this stage, it's probably a good place to invest a bit more time and energy.

As much as writing documentation is a painful and takes up valuable time coding time; I think this creates a significant blocker for onboarding new users.

I'm happy to help review updates and give feedback.

from cozo.

zh217 avatar zh217 commented on June 14, 2024

I agree that the docs are not up to standard, but an even bigger problem faced by all new languages/DSLs is that they are new, and hence GPT/other LLMs cannot write correct code for it. This has been troubling me a lot for several months and no amount of prompting can fix it. (Making Cozo speak SQL instantly solves all problems, of course, but that defeats the purpose of this project.)

We actually have a half-decent generator of valid CozoScript but training/fine-tuning LLMs on the output is infeasible at the moment due to the high cost. I think we can just wait a bit longer as at the moment, as LLMs are evolving faster than anyone can write documentations.

from cozo.

gisborne avatar gisborne commented on June 14, 2024

from cozo.

ansarizafar avatar ansarizafar commented on June 14, 2024

I totally agree with @dustyatx on all points. Most of the developers know SQL so It would be of great help If we can add SQL to Cozoscript Mapping Chart in the documentation like.
mongodb.com/docs/manual/reference/sql-comparison

from cozo.

gisborne avatar gisborne commented on June 14, 2024

from cozo.

dustyatx avatar dustyatx commented on June 14, 2024

@gisborne If you could put together a couple of step by step user journeys, I'm happy to run through them to test to make sure it's a standalone document. Start to finish installing, loading data & querying the data (for all the major use cases), saving and reloading the database, etc.

Am I correct in thinking that Graph, Search (Full-text & Similarity), Analytics are the most important capabilities that have to be demonstrated for onboarding? Maybe my bias but that is why I'm focusing on CozoDB since the other vector DBs don't provide all of these capabilities.

The other thing I was trying to figure out is how is the CozoScript variant of Datalog different from the standard one, I had a pretty good sense of the basics of Datalog after going through the tutorial Learn Datalog Today . If I had the mapping, I think I might have been able to get GPT4 to figure out how to transform the Datalog tutorial into a CozoScript tutorial. Maybe not the first thing to be done but since information is so scarce on this topic, it would be good if the community could leverage the Datalog information to learn CozoScript.

from cozo.

athanhat avatar athanhat commented on June 14, 2024

Documentation will become a lot better if there are definitions followed by simple examples that illustrate concepts, syntax, use cases of CozoDB datalog version. Moreover since CozoDB is trying to become truly relational it will be beneficial to demonstrate how relational operators are transformed in CozoScript language.
Most important I would like to see:

  • how user can construct and use customized complex types in CozoDB relations
  • how CozoDB behaves with operations on relations that have null values
  • any cases that break relational closure property
  • how you set and use variables and input parameters
  • how constrains can be implemented, in particular type constrains vs database constrains and especially those that interrelate e.g. No supplier with status less than 20 can supply part P6.

Moreover I agree 100% with @zh217 - Ziyang, stay away from any SQL implementation, but it is a good idea to create some SQL to CozoScript mapping. For example after a lot of experimentation I found how to write Outer Join queries but I am not sure if that is the optimal way in CozoScript. Does it make sense to create an operator to simplify this?

Kudos on the effort to create a relational datalog DBMS, wish you a successful continuation

from cozo.

gisborne avatar gisborne commented on June 14, 2024

from cozo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.