outlines-dev / outlines Goto Github PK

View Code? Open in Web Editor NEW

7.8K 7.8K 388.0 4.51 MB

Structured Text Generation

Home Page: https://outlines-dev.github.io/outlines/

License: Apache License 2.0

Python 99.92% Dockerfile 0.08%

cfg generative-ai json llms prompt-engineering regex structured-generation symbolic-ai

outlines's People

Contributors

Stargazers

Watchers

Forkers

codeaudit brandonwillard rlouf dgerlanc tiendung arunpatro harshkumarchourasia kanganisalisa demolicity chunmk mlelarge svats2k havietisov convocat kurtseifried goswamig brianpetro tomchapin evdcush hbcbh1999 tarunamasa techthiyanes drahfa jawond lukestanley suryatmodulus headinthebox apollohuang1 farice krishnamenon22 qwranglers id-2 jjhw neuroradiology onejessica tomhodson richardsonjf aiworkspace minhalvp sysujayce davidberenstein1957 aspirincode massi-ang samduffield eltociear stephencaldwell xaviruvpadhiyar98 wjwjwj886 williamtran29 josephrp ai-alebrijecircus-x dchichkov dwhitena plurigrid ishaan-jaff aprkwiki krasserm yuamlu vi-cs jrysana maccam912 henry-zeng afilimonov alvarobartt edreams rand sotokisehiro amerine cuuupid slapdrone cassanof jakeginesin zaqqwerty shaowentian bramvanroy jccalvojackson sebastianschramm ipmob spisupat nasa03 tallamjr crystaltester veezbo ckgresla oliver4701 ouip23 yusuke1997 randomvpn guijinson kaelandt l-p-b devin-gpt axelrda kc611 cdnclass barahlush mattkindy ahs8w touristshaun iwillcodeu

outlines's Issues

Use Pydantic to suggest and enforce response format

AutoGPT's and BabyAGI's prompts contain instructions regarding the expected output format, but then use custom parsing code for the response. I suggest to let the user define the expected response format with Pydantic, and build the prompt directly from the Pydantic schema.

So the following instruction:

{
    "thoughts":
    {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
}

Could be passed by first defining and decorating the following schema:

import outlines.text as text
from pydantic import BaseModel, Field

@text.response
class Thoughts(BaseModel):
    text: str = Field(description="thought")
    reasoning: str = Field(description="reasoning")
    plan: str = Field(description="short bulleted list that conveys")
    criticism: str = Field(description="constructive self-criticism"
    speak: str = Field(description="thoughts summary to say to users")

So we can do:

import outlines.text as text

@text.prompt
def prompt(schema):
    """RESPONSE FORMAT:
    
     {{schema.description}}
    """

prompt(Thoughts)

And on an LLM response:

answer = Thoughts.parse(llm(query))

Add HF causal models integration

outlines.text.models.hugging_face("name", **config)

We can also alias a couple of common models such as outlines.text.lm.GPT2.

Add `for_loop` and `while_loop` operators

Symbolic or not?

What can we implement without a symbolic representation, what can't we?

Syntactic sugar / string semantics

We can subclass python's str implementation and thus can add syntactic sugar for string implementation, including .words, .lines, etc. methods. We can add a decorator so users don't have to instantiate this explicitly. Operators can convert input strings to our string type. For this we will need to wrap the functions that define the Ops with a decorator.

This is not very different to creating a "StringVariable" in a symbolic setting and could serve as the basis for a later symbolic implementation.

Logging

By logging I mean returning to the user the full execution path of the program. In a non-symbolic world we can propagate this state within the custom string instance.

Random variables and inference

I think this is where it hurts in a non symbolic context, and the reason why cascades was designed the way it is. Unless you return the variables of interest explicitly. There is no way you can know which variables are random, and you'll have to parse the AST

f-strings

Name Ops with the name of the variable in the script

This way we can retrieve these nodes in the generated graph, and create a trace indexed by name.

Add interactive execution mode

Display the completion in real time in the terminal.

Debug mode. Stop execution, navigate up the trace, modify a node and re-execute. This can avoid expensive calls for nothing to generative models.

This may need to compile to a Program class which contains a stack of frames. __call__ runs the program, possibly with an interactive kwarg.

Add Wolfram Alpha connector

This should be async.

Vectorize the model and function calls

Below is a proposal to vectorize the outlines code, which would pave the way to interesting workflows with large language models.

Vectorizing means that models can accept arrays as inputs, and map over these inputs to get as many outputs:

import numpy as np
import outlines.models as models

llm = models.text_completion.openai("text-davinci-003")
inputs  = np.array(["A first input", "a second input", "a third input"])

llm(inputs)
# np.array(["first output", "second output", "third output"])

They can also do so when several samples generated per prompt:

llm(inputs, samples=2)
# np.array([
#    ["first output sample 1", "first output sample 2"],
#    ["second output sample 1", "second output sample 2"],
#    ["third output sample 1", "third output sample 2"]
# ])

The execution method will vary between API-based models and local models: in the first case we will need to batch async calls; in the second case batched generations on arrays are already available.

We would then need to be able to map other functions over the resulting arrays. We can provide a convenience function outlines.map which reduces to a for loop for synchronous functions, and executes async functions in parallel. We can adopt the same syntax as JAX's vmap:

import outlines
import outlines.models as models
import outlines.tools as tools


llm = models.text_completion.openai("text-davinci-003")
inputs  = np.array(["A first input", "a second input", "a third input"])

results = llm(inputs)
print(results)
# np.array(["first output", "second output", "third output"])

results = outlines.map(tools.google_search)(results)

The cleanest possible way to implement this is to implement the equivalent of numpy.vectorize. We can initially only support functions with scalar input and output core dimensions; we will also need vector core dimensions for when generating several samples with models.

Consider implementing a more powerful templating engine that includes Ops

Consider the following example taken from this paper:

def fill_in_the_blanks(question, llm):
    meta_prompt = compose("""
    ${question}
    To solve this problem, we will analyze each of the options and determine
    """, question=question)
    goal = llm(meta_prompt)

    prompt = compose("""
    ${meta_prompt}${goal}. Let's begin.
    """, meta_prompt=meta_prompt, goal=goal)
    answer = llm(prompt)

    return goal, answer

direction = compose("""
Directions: In the following question, a related
pair of words or phrases is followed by five
pairs of words or phrases. Choose the pair
that best expresses a relationship similar to
that in the original pair.
BRAGGART :: MODESTY
A) FLEDGLING : EXPERIENCE
B) EMBEZZLER : GREED
C) WALLFLOWER : TIMIDITY
D) INVALID : MALADY
E) CANDIDATE : AMBITION
""")

llm = OpenAI("text-davinci-001")
fn = outlines.chain([], fill_in_the_blanks(direction, llm))

The Outlines implementation is understandable, but it may be nicer if we could include the LLM answers in the prompt directly, for instance like so:

def fill_in_the_blanks(question, llm):
    goal, answer = compose("""
        ${question}

        To solve this problem, we will analyze each of the options and determine #{goal}. Let's begin.
        #{answer}
        """,
        question = question,
        goal=llm,
        answer=llm,
    )
   return goal, answer

direction = compose("""
Directions: In the following question, a related
pair of words or phrases is followed by five
pairs of words or phrases. Choose the pair
that best expresses a relationship similar to
that in the original pair.
BRAGGART :: MODESTY
A) FLEDGLING : EXPERIENCE
B) EMBEZZLER : GREED
C) WALLFLOWER : TIMIDITY
D) INVALID : MALADY
E) CANDIDATE : AMBITION
""")

llm = OpenAI("text-davinci-001")
fn = outlines.chain([], fill_in_the_blanks(direction, llm))

Prompting often consists in the recursive application of LLMs to the previously-evaluated prompt, and such a templating language would reduce the back-and-forth between prompting and evaluating and make the prompting strategy more directly apparent.

Allow user to swap models with a rewrite function

We can then leverage this to implement something similar to nat.dev for instance using Textual.

A good way to do this would be to etuplize the graphs and use miniKanren.

Print the graphs

Persist the cache between sessions

We currently use functools to cache expensive model calls, but this is hardly useful as one rarely makes identical calls during one session.

Instead we need to persist the cache between sessions. This will allow users to prototype more quickly and cheaply.

Current specs:

Implement outlines.cache.get to get the Memory object initialized with the cache directory. This functions is called everywhere where there's an expensive API or model call by default.
Allow users to override the cache location with a OUTLINES_CACHE_DIR environment variables;
Implement outlines.cache.clear to manually clear the cache.

Generative models output different results for the same input when sampling. In some cases this behavior might be desirable and we may want to disable the cache temporarily. We could implement a outlines.cache.disable to disable the cache for the current session.

However, we can (and probably should) see it from a different perspective: sampling runs are actually reproducible if we fix the PRNG seed. We can pass the PRNG explicitly as an argument to the generating functions, fix the seed by default, and instead allow user to change the seed or use a random seed. This way we don't need to fiddle with the cache at all: if the seed is the same the cache will be hit; if it is different the cache won't be it and the user will get a different result.

The only annoying thing is that OpenAI, for instance, does not allow to pass a seed parameter to their generation APIs. Until they add it we will need to adopt a hybrid approach.

Add OpenAI GPT3 integration

outlines.text.models.OpenAIGPT3

Add `cond` and `ifelse` operators

Following the Lisp syntax which I find to be the sanest.

Implement the prompt encoding functionalities as Jinja custom filters

It feels clunky to need to decorate functions for the sole purpose of making them renderable in prompts. We should instead implement Jinja custom filters that can extract some information from functions (but also, possibly, Pydantic models) to render them in a template.

The code that is currently in the README:

from typing import Callable, List
import outlines
import outlines.text as text


@outlines.tool
def google_search(query: str):
    """Google Search"""
    pass


@outlines.tool
def wikipedia_search(query: str):
    """Wikipedia Search"""
    pass


@text.prompt
def my_commands(tools: List[Callable]):
    """AVAILABLE COMMANDS:

    {% for tool in tools %}
    {{loop.counter}}. {{tool.name}}, {{tool.description}}, args: {{tool.signature}}
    {% endfor %}
    """

prompt = my_commands([google_search, wikipedia_search])

Would become

from typing import Callable, List
import outlines.text as text


def google_search(query: str):
    """Google Search"""
    pass

def wikipedia_search(query: str):
    """Wikipedia Search"""
    pass


@text.prompt
def my_commands(tools: List[Callable]):
    """AVAILABLE COMMANDS:

    {% for tool in tools %}
    {{loop.counter}}. {{tool | fn_name}}, {{tool | fn_description}}, args: {{tool | fn_signature}}
    {% endfor %}
    """

prompt = my_commands([google_search, wikipedia_search])

This is more succinct, and contains the prompt logic to the text module.

Allow users to seed random sequence generations

When integrating HF's GPT2 in #30 I used python.random.seed() and python.random.randint to generate a random seed value for jax.random.PRNGKey(). This method does not allow the user to seed the sequence generation, which is however necessary for reproducibilty.

We can either:

Ask users to set the seed value globally;
Pass PRNG keys explicitly in the code.

Implement LMQL Fig.4

This will evolve with the specifications in #8. On top of the specs this introduces:

Arbitrary constraints implemented as a function;
The words properties of the StringVariable type. lines is a natural extensions;
The newline method of the StringVariable type;
A nicer interface to define simple programs;
The condition argument to txt.lm.constrain which allows to pass arbitrary functions that return a boolean as additional constraints. Not a big fan of the name but will do for now.

import txt

llm = txt.llm.Normal()

expert = txt.lm.constrain(
        llm,
        stop_at=["\n"],
        conditions= [ lambda x: len(x.words) <= 3]
    )
answer = llm

def meta_prompt(question):
    prompt = question
    prompt.newline("I believe the best person to answer this is ")
    expert_rv = expert(prompt)
    prompt += f"{expert_rv}.\n Indeed, {expert_rv} addressed this question: "
    answer_rv = answer_fn(prompt)
    return prompt += answer

out = txt.lm.beam_search(meta_prompt("What is the Earth's diameter?")
out.eval()

This leaves open the possibility to add more structure to the program like control flow. However, we can expect that many programs will have a simple sequential structure like in this example, and we should provide a simpler way to define these programs. I suggest the following:

import txt

llm = txt.lm.Normal()

def meta_prompt(question):
    prompt = txt.prompt("""{{ question }}
    I believe the best person to answer this question is {{ expert }}.
    Indeed, {{ expert }} addressed this question: {{ answer }}""")

    expert = txt.lm.constrain(
        llm,
        stop_at=["\n"],
        conditions=[lambda x: len(x.words) <= 3]
    )
    answer = llm

    return prompt(question=question, expert=expert, answer=answer)

out = txt.lm.beam_search(meta_prompt("What is the Earth's diameter?")
out.eval()

The prompts implicitly define a program; it is parsed into a directed acyclic graph. beam_search transforms this program into one that decodes the outputs of the LMs. eval runs the evaluation. This interface should be seen as the equivalent of flax.linen.Sequential.

Parallel execution when mapping on several inputs

Add Discord connector

Must run asynchronously. Since outlines can run any Python code and there already is a library that connects to Discord (https://github.com/Rapptz/discord.py) this can just be an example in the documentation.

Auto-document functions using a decorator

We can instruct LLMs how to use tools functions by passing the name of the corresponding Python function, its description and the list of its arguments. We can simplify this manual task by using the information already present when we defined the function by wrapping the function in a decorator:

import outlines


@outlines.function
def google_search(query: str):
    """Google search."""
    pass


@outlines.prompt
def prompt(tools):
    """AVAILABLE COMMANDS:

    {% for fn  in tools %}
    {{loop.counter}}. {{fn.description}}, "{{fn.name}}", args: {{fn.args}}
    {% endfor %}
    """

Add more vector store integrations

Use the async client (when it makes sense) for:

Weaviate
Pinecone
Chroma
FAISS

Add integration for OpenAI embedding API

In outlines.models.openai, it's better to group the integrations by provider than by functionality.

https://platform.openai.com/docs/api-reference/embeddings

Implement LMQL Fig.11

The following currently contains errors, and probably requires some formatting functions, but the gist of how this would work with outlines is here:

Completions are returned by models;
text.prompt decorated functions can be used to generate prompts (and thus update completions)
We can use any control flow
We can use external tools

import outlines.text as text


@text.prompt
def reAct(question):
    """What is the elevation range for the area that the eastern sector of the Colorado orogeny extends into?
    Tho 1: I need to search Colorado orogeny, find the area that the eastern sector of the Colorado ...
    Act 2: Search 'Colorado orogeny'
    Obs 2: The Colorado orogeny was an episode of mountain building (an orogeny) ...
    Tho 3: It does not mention the eastern sector. So I need to look up eastern sector.
    ...
    Tho 4: High Plains rise in elevation from around 1,800 to 7,000 ft, so the answer is 1,800 to 7,000 ft.
    Act 5: Finish '1,800 to 7,000 ft'
    {{ question }}
    """"


@text.prompt
def mode(i, mode, object, completion):
    """{{completion}}
    {{mode}} {{i}}: {{object}}
    """


model = text.completion("openai/davinci", stop_at=["\"])


for i in range(10):
    mode = model(reAct)
    if mode = "Tho":
        answer, completion = model(mode(i, "Tho", "", completion))
    elif mode == "Act":
        action, completion = model(mode(i, "Act", "", completion))
        subject, completion = model(action(completion))
        if action == "Search":
           result = outlines.tools.wikipedia(subject)
           completion = mode(i, "Obs", result, completion)
        else:
            break

Add web search tool

One of Google, Bing, DuckDuckGo. Must run asynchronously.

Add advanced prompting techniques to the DSL

From this gist:

Prompt alternating
Prompt weighing
Prompt fusion

This will require to extend the DSL to do more than what Jinja offers, which involves a bit of software machinery. And not all of these are implementable with closed-source models.

Add a Boolean type

Example of application: running generated code through a test suite.

Add Pinecone integration

Pinecone is a vector database; the reason for prioritizing it over other stores is to be able to reproduce BabyAGI as an example.

AutoGPT seems to be using it as well.

Use OpenAPI specs for the API connectors

GPT-4 can generate OpenAPI specs so it would be fun to have a plugin that takes an OpenAPI spec and calls the APIs directly.

https://www.openapis.org

See e.g. in python: https://github.com/openapi-generators/openapi-python-client

Add chat completion

Chat completions are essentially the same as standard text completion, but they use a slightly different API. Instead of a single prompt, they require an "instruction" prompt and a "chat history". These three elements are combined upstream to form a prompt that is fed to a language model (see the documentation for Anthropic's Claude). I'd rather not having to overfit the library's API on this particular kind of interaction pattern, but this format is ubiquituous today and there is no way around it for OpenAI's endpoints.

This has implications on the API of text completions if we want to maintain a consistent API across bot outlines.completion and outlines.chat_completion.

import outlines.text as text


query = "What is Mt Everest's height?"
instructions = "You are a question answering model"
history = [{"query": "", {"answer": }]

model = text.chat_completion(
    "openai/gpt-4",
    max_tokens=128,
    stop_at=["."]
)
answer, new_history = model(query, instructions, history)

For the sake of consistency we may want to turn the text.completion decorator into a simple function that outputs a model that should be called with a str:

import outlines.text as text


prompt = "What is Mt Everest's height?"

model = text.completion(
    "openai/text-davinci-003",
    max_tokens=128,
    stop_at=["."]
)
answer, completed_prompt = model(prompt)

Which, with hindsight, is not necessarily a bad thing. It's hard (when possible) to un-decorate a function in Python, and allowing such strong coupling between the model and the prompt may involuntarily prevent prompt re-use. The @text.prompt decorator would remain, and can be used for either prompt, instructions of even history. Example of usage:

import outlines.text as text


@text.prompt
def query(question):
    """I have a question for your:
    {{question}}"""

@text,prompt
def instruction(type):
    """You are a question-answering {{type}}"""

model = text.chat_completion(
    "openai/gpt-4",
    max_tokens=128,
    stop_at=["."]
)
answer, new_history = model(query("What is Mt Everest's height?", instruction("model"))

Particular cases

Note that chat completion APIs behave as standard completion endpoints when instructions=None and history=None, so the corresponding models should also be made available via text.completion, handling the None cases within the model integrations.

Anthropic's Claude can emulate chat behavior when we format the prompt in a specific way. Since it is specifically mentioned in their documentation we could offer a "chat completion" interface for their model as well.

All that to say, we are pretty much overfitting on a particular use case / OpenAI's API.

TBD

The order of the arguments when calling chat completion models.

Ops

Ops are elements in the graph that create one or several new Variable from one of several variables.

The most elementary ones are simple string manipulation:

text = txt.text()
text + "stuff"
text < "stuff"
text << "stuff"

We also have Boolean operators:

txt.endswith(".")
"." in txt

Or operators that can return an int:

txt.words
len(txt)
txt.lines

Non-trivial Ops can be anything. For instance storage and retrieval in a vector store:

vector_store.store(text)
vector_store.query(text)

Writing/reading from a file:

file.read(text)
file.write(text)

Search:

engine.search(text)

Calling an API (we can have a set of API whose interaction patterns have already been implemented):

requests.get(text)

Executing code:

python.execute(text)

Integrate `pillow` functions (drop-in remplacement)

Implement Python string operators for `StringVariable`

Add the possibility to seed the generations

Generative models return samples from a distribution; if no seed is set they should return different results each time they are called. The current caching implementation however does not take the random state into account. It is only based on the other inputs, so two successive runs will return the same value even though no seed has been set. This behavior is unintuitive and should be corrected.

Instead, we should add a mechanism so users can set the seed globally. If the seed is set the cache will be hit at the next run, otherwise a new result will be returned. This is however made slightly complicated by the fact that it is not possible to pass a seed value to some API providers like OpenAI and Anthropic, so they will generate a different value each time they are called when the temperature is different from 0.

I have thought of a possible workaround. We can manually define cache keys in the form:

random_seed + model_name + call # + argument values

The random_seed can be set globally by the user with outlines.seed(42), otherwise it is drawn at random when requested. This ensures that consecutive runs will not hit the cache when no seed is set.
model_name is here to ensure that we're not hitting the cache if we've changed the model used for generation between two runs.
argument_values so the cache is not hit when the same model is called twice with different arguments.
call # so we can cache calls to closed APIs with non zero temperature and repeated calls within the same program execution to yield different results.

Implement LMQL Fig.10

import outlines.text as text


@text.prompt
def chain_of_thought(examples, options):
    """
    {% for example in examples %}
    Pick the odd word out: {{example.options | join(",')}}.
    {{example.reasoning}}
    So the odd one is {{example.odd_one}}.
    {% endfor %}

    Pick the odd word out: {{options}}
    """


model = text.completion("openai/davinci", stop_out=["Pick the odd word out"])

Make outlines generate its own documentation

We can iterate on the following prompt. The goal is to have outlines generate its own documentation in CI.

import outlines.text as text

@text.prompt
def build_documentation_prompt(fn, calling_functions):
    """
    Write a concise, high level, high quality description of the Python function below:

    FUNCTION TO DOCUMENT:
    {{ fn | source }}

    To give you some context here are a few functions that use the function you need to document:

    CONTEXT:
    {% for fn in calling_functions %}
    {{ fn | source }}
    {% endfor %}

    You can now describe the function:
    """

Language programs

Language models are distributions over sequences

Language model is a distribution over a sequence of tokens. Sampling from a language model returns sequences of tokens that follow the model's distribution. The output of a pre-trained language model parametrized by a prompt $P$ is a random variable:

$$ sequence \sim \operatorname{LM}_\theta(P) $$

What would this look like in code? In the following $s_{rv}$ represents a random variable:

model = lm.Normal()

prompt = "test"
s_rv = model(prompt)
type(s_rv)
# RandomString

Constrained language models

We can further constrain the output of the LM, in which case we are defining a new distribution.

$$ sequence \sim \operatorname{LM}^c_\theta(P) $$

Say we want the sequences to stop after a set of tokens have been found, to start with a set of tokens. The constraints apply to the LM distribution itself:

model = constrain(
    lm.Normal(),
    stops_at = ["\n", "."],
    starts_with = ["The"],
)

prompt = "test"
s_rv = model(prompt)

We can expand these contraints to add more complex validation methods, for example for code-generation tasks (see this, this and this paper for instance). The LQML paper suggests an efficient way to apply these constraints.

An interesting case is when we limit the output of the LM to a finite number of tokens. In this case we define a new random variable we can truly sample from. Syntax is not yet clear in my mind, but I feel we should distinguish this case from the starts_with and stops_at constraints above:

model = lm.Normal()

prompt = "test"
s_rv = model(prompt).choose_between(["beach towel", "watch"])

Language generators

A language generator is a function that returns a token sequence given an input token sequence. It may be deterministic or stochastic, may or may not be parametrized. The combination of a LLM with a decoding method (argmax, sample, beam search, nucleus sampling, etc.) is a language generator. Decoders can be seen as program transformation, the same way joint_logprob is in AePPL: they produce and execution graph that returns a string.

model = lm.Normal()

prompt = "test"
s_rv = model(prompt)

s = argmax(s_rv)  # greedy, tries to get the "best" sequence
s = beam_search(s_rv)  # greedy, tries to get the "best" sequence
s = self_consistency(s_rv) # greedy as well
s = ancestral_sampling(s_rv)

Self-consistency is defined in this paper

Language programs

Language programs are Directed Acyclic Graphs that link different LM-distributed random variables together. They are typically applied recursively to an initial prompt that is augmented with the RVs:

model = txt.llm.Normal()

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

In theory, executing this graph with e.g. prompt.eval() should return random strings (maybe with ancestral_sampling?). In practice, we often want to get an optimal-ish output. In this case we can transform the graph using the previously-defined operators. Different operators behave in different ways. For instance, argmax greedily decodes the graph, so this program:

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = argmax(prompt)
out.eval()

is equivalent to this one:

prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a
prompt.eval()

Other program transformations, like beam_search, yield different results when they're applied to a whole graph or to individual LM rvs. When applied to a graph with multiple LM calls, the beams used to decode a variable are continued when decoding the next variable, thus trying to find the most likely sequence for the program as a whole (called scripted beam search in the LQML paper. When applied to the LM calls individually the beams are re-initialized after each decoding:

prompt = "Q: "
q_rv = model(prompt)
prompt += q_rv + "\nA: "
a_rv = model(prompt)
prompt += a_rv

out = beam_search(prompt)
out.eval()

# is NOT equivalent to

prompt = "Q: "
q = beam_search(model(prompt))
prompt += q + "\nA: "
a = beam_search(model(prompt))
prompt += a
prompt.eval()

Other random variables

Other random variables can be part of a language program. They are not affected by generators, in the sense that an .eval() call on the output will consist in first drawing from the random variables' distribution and then decode. Example of random variable:

a_rv = choice(["The", "A", "All"])
a_rv.eval()
# The
a_rv.eval()
# All

This also applies to llm(prompt).choose_between(["The", "A", "All"]) types of random variables. Such variables can be used in a context where we want to infer the best few-shot prompts for a given task, for instance.

Infer the posterior distribution of values

In a program where we do not apply a generating transformation (such as beam_search) to graphs containing LM-distributed random variables like a_rv = model(prompt) it is not clear how to perform efficient inference, because defining good proposal distributions in this space is non-trivial afaik. It remains nevertheless an open possibility with this implementation.

It is however possible to perform simulation-based inference when using one of the generators, thus treating language programs as simulators. We can use humans in the loop to validate the sample, or apparently even use LMs as discriminators.

Use tools (like web search)

Tools are operators that take a sequence as an input and return a sequence (or a list of sequences). They can thus easily be added to the graph:

model = llm.Normal()

p = "Prompt"
a_rv = model(p)
res = google_search(a_rv)
p += res
b_rv = model(p)

Here is a survey on augmented language models. We could use web search, API calls, code execution, etc.

We can even add humans in the loop with for instance a human_input operator.

Multi-modality

Multi-modality is achieved by defining an ImageVariable type, and defining operators that act on/return these types. For instance with a stable_diffusion operator:

prompt = "Q: "
q = argmax(model(prompt))
prompt += q + "\nA: "
a = argmax(model(prompt))
prompt += a

img_prompt = beam_search(prompt)
img = stable_diffusion(img_prompt)

img.eval()

Support async model calls

Since many outlines functionalities rely on API calls, the execution of programs is mostly I/O bound. The performance of #22 would thus be greatly improved if we could call the models asynchronously. Most SDKs support asyncio out-of-the-box so this should be straightforward.

Add Slack connector

Since Slack provides an async Python SDK (https://slack.dev/python-slack-sdk/) this can just be an example in the documentation.

LM interaction

LM = token generator given a prompt.
Output is a random variable $x_t \sim LLM(prompt, \theta, x_{< t})$
Should be able to choose the generation method (decoder) between argmax and beam_search; this can be passed as a StringVariable (and it can eventually be random)
By extension, output of the generation method (which is a deterministic transformation), is a random variable.
LLM + decoder can be considered a simulator. We are limited to simulation-based inference.
LM + decoder is an Op that takes text as an input and returns txt.
Constrained generation: https://arxiv.org/abs/1804.06609
Choice of LM could be stochastic as well and we could infer which gives best results

# `constrain` and `is_in` modify the decoder itself so they may apply
# to the decoder itself (not represent here)
lm_2 = txt.lm.constrain(
    txt.lm.Normal(),
    lambda x: x.words <= 3
)
lm_1 = txt.lm.is_in(
    txt.lm.Normal(),
    ["Tho", "Act"]
)


s = txt.text()
# Make sure that the generation is only executed *after* the choice is made
l = txt.random.choice([lm_1.generate(s), lm_2.generate(s)])

Add `ImageVariable` type

Use the same `aiohttp` session for element-wise operations

We introduce an outlines.elemwise decorator in #72 which allows to map a function over a single input over a list of inputs. This supports async API calls, which does bring some speedup; however we could get extra speedups by re-using the same aiohttp session like suggested in the openai-python repository: https://github.com/openai/openai-python

Use `climage` to display generated images in interactive sessions

Implement convenience `StringVariable` operators

from outlines import string 


s = string()
s.new_line("line")
s.new_word("word")

Refactor the completion and chat completion interfaces

Work in progress

Outlines currently proposes two interfaces with language models: outlines.text.completion which wraps simple completions with a language model, and outlines.text.chat_completion which wraps the interface with chat-like APIs. However, I believe that the distinction as it is currently made is artificial:

Chat-like APIs can also be used for simple text completion by passing a single user query and no prefix;
Chat-like APIs are a wrapper around an autoregressive completion process;
Completion APIs can be used in an autoregressive completion process, possibly involving different models.

We should thus refactor the interface with language models, and make a distinction based on the way they are used:

One-off completions, i.e. taking one sample from the distribution that these models represent.
Compound completion processes (chains?) in which we mix calls to different models, calls to models followed by user inputs, etc.

Taking a single sample

The current API makes it easy to generate samples

import outlines.text as text

@text.completion("openai/text-davinci-003", stop_at=["\n"], max_tokens=128)
def complete_task(objective, task):
   `"""You are an AI who performs one task based on the following objective: {{objective}}.

    Your task: {{task}}

    Response:
    """

result, completed = complete_task("Something", "something")

However:

The decorator means the prompt can only be used by one model, and/or with one parametrization.
It is hard to debug the prompt in isolation, and this is partly why we currently the concatenation of the prompt + the result.

Consider the alternative way of getting the same result:

import outlines.text as text
import outlines.models as models

@text.prompt
def complete_task(objective, task):
   `"""You are an AI who performs one task based on the following objective: {{objective}}.

    Your task: {{task}}

    Response:
    """

model = models.OpenAICompletion("text-davinci-003", stop_at=["\n"], max_tokens=128)
prompt = complete_task("Something", "something")
result = model(prompt)

It is more explicit, and not substantially more complicated. To make model discovery easier, we could turn text.completion into a function that returns the model instead.

import outlines.text as text
import outlines.models as models

@text.prompt
def complete_task(objective, task):
   `"""You are an AI who performs one task based on the following objective: {{objective}}.

    Your task: {{task}}

    Response:
    """

model = text.completion.openai("text-davinci-003", stop_at=["\n"], max_tokens=128)
prompt = complete_task("Something", "something")
result = model(prompt)

For convenience we can add a outlines.models.completion function which returns the model class corresponding to a model name:

import outlines.text as text
import outlines.models as models

@text.prompt
def complete_task(objective, task):
   `"""You are an AI who performs one task based on the following objective: {{objective}}.

    Your task: {{task}}

    Response:
    """

model = models.completion("openai/text-davinci-003", stop_at=["\n"], max_tokens=128)
prompt = complete_task("Something", "something")
result = model(prompt)

which may be more convenience when providers distinguish between completion and chat completion in their APIs.

Compound completion processes

Consider the following meta-prompting workflow:

import outlines.models as models
import outlines.text as text

expert_model = models.OpenAICompletion(stop_at=["\n", "."])
answer_model = models.OpenAICompletion()


@text.prompt
def find_expert(question):
    """
    Q: {{question}}
    A: A good person to answer this question would be
    """


@text.prompt
def get_answer(expert, memory):
    """
    {{memory}}.

    For instance,{{expert}} would answer
    """


expert_ppt = find_expert("What is the Earth's diameter?")
expert = expert_model(expert_ppt)

history = expert_ppt + expert
answer_ppt = get_answer(expert, history)
answer, completed = answer_llm(answer_ppt)

which is very similar to interacting e.g. OpenAI's ChatCompletion API, where at each step we send a list that contains the previous interactions and the new query, and get back the model's completion.

This interface could be improved, as the user currently has to:

Manually concatenate the result to the prompt
Manually manage the context length of each model
Concatenation quickly becomes hard to keep track of

Instead, we could let models have a state parameter which contain previous queries and results, and add convenience functions that allow to manipulate this state.

prompt = ""
answer, state = first_model(prompt)

prompt = text.concatenate(state) + ""
answer, state = second_model(prompt)

We could also let models concatenate the state internally:

prompt = ""
answer, state = first_model(prompt)

prompt = ""
answer, state = second_model(prompt, state)

This way the distinction between completion and chat completion APIs disappear as the model would interpret the state as a succession between user queries and model outputs:

prompt = ""
answer, state = chat_model(prompt)

prompt = ""
answer, state = chat_model(prompt, state)

The output of models can be accessed by indexing the state:

state[first_model]   # The first model's answer
state[second_model]  # The second model's answer
state[chat_model]  # The whole interaction trace

We can also add a name keyword argument when calling the model which allows to retrieve the state more easily down the line, and point to a particular answer when the same model instance is used several times:

prompt = ""
answer, state = model(prompt, name="first")

prompt = ""
answer, state = model(prompt, state, name="second")

state["first"]

The state itself can be modified, or initialized. The model can fetch the history up to the size of its context window. We can initialize the state with a prefix prompt which is never cut out when the context window has been reached.

Random variables

Builds on the specs defined in #8.

import txt

few_shot_examples = [
    {"question": "What is?", "answer": "It is"},
    {"question": "What is?", "answer": "It is"}
]

llm = txt.lm.Normal()

def random_prompt(prompt):
    chosen = txt.random.choose(few_shot_examples)
    for ex in chosen:
        prompt.newline(f"{ex.question}: {ex.answer}")

    return llm(prompt)

result = txt.lm.beam_search(random_prompt("Please answer the following question"))
result.eval()
# Something
result.eval()
# Something different since the few shot examples are drawn at random

We can use simulation-based inference to infer the best choice of few-shot examples for a given use-case.

Implement a local "vector store"

Calling a remote vector database for exploration work on a few 1k-100k examples feels completely overkill, on top of probably getting any potential performance improvement eaten by latency. We should add both an in-memory and a simple disk store.

Return extra information with model calls

We currently return the completed prompt in addition to the model's return value. However models often return more information than this: the total logprob value for the completion, API usage stats, etc. Instead of simply returning the completed prompt we should return an object that contains more information, e.g.:

from typing import NamedTuple

class OpenAICompletionInfo(NamedTuple):
    prompt: str
    completion: str
    logprob: str
    api_usage: OpenAICompletionUsageInfo

model = text.completion("openai/davinci")
answer, info = model("Who was Arianna Rosenbluth?")

outlines-dev / outlines Goto Github PK

outlines's People

Contributors

Stargazers

Watchers

Forkers

outlines's Issues

Syntactic sugar / string semantics

Logging

Random variables and inference

f-strings

Particular cases

TBD

Language models are distributions over sequences

Constrained language models

Language generators

Language programs

Other random variables

Infer the posterior distribution of values

Use tools (like web search)

Multi-modality

Taking a single sample

Compound completion processes

Recommend Projects

Recommend Topics

Recommend Org