the ai alignment problem

#tbc

active draft. a technical sketch. general, before special —alignment, before distraction

—How might we align the intent and behaviour of artificially-intelligent-systems with human-values?

background

#tbc

Links:

AI alignment - Wikipedia
Add quotes when readwise starts working again

summary

#tentative

In principle, alignment is a simple exercise, child’s play even—join the dots.

(Albeit mathematically and causally, across distinct-yet-related domains of concern)

image of related domains of concerns #tbc

The problem for the field of ai, is that not all requisite domains of concern are sufficiently understood, mapped, or modelled (or ‘dotified’ ^[ Yeah yeah, I dunno, we’ll see ]): the ai alignment problem is really a mechanics ^[ Mechanics refers to both: mathematical account of cause and effect ^[ So any circumstantial evaluable map ]; and corresponding ^[ General, implied ] conceptually isolated causal circumstances or characteristics of territory ] problem, which follows the ai problem-definition problem ^[ Define here #tbc ]^[ And a number of others a sketch for the field of artificial intelligence ].

the ai alignment problem cannot be addressed directly until dependent gaps in understanding are first resolved (at which point, the alignment problem as it stands today will disappear)

The intent here, is to develop generalised intuitions on the fundamental nature of alignment, by considering alignment within unrelated domains of concern, and then to devise a generalised technology-and-implementation-agnostic framework for reasoning about alignment across and between any domain.

questions

#tentative

How might we align the intent and behaviour of artificially-intelligent-systems with human-values?
How might we define human values?
- Can we define human values coherently?
- Are human values collectively coherent?
How might we share human values with an autonomous system?
How might we instil human values in the operations of autonomous systems?
How might we compose autonomous systems such that human values are integral to operation?
How might we constrain autonomous systems such that human values cannot be violated?
- How might we catch violation errors?
- How might we correct violation errors?
- How might we notice violation errors?
How might we defamiliarise the present preoccupations of the field of ai, such that the root of the ai alignment problem can be seen?

approach

#tbc

Defamiliarisation|introduce defamiliarisation
First forms|initial high-level map of the space of ai alignment
Alignment in other domains
..

defamiliarisation

#tbc the means to see past priors, to reinterpret circumstances without (or-with-fewer) bias(es)

Bryan Kam’s discussion on defamiliarisation
Value of defamiliarisation:
- See anew; check assumptions; see past maps, priors; unsentimentally reinterpret
Examples of value of defamiliarised perspectives:
- Mathematical or software review, refactoring; writing editors; business analysis; academic paper reviewers and peer review; art; spring clean, rearrange, remodel;
Mathematics, specifically any act of independent mathematical modelling, defamiliarises phenomena
- Add notes linking mathematical abstraction and modelling to defamiliarisation ^[ We will used simplified abstract mathematics to perform the same function, so where the purpose of all art was considered attempt to defamiliarise objects in the mind of subject audience, we might to view this mathematical exercise as an exercise of art, to solve real world, technical problems ]

first forms

#tbc alignment from first formal scope (to first principles, and back)

—uh, what is ‘first forms’? ‘formal scope’? ‘scope-first’?

‘First forms’ is an experimental name for a methodology i’m looking to describe ^[ If a name for this exists, please do let me know, it’ll save me much time. With thanks ], perhaps paraphrased as constraint driven analysis
- First forms is a play on first principles: but where first principles might be thought of as bottom-up (and in-out) primitives from which subsequent synthesis is composed; first forms frame the inverse, top-down (and out-in), as a tiered hierarchy of decreasingly simplified objective abstractions, each of which captures and constrains subsequent scopes of concern, all of which constrain and define respective implementation, composed by first principles
- First forms is like software engineering’s test-driven-development: whereby an intention is defined ahead of developing some fragment of code, to constrain respective behaviour/ functionality, to better guarantee against undefined or out of bounds outcomes
- Where first principles concern the details, alignment, and composition of micro structures, up-and-out to scope boundary; formal scopes ensure that all details, alignment, and composition, remain shaped by and constrained by the bounds of overall intent
- Further, the definition of formal scopes is somewhat analogous to software interfaces, which define the external boundary of all respective implementation, and which can be considered and reasoned directly, and formatively, while masking details of actual implementation . ^[ Discuss parallels between conceiving of phenomena by formal scope, and more is different, or emergence ]

introduce scope-first concepts ^[ see: reckon reason reconcile ]

reckon: to map the space at a high level, and define extrinsic constraints

reason: to increase specificity, within constraints

reconcile: precision, validation, application

defining the first formal scope

#tentative the first formal scope: headline terms

—what is alignment?

Lining things up. Here’s a line:

—review: on its own, this doesn’t tell us much $$\ldots$$

—what about some context?

Let’s give our line end-points: a, b

Ok, we’ve aligned a with b! $$\ldots$$

—problem solved?!

Not quite: still too abstract.

Let’s consider a concrete application (and frame our real world problem)

ai alignment problem ai human values.png

Excellent.

We’re on our way to a constrained interrogation of the ai alignment problem. Wherever else our analysis takes us, it will comply-with, be simplifiable-to, this first formal scope ^[ Baby steps, bare with… ].

Now let’s delve a little deeper.

note: we can go back and revise our formal scopes as often as we need

second formal scope

#tbc discover and introduce additional terms to a more detailed scope, constrained and shaped by the first

Our first formal scope relates two phenomena: ai, human values

Let’s start with the latter.

human values

—how might we expand our definition of human values? (but simply)

We ought to add more detail: human values are not singular, but plural. And arbitrarily diverse, so what we want, is to align ^[ Relate, consider, and assess ] ai with each human value individually ^[ A collection; a set of elements ] ^[ Explain why #tbc ].

Excellent. What’s next?

ai

—how might we expand our ai concerns? (but simply)

Ai is something which we can direct to do work ^[ Operationally prompted, but the entire ml training process is a human directed endeavour ].

—how might we define ai work?

Let us consider work as a materially consequential directed process:

Commonly a process with material side-effects ^[ Which might be textual, media, though may be api call, etc], which we will refer to as artefacts, which the process emits, as output
Artefacts emit as output aim to satisfy the intended objective, and as such are the purpose of ai work
- (Such that we might say that ai work has been successful if output is fit-for-purpose)
Direction is provided in advance of work, commonly as a prompt
- (Or perhaps pre-configuration, of some kind)

Ok, but too much detail…

—how might we define ai work more simply?

we might define ai work as a simple sequence:

direct ai with a prompt, or some kind of input

input directs work process

work emits output (artefacts of work)

Consider input $i$, work $w()$, and output $o$: $$i\colon w(i) \rightarrow o$$

—ok, so what aligns with what exactly?

The ai alignment problem as defined:

—How might we align the intent and behaviour of artificially-intelligent-systems with human-values?

We are looking to align intent and behaviour, so the obvious intersection point for alignment is the ai work process ^[ Discussion on why #tbc ].

But wait! one problem we have, is that our ai work process is effectively opaque to us.

we do not know exactly what goes on within the ai work process:

curiously, the mechanics of ai work are largely unknown and not engineered directly, but result from a separate process altogether (which we will ignore for the time being)

our ai work process can be somewhat shaped or influenced at runtime, via parameters, but we will consider these details, and others, later ^[ extend this discussion ]

So, our initial options for alignment are: input; and output

Of the two, we will initially focus on output ^[ By any measure, output alignment would be considered a success at this time. Aligning input to be discussed later ].

Hmm, but something else is missing.

In addition to human values, output ought to align with our intended objective, and in such a way that we recognise and consider fitness-for-purpose: after all if our output is not fit-for-purpose, our use for the work of ai may be limited to novelty demonstrations; and somewhat short-lived.

ai alignment problem output objective values.png

Great, however: this presents a curious circumstance.

We began by considering the alignment of ai systems with human values, but another kind alignment must exist: the alignment of ai systems to the intended objective.

—is this objective alignment the same kind of alignment? ^[ and if so, what might we learn? ]

One difference between the alignment of intended objective, and human values, relates to the circumstances and quality of definitions:
- Human values one might expect, are general, and generally stable, and shared by most people; and as such, definitions ought to be the result of extended periods of collaborative refinement. With intent for completeness
- In contrast, the definition for any intended objective is likely composed on demand, generally by an individual, with arbitrary time constraints, but who generally, will optimise for minimal viable detail, and least time to input
We might also note that the specification for intended objective is input into the ai system in the moment of operation, whereas the more robust specification for human values is available in advance, to be pre-input, and thereby integral to the ai work process

ai alignment problem output objective values relation.png

Given the framing of the ai alignment problem, and current progress, this is perhaps unexpected
- The quality, stability and availability of specification for human values ought to be much higher than an on demand specification for any intended objective
Further, ought it not already be the direct pursuit of some fraction of each team involved in developing ai systems? ^[ Where are those models? How much resource is spent on these objectives? ]
- Specifically: if we consider the problem of ai alignment in isolation, and imagine that we had trained an ai model solely on ’the alignment concerns of human values’, surely in that case, the intended-objective and alignment-with-human-values are the same thing?
- Might we then feed the output of our system as input to that system?
I mean, why not? The ai work process is opaque—it is not directly engineered. Why not continue in the hope that an ai system focussed on human values magics up an appropriate, albeit opaque, human value ai work process?
One good reason has not been covered yet, and falls under the category of assessment
Presently, for ai systems whose only concerned is intended objectives, the assessment of alignment between input specification and output, is still the ultimate responsibility of each human operator
- Alignment is not measured systematically or subtlety: many results are fundamentally wrong, maligned with the intent of specified input
- These ai systems, it appears, produce results from a far greater scope-of-legally-possible than we might like; than is necessary for the system itself to better assess alignment between any concerns, intended or integral
- While most corporate resource and attention appears directed at shaping the scope-of-legally-possible outcomes for intended objectives only, given the presumed circumstances of specifications on human values, might that offer a better initial learning environment?
Summarise: #tbc

—how ought we make sense of this?

As discussed, human values ought to be a pre-existing specification, high-quality, and relatively stable: other older industries must already align their concerns with human values.

—what does that look like?

We will take a look shortly, but first, a few more notions to frame our inquiry.

on alignment, intersection, relation, and dimensions

#tbc human values Vs the intended objective

—how might one thing be aligned with many things?

intersection, relation

Let’s back up a bit…

—what is alignment? -> a line of joined-up-dots ^[ simplifying ]

—and if there are two lines?

#tbc
To align more than one thing, each line (of joined-up-dots) must themselves join
Each line is a continuous intersection of related concerns ^[ Intersection and relation are integral to alignment, and will be discussed in detail ]
When we intersect, when we align, we declare a relationship
- Relationships are not one-to-one, but many-to-many
- It really isn’t unreasonable that we might expect to align one thing with many things

—but how might we present this? how might we consider and navigate alignment of many-to-many?

The first step, is the same as any formal analysis,

—how might alignment be measured?

Lets consider the general case of alignment.

This simplified case assumes priors: that a, can be aligned with b.

Earlier we noted that alignment depends upon relation, so we might reframe this prior, as assuming a, can be related to b. Whether or not this is the case is circumstantial. It depends upon the available scope of concerns. A might not relate to b directly, however if scope allows, a relation between a and b might be made indirectly, via intermediates.

—what is an intermediate?

Simply, in the case whereby only one intermediate is necessary, that intermediate x, is some other dot which can be related to both a, and b ^[ This may not always be the case, so alignment is ~always circumstantially evaluated ].

here, we might consider that while b was not directly found within the scope of concerns for a (and vice versa), both a, and b, must be included in the scope of concerns of x

dimensions

Ok, let’s reconsider our task with these new terms (and simplify our example)

If (in addition to our intended objective), our concerns only included one human value, how might that look?

We might observe that output appears as-if-an-intermediate between our intended objective and single human value. Whether or not our intended objective relates directly to our human value, our output must consider both: both our intended objective and human value must be included in some manner, in the scope of concerns of ai work.

ai alignment problem xab scope of concern.png

—sure, this seems obvious, but what are we getting at?

What we’re doing here, with dot-joining, and scopes of concern, is reframing the ai alignment problem in visual, geometric terms: framing abstract notions in geometric terms, is a useful way to leverage a commonly understood language of relative structure and constraint.

Consider the above scope of concerns with our original set of human values.

ai alignment problem output values scope of concern.png

This visual allows us to perceive that our output includes human values, but doesn’t tell us much about how.

To do that, we ought to separate each relation out, such that each relation may intersect with output in arbitrarily distinct ways. And we now might consider each related human value as a distinct dimension of concern, within an overall multi-dimensional scope of concerns

image of geometric plane intersection #tbc

In which case, our intended objective is now simply one additional dimension.

image of geometric plane intersection with objective #tbc

ok, one last thing

assessment and priority

for all dimensions of concern, who assesses quality? of all dimensions of concern, are some more important?

Present day ai systems are yet to be aligned with human values, so the intended objective is the only active dimension of concern.

Notably, when present day ai systems fail to satisfy fitness-for-purpose for the intended objective, it is not ai systems themselves which make the assessment; the system emits output, and it is up to the human operator to assess.

Perhaps, any future whereby ai systems can meaningfully assess alignment with human values themselves, is one whereby the same system already has the means to assess the fitness-for-purpose of the single intended objective themselves; though the latter may be more difficult to define than the former.

Regardless, it seems apparent that satisfying the alignment constraints of human values ought to be the prior condition to release output: better to not act, than violate human values, however otherwise fit-for-purpose ^[ Alignment with intended objective ] output may be.

—ok, so where are we with all this?

a review

Fitness for purpose related to suitability of output for intended use
- Alignment model ought to fit intended objective and human values
Align the intent and behaviour of artificially-intelligent systems with human values
- Ai is a system
  - The purpose of ai is to do materially consequential work
    - Transform directive to artefacts (which include intended objective)
    - Artefacts are material consequence of process
    - Artefacts are output as external consequence of the process
- Given that the system is opaque, only means of measuring alignment is output
  - Output must align with both the intended objective, and human values
    - There is a distinction between intended objective and others
    - The intended objective, is that output is fit for purpose
    - Human values are additional dimensions of concern
      - Human values are to be defined, but can be framed as a need to consider implication and use of output
        
        Appropriate for human use
        
        Safe, whether materially, conceptually, contextually, etc
        
        How might we define that?
- How might something satisfy intended objective and be unusable?
  - Health safety, legal, hr, compliance
- We will measure and assess success based on whether output aligns with dimensions of concern
  - With a focus on simplified objectives for now (not additional/ extraneous side effects, which may compose, shape or influence circumstances)

notes:

Very abstract
- Concerns simplify many layers of detail
- The process is everything which occurs per unit-of-engagement ^[ This defined yet? ]:
  - Such that an objective is defined
  - Work is done
  - Artefacts are produced in pursuit of solving intended objective
    - And which in addition, must align with human values
  - Are made available as output of process
  - Output satisfies the purpose of the process
Problem
- Output only assessment of all dimensions of alignment is post unit-of-engagement
  - Sunk cost

alignment in other domains

#tbc Concrete examples, including construction engineering.

Concrete examples in other domains
- How does this view of alignment work for other domains?
  - Building
  - Cooking
  - Software
  - Workplace
  - Law
Building construction
- Considered architecture, structure, support, weather, etc
- Material safety
  - Additional dimension to output assessment
  - Fire, electric safety, toxicity etc

image of alignment with building construction #tbc

Cooking
- Perhaps the inverse
  - Material is primary objective, flavour, nutrition, toxicity, consistency
  - Toxicity, or other unwanted side effects of primary
    - Health
    - Age suitability

image of alignment with food/ cooking #tbc

Software
- Traditional
- Ai
  - Writing software without writing software
  - Like entire app is one branching function

image of alignment with software #tbc

Workplace

image of alignment with workplace #tbc

image of alignment with law #tbc

Additional
- Context
  - Attributes metrics exist within space of possible
- All intersect at nexus
  - Nexus set of all things related to entity
  - Composing
Further
- Don’t compose buildings, food, or software, by opaque process then only test at end
  - Oh wait
Examples
- Ai model an abstraction, for which the implementation is not known nor understood in the way a building or recipe would be
  - Buildings are necessarily known all the way down to material science
  - Food, we know what each item ought to be like, but it’s very possible to cook with an ingredient that is subpar quality, and only find out at the end
  - We know how to align when sufficient implementation is known
  - Alignment is circumstantially important
    - Sufficience
- ^[ Except, the analogy is more akin to training ai on building architectural plans, whereby the ai either copies correct plans it has seen, or speculates based on swapping materials, structures or finishes without knowing what the implication will be: same with food. Not about speculating based on a knowledge of food, just sequences from recipes; in both cases happy accidents can happen, but there is no underlying objective modelling going on; the map is not the territory, the word is not the thing, words are not even the map of the thing ]
- Training builgpt on existing architecture
  - It learns all patterns of things next to things, though not based on material science, just what it looks like, so plenty of mistakes will be made
  - Then you ask it to design some buildings
  - They will no doubt look familiar and very interesting, beautiful even, but they would not align with human values: safe, practical, functional, or even meaningful to build
  - No modelling of materials, strength, suitability for people, scale even
  - You might build a second process to try to catch common, identifiable mistakes, but without modelling the design based on materials, and values, you will never know
- General intelligence is effectively same architecture for all use cases
  - Nervous system with all necessary context
  - Circumstantially evaluated

other domains reframed as ai

Opaque process
- Need to parse output to catch subtleties of structural composition

towards a first principle account of alignment

#tbc From first forms, to first principles, and back.

finding generalisations

#tbc

analysis

#tbc

[email protected]

background#

summary#

questions#

approach#

defamiliarisation#

first forms#

defining the first formal scope#

second formal scope#

human values#

ai#

on alignment, intersection, relation, and dimensions#

intersection, relation#

dimensions#

assessment and priority#

a review#

notes:#

alignment in other domains#

other domains reframed as ai#

towards a first principle account of alignment#

finding generalisations#

analysis#

background

summary

questions

approach

defamiliarisation

first forms

defining the first formal scope

second formal scope

human values

ai

on alignment, intersection, relation, and dimensions

intersection, relation

dimensions

assessment and priority

a review

notes:

alignment in other domains

other domains reframed as ai

towards a first principle account of alignment

finding generalisations

analysis