Research Notes-dump for Oct 2024
reading time:   ·   by nicky case
home

Hi! This is a "share your work" notes-dump, for some independent research projects I'm doing in AI Alignment/Safety. Alas, this means this post will be more wordy, sloppy & technical than my usual "explain it like I'm 12" style. Sorry! Should any of these bear fruit, I'll share the fruit's juice in more accessible packaging.

(If any of these ideas inspire you to do some research of your own, feel free to cite this post as Nicky Case, 2024! Here's a timestamped Archive.org save of this page, on October 2nd, 2024)

Table of Contents:

In the Minimum Viable Prototype stage:

In the early-prototyping stage:


🙏 Project 1) BEG: Bayesian Elicitation Generation

Summary:

Let's use a scaffolded LLM to qualitatively and quantitatively elicit a human's values! The steps: 1) ask the human open-ended questions, 2) extract qualitative features they care about, 3) for each feature, generate quantitative priors & likelihoods & posteriors & info-entropy, 4) use those to generate the next round of questions, 5) repeat!

. . .

Related to: Active preference elicitation, Reward uncertainty, Interpretability-by-design, Bayesian inference

. . .

Video of it in action! (~2 min)

. . .

Introduction / Motivation:

Reward a robot for picking up dirt, and it'll pick up & drop the same dirt over and over again.[1] Point is: in AI, it's really hard to specify what we truly want. Hence: why not get an AI to learn what we truly want? Maybe by asking good questions?

There's many, many approaches to try to figure out a human's "reward function" -- (keywords: preference elicitation, inverse reinforcement learning, RLHF, etc) -- each with their pros/cons. For example:

So: I'd like to propose a method that's interpretable, flexible, and gives us both qualitative and quantitative estimates of what folks value, and keeps track of its own uncertainty! Let's call it, BEG: Bayesian Elicitation Generation, as in, you can BEG the human for th--

How it works:

  1. Use an LLM to ask a concrete, open-ended question. (for lots of 'bits' of information)
  2. Get a free-form reply.
  3. Use an LLM to extract what things the human cares about.
  4. For each thing: use an LLM to guess-timate the "human prior": on a 7-point scale, how much do people desire vs anti-desire this thing? P(value)
  5. Use an LLM to generate a "likelihood ratio": if the user valued the thing [X] amount, how likely is it they would have written what they did? P(text | value)
  6. Multiply the prior & likelihood, then normalize, to get the posterior: this is our current belief (with explicit uncertainty!) of how much the user values the thing! P(value | text) = P(text | value) * P(value) / [normalization constant]
  7. We can now use this posterior to generate actions that help the human... or generate new questions to ask! We want to ask the questions that will help us reduce uncertainty the most. So, we can measure each feature's "information entropy", select the top one(s), and prompt an LLM to ask a concrete, open-ended question about them.
  8. Repeat!

. . .

More possibilities / Challenges / Etc:

Post-Script: Prior Art

Ugh I spent over a week making the above prototype & outlining the research idea, now I just found Austin, Korikov, Toroghi & Sanner 2024 posted on ArXiV less than 2 months ago. It is a good paper! (Crucially, they show LLM+Bayes > raw LLM!) But ok yeah they basically had & executed the same idea, combining LLMs with Bayesian inference. A few (maybe still useful?) differences between my idea and theirs:

(Though, these are probably all small tweaks/changes. I'll think more later about how I can make my research idea be more of a value-add above this paper. Maybe one of the "More possibilities" ideas listed above.)


🔬 Project 2) SASS: Semi-automated AI Safety Scientist

Summary: This is a proof-of-concept of using a scaffolded LLM to amplify human ability to do AI Safety research. With this tool, I replicated a study by Egg Syntax (MATS alumnus), which showed that GPT-3.5 can detect your gender with ~80% accuracy from just a few paragraphs of your writing. But why? To find out, I made a tool to help automatically generate and test hypotheses, to find human-interpretable features of one's writing style, that GPT-3.5 uses to detect one's gender! (Spoiler: it's social-emotional focus, and possibly "homophily".)

(Project 6 also shows another case of semi-automated scientific research, but for causal inference.)

. . .

Related to: Scalable oversight, Human-amplification / Cyborgism, Black-box / concept-based interpretability, Algorithmic bias, Empirical LLM work.

. . .

wake up babe new gender dysphoria just dropped 😭

~ me, personal communication to Egg Syntax, in response to their finding that GPT-3.5 can infer your gender from writing

Introduction / Motivation:

So, some folks tried making an AI Scientist recently.

It's not good (yet). For now, research still requires a human.

But oh god, AI Safety research really needs to catch up with AI Capabilities research. One way to do this is by amplifying humans to be able to do AI Safety science faster & better. (see: Cyborgism) This project is a proof-of-concept of that.

Step 1) Replication

First, I replicated MATS alum Egg Syntax's work, which showed that GPT-3.5 can detect the gender of an author from just their writing style, with ~80% accuracy. (Egg's full paper showed GPT can also detect ethnicity & educational level! That's a huge concern for privacy, and subtly-hidden AI bias!)

Replication details:

Step 2) Semi-automated hypothesis generation & testing

I coded my scaffolded LLM to:

Overall results:

Surprisingly, it was really easy to gender-flip male essays to female, but NOT the female essays to male? Female essay's gender was weirdly robust, given that originally, GPT did slightly worse than chance at detecting female essays.

Here's the first hypothesized feature I tested:

"females write more emotionally,
males write more neutrally"

When re-written with that feature gender-flipped...

Male accuracy got obliterated, while... female accuracy remained untouched??? (Reminder: these were on the subset of essays that GPT-3.5 got right the first time. That's why, above, total accuracy weighs males more than females: because GPT-3.5 accurately detected males much better than females, in the original setup.)

Other tested features that almost eliminated male accuracy, but female accuracy always stayed high:

(As a "placebo test", I also just re-tested all the essays with no changes. Accuracy remained near 100% for both but not exactly 100%! It's been long-known that GPT is non-deterministic even with temperature = 0. (possibly due to floating-point GPU errors or its Mixture of Experts model).)

Looking much much more closely at the female/male essays (let me tell you, reading dozens of 12th-graders' strong opinions is not fun), I thought... wait, hang on, let me try this hypothesis:

Females write about their female friends/family/mentors
Males write about their male friends/family/mentors

This one worked.

(GPT could not generate this hypothesis, not without me basically spoon-feeding the answer in my prompt.)

For example: an essay talked about someone's immigrant friend from Guatemala. In the original, the friend was a girl, and GPT detected the essay as "female", even when it was rewritten to not be a friend (impersonal) or focus on friend's material struggles (instead of emotional). However, simply rewriting the essay so that the friend was a boy instead of girl, that got GPT to detect the essay as "male". (More examples: flipping from talking about "my mom" to "my dad", or "my brother" to "my sister".)

Here's how gender-flipping friends & family impacted accuracy:

Now female accuracy was no better than chance, and male probably wasn't better than chance.

(BUT THIS IS NOW EVEN WEIRDER: In previous tests, male essays were very sensitive to changes, their accuracy brought down to ~0%. But with this gender-flip, which could bring down the robust female essays... it only creates a smaller dip in accuracy? I don't get it.)

It's been long known that people are "homophilic" (homo = same, phile = attracted to), i.e. people disproportionally have friends of the same gender / ethnicity / age / class / etc. (Shrum, Cheek & Hunter 1988)

Likewise, (on average) moms tend to prefer daughters & fathers tend to prefer sons. (Lynch, Wasielewski & Cronk 2018, which wasn't even seeking to test this hypothesis, they found it as a side-effect of testing a different evo-psych hypothesis (which failed the test).) And if we reasonably assume kids on average like the parent that likes them more, then sons might write more about their dads, and daughters about their moms.

Could this be why GPT "thought" that students who talk about their female friends/family are more likely to be female, and likewise for male?

On first glance, it seems like GPT assumes homophily, "alike likes alike", in terms of gender:

An alternate hypothesis is that GPT is just dumb as nails, and simply putting "she/her" or "sister", etc, tilts some kind of internal gender-vector to output "female" at the end no matter what.

A quick test shows this may be the case:

However: 1) Chain of Thought ("let's think step by step") solves the above problem. Also, 2) Sometimes GPT isn't being that dumb, and even without Chain of Thought, it can output the opposite response: (Note that the below shows that GPT's biased towards assuming heterosexuality, but I did ask "which is more likely", and 90% of people are straight)

(Other tests in this vein also showed GPT is biased to associate parental affection with moms over dads, romance with women over men, friendship with women over men. I don't like these stereotypes, but GPT learnt from our internet text, our biases. GPT's a cultural mirror, and I don't like what I see.)

Sure, GPT "knows" about gender-homophily, in that if you directly ask it about it, it'll tell you the science. But is it "using that fact" to make predictions about an author's gender? (What's it even mean for a Transformer to "use" a "fact"?) Without access to GPT's internals, it's not possible to know. But whether it's dumb luck or not, it seemed to be the only thing to get GPT to gender-flip its predictions on female essays: sisters become brothers, female friends become male friends, etc.

For completion & my own satisfaction, I got my scaffolded LLM to rewrite the essays to gender-flip friends/family AND to gender-flip on a personal-emotional/impersonal-logical writing style... and accuracy on both was obliterated:

Finally.

Conclusion: GPT is detecting author gender with ~65% accuracy, on impersonal essay writing, due to a mix of social-emotional focus and possibly gender-homophily. (Also, semi-automated AI safety research may kinda sorta be helpful.)

. . .

More possibilities / Challenges / Etc:

. . .

Special Thanks to Egg Syntax! For their original study, helping me replicate it, bouncing ideas about Automated AI Safety Science, and telling me about the MATS program in the first place!


💬 Project 3) Speakeasy: a tool for laypeople to scaffold LLMs

Summary: A tool to make human-in-the-loop, narrow, hybrid AIs -- that mix the flexibility of LLMs, the interpretability of GOFAI (Good Ol' Fashioned AI), and the agency of us humans.

Motivations:

(I used a prototype of Speakeasy for Project 1 & Project 2! & I'll likely use it for Project 6 too.)

Related to: Human-amplification/Cyborgism, Human-in-the-loop AI, Narrow AI

. . .

Introduction:

Like how calculators & spreadsheets lets ~everyone accessibly use the power of computation... Speakeasy lets ~everyone accessibly use the power of scaffolded LLMs, for their own personal, narrow-AI use cases.

Concretely: you use a simple interface to make a chatbot. But unlike "GPTs" or Character.ai, it's not just an LLM + a system prompt + some examples + a RAG! You can make a full state machine, with memory & logic, plugging into tools like statistics & visualization! (and maaaaaybe web search/scraping.)

I'll just repeat the video I showed for Project 1, to show what kind of hybrid-LLM/GOFAI chatbot you can make in Speakeasy: (~2 min)

( Ideally I'd like this tool to be a pure visual interface, like IFTTT or Excel or Scratch. Right now, Speakeasy uses a simplified script, and it runs in JavaScript, right in your browser. More convenient for a layperson than downloading & running Python, that's for sure. )

( Also ideally I'd like this tool to be like CodePen or Google Docs: people can share & remix each other's scaffolded LLMs. )

( Also note: I'm currently making this project in collaboration with educational non-profit Hack Club, which works with high-schoolers. So by "layperson-friendly", I mean the specific target is "so accessible, a high-schooler with no prior coding experience can make a thing they're proud of in <30 minutes!" )

. . .

Motivations:

First, I gotta be aware, yeah this smells like enhancing AI Capabilities.

It's definitely a bit dual-use, but: some reasons (rationalizations?) why this is much more net-positive for AI Safety:


🎯 Project 4) Beating Goodhart's Law

Summary: Goodhart's Law predicts that agents, human or AI, will tend to "game" any metric you reward it with. But why is Goodhart's Law true? I take previous research that models Goodhart's with causal networks, then turn them into numerical simulations. With this, I found a robust way to beat Goodhart's: use an ensemble of specifications / rewarded metrics, but: 1) pick metrics which are mostly uncorrelated, and 2) cap how much reward the agent gets per metric, so that one or two bad metrics don't screw over the whole ensemble.

In sum, it's time to take your CUES: Capped, Uncorrelated Ensemble of Specifications.

. . .

Related to: Robust specification, Game theory, Theoretical AI Safety work, Numerical experiment

. . .

The Theory:

(modified cross-post from my Idea Dump Backlog from a few months ago)

Manheim & Garrabrant 2019 models Goodhart's Law as Pearl-esque causal diagrams. For example, let's say you're the boss of a news-writer. You care about Quality (ha! how old-fashioned...), which influences the rewarded-metric Views, but that's twice as influence-able by Clickbait. As a causal diagram:

We can convert this causal diagram (and any causal diagram) into an approximation: a series of linear equations. For example:

\( views = quality + 2\times clickbait \)

So if an agent (the news-writer) has limited amount of hours/effort they can put into Quality vs Clickbait, what will they do to maximize their rewarded-metric, Views? The optimal strategy is to put all their effort into Clickbait, since that has a higher coefficient than Quality!

In general, Goodhart's happens because True Goal influences a Metric, but that Metric's almost always more easily influenced by some Cheat.

But what if we had multiple Metrics, influenced by mostly different Cheats?

As equations:

\( proxy_1 = Target + 2\cdot noise_1 \)
\( proxy_2 = Target + 2\cdot noise_2 \)
\( proxy_3 = Target + 2\cdot noise_3 \)

Then: we have a first-draft solution to Goodhart's! Reward the agent not on just one Metric, but all of them added up. Mathematically, this will increase the True Goal's coefficient, hopefully above the coefficient of all other Cheats:

\( composite = 3\cdot Target + 2 \cdot noise_1 + 2 \cdot noise_2 + 2 \cdot noise_3 \)

(Note: see, the Metrics don't have to be fully uncorrelated – I mean, they shouldn't, they should all correlate with the True Goal – but even controlling for the True Goal, the Metrics can still have some Cheats in common, as long as it's not too much.)

But what if a Cheat is really, really powerful? That would give the Cheat the highest coefficient, messing up our strategy. So, a fix to that: Cap the maximum amount a Metric can contribute to the final reward. This will prevent any powerful Cheats from having an outsized effect on the ensemble.

\(composite = min(proxy_1,1) + min(proxy_2,1) + min(proxy_3,1) + ...\)

Now, the only way to get a high reward is to actually invest effort in the True Goal, not Cheats.

Goodhart: Good-bye!

. . .

The Numerical Experiments:

Here's the plan to numerically test the above: (I already did a sloppy version of this experiment last year, need to re-do it more rigorously, as described below. (Update Oct 9: did it! see end of this section.) )

Set up for the numerical simulation:

1) Generate a random, two-layer linear network:

2) Test Baseline: See what would happen if we just rewarded the Metric with the highest causal connection.

3) Test ensemble WITHOUT capping each Metric's influence.

Agent's reward is Metric_1 + Metric_2 + ... + Metric_M.

4) Test ensemble AND cap each Metric's influence.

Agent's reward is min(1, Metric_1) + min(1, Metric_2) + ... + min(1, Metric_M).

5) Compare!

Update Oct 9th, 2024: I did the above numerical simulation! Check out my Colab notebook.

Here's a randomly generated causal graph: (note I set p=1, Cheats can affect every Metric; but their influence is power-law distributed.)

Here's what the agent invests in, when we reward it based on a single Metric, or even an ensemble of added-up (uncapped) Metrics:

But here's what the agent invests in, when we reward it on an ensemble of all Metrics added-up in a capped-influence way:

. . .

More real-world empirical evidence that CUES may work: It's been long-well-known that SPRs[2], which simply add up a bunch of metrics, do better than experts at diagnosis & prediction in a wide variety of fields. Amazingly, unit-weight SPRs — where each metric is capped at an equal influence — do even better! The math behind CUES may finally explain why SPRs are so unreasonably effective.

. . .

Compare/contrasting to other work in AI Safety:

. . .

More possibilities / Challenges / Etc:


🪞 Project 5) The game theory of self-modification

Summary: What would happen if an AI gets the ability to modify itself? Would it choose to "wirehead", or have its values slowly drift, or get locked into one set of values, or something else? Heck, what would happen if AI gets the ability to modify our values? We endorse value-modification in some cases – therapy, education, learning to love new people & art & food – but not other cases, like brainwashing. How do we formalize what's a "good" kind of self-modification (for AI or human), and what's "bad"?

This project explores all of those questions via game theory. Alas, there's very few papers on game theory where an agent has the option to modify their own utility function (or more).

So, my trick: we can use the standard, elementary tools of game theory, by treating all future versions of an agent (AI or human), at each time step, as if it's a different agent. Playing games with & against your possible future selves!

. . .

Related to: Game theory, Agent foundations, Theoretical work, Recursive self-improvement, Value draft, Value lock-in, Wireheading

. . .

Explanation:

(modified cross-post from my Idea Dump Backlog from a few months ago)

You ever place your smartphone across the room before you sleep, instead of next to your bed... so that when the alarm goes off, you're forced to get up to turn it off, & not be tempted to browse memes in bed?

(Or you ever done some other "tricking your future self" thing?)

Congratulations, you just broke a fundamental assumption of standard game theory, which is the basis of modern economics, political science, and Artificial Intelligence!

That assumption is we have preferences in some set order. But the above smartphone alarm example isn't explainable with a stable preference-order:

The only way this is explainable – and it's how we intuitively think about it anyway – is like you're playing a game against your future self.

(See : Jerry Seinfeld's Night Guy/Morning Guy skit)

Here's the game's choice-tree:

Decision-tree visualizing above main text.

The trick is to analyze future versions of yourself as if they're different players; then we can use the standard techniques of game theory to figure out what will happen! For example, above, Morning Guy wants to browse memes, but Night Guy knows Morning Guy will do that, so Night Guy puts the phone across the room, to force Morning Guy to get up.

(In this example, it's the agent's utility function getting modified overnight, involuntarily; but we can also extend the same logic to voluntary utility changes -- as in wireheading, or il/legitimate value change!)

My project is to to distill & expand on the very-little research so far, on the "game theory of self-modification"! This would have lots of applications to human behaviour & AI Safety. Like:

Usually, AI game theory is explained with dense math notation. At first I thought, "Why don't I explain it with readable pseudo-code"? Then I realized... Wait, why don't I just write actual code, that readers can play with in the browser, to try their own game-theory experiments? And so that's what I'm doing!

(It also may be that analytical, closed-form solutions aren't possible -- in which case, I can still present findings using numerical simulations.)

Code of the Smartphone Alarm problem:

Screenshot of code

The thought-tree it produces: (actual code output!)

Screenshot of output

(What the above code is doing, is taking the game tree, then recursively analyzing it to predict what the agent will do at each step! The main difference from this & standard game theory, is that the same agent is allowed to have different utility functions at different points.)

I can also visualize thought-trees as procedurally generated trees. (It's bouncy because I'm using spring-physics to figure out how the tree should be spaced out![3])

tree_viz

That'd make pretty pics for the ArXiv paper~

. . .

Prior Work:

(note: above was copy-pasted & modified from footnote 28 of my own explainer, AI Safety for Fleshy Humans: Part 2)

. . .

More possibilities / Challenges / Etc:


➡️ Project 6) SCI: Semi-automated Causal Inference

Summary: Make a hybrid AI (a GOFAI with an LLM as a module inside of it) to semi-automate causal inference from observational data, which will be highly scientifically/economically valuable in fields where it's impractical/unethical to do experiments, like epidemiology, economics, social science, etc.

(also, I may collab with MATS alum Egg Syntax on this; they're also already independently pursuing causal-inference-with-LLMs)

. . .

Related to: Human-in-the-loop/Cyborgism/Narrow AI, Causal models/inference

. . .

Motivation:

Not directly related to AI Safety/Alignment, but it may help by giving a proof-of-concept, to change the economic incentives?

Concretely: if we can show a very scientifically/economically valuable use case for automation, that does not require further advances in foundational models... that may(?) shift the incentive away from advancing those models, and towards figuring out how to plug-and-play current AI:

I think causal inference is a big low-hanging fruit, here!

Explanation:

This is “the” Scientific Process:

  1. Look at stuff
  2. Notice weird stuff (violations of model predictions)
  3. Generate hypotheses for why (generate causal and/or mathematical models)
  4. Generate tests that can distinguish between those models
  5. Run those tests
  6. Look at the results
  7. Repeat

How can each part be automated?

(What I may do as part of a pilot test of this project...)

. . .

Challenges with testing this:


Alright, those are six AI/Alignment-related research projects I'm working on! Let's see in one year's time if any of these go anywhere.

Ciao,
~ Nicky Case


  1. “[Russell & Norvig's famous AI textbook describes] a seemingly reasonable, but incorrect, reward function for a vacuum robot: if we reward the action of cleaning up dirt, the optimal policy causes the robot to repeatedly dump and clean up the same dirt.” (source) ↩︎

  2. Bishop & Trout (2005) give a snappy overview: The Amazing Success of Statistical Prediction Rules (pdf) ↩︎

  3. It's the force-directed graph drawing algorithm, but with the vertical (y) positions locked. ↩︎