UMAP
I found the talk by Barbara Tversky to be really calming. Something about the way she talks, from her tone, to her choice of words, to her warmth and candidness. It's interesting how a lot of the stuff we've been reading / watching really are saying very similar things but from their own point of view. This is something that Tversky mentions too — how we usually learn "vicariously", from other people's experience and opinion and then ideas and concepts kind of propagate down the line from a first-hand experience and so on.
Talking about spatial language made me think of this book called "A Language Pattern" by Christopher Alexander (and a few others that contributed as well). They were a group of inter-disciplinary architects that took upon themselves the challenge of detailing many patterns they have observed / found in the way we, humans, have structured our lives, from they way cities are built to the positioning of windows in every room of a house, to the very materiality of the bricks that make our pavements. For each observable pattern like this, aside from outlining it, they identify things that are problematic and then they suggest solutions. Some solutions have greater scientific backing than others (each pattern is annotated accordingly). Anyway, a lot of the patterns and the solution really come down to our physicality — one example:
They later go on to explain how essentially, the bigger the distance is between an individual and the person / institution holding the power over said individual, the worse this is for society. This is not just physical distance, but also a mental one — if the mechanism which dictates so much of our lives is opaque then it is beyond reach for us and we cannot meaningfully participate in such a society.
In Tversky's video, they mention how class arrangement affects participation. She explains how, on Zoom, almost counter-intuitively, many students seem to have been participating more. She suggested this was happening because they could finally see each other equally — everyone's faces were equally distributed on the screen. So the screen actually allowed everyone to face each other at the same time — something that is pretty much impossible to do in the physical realm with a group of 20 students. The screen, compressing the physical dimensions into a 2D display, loses things like body language, and minute subtext, but what we gain is the possibility of meaningfully interfacing (literally) with larger groups of people that we just could not do before. A Pattern Language was published in the 70's. I wonder if and how the authors' observations would have changed. I wonder how our brains might be slowly changing as a result of that too. What do we lose and what do we gain? (We probably don't have to worry about something hunting us down as much anymore, but more so about being able to focus effectively for long periods of time. But actually seeing a drastic change is going to take some time...)
It's quite nice listening to a Stanford professor talking so fondly about design and layout and typography. Science-y people usually don't pay much attention to that stuff, or think it's just unnecessary fluff. I'm a designer, I went through 4 years of design school and I've been a freelance designer for (this April) 10 years now. I can try and explain some rules of thumb that I've learned, but in reality most of the work is really just intuitive. The rules can back my decisions when I'm unsure, but I think it really comes down to a combination of intuition and "muscle memory". The same way we make gestures that are more in tune with how we feel rather than how we talk about our feelings, I can't always explain why I make certain design choices, I just aim towards creating a certain feeling / atmosphere and then work and gradually tweak things to achieve that.
The level of specificity one can achieve by making design choices is really astounding to me. By employing a certain layout, paragraph indentation, a certain font, letter spacing etc, one can create an incredibly intricate experience that lies on top of the content and maybe even acts as an emotional context layer. Most viewers won't be able to say anything about these aspects, but they will feel it nonetheless, which is exactly what "good design" is (at least to me).
At some point in the video they were talking about how some spatial arrays of cells in the brain relate to physical spaces, which immediately sounded similar to embeddings models. It might be interesting to somehow map it out — like take a stroll through the inter-dimensional neighborhood. I imagine maybe starting from one point, like New York, and traveling all the to Chicago — how closely correlated are the things that are in between in the real world and in the model? Could we draw out maps and interesting routes that emerge? Will we discover that the way from A to C unpredictably goes through M? Will it make sense to us? And similarly to how zoom lets us interface with many people at once, embeddings models let us interface with where these "concepts" are located — not just things that make sense in our physical world, but rather things that aren't. My only objection to this is that I'm not entirely buying that any of these models are representative of our collective consciousness. And so I'm hesitant to accept at face value that the output we get reveals anything actually significant (deep? inherent?) about ourselves beyond our biases and tendencies and the process through which we trained and built the models. But more than happy to be convinced to think otherwise...
Edited after the making part: Ok. I got some insight into how things are actually positioned in the model and it does feel somewhat revealing. It does completely make sense that things we consider opposites are actually pretty close to each other in that space.
I went in deep this entire weekend with embeddings as per Dan's recommendation (didn't get to try UMAP yet though! I had to stop so I could write this post)— trying to convert my Red Line project to work with embeddings rather than with an LLM. I didn't like just requesting scores from an LLM without any points of reference, and came to the conclusion I have two options:
I started with the second option. It took a while to make my existing code work with the CLIP model, but once it was working that's when the real trouble began... And by trouble I mean banging-your-head-against-the-table kind of trouble...
I was comparing any news article's embedding to both the best case and worst case's embeddings, calculating the distance of it in between them and giving it a normalized score accordingly. But the numbers didn't feel right. They were all kind of the same. After reviewing and adding very detailed logging courtesy of Copilot I realized what was happening — my supposed extreme cases were in fact very close to each other! I've also come to realize that debugging embeddings output is not as easy, it's all relative and not something you can clearly see.
This is what copilot had to say:
I tried changing up the prompts — I realized CLIP was trained on image-text pairs so I changed all my prompts to be more descriptive, like an image description. No improvement. I went as far as listening to copilot (who, by the way, kept declining my questions because I was dealing with "sensitive materials" or some bs) who desperately suggested just using generic descriptions of light and darkness.
When I finally caved in, just to try it out, it did sort of work, but any attempt to slowly bring it back into something remotely related to what I'm trying to do immediately resulted in clusters that are too close.
For the integrity of my project I cannot use unrelated prompts such as these as points of reference. So this all sort of gloriously failed. Sorry I didn't get to try UMAP or do something special.
In regards to this project, the next things I'm going to try are:
Current code is available on github.