naming things

Naming things is often cited as one of the hardest problems in computer science. But the concept conflates two separate issues: Identifying things, and finding things. Traditionally these have been interchangibly mixed in filing systems; When you create a file, you have to choose both a filename and a filepath. Both of these aspects combined contribute to distinguishing the file from all the others, and also to finding that particular file among the others.

With the advent of computer cryptography, we can now properly separate the two. The computer needs to identify each file to present the user interface (UI) to operate on it. But this identification only needs to be machine-readable. The UI facilitates all operations from the user's perspective. Think about a file browser in an operating system; You click the icon of the file; You need not know its internal ID in the filesystem hierarchy in order to operate on it. It can be opened, altered, dragged and dropped, without mattering what it's named.

The name, then, is merely the method by which the human operator finds the file. This is why names have had to been human readable. But the proliferation of search facilities and steadily increasing performance of new machines, they can now sift through the data for us, and again present a UI for browsing through it. As of writing, these UIs are still in their infancy. Their life began as hierarchies, then tag systems, and only now are we seeing machine learning used to analyze and categorize content automatically.

identifying things

Since human readability is no longer a concern, we only need to assure that each file has a globally unique identifier among all computers on all worlds. Computers are finite machines, however, so this is mathematically impossible. We can approximate it though.

Content-based addressing has the right core idea, within the cryptographic hash functions: Randomness. But naming all files by their contents is unsuitable for all applications; Consider a log file, for example: Its name would change every time a program writes to it. We don't want that.

The only sensible way to identify files is to name them randomly, using strong, cryptographic randomness. The probability of collision is comparable to content hashing, but we've freed the identifier of the file contents. Indeed, all addressable resources on this website have a unique 64-bit ID. Provided I've used a strong enough generator, no thing in the whole world has the same ID.

(Some automated listing pages have predictable IDs; They're not intended to be addressed globally)

finding things

While I believe the future of AI holds great promise, for now, humans are responsible for curating and organizing content. Hierarchial filing systems are paradoxical. That is, they produce situations where no option is correct. Suppose you have a photo of a cat. Does it go into photos/cats, or cats/photos? Well, both. But hierarchies can't do that. Some platforms provide methods to circumvent this issue with filesystem links - making it seem like files appear in multiple locations when they actually don't. But that's a hack.

The real solution is tag-based categorization. And indeed, all content on this website is organized by tags. To increase flexibility, tags can be content themselves. Or content can be tags. Works both ways. A tag is just an identifiable, findable relation between files.

We still have the problem of localization when naming things. The world has many languages. My work here concludes incomplete; I'm lazy so I'll just write mostly English, rarely Finnish when what's written won't translate well.

comment