The next incarnation of the Internet will liberate both the content
and the CPU cycles from the actual hardware that performs storage 
and computation.  That is, both the data and the compute power
will be "virtualized" even further away from the hardware its been 
traditionally bound to.  The popular P2P file trading systems 
already hint at what distributed storage might look like.  
Efforts such as ZeroInstall show that one might be able to 
run operating systems without actually having to 'install' them,
and 'Stateless Linux' show how one might be able to access one's
"desktop" from any handily available keyboard and monitor. 
Distributed computing efforts such as SETI@Home hint at how 
cpu cycles can be extracted from the vast repository of idle 
computers attached to the net.
Keeping this articule up-to-date is difficult. A lot has happened 
since the first draft of this article: one-fourth of "the next few
decades" has already gone by.  Some links below may be dead,
and some statements may appear quaint.  2001
Draft of this paper.
- Eternity Service
- Ross Anderson described the Eternity Service
    as a distributed filesystem that could survive damage to its storage
    infrastructure in analogy to how the Internet can survive 
    damage to its network.
    .... 
    (prototype)
    and related concepts, such as 
    FreeNet,
    eMule
    and 
    GriPhiN
    all provide ways of publishing
    information on distributed networks.  Each technology enables a 
    user's home computer to participate in a broader network
    to supply distributed storage.  If you think about it, this is 
    very, very different than the defacto Internet today, where web 
    pages are firmly rooted to the web servers that server them up.
    
    If you are reading this web page near the turn of the century,
    chances are good that your browser fetched it off of the web 
    server I run at home.  Chances are also good that you got it off
    some caching proxy.   I know my ISP runs one.  The caching proxy 
    stores a copy of the web page ... for a while.  Not very long.
    If my server dies, chances are you won't see my page either.
    The caching proxy helps with bandwidth costs for my Internet
    provider, but doesn't help me much.
    ...
    But I know that my life
    would be a lot better if I didn't actually have to be sysadmin for the
    server I run at home (I hate replacing broken disk drives, etc).
    I would like it much better if I could just
    publish this page, period. Not worry about maintaining the server,
    about doing backups.  
    Just publish it on FreeNet or Publius. 
    If everyone's home computer was automatically a node/server on
    Publius, and if Publius required zero system administration,
    then I, as a writer/publisher, would be very happy.  I could just
    write these thoughts, and not worry about the computing
    infrastructure to make sure that you can read this.
    We conclude that the eternity service is an important component
    of Gelernter's Manifesto, which he sadly fails to name as an
    important, contributing technology.
     
    A crucial component of this idea is that of 'zero administration':
    the ultimate system must be so simple that any PC connected to the
    net could become a node, a part of the distributed storage infrastructure.
    The owner of a PC (e.g. my mom) should not have to give it much 
    thought: if its hooked up to the Internet, its a part of the system.
     
    Aspects:
     
    - What type of storage is it focused on, public, private, or
        commercial?  Each has different characteristics:
        I want my private storage to be accessible from
        anywhere, to endure even if the network/servers are damaged.
        But I want it to remain private, to stay in my posession.
        I want my public writings to be robust against network
        damage as well, and I also want them to be hard-to-censor.
        I might want to be able to engage in anonymous speech, 
        so that I could, for example, blast the sitting president
        (or the RIAA) without feeling I could get in trouble for it.
        The third, "commercial storage" would be a system that allowed
        me to access commercial content from anywhere, for a fee. 
        This is the system that the RIAA is failing to build,
        failing to support: a way to get at the music that I paid for,
        where-ever I might be.
    
- Does it provide eternity?  Will a file get stored forever,
        or can it vanish?  There are two types of eternity: protection 
        against censorship, and protection against apathy.
        
        - Censorship Protection: content cannot be (easily) 
            removed by authorities objecting to the content, 
            e.g. political speech, state secrets, bomb-making plans.
        
- Apathy Protection: no one cares about the content at this
            time, and thus, it will slowly get purged from various
            caches and stores until the last copy disappears forever.
        
 Note that one can implement censorship protection, and still
        not get apathy protection: FreeNet works like this.   
        One can also implement a system that is censorable (so 
        that the sysadmins can explicitly purge spam), and still 
        get apathy protection: as long as a file is not actively
        hunted down and terminated, it will stick around forever.
        These are orthogonal concepts.
- Provides anonymity protections to poster, if desired. 
        This would allow whistle-blowers and politcal rabble-rousers
        to remain anonymous without fear of intimidation/reprisal.
        This would also allow posters of spam, viruses and other
        criminal content to remain anonymous and beyond the reach 
        of the law.
    
- Allows censorship of content by editor or network operator.
        This would allow police authorities to remove child pornography
        or other objectionable content.  This would also allow copyright 
        holders or thier agents to remove content.  This also allows
        the removal of old, out-of-date content and a general cleanup
        of e.g. spam or viruses that have clogged the system.
    
- Identifies the downloader.  This can potentially enable payment
        for downloads, or otherwise hook into a subscruiption service.
    
- Provides file download popularity statistics.  Of interest for
        a variety of reasonable and nefarious reasons.
    
- Appears to the operating system as a filesystem.  Thus, for
        example, I could put a binary into it, and then run that binary
        on my desktop.  ZeroInstall tries to do
        this.
    
- Versioning/Version Control (Gelernter's "Lifestreams").
        Can I get earlier versions of my file?  Is my file tagged
        with date meta-info?  Can I get an earlier draft of this 
        paper?
    
- Support for extended File Attributes; storage/serving of 
        file meta-data along with the file.  Can I mark up the file
        with info that is important to me, such as where I was 
        (geographically) when I last looked at it?  Can I categorize
        in in many different ways, e.g. if its a hospital bill, 
        can I put it in my "hospital" folder, as well as my "finances"
        folder?  Note that folders do not need to be literally folders:
        they could in fact be fancy search queries: as long as the
        object responds to the query, its a part of that folder.
        This is how a given file might be in many folders at once.
    
 
    See also:
     
 
- Search and Query
- Gelernter goes on at length about content addressable memory,
    about how it should be possible to retrieve information from 
    the Eternity Service based on its content, and not based on its file
    name.  Search is important not only for finding a
    needle-in-the-haystack on the Internet, but also for finding
    the mislaid file on ones own computer.  In a different but still
    important sense, querying is used, for example, to report your
    bank balance, out of the sea of other transactions and investments 
    and accounts one may have.   The importance and centrality 
    of search and data sharing for general application development
    is further discussed in the 
    Why-QOF
    web page.
    
    What are the pieces that are needed, and available?
     
    - Natural language query parsers.  
        Gnome Storage
        is looking to provide natural language query for desktop
        applications.
    
- Distributed databases and distributed query.  DNS (the Domain Name
        System) is a distributed database for performing IP address lookup.
        unfortunately, there is no straightforward generalization to
        arbitrary data. LDAP (the lightweight directory access protocol)
        in theory can handle more generic data, but it remains difficult 
        to set up and use.
    
- My personal entry on this chart is 
        QOF, the goal of which 
        is to make it trivial for programmers to work with persistent,
        globally-unique, versionable, queryable OOP-type 'objects'.
    
- Massively scalable search already has a proof-of-concept with
        Google.
        Curiously, though, the google page rank is the result
        of a carefully hand-tuned and highly proprietary algorithm.
        This indicates that search by content alone is not enough;
        search-by-content has to be ranked to provide results that
        are meaningful to users.  And it seems that its the ranking,
        and not the search, that is the hard part.
    
- Google focuses on free-text search.  If you want prices,
        you need < a href="http://www.google.com/froogle">Froogle.
        Google is useless for binaries: if you want binary content,
        you go to specialized sites:
        rpmfind.net to locate RPM's, 
        tucows to locate shareware, or mp3.com or scour.net to find audiovisual 
        content.  Each of these systems are appallingly poor at what they do:
        the RPM Spec file is used to build the rpmfind directories, but doesn't
        really contain adequate information.  
        The mp3 and shareware sites are essentially built by hand: that
        part of the world doesn't even have the concept of an LSM to classify 
        and describe content! (LSM is a machine-readable format used by 
        metalab.unc.edu to classify the content of packages in its software 
        repository.)
    
- Searchable meta-data, and automatic time and (geographic) 
        place tagging of a file when its created, viewed and edited. 
        If I created a file while I was drinking coffee in a
        coffee-house, I want it tagged, so that I can find it later
        when I go searching for the words "coffeee house, 2 months ago".
        If I happened to create three versions of that file, 
        I'd like to be able to call up each: tehre should have been
        (semi-)automatic file versioning, a "continuous backup"
        of sorts. A 
        Wayback 
        Machine for my personal data.
    
 
    Here are some additional references:
     
    - gPulp provides a framework
        for distributed searching.  Derived from Gnutella Next Generation.
        See the 
        Wired Article.  European consortium, standards body, costs 
        real money to join. They seem to be working specs, not
        implementations.  The main spec is a P2P 'data discovery
        protocol'.
    
 
 
- LSM's, Name Spaces and Self-Describing Objects
- There is another way to look at the problem of searching and finding
    an object based on its content, rather than its 'unique identifier'.
    Filenames/filepaths/URL's are essentially unique identifiers that
    locate an object.  Unfortunately, they only reference it, and maybe 
    provide only the slimest of additional data.  For example, in Unix,
    the file system only provides the filename, owner, read/write 
    privileges, modification/access times.  By looking at the file 
    suffix one can guess the mime-type, maybe: .txt .ps .doc .texi .html
    .exe and so on.  File 'magic' can also help guess at the content.
    URL's don't even provide that much, although the HTTP/1.1 specification 
    describes a number of optional header fields that provide similar 
    information.  See, for example, 
    Towards the Anti-Mac
    or The Anti-Mac 
    Interface for some discussion of this problem.
    
    What is really needed is an infrastructure for more closely defining
    the content of a 'file' in both machine-readable and human-understandable 
    terms.  At the very least, there is the concept of mime-types.  Web-page
    designers can use the <meta> tags to define some additional
    info about an object.  With the growth of popularity of XML, there is some
    hope that the XML DTD's can be used to understand the type of object.
    There is the semi-forgotten, semi-ignored concept of 'object naming' 
    and 'object trading brokers' as defined by CORBA, which attempt to match 
    object requests to any object that might fill that request, rather 
    than to an individually named object.  Finally, there are sporadic attempts
    to classify content: LSM's used by metalab.unc.edu, RPM Spec files used by
    rufus.w3.org, deb's used by the Debian distribution.  MP3's have an
    extremely poor content description mechanism: one can store the name of the 
    artist, the title, the year and the genre.  But these are isolated examples
    with no unifying structure.
     
    Unfortunately, Gelernter is right: there is no all-encompassing object
    description framework or proposal in existence that can fill these needs.
    We need something more than a mime-type, and something less than a free-text 
    search engine, to help describe and locate an object.  The system must be
    simple enough to use everywhere:  one might desire to build it into the
    filesystem, in the same way that 'owner' and 'modification date' are file
    attributes.  It will have to become a part of the 'finder', such as the
    Apple Macintosh Finder or 
    Nautilus, the 
    Eazel finder.  It must be general enough
    to describe non-ASCII files, so that search engines (such as Google) could
    perform intelligent searches for binary content.  Today, Google cannot 
    classify nor return content based on LSM's, RPM's, deb's, or the MP3 
    artist/title/genre fields. 
     
 
- distributed.net and SETI@home
- distributed.net 
    runs a distributed RC-64 cracking / Golumb Ruler effort.  
    Seti@Home
    runs a distributed search of radio telescope data for interesting
    sources of extraterrestrial electromagnetic data.  Both of these
    efforts are quite popular with the general public:  they have built
    specialized clients/screen-savers that have chewed through a 
    quadrillion trillion CPU cycles.   Anyone who is happy running 
    a distributed.net client, or a seti@home client might be happy 
    running a generic client for performing massively parallel 
    computations.  Why limit ourselves to SETI and cypher cracking?
    Any problem that requires lots of CPU cycles to solve could,
    in theory, benefit from this kind of distributed computing.
    These high-cpu-usage problems need not be scientific 
    in nature.
    A good example of a non-science high-cpu-cycle application is
    the animation/special effects rendering needed for Hollywood
    movies.
    The problem may not even be commercial or require that
    many cpu cycles: Distributed gaming servers,
    whether role-playing games, shoot-em-ups, or civilization/war
    games currently require dedicated servers with good bandwidth 
    connections, administered by knowledgeable sysadmins. 
    
    The gotcha is that there is currently no distributed computing
    client that is is 'foolproof': providing generic services, 
    easy to install and operate, hard for a
    cracker/hacker to subvert.  There are no easy programming API's.
    XXX But this may be changing now, see BOINC, below.
     
    Other clients: 
     
    - BOINC, the software under SETI@Home,
        see listing below.
    
- Xenoservers, reference below.
    
- Climate
        Dynamics at RAL.
    
- United Devices
        Purely commercial, totally proprietary.
    
- PVM & MPI are older technologies, optimized for cluster
        and parallel computing.  They are rather heavyweight,  
        demanding of bandwidth, and unable  to deal with clients
        come and go (unrealiable clients).
    
- Folding@Home
        is attempting to solve protein folding problems with pure-custom 
        software.
    
- Popular Power attempted to 
        pay for CPU cycles, as did 
        Process Tree Network.  Both Defunct.
    
- Cosm attempted to define distributed 
        computing API's.  Defunct.
    
 
    
 
- ERights and Sandbox Applets
- Java still seems to be a technology waiting to fulfill its promise.
    However, it (and a number of other interpreters) do have one
    tantalizing concept built in: the sandbox, the chroot jail,
    the honeypot.  Run an unsafe program in the chrooted jail, 
    and we pretty much don't care what the program does, as long
    as we bothered to put some caps on its CPU and Disk usage.
    Let it go berserk.  But unfortunately, the chroot jail is 
    a sysadmin concept that takes brains and effort to install. 
    Its not something that your average Red Hat or Debian
    install script sets up.  Hell, we have to chroot named and
    httpd and dnetc and so on by hand.  We are still a long ways
    off from being able to publish a storage and cpu-cycle playground
    on our personal computers that others could make use of as they
    wished.  It is not until these sorts of trust and 
    erights systems are set up
    that the kind of computing that Gelernter talks about is possible.
    
    References:
     
 
- Streaming Media & Broadcast: Bandwidth Matters
- The naivest promise of 'digital convergence' is that soon, you'll
    watch TV on your computer. Or something like that.  There are
    a multitude of blockers for the roll-out of these kinds of services,
    and one of them is bandwidth strain put on the broadcaster and the 
    intervening Internet backbone.   Given the way that people
    (or rather, operating systems and software applications) use the
    Internet today, if a thousand people want to listen to or view
    a streaming media broadcast, then the server must send out a
    thousand duplicate, identical streams.   This puts a huge burden
    on the server as well as nearby routers.
    
    The traditional proposed solution
    for this problem is MBONE, but MBONE has yet to see widespread 
    deployment.  (MBONE is the Internet 'multicast backbone' which
    allows a broadcast server to serve up one packet, and then have
    Internet routers make copies of the packet as it gets sent to
    receiving clients.  Clients receive packets by 'subscribing' 
    to 'channels'.)
     
    There are two other approaches to distributing the bandwidth 
    load: ephemeral file server and distributed
    streaming.  Both leverage the idea that if 
    someone else is receiving the same data that you want, 
    then they can rebroadcast the data to you.  The difference 
    between these two is whether you get the data in order, and 
    possibly whether you keep a permanent copy of it.  
    In either case, you get your data in "chunks" or pieces,
    rather than as a whole.
    For streamed media, e.g. a radio broadcast,  it is assumed that 
    you are listening as it is broadcast, rather than waiting for 
    a download to "finish", and then listening to it.  For
    streamed media, the data must arrive in order, and must arrive in a
    timely manner.  I don't know of any examples at this time.
     
    An ephemeral file server, by contrast, can (and usually will) 
    deliver data out-of-order (sometimes called "scatter-gather"). 
    A good example might be 
    BitTorrent, which only shares the 
    file that you are currently downloading, instead of sharing all
    of your files.  It is "ephemeral" in the sense that sharing usually
    stops shortly after download completes.  BitTorrent explicitly
    delivers chunks of the data out of order: the goal is to
    make sure that everyone has something to share, rather than,
    e.g. everyone having the first half but not the second half of
    a file.  "Ephemeral" does not mean short-term: torrents can
    (and do) exist for months: they exist as long as a file is
    popular, and as long as at least one client is up on the net.
    Equally interesting are the things that BitTorrent doesn't do,
    or gaurentee: for starters, there is no 'eternity':
    if there are no clients offering the file, it is effectively gone.
    BitTorrent does not keep either a master index of files offered,
    nor a even a searchable index of offered torrents.  One must
    locate the torrent one wants in some other fashion: e.g. through
    web pages or traditional search engines.  In the same vein,
    its not a file system: there is no heirarchy of files that are
    kept or can be browsed.  The goal of BitTorrent really is to
    balance the network load in a distributed fashion.
     
    To summarize the technical points:
     
    - The search problem:  Can the user browse a list of available
        content?  Can the user search for particular content?
        (BitTorrent relies on web pages and web search egines to
        solve these problems).
    
- The peer discovery problem: Once a particular bit of 
        content has been identified, how does a client discover
        the other clients that are ready to share?
        
        - BitTorrent And PDTP solve this problem by having a 
            tracker
            for each offered file.  Clients register with the tracker 
            and tell it what chunks of the file they already have; 
            the tracker responds with a list of clients that might
            have the chunks we don't yet have.  Clients keep the 
            tracker up-to-date as the download proceeds. Conceptually,
            there is one tracker per offered file.  Note, however,
            that the tracker is vulnerable: if it goes down, new
            clients are shut out.
        
- Swarmcast uses a Forward-Error
            Correction (FEC) algorithm to create packets that occupy
            a data space that is orders of magnitude larger than the 
            offered file.  Thus, the receiver can reconstruct the
            whole file after having recieved only a very small
            portion of the total packets in the space.  The 
            use of FEC encoding elminates the need for a 
            chunk tracker: all packets in the data space are 
            "gaurenteed" to contain data that the client does not 
            yet have.  This is by encoding in a very large data 
            space: the probability that the client receives data
            that it already has is equal to the ratio of the 
            file size to the data space size; this ratio can be 
            made arbitrarily small.  (Its kind of like a hologram;
            you need only some of it to reproduce the whole).
            
 
 Downside to this approach is that its CPU-intensive,
            and it can inflate the total amount of bytes that need
            to be delivered by a fair amount.   Upside is that it 
            can roll encryption and encoding into one.
 
- The streaming problem. For streaming to work, data must
        be delivered in order. (BitTorrent doesn't do that)
    
- Bandwidth allocation/balancing between peers.  BitTorrent 
        tries to load-balance by using a tit-for-tat strategy:
        a client will only offer chnks to those clients that 
        are sending chunks to it.  For streaming media, this
        strategy clearly can't work: sharing must be transitive,
        not reciprocal.
    
- The 'dropped frames' problem: The viewer/receiver of a 
        real-time stream must be able to get data in a timely 
        manner, so that they can watch thier show/movie without
        interruption. The viewer is potentially willing to trade 
        disproportionate amounts of upload bandwidth in exchange
        of a gaurenteed download bandwidth.  The receiver is 
        mostly interested in having multiple redundant streaming
        servers handy.
    
 
    I am not yet aware of any generally available streaming-media 
    reflectors, other than on those based on MBONE.  
     
    - Swarmcast, now defunct, may be
        unique in having been the first to use a scatter-gather
        type algorithm for delivering a file by chopping it up into
        chunks.  (Swarmcast predates BitTorrent).  GPL license.
    
- BitTorrent, described below,
        is an 'ephemeral fileserver', serving up files in a 
        distributed fashion for the few mements that they are
        popular and being actively downloaded by others.  
    
- PDTP is a distributed file system, 
        with heirarchical directories, but offers network 
        load balancing through distributed file-piece delivery.
    
 
 
- The Internet for the Rest of Us
- To understand the future, it is sometimes useful to look at the
    past. Remember UUCP? It used to tie the Unix world together, as
    did BITNET for the VAx's and Cray's, or the VM network for
    mainframes.  They were all obsoleted by the IP protocols of
    the Internet.  But for a long time, they lived side-by-side,
    even attached to the Internet through gateways.  
    The ideas that powered these
    networks were subsumed into, became a part of the Internet:
    The King is Dead, Long Live the King!.  The spread of the types
    of technologies that Gelernter talks about will be evolutionary, 
    not revolutionary.
    
    Similarly, remember 'The Computer for the Rest of Us'?  
    Well, before the web exploded, Marc Andressen used to talk about
    'The Internet for the Rest of Us'.   Clearly, some GUI slapped
    on the Internet would make it far more palatable, as opposed to the
    'command-line' of telnet and ftp.  But a web browser is not just
    a pretty GUI slapped on telnet or ftp, and if it had been, the 
    WWW still wouldn't exist (what happened to 'gopher'? Simple:
    no pictures, no 'home pages').  The success of the WWW
    needed a new, simple, easy technology: HTTP and hyperlinks, to 
    make it go. The original HTTP and HTML  were dirt-simple, and that
    was half the power of the early Internet.  Without this simplicity 
    and ease of use, the net wouldn't have happened.
     
    What about 'the rest of us'?  It wasn't just technology that made 
    the Internet explode, it was what the technology could do.  It 
    allowed (almost) anyone to publish anything at a tiny fraction 
    of the cost of traditional print/radio/TV publishing.  It gave
    power to the people.  It was a fundamentally democratic movement
    that was inclusive, that allowed anyone to participate, not just
    the rich, the powerful, or the members of traditional media
    establishments.  In a bizarrely different
    way, it is these same forces that power music file trading:  
    even if the 
    music publishing industry hadn't fallen asleep at the wheel, it 
    is the democratization that drives file traders.  Rather than listening
    to what the music industry wants me to listen to, I can finally listen  
    to what I want to listen to.  At long last, I am able to match
    artist to the artists work, rather than listening to the radio and
    scratching my head 'gee I liked that song, but what the hell was 
    the name of the artist?'  Before Napster, if I didn't know what 
    music CD to buy, even when I wanted to.  I wasn't hip enough to 
    have friends who new the names of the cool bands, the CD's that
    were worth buying.  Now, finally, I know the names of the bands 
    that I like.  Napster gave control back to the man in the street. 
     
    Similarly, the final distributed storage/computation infrastructure
    will have to address similar populist goals: it must be inclusive, 
    not exclusive.  Everyone must be able to participate.  It must 
    be for 'the rest of us'. 
     
 
- Commercialization
- Like the early days of the net, the work of volunteers drove the 
    phenomenon.  Only later did it become commercialized.  Unlike then,
    we currently have a Free Software community that is quite conscious
    of its own existence.  Its a more powerful force.  Once the
    basic infrastructure gets built, large companies will come
    to make use of and control that infrastructure.  But meanwhile,
    we, as engineers, can build it.
    
 
I guess the upshot of this little diatribe is that Gelernter talks 
about his changes in a revolutionary manner, leading us to believe
that the very concept of an operating system will have to be
re-invented.  He is wrong. The very concept of an operating system
*will* be reinvented, someday.  In the meanwhile, we have a perfectly
evolutionary path from here to there, based not only on present
technologies and concepts, but, furthermore, based on the principles 
of free software.