We have seen it often, with almost every known social network and platform built to express yourself - at some point it gets to a milestone which changes the platform to the opposite of its original vision.
"Don't be evil" becomes THE evil, the outlaw becomes the law enforcer, independent form alliances and neutral become ferocious fighters against neutrality. There are still islands of free speech, where borders are wide enough to call it unbiased, but they are doing it with serious personal risk and at significant expenses.
In this essay I'll try to analyse the history, trends and risks of self-expression platforms, and provide some food for thought for those who will attempt to build the next platform.
The most important aspect of this work is technology, second is sociology and third - economy. We'll scrape the law a bit, but the main premise will be that whatever you are doing is illegal somewhere, and if not - then content certainly will be. And so we'll use technology to obey the law when possible, and protect from it when it's not.
But first - let's look at existing platforms and their pitfalls. This will help us to list the known problems and caveats, and see how titans fell and why.
Current state of social platforms
How many people were imprisoned in the world, because they posted something, or even liked it? How Facebook became the second largest censorship machine in the world after Google? And why people are still using it, knowing what it is? Because they need the social platform and hope that safety is in numbers. I don't like speaking about politics and I'm not going to push my views in this blog, but the main problem of social networks lies in political area, so I'll have to turn on the Snowden mode for a bit.
The caveats of modern platforms are numerous, from user's point of view, but the first of all, that doesn't look like much problem until your right to speak is replaced with responsibility for what you said - Facebook (or almost any platform out there) is not anonymous. Moreover, it asks you to provide documents for using certain features, you have a choice between the carrot and stick, and most people prefer carrot, as they don't take the stick seriously. Facebook has widgets on most websites in the Internet, so it knows everything about you, probably even if you aren't using Facebook at all. The only other website that has such penetration is Google. Even this page contains Google script that allows me to get some information about you - where are you from, how did you find this page, whether or not you stay long enough to think you are interested, and so on. This is a fraction of information that Google gets about you, because they analyse you in context of all websites you visit, not just this one. That is, of course, if your browser is not blocking that script. And your browser isn't from Google, is it? It's not Chromium based Microsoft Edge either? I'm not saying it watches you, I'm talking about penetration these two companies got and what advantage it provides. To them.
Having this much information in Information Era worth more than gold. That's the reason why organisations that work with information - NSA, CIA and so on, became so powerful and important - the value of their data increased so much, it became the most precious thing to any government, taking it has means to process and analyse that data. And while USA and Russia have successful Big Data engines, the main providers of that data are social networks, including messengers and mail providers.
Just like Facebook in English segment of the Internet, Yandex, VK and Telegram are in Russian and Baidu, Sogu and Shenma in Chinese. They store and process so vast amounts of data, that it is extremely improbable, that this data is not for sale, and by "sale" I mean either money or favours from those who may turn the business of these companies off in particular segment of the Internet, or improve that business significantly enough to take the risk. If that "turn off switch" wouldn't exist, the company wouldn't have to comply, unless there are other means to force it into compliance.
Many countries have a law that would require data providers to store personal data of citizens of that country within physical borders of that country. US, Brazil, EU, Russia, China - that's almost the whole Internet, and companies like Facebook have to comply and split the data between their data centres so that personal information of EU citizen wouldn't leave EU. Do they actually comply? No one knows, but one thing is for sure - requests to Facebook from EU are served by their servers in EU, and if EU needs to access or delete that information - it's much easier to do so. Additional filters may apply to analyse Facebook traffic, because servers are local, and local legislation allows to install network filters if needed. Facebook (or any other data provider) doesn't even have to hand your data to authorities - chances are, authorities have their own copy.
When Russia asked LinkedIn to comply, LinkedIn said "no" and was banned in Russia. Personally, I don't like them for other reasons, as you have seen in related posts from the past, and deleted my account there long time ago, but that's what I have in common with Russian citizens - we don't have LinkedIn accounts. Many of them don't even know about it, as such bans tend to fade the memory about the website, so no one cares.
When Russia asked Telegram to comply, Telegram said "no" too, though in this case it wasn't about storing data, it was about handing the private encryption keys to government in order to keep track on who is saying what. Before you raise your eyebrows - that's what Skype was doing for years and other American messengers are providing logs and video footage to NSA, it's not a conspiracy theory but a fact, made known by Edward Snowden. That's just the way it is everywhere, and Telegram, at least publicly, decided to be different. So, Telegram refused to comply and was banned. Sort of. The architecture of Telegram isn't as centralized as of Facebook, and blocking IP addresses didn't help. Users are still able to use Telegram in Russia, as there is no law that would punish people for using that messenger. We are not talking about moral aspects of both government and service provider's decision, as it might be equally bad for someone to read your messages and to provide messaging service to terrorists. Instead, we are reviewing this as technological problem that was solved by Telegram and wasn't - by government.
In this case government didn't ask Google to stop providing Android app in Russia, though Google would comply, but it wouldn't give any significant effect - people would download and install APK packages directly to their phones, or use any alternate store. This way, at least, government knows if you even have this app, not that it provides much useful information.
Meanwhile Facebook is playing game with Russian government, promising to comply, but not moving a finger to build a data centre in Russia. Chances are - they will be banned, together with their messenger, and people will use VPN at first, but then will migrate to local services, like it happened in Ukraine, where some people migrated from banned Russian services to Facebook, when local attempts to create social network failed.
So, what do we get from this?
If your service is centralized, if you have significant dependency on infrastructure, those who control that infrastructure may use that control... OK, they WILL use that control to force you into compliance.
Because they need your data, which you store for some reason. But what if you wouldn't have to store much? What if all user data would be stored on user premises, hence complying with formal rules of homeland storage and decrease your requirements for storage infrastructure? We'll get to it later on.
And where is the crowd, there are bots. It's a plague of social networks. Applications that act like human users and performing operations from farming data to whole range of offensive actions. They are owned by people with totally different backgrounds and agenda, but they have one thing in common - they are no more interested in health and survival of your platform, than viruses care of their host. Therefore the platform, at the level of API, should include protection from those, at least from basic bots that consume traffic and mud the water. It's very delicate balance between being open, transparent and fair, and being safe.
It's technology and people that platform should care about. We'll discuss each topic in details here.
Technology consists of:
- Your servers
- Your domain names
- Your server software
- Your mobile apps and their distribution
- Search engines
- Advertisement
People are:
- Your employees
- Your users
- Compromised employees
- Rogue users
- Hackers
And there are organisations, most powerful being government, but also media companies and tech giants, sometimes acting as proxies for governments.
There are some basic technology risks worth mentioning, though the full list would take quite a few pages, and solution a book (fortunately, some books are already written, starting with CISSP course):
- Your servers can be seized. You can host your websites in the cloud, but these clouds belong to large corporations, and you can be easily kicked off them for slightest violation of contract, and your "free as in speech" platform will violate one or more positions of a mile long list of obligations that no one cares to read in full. You have web servers, mail servers and DNS servers, most important being web. Though there are many data centres in the world that would allow you to place physical servers for colocation, there are only few cloud providers - like Azure and AWS. Both will eventually show you the door, so your platform should be able to switch locations asap, on short notice.
- Domain names were a big problem, because previously there were just a few owners of domain zones - com, net, org and national domain zones. National zones are obviously controlled by respective governments, and .com may be seized by USA. Fortunately, there are hundreds more new zones, and so we can draw the rule - you should have more domain names in different zones than you need. And platform should be able to switch between domains. You may have a central server that would play the role of pointsman - redirect requests to correct website depending from location and purpose. This could help with other problems as well.
- Server software is backend code of your platform. You need to control it completely and have audits performed - penetration tests, stability and performance. There should be no backdoors, even for you, and nothing should left to be "finished later". Imagine, that someone has particular permissions - how much damage he could do? The point is - popular free social platform will be targeted by skilled hackers - to either steal the data or compromise. This includes the API for third party applications.
- If your mobile application is available in Google or Apple stores, then these companies control your presence in mobile market. I witnessed quite a few great apps disappearing from stores for no good reason because Google or Apple decided so. More often than not, they were competing with company offers. Coincidence, of course. Either way, the sudden disappearance of your app for bogus reason is likely scenario, so the new platform should be ready for this from day 1.
It is hardly a problem for Telegram, as its client is Open Source and there are two different apps for the same purpose. Technically, anyone can make their own client for Telegram, and I assume there are some. Therefore it's reasonable to make Open Source a requirement. Note, that we are talking about front-end, not back-end, so you don't have to make your whole platform open-source. Client app should be using API, which would enable all or most functions of the platform for 3rd party developers. Official app may be using its own version of API, but it would be cheaper to make one stable API that would serve all.
- Search engines belong to large tech companies, and they tend to censor contents they show to users. Your content is likely to be censored. Your platform shouldn't depend much from search engine, though it may take longer to develop in that case. Besides, how long will it take for search engines to begin censoring your platform? There will be a window, so make sure you don't start slow and steady.
- At some point advertisement networks, such as Google, will decide to not show adverts for your platform. That is - they may decide so. Hence don't rely on it further than at the beginning of the project's life.
And speaking of people:
- Your employees will be targeted. By government and competitors.
- Your users will be targeted. By government and other users. (Do I use the word "government" too often? I wish I wouldn't have to).
- Some employees will be rogue from the start. Think government and competitors, or simple malevolence. I'm not talking about your developers, but rather administrators and moderation. Neutrality is a virtue that's hard to keep.
- Same is about users. They will look for any possibility to bring your platform down.
- Hackers - the easiest kind of threats, they will attempt to hack into your system using technology or social engineering.
In other words, you need to protect everyone from everyone, including themselves. In order to find out the defence mechanism, we need to assess the threat. What do they want?
- Competitors want to bring you down. Think performance, stability and security. As well as, yes, politics.
- Government wants your data, ability to get all information about the user, censor users and user groups, push their own agenda and use your platform in information warfare.
- Rogue employees want the same as government, for either own pleasure or to sell that service to others.
- Rogue users want to have a troller-coaster, bully other users, silence opponents and make your platform their own playground, and when it doesn't work - break what they can.
- Hackers want to disseminate your code, find weak spots and exploit them to gain unauthorised access or data. Comparing to other groups, they are lovely bunch.
The Media and The Data
Let's imagine, that our media platform is akin to Facebook. It has text information, photos, some services that are based on user data and 3rd party data, and probably videos. The data can be separated into 3 groups - user private, user public and system. User private information is what user provided to us, but what we aren't showing to anyone. Address, financial data, list of private contacts, private messages. Everything that is not public and that belongs to this user, whether or not he can change or remove it. The public information is what everyone is supposed to see, or almost everyone - either way, more than just this user. Public messages, public images, the public timeline. And the system information is user password, access logs, all kind of traces and breadcrumbs, usually related to security and access.
Accessing public information in whole may have its risks, hence Twitter doesn't allow to download the whole timeline. You still can do it by scraping the website, but not through the API. Personally, I don't think it's a big deal, but let's imagine it's a feature worth having, so we'll have it as an option, and let user turn it on or off. Would you like others to easily see your ancient goofs and have blasts from the past? Why not, someone would get it anyway.
Private information should be secure at all levels. Passwords, contact information, all of it. If there is private information that we don't need, we shouldn't have it. Some information worth asking twice instead of saving.
System data shouldn't ever be exposed to API or public services, and storage must be protected from unauthorised access. No support users should be able to get what they don't need. That's basics, right? Not for all known social networks - they all had this problem at some stage.
How the data is acquired by rogue agents?
- Scraping for public information
- Using exploits in our software, and that includes sniffing
- Posing as owner of information - technologically or socially
- Getting physical access to database and media storage
This makes our server the vulnerable place. In typical large social media, we have a data centre with layered structure of servers:
- Web farm, an array of web servers which serves website's HTML, and provides API access to browser and external apps
- Database servers, which store information about users and references to storage blobs with actual data
- Storage servers, that store actual text, photo and perhaps video data for users
Accessing first layer allows to plant spy code and get access to access logs of web server, if there is such thing. Access logs may also be stored in database server, but it's irrelevant information in our case. What is important, is that access logs are important for whoever would like to dig a particular user, because they contain information regarding user's actual physical location, and usually can tell us where exactly he lives, what's his behaviour (e.g. at 10am each Wednesday he is at particular café) and so on. If we can, we should analyse logs for whatever information we need the next day after logs are written, and then compress logs and store them somewhere where they would be automatically deleted after some time. There may be legislation that requires us to store that information for some time, but that legislation doesn't cover what should be in these logs. Hence we should ensure that access logs store only required information, and irrelevant stuff is omitted.
But even more serious problem of that layer is that it could be disabled or used to disable your service, if misused or exploited. It has to be replaced with something, that couldn't be:
- seized, powered off or hijacked
- locked out of country by firewall
- used to harvest private information
And if we are talking about replacing it, how about we add some benefits, like:
- It should increase performance at places where Internet is scarce
- It should serve faster popular media, unpopular posters, like me, should be served too, always, but in zones where they are more popular they will be served faster
- Data must be encrypted better than SSL. End-to-end encryption is the way.
- There should be potential geographic and age limit for data. Not in form of warning, but enforced.
- Messages must have TTL, or "time to live", in them - you may want them to self-destruct after some time.
So, it's not a data centre anymore, or at least not that huge. And it's not centralized environment. Not a star, but mesh. It's quite clear bit-torrent, where torrent chunks are encrypted, and you only share what you are subscribed for, hence popular voices have more bit-torrent peers. BT can't be stopped by firewall, only shares public information using end-to-end encryption (more about it later), increases the performance because you are getting data from few peers that are closer to you, and you won't get BT chunks if they aren't for your geographic region, unless you are using VPN, of course.
End-to-end encryption may be implemented by creating public key for each user, which would be used to encrypt the symmetric key, used in actual data encryption. So, if you are sharing a photo, it is split to BT chunks, with data that's encrypted using symmetric key (say, AES), and that key is then encrypted using your public key, when you are attempting to read it. Similar to PGP or SSL, really. If that photo is not shared with you, you physically can't read it, even if you have BT chunks saved on your computer.
The question then is in authorising requests to read data, if server isn't allowed to decrypt it. Only those having the private key could provide it to new reader, which is hardly an option. And if server is allowed to access public information, then private, like actually private messages, would have to have known recipients from the start. Using the Facebook as an example, your public post would go signed but not encrypted, so it couldn't be changed by the platform, and it would still be encrypted on transport level. Hence no decryption would occur on application level, only decoding the BT chunk that was delivered using secure channel. The private messages would be encrypted before distribution, and it would have to use least hops (or intermediate storage containers) as possible.
But what if public message is of nature that would make user arrested in particular location? In that case users should be able to turn on the storage encryption, that would ensure that even public messages are stored in encrypted form. Transport level is secure anyway, so we are not expecting man in the middle exploits on Internet Service Provider level, but storage has to be secure and user shouldn't have control over it. He should be able to create encrypted container, but not decrypt it.
Private messages, when sent from one country to another, may be blocked by firewalls if we attempt to send it directly from sender's node to recipient's, hence we should employ BT proxy nodes that would act like postmen, having temporary distributed "mailboxes" with encrypted chunks that could only be served to particular recipient. After receiving and decryption of private message part, a part of the message should contain the key that would allow to delete the private message from all nodes, simple GUID would do.
Speaking of nodes. As I see it, the node acts like web server for local machine or home network, akin to proxy. Except that it would combine the BT node and web server. This would eliminate excessive traffic to central nodes, because all requests would go to your computer or at least the one in close vicinity. BT node would store mainly chunks from users you are subscribed to, plus what you would like to share with the network. E.g. you may choose to devote 5Gb of your storage and 500Mb of internet traffic per day, and skip any video media, then if those who you follow would only consume a fraction of that quota, the rest could be used with chunks that are interesting to your neighbours.
Such node wouldn't require SSL, unless we want protection against local sniffers, including those in LAN. In that case, optional self-signed SSL mechanism would be required.
As result, private information could be stored in encrypted form on users node, with encrypted backups in central nodes. The service owners wouldn't be able to retrieve and use it, unless request is sent to user node and accepted by user. Obviously, some user-related information never leaves central servers, and potentially encrypted backup of that information could be stored in user node. This would help us to facilitate the solution for the problem of deleted user data - when user data removal is enforced by the administration of the platform (with the help of government, or because of breaking the rules or simply because administrator's account was hacked), the copy of it could be saved on user's node, but marked as deleted.
How would the node operate, technically? It could be either an application, or JavaScript applet within the browser. There is a hundred ways to skin this cat, including application in home network attached storage, mobile phone or special device based on Raspberry Pi / Arduino, but one thing is certain - it would require investment of time and physical resources that most users don't have. Therefore we come to scenario when there are two types of nodes - client and system, where the difference is in tenancy - personal nodes may serve a small group of users, while system nodes belong to platform owners and their partners. Who may become such partner and donate their computer resources to the platform and why - is rather business and political question, but such mesh of nodes would allow "light" clients, like browsers and people in internet-cafes to participate in platform without hassle.
To be continued ...