ShinTakezou's Blog: Beyond 140

Twitter has the limit of 140 characters per tweet, and this limit is even reduced when you add media contents like images. Now that Twitter is “consumed” through the web, there wouldn't be serious reasons not to rise that limit.

The magic number comes from the SMS limit (160), decreased by 20 characters to keep room for the user name. Jack Dorsey explains it in this interview, Twitter creator Jack Dorsey illuminates the site's founding document. Part I.

It was really SMS that inspired the further direction — the particular constraint of 140 characters was kind of borrowed. You have a natural constraint with the couriers when you update your location or with IM when you update your status. But SMS allowed this other constraint, where most basic phones are limited to 160 characters before they split the messages. So in order to minimize the hassle and thinking around receiving a message, we wanted to make sure that we were not splitting any messages. So we took 20 characters for the user name, and left 140 for the content. That’s where it all came from.

If we want to write more than 140 characters, we can't (there are exceptions that don't concern common users). But we aren't constrained to use text-only tweets anymore, thus you can easily imagine tricks to make people read more than 140 characters in your tweets: you can add an image you “printed” your long text on.

Of course there's at least two disadvantages: ① you need to run your image editor of choice, write your text-as-image (see the text tool in GIMP as an example), save the image, add it to your tweet… — a special Twitter client or an app could do this for you, of course…; ② it's harder to index and search textual content of an image… — even if you can use the text to scatter keywords and tags, minimizing this disadvantage.

Moreover a “textual image” needs surely more storage space than a real plain text. Wouldn't it be better for Twitter to allow for, say, 256 characters? I would also add light markup features, e.g. italics and bold.

They wouldn't go short of space and bandwidth because of these changes. But for some reason it seems that Twitter's limits have become a sort of core key value.

Here I'm not going to inspect why, though I think it'd be interesting to discuss when and if the textual shortness has become in fact part of the brand, and if it is a value¹, and why.

Instead, here I want to find a way to go beyond that limit.

140, twice

When I have written characters I have meant characters, in fact. Not bytes/octets. Once the web was happily using latin1 as default. But nowadays the “standard” encoding on the web (and elsewhere) is UTF-8, and UTF-8 is a multibyte encoding which can encode all the characters in the Universal Coded Character Set.

What does it mean? It means that the character you read in a tweet could take more than a single byte, but Twitter counts it as 1 anyway. (Which is the logical choice.)

How can we exploit this?

Let's take our UTF-8 encoded string, interpret it as a stream of UCS4 codes (each takes 4 bytes), let's re-encode it as UTF-8… We tricked Twitter into thinking that each four-bytes code is a single character. Hence we could pack into a tweet a text made of 4×140 characters circa²!

No, wait…

There are problems because the UCS maps less than 0x10FFFF valid characters, and UTF-8 currently is constrained to this limit (see RFC 3629), even if the original proposal was not. Encoding in UTF-8 the code point e.g. 0xF00D4A11 is not possible, and not only because that code point does not exist and thus wouldn't be valid…

We could be happy with 3×140 characters, padding with a 0-bytes to obtain again a UCS-4 in the range 0x00000000 - 0x00FFFFFF? No, because code point beyond 0x10FFFF are not valid in any case…

So we must stick to 2×140 and interpret every couple of bytes according to UCS-2. Unfortunately not all the code point in the range 0x0000 - 0xFFFF are valid, nonetheless in this case UTF-8 has no problem representing them. Problems could arise when an application tries to interpret those codes.

If Twitter does not strip them away, we shouldn't worry about that.

Ok, let's start doing something for real (almost…).

Preparing the data

Quick with iconv. E.g.

echo "hello world" |iconv -f UCS-2 -t UTF-8 >data

Take a little bit of care ensuring the number of bytes (I've said bytes, not characters) is even. Otherwise you'll take a warning:

iconv: incomplete character or shift sequence at end of buffer

Preparing the app

I'm not going to write the tweet copy-pasting the text. The idea is to write and read tweets through an app. Its work would be the core of a client that would allow you to write longer tweets, and read longer tweets, provided they are written by the same app (likely we'll need to waste a single character as a marker).

For this PoC I'll use chatterbot — I've just finished authorizing an app of mine to access shintakezou's Twitter account.

Write a tweet

I will use the following text:

In Italian language the subject isn't mandatory; so we say “è bello” and the grammatical subject (he/it) is already contained in the conjugation of the verb, “è” — that is, present indicative tense, third singular person of the verb “essere” (to be).

It's 266 bytes long (so, 133 fake characters long). When encoded using iconv as shown above, it will be 399 bytes long.

Since I'm using chatterbot, which is written in Ruby, let's fire irb and do a quick check.

txt = File.read("data")
puts txt.length

The output is 133, as expected.

Ok, let's tweet it now.

湉䤠慴楬湡氠湡畧条⁥桴⁥畳橢捥⁴獩❮⁴慭摮瑡牯㭹猠⁯敷猠祡鲀ꣃ戠汥潬胢₝湡⁤桴⁥牧浡慭楴慣⁬畳橢捥⁴栨⽥瑩⁰獩愠牬慥祤挠湯慴湩摥椠⁮桴⁥潣橮杵瑡潩⁮景琠敨瘠牥Ɫ鲀ꣃ胢₝胢ₔ桴瑡椠ⱳ瀠敲敳瑮椠摮捩瑡癩⁥整獮ⱥ琠楨摲猠湩畧慬⁲数獲湯漠⁦桴⁥敶扲鲀獥敳敲胢₝琨⁯敢⸩
— Mauro Panigada (@shintakezou) May 8, 2016

Read the tweet

In order to read back the specific tweet, I will use curl; something like this:

curl -H "Authorization: Bearer ACCESSTOKEN" \
  "https://api.twitter.com/1.1/statuses/show.json?id=729363443107631104" \
  -o tweet.json

Of course ACCESSTOKEN is the bearer access token. Here you should learn everything you need to get one. (I haven't checked if chatterbot can get a specific tweet by its id.)

We obtain the JSON object representing the tweet; the text property contains our text, encoded as 2-bytes “Unicode” “entities” (just the way you specify “unicode” chars in a literal string in JSON):

\u6e49\u4920\u6174\u696c\u6e61\u6c20\u6e61\u7567
\u6761\u2065\u6874\u2065\u7573\u6a62\u6365\u2074
\u7369\u276e\u2074\u616d\u646e\u7461\u726f\u3b79
\u7320\u206f\u6577\u7320\u7961\ue220\u9c80\ua8c3
\u6220\u6c65\u6f6c\u80e2\u209d\u6e61\u2064\u6874
\u2065\u7267\u6d61\u616d\u6974\u6163\u206c\u7573
\u6a62\u6365\u2074\u6828\u2f65\u7469\u2070\u7369
\u6120\u726c\u6165\u7964\u6320\u6e6f\u6174\u6e69
\u6465\u6920\u206e\u6874\u2065\u6f63\u6a6e\u6775
\u7461\u6f69\u206e\u666f\u7420\u6568\u7620\u7265
\u2c62\ue220\u9c80\ua8c3\u80e2\u209d\u80e2\u2094
\u6874\u7461\u6920\u2c73\u7020\u6572\u6573\u746e
\u6920\u646e\u6369\u7461\u7669\u2065\u6574\u736e
\u2c65\u7420\u6968\u6472\u7320\u6e69\u7567\u616c
\u2072\u6570\u7372\u6e6f\u6f20\u2066\u6874\u2065
\u6576\u6272\ue220\u9c80\u7365\u6573\u6572\u80e2
\u209d\u7428\u206f\u6562\u2e29

That is, already our fake UCS-2 stream!

We could extract the bytes from there and interpret them as UTF-8 in order to obtain our original text. (Beware: in this way we inject an undesired dependency from how Twitter encodes the strings in the JSON it returns — you must avoid this carefully in production code, while it'd be fine just to quickly check if the bytes are those we expect.)

Instead let's put it into a file using Ruby again; something like:

require 'json'
h = JSON.parse(File.read("tweet.json"))
File.open("original-u8.txt", "w") { |f|
  f.write(h["text"])
}

In irb we shall read

=> 399

Good! Let's extract the real original text:

iconv -f utf-8 -t ucs2 <original-u8.txt >original.txt

And now we can read our original text in original.txt.

What's next

The next step is to write an usable client to exploit this. I have the name already: JACNAF, i.e. Just Another Client Nobody Asked For.

How does it look

If you have a font that can represent those characters, my tweet looks like this:

Pay attention to few characters that indeed are placeholder specifying the code there's no glyphs for (namely, 2072 and 209D in my case, but it depends on your fonts).

I have my thesis: no, it's not a value if we use Twitter to discuss complex topics. In fact complex arguments are too easily and wildly reduced to slogans and aphorisms. Forcing people to express complex thoughts with few words makes them more stupid, less accustomed to reasoning, dialogical confrontation and to the evaluation of nuances. Misunderstandings become easier, attention span shorter. If you try sincerely to use it to discuss with people having a positive attitude, likely you both will produce several tweets for a single argument. The resulting fragmentation of the discourse makes hard, especially for anybody who stumbled in one of the tweet, to look at the discourse as a whole, holistically. Briefly, Twitter can be used effectively only for slogans, aphorisms, self promotion, links and image sharing, short self-contained black-or-white opinions. In all the other cases, it becomes wasteful and almost dangerous, increasing that sort of “communication hysteria” which is partly a consequence of the speed we pretend to communicate at in order to consume faster an unmanageable amount of unstructured informations.↩
Circa since if we write in a language which already uses, for a single character, more than a byte, then the trick is less effective for that character.↩

ShinTakezou's Blog

2016-05-08

Beyond 140