Category Archives: Artificial intelligence

Can AI understand webcomics?

Vision language models (VLMs) are a new type of artificial intelligence that combines computer vision with the natural language processing capabilities of large language models. VLMs are often capable of chatting but they are designed to particularly excel in understanding and generating text about images.

Over the past several months, many VLMs have been released including open source ones that can be run locally on home computers. In this post, I will be testing their capabilities using this webcomic:

Based on the original “Illusion” webcomic by Abstruse Goose (CC BY-NC 3.0 US).

Here is the prompt that I used:

What is the message being conveyed by this image?

And here are the results:

Chameleon:

The image is a cartoon-style diagram that depicts a simple flowchart. The main subject is a sheep, which is depicted as a black outline with a white body. The sheep is situated in the center of the image and faces to the left. The flowchart itself is drawn on a white background and features three main branches, each of which is labeled with a word or phrase. The branches originate from a central node labeled "This is a flowchart." The leftmost branch is labeled "Chickens," while the middle branch is labeled "Foxes." The rightmost branch is labeled "Flies." Each branch then splits into smaller sub-branches, which are labeled with additional words or phrases. The leftmost branch has two sub-branches: "Lay eggs" and "Get eaten." The middle branch has three sub-branches: "Catch mice," "Catch rabbits," and "Get eaten." The rightmost branch has two sub-branches: "Bite humans," and "Get eaten." Each sub-branch ends in a small circle or oval, which adds a sense of finality to the flow

ChartGemma:

The image is a satirical depiction of the illusion of free choice, highlighting the potential limitations of democratic systems.

Idefics2-8B:

There are two cows, one on each side of a wall. The cow on the left represents Democrats and the cow on the right represents Republicans. They both want to get to the "Slaughter House" but there's a big obstacle in their way. This illustrates how both parties work together for their own interests instead of working together as a whole.

Idefics2-8B-chatty:

The message being conveyed by this image is that there are limited options or choices available for both Democrats and Republicans, which can be seen as a metaphorical representation of the political landscape in the United States. The cow standing at the bottom of the image symbolizes the American people who are often faced with only two main political parties to choose from when voting. This limited choice can lead to a sense of frustration or dissatisfaction among voters, as they may feel that their preferences and concerns are not fully represented by either party.

The image also highlights the idea that both Democrats and Republicans might have some similarities or shared beliefs, despite their differences. This can be seen through the fact that they are both standing on the same side of the wall, suggesting that they share common ground or goals. However, the wall itself represents the division between these two groups, emphasizing the polarization and partisanship that often characterize American politics.

In conclusion, the image of the cow standing below the two political parties and the wall between them serves as a reminder of the limited choices and the division that exists within the American political system, while also hinting at potential areas of agreement or commonality between the two major parties.

LLaVA 1.5:

The image conveys a message about the concept of "free choice" and the illusion of it. It features a cow standing in front of a sign that says "Slaughter House" and another sign that says "Republicans." The cow is looking at the signs, which suggests that it is aware of its situation and the potential consequences of its actions. The image might be suggesting that even though the cow has the freedom to choose, its choices are ultimately limited by its environment and the decisions made by others. This can be interpreted as a commentary on the limitations of free choice in certain situations or the influence of external factors on individual decision-making.

OpenGPT 4o:

The image is a cartoon that illustrates a political divide between Democrats and Republicans. It features a cow standing in the middle of a road with a sign that reads "SLAUGHTER HOUSE" on one side and "THE ILLUSION OF FREE CHOICE" on the other. The cow appears to be looking at the sign on the "SLAUGHTER HOUSE" side, which suggests that the cow is being influenced by the illusion of free choice on that side. The message being conveyed is that the illusion of free choice can be misleading and that the true power in politics is held by those who control the "SLAUGHTER HOUSE," which is a metaphor for the concentration of power in the hands of a small elite.

In order to make things a bit more challenging for the AI, I then repeated the same experiment but with the explanation text removed:

The edited version with explanation text removed.

Here are the results of that second experiment:

Chameleon:

This image is conveying a message that when you make a big change, it can cause a lot of pain and discomfort, but in the long run, it will lead to great rewards.

Chameleon (another attempt):

The image is a cartoon illustration that shows three different types of borders and their purpose. The illustration is accompanied by a text that reads "Borders serve a purpose". The illustration is a clear representation of how borders help define and protect different areas or entities.

ChartGemma:

The image suggests that Republicans are moving further apart from Democrats, while Democrats are moving further apart from Republicans.

Idefics2-8B:

This cartoon shows a cow looking at two doors. One door has "Democrats" written on it and the other one has "Republicans." The cow represents the American people who are trying to decide which party they want to vote for in an election.

Idefics2-8B-chatty:

The message being conveyed by this image is that there are two distinct groups of people or political parties - Democrats and Republicans - who have different opinions and beliefs. These differences are symbolized by the cow standing in front of the two separate walls, representing the divide between these groups. The cow looking at the wall of the "Slaughter House" suggests that the Democrats might feel threatened or endangered by the actions or policies of the Republicans, while the cow looking at the wall of the "Republicans" indicates that the Republicans may also have concerns or disagreements with the Democrats. Overall, the image highlights the polarization and division within politics, where opposing viewpoints often lead to conflict and tension.

LLaVA 1.5:

The image conveys a message about the political divide in the United States, specifically between Democrats and Republicans. The cow is standing in front of two doors, each labeled with the name of a political party. The cow represents a neutral observer, and the doors symbolize the two opposing political parties. The image suggests that the political divide is as deep as the ocean, with no easy way to bridge the gap between the two parties. This can be interpreted as a commentary on the challenges faced in achieving consensus and cooperation between the two parties in the political arena.

OpenGPT 4o:

The image is a cartoon that illustrates a humorous take on the political divide in the United States. It features a cow standing in the middle of a road with two signs on either side, labeled "Democrats" and " Republicans". The cow appears to be facing the "Democrats" sign, but is facing away from the " Republicans" sign. This could be interpreted as a commentary on how people may be more interested in the policies or ideas of the party they align with, rather than the party itself. The image is not meant to be taken seriously and is likely intended as a light-hearted joke.

As you can see, any perceived understanding immediately falls apart once the explanation text is removed. And even when it isn’t, there is still room for improvement. Some models identified the cow as a sheep, while others identified two cows when there is only one. Chameleon in particular seems to be quite prone to hallucinating. Given that it’s actually one of the more recent models, I’m not sure why it did so badly.

If you know of any other non-paid models that I could test for this experiment then feel free to post them in the comments. As the technology improves, I’m sure better VLMs will be released in the future that will easily pass this test. Until then, it’s probably fair to say that the answer to the question posed by this post’s title is “no” – i.e. current AIs aren’t quite intelligent enough to understand webcomics yet.

Turn (almost) any article into a podcast with Podcastify

I was browsing through the many spaces on Hugging Face recently and found this interesting one that claims to be able to convert any text-based article into a simulated podcast featuring two AI speakers who take turns speaking.

It uses a range of open source technologies including Jina AI’s Reader for parsing URL queries to AI-friendly text, the large language model Hermes-2-Pro-Llama-3-8B by Nous Research for function calling support, and the multilingual text-to-speech library MeloTTS by MyShell to generate the actual voices.

At first I tried pasting in the Wikipedia article for AI but that brought up an error. Sensing that perhaps length could be the issue, I tried the same with the much shorter Simple English version but still got the same error. But trying a third time with an earlier article I wrote for this blog about generative AI ended up working!

Here is the result of that experiment (unlike a normal podcast, it’s only 80 seconds long).

As you can see, it’s not perfect. The voices are still somewhat robotic-sounding and some of the pronunciations are wrong. But it’s still a very interesting technology. I can see tools like this being useful for those who don’t want to read through pages upon pages of text but would rather get the gist of a subject while waiting in traffic or riding the bus, etc. That being said, a big part of the appeal of podcasts is that they’re actually entertaining to listen to, and I don’t think this AI is quite there yet.

And finally, after this post has been published, I will ask Podcastify to generate a new podcast episode based on it – making this the world’s first self-referential podcast episode!

Looking for a mobile generative AI workstation on a budget? Consider buying an old gaming laptop

This gaming laptop from 2017 has more VRAM (8 GB) and storage (512 GB + 1 TB) than most current laptops.

If you’re interested in running text-to-image models and large language models locally but buying a desktop isn’t an option then consider purchasing a used gaming laptop instead.

The newer RTX 20, 30, and 40 series GPUs are great cards that provide big performance boosts for games that take advantage of their newer features such as hardware ray tracing (in fact, the “R” in “RTX” stands for ray tracing) and DLSS. If you intend to play recent games at medium to high settings with high FPS, then by all means go for these modern cards.

But when it comes to AI, VRAM is king, and newer GPUs by Nvidia have been very lacking in the VRAM department. Basically, the AI boom caused Nvidia to become extremely stingy with the amount of VRAM that they put in their consumer products because they don’t want to cannibalize sales of their much more profitable enterprise products designed for corporate clients and data centers.

Case in point, the mobile GTX 1070 and RTX 4070 were mid-range cards released in 2016 and 2023 respectively. Yet, they both have the same amount of VRAM – 8 GB. If you want more VRAM in a portable form factor, Nvidia forces you to splurge on the much more expensive mobile RTX 4080 or 4090. And until AMD can prove itself to be a viable competitor in the mobile space, this situation is unlikely to improve.

Storage is another reason why these older laptops are great for AI tasks. As laptops switched from having both a hard drive and a solid state drive (SSD) to having just an SSD, their storage capacities actually decreased. Most gaming laptops from 2016 to 2019 should be fitted with both an SSD and a mechanical hard drive. The idea was that you would have your operating system and commonly used applications installed on the SSD while things like your photos, videos, music, and documents that don’t benefit from the speed advantages of the SSD can be stored on the hard drive where storage is slower but cheap and plentiful.

For generative AI, any applications you run will likewise be limited by the GPU rather than the speed of the hard drive so the performance benefits of the SSD would be wasted. Furthermore, you will most likely need to keep large amounts of datasets and media files on hand which will benefit from large amounts of cheap storage space where speed does not matter.

And finally, if you’re wondering whether the new laptops marketed with “AI” branding are any better for generative AI then unfortunately, they’re not. Compared to GPUs, the neural processing units (NPUs) inside these laptops are still fairly useless for generative AI tasks. A decent NPU might achieve 40 trillion operations per second (TOPS) while a dedicated GPU can achieve hundreds of TOPS to over 1,200 TOPS depending on the model. Instead, these NPUs are designed to assist in office tasks that might benefit from a light sprinkling of AI such as face detection and reducing background noise during video and voice calls while consuming very little energy in the process and thus reserving the GPU for more demanding tasks.

Depending on where you live, you should be able to score a gaming laptop from 2016 to 2019 for about $250-500 in the second-hand market. Try to aim for at least 16 GB RAM with a 256 GB SSD and a 1 TB mechanical hard drive. The CPU is unlikely to be a bottleneck because gaming laptops are unlikely to be fitted with crappy Celerons and pre-Ryzen AMD APUs. That being said, try to aim for an Intel Core i5 or i7. When it comes to graphics cards, some excellent choices are the 6 GB version of the GTX 1060 (avoid the 3 GB version), the 8 GB GTX 1070, and the 8 GB GTX 1080. AMD GPUs aren’t recommended due of their lack of CUDA support.

“Alexology” according to text-to-image AI models

An image generated from the prompt “Alexology”.

Just for fun, I decided to ask a few text-to-image artificial intelligence models to generate images using just the prompt “Alexology” with their default negative prompts. Here are some of the images that they generated (click to expand):

Even though it still messes up often, AI has come a long way in being able to generate coherent text. Something like this would have been difficult to make just a year or two ago.

Also what’s that on his head?

This one shows one of the most defining traits of AI-generated images. At first glance, everything looks fine but if you zoom in then a lot of the details don’t really make sense.

This one looks like it could be a pretty good album cover for a retro-inspired band (shame about the mangled text though).

This one could also be an album cover or a great promotional image for a video game.

What are your thoughts on AI-generated images? Love them? Hate them? Feel free to share your thoughts and creations in the comments.