Skip to main content

Web Scraping to feed @CastroTiempo Twitter Bot

Web Scraping to feed @CastroTiempo Twitter Bot

A program that crawls a webpage, grabs some weather information of my city, and then feeds a Twitter bot (@CastroTiempo) I also coded by myself.



This week's entry is dedicated to something I learnt some weeks ago and that I wanted to implement in any of my projects but that until now I didn't have any idea worth spending my time on (not saying that this one was).

It will be tweeting for some time, I have not decided how much time yet, I have a "computer" dedicated to run the code forever, so it's basically until:

  1. I get tired of it or
  2. Don't get much feedback or 
  3. There is an error in the code that makes it stop (this one will likely happen the first one) or
  4. I pull the electric cord and the "computer" switches off (I guess this is the second most probable one) or*

So I was thinking of coding a little program that crawls a webpage, grabs some information and then does something with that. This technique is called Web Scraping, and if you want to learn more about it I will leave some resources at the end of the entry as usual.

For this WebScraping part of the code I used a Python library called Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) which probably has the most unpleasant webpage to read of all those that I came across, but well at least they have some documentation. This one usually comes together with another library called Requests (https://requests.readthedocs.io/en/master/) or as they call it "HTTP for Humans".

For the tweeting part of the project I used the most awesome of them all Tweepy (https://www.tweepy.org/) which has very nice documentation but too many examples, so better go search them somewhere else. And this is not enough to make a script of yours tweet, you also need to open a Twitter account, and then ask for API keys and tell the Twitter people what you are gonna use it for and so on and so on, after some hours they will review your application and if you didn't say that you would use to spam the hell out of people they'll probably grant you the almighty secret token, auth tokens, and developers keys.

Some other libraries that I used are the DateTime and RegEx ones that come built-in so you don't have to worry about installing them.

As usual, some code for you to copy if you want to (I know you won't but let's just pretend, ok?)

import requests
from bs4 import BeautifulSoup
import re
import tweepy
from datetime import datetime
import pytz


auth = tweepy.OAuthHandler("secret", "secret_2")
auth.set_access_token("token_1", "token_2")
api = tweepy.API(auth)

api.update_status("Hola castreƱos y castreƱas, soy @CastroTiempo, el bot con el tiempo de Castro-Urdiales actualizado cada hora y en tiempo real.")

The very first lines are just to import the libraries that I mentioned before, then generate an API object (this would be the place where you put your keys, tokens or whatever you call it) and then just post the very first message of the account.

def update():
import time
while True:
res = requests.get("https://aqicn.org/city/spain/cantabria/castro-urdiales/es/")
soup = BeautifulSoup(res.text,'html.parser')

ACI = str(soup.find(id="aqiwgtvalue"))

humid = str(soup.find(id="cur_h"))

wind = str(soup.find(id="cur_w"))

pressure = str(soup.find(id="cur_p"))

pattern = re.compile(r"(>\d{2}<)|(>\d{3}<)|(>\d{1}<)|(>\d{0}<)|(>-<)|(>\d{4}<)") #RegEx takes 2 or 3 digits between ><

ACI_str = (pattern.search(ACI)).group() #ACI number string version
calidad = str(f"El nivel de Calidad del Aire es {ACI_str[1:len(ACI_str)-1]}\n")

pattern_temp_curr = re.compile(r"((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{1})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{1})+(\"))")

temp = str(soup.body)
temp_str = (pattern_temp_curr.search(temp)).group() #ACI number string version

temperature = str((f"La temperatura es de {temp_str[37:len(temp_str)-1]} ĀŗC\n"))

humid_str = (pattern.search(humid)).group()

humidity = str(f"La humedad es del {humid_str[1:len(humid_str)-1]} %\n")

wind_str = (pattern.search(wind)).group()

viento = str(f"Tenemos un viento de {wind_str[1:len(wind_str)-1]} km/h\n")

pressure_str = (pattern.search(pressure)).group()

press = str(f"La presion es de {pressure_str[1:len(pressure_str)-1]} hPa\n")

tz_Madrid = pytz.timezone('Europe/Madrid')
datetime_Madrid = datetime.now(tz_Madrid)

hora = str("¡Hola! Son las " + datetime_Madrid.strftime("%H:%M")+"\n")

mensaje = hora + temperature + humidity + press + viento + calidad

with open("temp.txt","w") as f:
f.write(mensaje)

with open("temp.txt", "r") as f:
api.update_status(f.read())

time.sleep(
60*60)

And here comes the chicha as we say in Spanish, the bulk, the part with some enjundia. OK, enough.

It is just a function called update, because it will update my Twitter with the weather conditions. 

At first I create a soup object with the parsed webpage that I crawl, which is this one btw (https://aqicn.org/city/spain/cantabria/castro-urdiales/es/) and then give some parameters that my crawler, scraper, spyder call it as you please, has to look for, in this case, I was looking for the:

  • Air Quality Index
  • Temperature
  • Humidity
  • Wind
  • Atmospheric pressure

This was a pain in the ass, as the webpage construction was not consistent I didn't know what I had to look for when I moved from one parameter to another, but it was so pleasant to finally be able to get the one I was aiming for (believe me, it was so much).

So I also had to use Regular Expressions (a.k.a. RegEx), these are some expression that search for coincidences in strings of characters, that's why that chunk of r"(>\d{2}<)|(>\d{3}<)|(>\d{1}<)|(>\d{0}<)|(>-<)|(>\d{4}<)" appears there, to look for some coincidences of the very specific characters I was looking for. And the same applies for r"((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{1})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{1})+(\"))" which was even more comlicated to get.

Then it's the same more basic stuff, it opens a .txt file where I write what I want to tweet, saves it and then opens it to update my timeline (i.e. to tweet).

Finally just a waiting time of 60*60*2 = 7200 seconds or in human language 2 hours. and then just call the function that will be executed in an infinite loop, since the bool True will always be so.

 Where the code is running

So, now that the code is ready to go I need somewhere to be run on. I had three options:
  1. Have it running on my computer which would imply to never turn on my computer (too noisy, don't want to wear out my laptop)
  2. Have it running in the cloud (AWS for instance), I couldn't do that and I will explain why later on.
  3. Have another dedicated computer to run it, it seemed like a bit of an overkill but that's what I went for.
The computer in which is running it's a Raspberry Pi 3 B+, a tiny computer worth around 40 euros that you can find in Primoroni for instance, this is where I bought mine (https://shop.pimoroni.com/collections/raspberry-pi) and the precise RasPi where is running is this one in the photo. I already had one but it is being used for another thing in my 3D printer which will come in another entry.
The RasPi with the internet cable and the charger to power it

I'm sorry I don't have a banana for scale but it's about one pointing finger of length by one thumb of width, thickness can be neglected.

Why couldn't I run it on the cloud and save 50 €?

That's a legit question that I'll try to answer succinctly without looking like I am a complete idiot. My first approach was not only to report the weather in plain text but also to show a photo of Castro in the moment of reporting the weather, I like in the upper part of Castro and so I have very good views of it, well I implemented the code so it would post the picture taken at that moment, just an old webcam connected to my laptop would do the job and it was perfectly working. The picture quality could be better but that's not important.

I was very happy, my code was running smoothly and it included pictures of Castro which I think it was a very good feature, then I started putting it in the Raspberry Pi, turns out I couldn't manage to install OpenCV (the packages I used in previous entries) in the Pie, so well I just quit trying.

To have the thing running forever and updating pictures of Castro it needed to have a camera running forever, that's why I spent that money.

I think this is getting too long but anyway, I also wanted to display the temperature but the RasPi again wasn't working as I intended to, so I had to remove the temperature from this initial launch, I will work on that don't worry (I know you don't worry)

Conclusion

The thing tweets, so as far as that requirement goes that's a success. The tweeting rhythm is as low because the webpage from which I grab the data from refreshes it very infrequently (I should have checked that before)

I will definitely keep on working on it to add all the features that couldn't make into this first version. Hopefully, it will come in the next entries.

For the resources, I think that I've given you the main webpages I've used.

Some evidence that the program actually scrapes the webpage and it's not just me manually tweeting.

Screenshot taken at the moment of the first tweet (more or less)

I can't even remember properly but in the last entry I promised something with 3D printing, it will come I just had some issues I'm very busy.

These projects are taking me a lot of time and effort, if you are liking them, have any ideas or overall you are just my friend (or not) and want to support my job consider subscribing to the blog, commenting, liking, or sharing it with your friends who might be interested. Also follow me (or the both, or the code that was created by I don't even know), I leave you the button here.




Hope you liked it,

:D

*who knows maybe the computer explodes or all of them happen at the same time.

Comments

Popular posts from this blog

Advent of Code, day 1

 Day 1 of the Advent of Code A bit late for Christmas, isn't it? It's been a long time since I did my last post, some days of boredom have inspired me to get back at it. I am going to be trying to solve the last Christmas' Advent of Code. For those that you don't know the Advent of Code is an Advent calendar of small programming puzzles that can be solved the way you like, each day contains two puzzles in which the first is linked to the second. If you want to get more info about it, check out this link:  https://adventofcode.com/2023/about Without further ado, let's get at it, I'm going to copy down below the statement for the Day 1 Statement Input Basically we are given a long list of characters (the one shown in the picture) where each line contains numbers and letters, we first need to get just the numbers, then store somewhere else the first and last numbers in each row, and lastly sum all those values. A link to the input file:  https://adventofcode.com/20...

A first approach to IoT, connecting my 3D printer to the internet

My first approach to the IoT, connecting my 3D printer to the internet IoT is one of those fancy words that people like to talk about in conferences and in TedTalks without (apparently) having too much idea of what it is all about. Set up to manage the 3D printer through the internet This one is going to be a short entry where I don't go through code or anything, just wanted to talk about a bit about how I connected my 3D printer to the internet.  I've been in the 3D printing thing for a while now, about a year and I haven't stopped printing ever since I bought my Ender 3. Fortunately enough, I live in a big house where my room/working place is on the fourth floor and my 3D printing is on the first one. You might be thinking as well: "OK Pablo but where do you want to bring us? Go to the point" Well, as you might have noticed there are two floors in betw...

Advent of Code, day 2

We made it to Day 2 of Advent of Code I have kept my promise for 2 days in a row. As weird as it seems for this day and age, I have kept my promise and today is the day 2 of the Advent of Code, this time with elves involved. There you go, the statement of today's first problem down below: And usual, my solution: import pandas as pd import re import numpy as np filepath = "/content/input 1.txt" with open (filepath, 'r' ) as file :     lines = [line.strip() for line in file .readlines()] lines_split = [] for line in lines:   token = re.findall(r '\w+|[^\s\w]' , line)   lines_split.append(token) blue_list = [] green_list = [] red_list = [] for line in lines_split:   blue,green,red = 0 , 0 , 0   for ix, word in enumerate (line):     if word == "blue" :       if blue < int (line[ix -1 ]):         blue = int (line[ix -1 ])     if word == "green" :       if green < int (...