Web Scraping to feed @CastroTiempo Twitter Bot
A program that crawls a webpage, grabs some weather information of my city, and then feeds a Twitter bot (@CastroTiempo) I also coded by myself.
It will be tweeting for some time, I have not decided how much time yet, I have a "computer" dedicated to run the code forever, so it's basically until:
- I get tired of it or
- Don't get much feedback or
- There is an error in the code that makes it stop (this one will likely happen the first one) or
- I pull the electric cord and the "computer" switches off (I guess this is the second most probable one) or*
So I was thinking of coding a little program that crawls a webpage, grabs some information and then does something with that. This technique is called Web Scraping, and if you want to learn more about it I will leave some resources at the end of the entry as usual.
For this WebScraping part of the code I used a Python library called Beautiful Soup (https://www.crummy.com/software/BeautifulSoup/bs4/doc/#) which probably has the most unpleasant webpage to read of all those that I came across, but well at least they have some documentation. This one usually comes together with another library called Requests (https://requests.readthedocs.io/en/master/) or as they call it "HTTP for Humans".
For the tweeting part of the project I used the most awesome of them all Tweepy (https://www.tweepy.org/) which has very nice documentation but too many examples, so better go search them somewhere else. And this is not enough to make a script of yours tweet, you also need to open a Twitter account, and then ask for API keys and tell the Twitter people what you are gonna use it for and so on and so on, after some hours they will review your application and if you didn't say that you would use to spam the hell out of people they'll probably grant you the almighty secret token, auth tokens, and developers keys.
Some other libraries that I used are the DateTime and RegEx ones that come built-in so you don't have to worry about installing them.
As usual, some code for you to copy if you want to (I know you won't but let's just pretend, ok?)
import requests
from bs4 import BeautifulSoup
import re
import tweepy
from datetime import datetime
import pytz
auth = tweepy.OAuthHandler("secret", "secret_2")
auth.set_access_token("token_1", "token_2")
api = tweepy.API(auth)
api.update_status("Hola castreƱos y castreƱas, soy @CastroTiempo, el bot con el tiempo de Castro-Urdiales actualizado cada hora y en tiempo real.")
The very first lines are just to import the libraries that I mentioned before, then generate an API object (this would be the place where you put your keys, tokens or whatever you call it) and then just post the very first message of the account.
def update():
import time
while True:
res = requests.get("https://aqicn.org/city/spain/cantabria/castro-urdiales/es/")
soup = BeautifulSoup(res.text,'html.parser')
ACI = str(soup.find(id="aqiwgtvalue"))
humid = str(soup.find(id="cur_h"))
wind = str(soup.find(id="cur_w"))
pressure = str(soup.find(id="cur_p"))
pattern = re.compile(r"(>\d{2}<)|(>\d{3}<)|(>\d{1}<)|(>\d{0}<)|(>-<)|(>\d{4}<)") #RegEx takes 2 or 3 digits between ><
ACI_str = (pattern.search(ACI)).group() #ACI number string version
calidad = str(f"El nivel de Calidad del Aire es {ACI_str[1:len(ACI_str)-1]}\n")
pattern_temp_curr = re.compile(r"((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{1})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{1})+(\"))")
temp = str(soup.body)
temp_str = (pattern_temp_curr.search(temp)).group() #ACI number string version
temperature = str((f"La temperatura es de {temp_str[37:len(temp_str)-1]} ĀŗC\n"))
humid_str = (pattern.search(humid)).group()
humidity = str(f"La humedad es del {humid_str[1:len(humid_str)-1]} %\n")
wind_str = (pattern.search(wind)).group()
viento = str(f"Tenemos un viento de {wind_str[1:len(wind_str)-1]} km/h\n")
pressure_str = (pattern.search(pressure)).group()
press = str(f"La presion es de {pressure_str[1:len(pressure_str)-1]} hPa\n")
tz_Madrid = pytz.timezone('Europe/Madrid')
datetime_Madrid = datetime.now(tz_Madrid)
hora = str("¡Hola! Son las " + datetime_Madrid.strftime("%H:%M")+"\n")
mensaje = hora + temperature + humidity + press + viento + calidad
with open("temp.txt","w") as f:
f.write(mensaje)
with open("temp.txt", "r") as f:
api.update_status(f.read())
time.sleep(60*60)
And here comes the chicha as we say in Spanish, the bulk, the part with some enjundia. OK, enough.
It is just a function called update, because it will update my Twitter with the weather conditions.
At first I create a soup object with the parsed webpage that I crawl, which is this one btw (https://aqicn.org/city/spain/cantabria/castro-urdiales/es/) and then give some parameters that my crawler, scraper, spyder call it as you please, has to look for, in this case, I was looking for the:
- Air Quality Index
- Temperature
- Humidity
- Wind
- Atmospheric pressure
This was a pain in the ass, as the webpage construction was not consistent I didn't know what I had to look for when I moved from one parameter to another, but it was so pleasant to finally be able to get the one I was aiming for (believe me, it was so much).
So I also had to use Regular Expressions (a.k.a. RegEx), these are some expression that search for coincidences in strings of characters, that's why that chunk of r"(>\d{2}<)|(>\d{3}<)|(>\d{1}<)|(>\d{0}<)|(>-<)|(>\d{4}<)" appears there, to look for some coincidences of the very specific characters I was looking for. And the same applies for r"((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(-)+(\d{1})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{2})+(\"))|((<span class=\"temp\" format=\"nu\" temp=\")+(\d{1})+(\"))" which was even more comlicated to get.
Then it's the same more basic stuff, it opens a .txt file where I write what I want to tweet, saves it and then opens it to update my timeline (i.e. to tweet).
Finally just a waiting time of 60*60*2 = 7200 seconds or in human language 2 hours. and then just call the function that will be executed in an infinite loop, since the bool True will always be so.
Where the code is running
- Have it running on my computer which would imply to never turn on my computer (too noisy, don't want to wear out my laptop)
- Have it running in the cloud (AWS for instance), I couldn't do that and I will explain why later on.
- Have another dedicated computer to run it, it seemed like a bit of an overkill but that's what I went for.
The RasPi with the internet cable and the charger to power it |
Why couldn't I run it on the cloud and save 50 €?
That's a legit question that I'll try to answer succinctly without looking like I am a complete idiot. My first approach was not only to report the weather in plain text but also to show a photo of Castro in the moment of reporting the weather, I like in the upper part of Castro and so I have very good views of it, well I implemented the code so it would post the picture taken at that moment, just an old webcam connected to my laptop would do the job and it was perfectly working. The picture quality could be better but that's not important.
I was very happy, my code was running smoothly and it included pictures of Castro which I think it was a very good feature, then I started putting it in the Raspberry Pi, turns out I couldn't manage to install OpenCV (the packages I used in previous entries) in the Pie, so well I just quit trying.
To have the thing running forever and updating pictures of Castro it needed to have a camera running forever, that's why I spent that money.
I think this is getting too long but anyway, I also wanted to display the temperature but the RasPi again wasn't working as I intended to, so I had to remove the temperature from this initial launch, I will work on that don't worry (I know you don't worry)
Conclusion
Screenshot taken at the moment of the first tweet (more or less) |
*who knows maybe the computer explodes or all of them happen at the same time.
Comments
Post a Comment