Web Scraping using HTTP-Get and Regex

guide to scraping webpages to grab data you can then use in your ghost. I used this method for scraping weather.com to display weather to the user

go back to main ukagaka dev page

the basic flow is as such: user clicks on a link in the ukagaka's balloon -> link leads to download webpage function -> download webpage function leads to the parsing function. I cover the download and parsing functions here.

Download

this uses \! [execute, http-get, URL, option, option, option ...]. full documentation is on the ukadoc project wiki, but I will go over the options I use here

URL
- webpage you want to grab
--async = OnFunction
- replace OnFunction with the name of your parsing function. I recommend using an On function for this. calls OnFunctionFailure if the get call fails.
--file = file name
- whatever name you want the webpage saved as. I recommend using this option for readability when you're debugging
--nodescript
- if you don't want the balloon to show the "Downloading..." marker in the bottom of the balloon. also stops the balloon from appearing if this command is the only piece of dialogue.
--timeout = number of seconds
- time to wait for a response from whatever URL you are pinging before it gives up.

Example Call

"\0\s[0]Downloading weather... \![execute,http-get,https://weather.com/weather/today/l/12345,--async=OnCurWeatherFound,--file=weather.html,--timeout=200]"

this command grabs the page at https://weather.com/weather/today/l/12345, and calls the function OnCurWeatherFound if it is successful, and OnCurWeatherFoundFailure if it isn't. it saves the file in the ghost/master/var/ folder, named weather.html. there is a timeout of 200 seconds.

you can make use of envelopes in the URL if you want to have user personalized data. in my weather code, I use the URL https://weather.com/weather/today/l/%(locationcode), where locationcode was filled in earlier with an input box.

Parsing

there are two parts to the parsing function: getting lines from the file, and actually parsing the lines. the file I/O functions aren't too complicated.

Beginning Variables in your Parsing Function

_file = "var\\name_of_file"
- replace name_of_file with whatever you put as the file name using the --file option in the download function. include the file extension (.html, .txt, etc).
_buff = ""
- buffer that will hold each line of the file.
_regularexpressions = regular expressions
- we'll go over regex in a bit. for now just know you should try to initialize these at the top of the parsing function.
FCHARSET(1)
- the default character set that ukagaka file I/O uses is Shift JIS, when most english webpages are in UTF-8. this sets the file to be read properly.
some sort of variable to hold the data
- basically whatever you want to get from the webpage. could be a string, an array, an int - whatever you need.

Basic File I/O Structure

if FOPEN(_file,'r') { //open the file with read only capabilities (we don't want to change the file) 

        for _buff = FREAD(_file); _buff != -1; _buff = FREAD(_file) { //loop through each line in the file untill there is no more file to read
            //parsing functions
          }
}
FCLOSE(_file) //close the file at the end

after FCLOSE you would display the results of your parsing in your ukagaka's dialogue.

using regular expressions requires a bit more legwork.

download the webpage (as in, use the http-get call in script input)
open the page in your browser, and right click -> view source
find whatever piece of data you want to grab, and use inspect element or digging through the source code itself to find that data and the surrounding HTML. an example of this is  data-testid="TemperatureValue>68°, where I'm looking for the temperature.
that part of the HTML will be your regular expression, but you want to replace the data you're looking for with regex syntax. YAYA uses an old form of Perl regex, but I'll summarize some important parts here.
1. all characters match themselves except for the following special characters .[{()\*+?|^$
2. . will match any single character
3. * after any character will match that character zero or more times. EX: a*b -> matches b, ab, aaaaaaaab
4. + after any character will match that character one or more times. EX: a+b matches ab, aaaaaaab, but not b
5. ? after any character will match that character zero or one times. EX: ca?b matches cb or cab, but not caab
6. | will match either of its arguments. EX: abc | def will match abc or def
7. brackets are used to create character sets. EX: [abc] matches any of the characters, 'a', 'b', or 'c'
8. \d matches any digit. \w matches any alphanumeric character + the underscore. \s matches any whitespace character
9. surround whatever part you want in parenthesis
wrap your regex in single quotes, not double quotes
use RE_SEARCH or RE_GREP to check if the regex is in the current _buff
use RE_GETSTR[1] to get the data you want ([0] just has the full line with the rest of the HTML bits, [1] matches the part in parenthesis)

example using the moon phase checking code

_regphase=',"Phase: <span>([\w|\s]*)<./span>'

the bit in match any combination of word and whitespace characters.

if RE_SEARCH ( _buff, _regphase ) { //example parsing function

                phase = RE_GETSTR[1] //phase is the string that holds the data we want
                //Ex of RE_GETSTR array = ["Phase: <span>Full Moon<./span>", "Full Moon"]
}

put the parsing functions in your file I/O for loop.

RE_SEARCH will be fine for things like grabbing one piece of data from a site. if you're doing something like populating an array of data, you might have to use RE_GREP. I don't fully understand the difference, but it seems like RE_SEARCH grabs the first instance of regex matching in a line, while RE_GREP gets all the matches in a line. so depending on the structure of the webpage you might need to use it.

RE_GREP just takes the whole regex. So the array for that might look like: ["Phase: Full Moon<./span>","Phase: New Moon<./span>", "Phase: Waning Crescent<./span>"]

when using RE_GREP, I then use RE_SEARCH to strip out the excess HTML bits. example from the weather code here, although I use a 2D array.

if RE_GREP( _buff, _reghour ) {
            
                //clear out singleday string, use a temp string to hold the results of the regexpr search
                _singleday = NULL
                _temp = RE_GETSTR
                
                //strip out HTML bits using RE_SEARCH. comma is important to keep _singleday a pseudo-array
                for i = 0; i < ARRAYSIZE(_temp); i++ {
                    
                    if RE_SEARCH(_temp[i], _reghour) {
                        _singleday = _singleday + RE_GETSTR[1] + ","
                    }
                    
                
                }        
                //add singleday contents to relevant row of data in forecast array
                hourlyarray[0] = hourlyarray[0] + _singleday
                
}

your regex probably won't work perfectly when you first run it, but you can tweak it and fiddle with it until it does. it's better to be more specific than broad with your regex.

go back to main ukagaka dev page