Scraping the Kenya Gazette
Creating a database with a list of the issue of new land title deeds in Kenya.
This is a project scraping HTML pages of weekly Kenya Gazette notices starting from 2010 in order to retrieve and categorise land-related notices. The result is a database with more than 200,000 rows of land-related notice entries.
Getting the Data
The Kenya Gazette
This is what the home page of the Kenya Gazette looks like
This is what the landing page for one year of gazette notices looks like
How I got the urls for each year
#From 2010 to 2020
#Weekly issues
#Every url for a year's worth of gazette notices starts with this url_year_base below
="http://kenyalaw.org/kenya_gazette/gazette/year/"
url_year_base
#I am trying to scrape for the years with htmls, that is 2010 to 2020
#Getting the range from 2010 to 2021 so 2020 can be included, then setting it into a list
=list(range(2010,2021))
years
#to get each url for each year containing a list of the year's gazette notices
=[]
year_link_dict_list
for year in years:
={}
year_link_dict=url_year_base+str(year)
url_year# print(url_year)
'year']=year
year_link_dict['url_year']=url_year
year_link_dict[
year_link_dict_list.append(year_link_dict)
= [x['url_year'] for x in year_link_dict_list] url_year_links
How I got the HTML content on each year’s page for all gazette notices within it.
Here, as a list of dictionaries, I get the following information:
- year
- url for the year
- url for each gazette notice within the year. I then exploded this so that each html for a weekly gazette notice gets its own row
for year_link_dict in year_link_dict_list:
print("$$$$$$$$$$$$")
=year_link_dict['url_year']
url_year_linkprint(url_year_link)
= requests.get(url_year_link).content
year_raw_html print(type(year_raw_html))
#assign year_urls_soup_doc as the doc holding parsed html
#learn type of year_urls_soup_doc
= BeautifulSoup(year_raw_html, "html.parser")
year_urls_soup_doc print(type(year_urls_soup_doc))
print("________")
#These are the links on the page for all gazette notices
=year_urls_soup_doc.select('#content')
links_within_year
#Both weekly and special gazette notices sections
= str(links_within_year).split('<p>')
sections_within_year
#To get weekly issues only use sections_within_year[1]
#Then split it by tr to get each entry of a gazette notice
1].split('<tr>')
sections_within_year[=sections_within_year[1]
weekly_section=weekly_section.split('<tr>')
weekly_section_entries
=[]
link_href_list
for link_within_section in weekly_section_entries:
try:
=link_within_section.split('<td>')[2].split('"')[1]
link_href# print(link_href)
# print("______")
link_href_list.append(link_href)'gazette_links']=link_href_list
year_link_dict[
except:
pass
# print(year_link_dict)
# print(year_link_dict_list)
How I created a function that scrapes a url for a gazette notice to extract the following information:
- volume number: volume_num
- volume date: volume_date
- volume url: volume_url
- title of notice’s number: notice_num_title
- title of notice’s act: notice_act_title
- notice’s capital and section: notice_num_year
- notice’s subtitle: notice_sub_title
- notice’s full body of info: notice_body
- specific notice’s date: notice_date
- name of notice’s registrar: notice_registrar_name
- number and location of notice: notice_num_loc
- notice’s location: notice_loc
def scrape_url(my_url):
try:
= requests.get(my_url).content
raw_html
#assign soup_doc as the doc holding parsed html
#learn type of soup_doc
= BeautifulSoup(raw_html, "html.parser")
soup_doc
#Finding volume number to keep track of what data is from what source in larger csv
# creating a list of all divs with id starting with GAZETTE NOTICE
= [elem.parent.parent for elem in soup_doc.find_all("em", text=re.compile("(Land Registrar|Registrar of Land|Registrar of Titles)", re.IGNORECASE))]
list_all_divs#The parent of p is a div
#print("Found", len(list_all_divs), "potentials")
#create function for making a dictionary out of each entry within the list of all divs
def class_to_dictionary(list_of_items):
#initialise keys of dictionary
=['volume_num',
dict_titles'volume_date',
'volume_url',
'notice_num_title',
'notice_act_title',
'notice_num_year',
'notice_sub_title',
'notice_body',
'notice_date',
'notice_registrar_name',
'notice_num_loc',
'notice_loc']
#initialise empty list which will contain all new notice entries
=[]
all_new_notice_entries
#For one entry within the list of items which is a variable you will pass in the function
#list_of_items will usually be the large capsules containing gazette entries
#mostly for the ones I have seen there are two such capsules among these that have land registration act entries
for one_item in list_of_items:
#if the words LAND REGISTRATION ACT or Land Registrar appear in the text
if "LAND REGISTRATION ACT" in one_item.text or "Land Registrar" in one_item.text:
#Then split one_item at the <hr/> point
= str(one_item).split('<hr/>')
entry_deets
#now entry_deets is a list of notices within the capsule
for entry_deet in entry_deets:
#one entry_deet is a notice within the capsule
#initialise a list which I plan to zip to dict_titles to make a dictionary
=[]
entry_deet_full_list
#since I changed to string to split on hr
#I need to return the content to beautiful soup classes
= BeautifulSoup(entry_deet, "html.parser")
entry_deet_soup
#now make a list of all p's within the gazette notice
= entry_deet_soup.find_all('p')
entry_deet_paragraphs
#When converting to text I kept getting Xa0
# Normalize unicode data like this
# # normalize("NFKD", x.text)
# https://stackoverflow.com/questions/10993612/how-to-remove-xa0-from-string-in-python
= [normalize("NFKD", x.text) for x in entry_deet_paragraphs if x.text.strip()!='']
entry_deet_paragraphs if len(entry_deet_paragraphs) == 0:
continue
#Introducing number, date and url
#Then the rest comes from
=soup_doc.select(".gazette-content #content div")[2].text
volume_num=soup_doc.select(".gazette-content #content div")[3].text
volume_date=my_url
volume_url=entry_deet_paragraphs[0]
notice_num_title=entry_deet_paragraphs[1]
notice_act_title=entry_deet_paragraphs[2]
notice_num_year=entry_deet_paragraphs[3]
notice_sub_title=entry_deet_paragraphs[4]
notice_body=entry_deet_paragraphs[5]
notice_date=entry_deet_paragraphs[6]
notice_registrar_name=entry_deet_paragraphs[7]
notice_num_loctry:
=(entry_deet_soup.find_all('p')[-1]).text.split(',')[1]
notice_locexcept:
="Na"
notice_loc
entry_deet_full_list.append(volume_num)
entry_deet_full_list.append(volume_date)
entry_deet_full_list.append(volume_url)
entry_deet_full_list.append(notice_num_title)
entry_deet_full_list.append(notice_act_title)
entry_deet_full_list.append(notice_num_year)
entry_deet_full_list.append(notice_sub_title)
entry_deet_full_list.append(notice_body)
entry_deet_full_list.append(notice_date)
entry_deet_full_list.append(notice_registrar_name)
entry_deet_full_list.append(notice_num_loc)
entry_deet_full_list.append(notice_loc)
# # print(new_deet_list)
= dict(zip(dict_titles,entry_deet_full_list))
first_rep # # print(first_rep)
# # print("######")
all_new_notice_entries.append(first_rep)return all_new_notice_entries
= class_to_dictionary(list_all_divs)
data return data
except Exception as e:
# raise e
pass
How I scraped each gazette notice:
- Use the function created to now scrape through all gazette links which were earlier scraped and saved into the df_links gazette links. Save it into a large list and then create a dataframe from it
- Now merge the data in df_links to the new dataframe df
How I incorporated district data
- Read data on Districts, Population and Area and merge it to the location information in the dataframe
- This data is important so that I can normalise the data on population but also so that I can get the code that I can use to group the data by provinces
- Clean up the districts’ names in the dataframe to make it possible to join with districts dataframe
- Merge to the main dataframe to include the districts’ information
How I used regex and other relevant scraping techniques to extract more information from the notices for categories
How I merged the resulting data to data on provinces
I wanted to have the option to group the data by provinces
I made a dataset that matched every province code in the dataset to a province in Kenya
I merged the resulting data to the main dataframe
How I mapped the dataset
Having grouped the dataset by provinces, I made a chloropleth map showing the distribution of land-related gazette notice activity by province by joining the dataset to a shapefile of Kenya’s provinces