How to scrape a website which requires login using Python requests and BeautifulSoup.

Table of contents

I was preparing an article on some free resources that students can access with just their School associated E-mail. I wanted to list out all of the resources which is over 80, but I felt so lazy to copy individually.

So, I decided to write a scraper to get me all the necessary informations.

If you are familiar with scraping, you should definitely know that in order to scrape data from a website that requires login, you have to use urllib, mechanize or selenium to handle submission of your login information.

In this short tutorial, I will be showing you a simpler way, that gets you there without selenium or mechanize, or any other 3rd party tools.

When you login into website , a user is verified using the provided information, and the same identity is used thereafter for every other interaction, which is stored in cookies and headers, for a brief period of time.

If you try scraping data from such website, you are always going to be scraping the Login Page, as you can not have access to the website until you are authenticated.

What you need to do is use the same cookies and headers generated when you make your http requests, and you'll have access to the website.

Steps

  1. In your browser, open the developer tools.
  2. Go to the target site and Login.
  3. After the login, go to the network tab, and then refresh the page At this point, you should see a list of requests, the top one being the actual site - and that will be our focus, because it contains the data with the identity we can use for Python and BeautifulSoup to scrape the website.
  4. Right click the site request (the top one), hover over copy, and then copy as cURL Like this: Screenshot (3).png

  5. Then go to this site which converts cURL into python requests and paste the copied code: https://curl.trillworks.com/.

  6. Take the python code and use the generated cookies and headers to proceed with sending your requests.

After doing the above, you should have access to the website just like when you are logged in.

Thanks for reading.

Don't forget to connect with me on LinkedIn Lawal Afeez.

stackoverflow.com/questions/23102833/how-to..