Web Scraping: Text Extraction from Infinite Scrolling Pages
Data Scraping Using JavaScript and Python
Recently, I've been receiving many messages about data scraping. When grouped, sources related to news websites and financial markets stand out. In addition, I have been wanting to work with a web site that has infinite scrolling for a long time. The request that came in one of the messages I received became a good opportunity for me to do this1.
The main purpose of the following example is data scraping from the T24 news website. In the first stage of this data scraping process, which includes several stages, I will extract a list of columns published by the columnist from the page that shows inconsistencies on the browser and cannot be resolved through requests. Then, I will save the list as HTML and Markdown files.
The data collected as a result of this kind of work is often used for archiving and analyzing the columns, etc. In the coming days, I will also share some posts about no-code tools, automation solutions, and different Python packages that can be used for such purposes.
You can check out my article titled Website Scraping and Data Scraping with R for an example of web scraping with R.
Page Structure & Limitations
We can take the example of Tan Oral articles & drawings as a sample page2. When you view the relevant page, you can see that the content retrieves 12 items every time the user scrolls down as an infinite scroll behavior.
If the number of posts is few, it is possible to scroll down directly and get the relevant titles and links via the Browser Console. At this stage, when all articles are displayed (when #MoreButton
is no longer displayed), the following code can be used to extract the column content.
// Get all elements with class '_2Mepd' and '_1fE_V'
var elements = document.querySelectorAll('._2Mepd ._1fE_V');
var results = [];
for (var i = 0; i < elements.length; i++) {
var element = elements[i];
// Get the 'h3' element inside the '_31Tbh' element and extract the text content
var h3 = element.querySelector('._31Tbh h3');
var h3Text = h3 ? h3.textContent : '';
// Get the 'a' element inside the '_31Tbh' element and extract the 'href' attribute
var a = element.querySelector('._31Tbh a');
var href = a ? a.getAttribute('href') : '';
// Get the 'p' element inside the '_31Tbh' element and extract the text content
var p = element.querySelector('._31Tbh p');
var pText = p ? p.textContent : '';
// Get the second 'p' element inside the '_2J9OF' element and extract the text content
var secondP = element.querySelector('._2J9OF p:nth-of-type(2)');
var secondPText = secondP ? secondP.textContent : '';
// Add the extracted information to an object and push it to the 'results' array
var result = {
title: h3Text,
path: 'https://t24.com.tr' + href,
description: pText,
date: secondPText
};
results.push(result);
}
console.log('Results:', results);
In my previous post titled JavaScript Kullanarak Liste İçeriklerini Seperatör İle Yazdırmak, I used a similar method.
The disadvantage of this method is that it loses its practicality when the number of items (rows/columns) increases in the content. In this case, additional controls such as repeating scroll movements, displaying the specified elements, or monitoring their changes can be included in the code.
Data Scraping repository is a collection of subdirectories, each of which contains a separate scraping project. You can find the repository homepage here, where you can learn more about our projects.
Data Scraping
When directly controlled scroll behavior with Python or JavaScript, content may not be retrieved after a few iterations. In this case, either #MoreButton
or a numeric value such as document.documentElement.scrollTop += 500
can be used instead of document.body.scrollHeight
. Of course, this will slow down the process of retrieving content.
In the first step, I tried to get the content using Python, but I couldn't get a stable result, so I decided to proceed with JavaScript via the Console. The relevant code snippets are in the T24 Columnists repo.
The JavaScript and Python codes in the repository achieve same result in different ways.
[
{
"title": "Tan Oral Çiziyor",
"path": "https://t24.com.tr/yazarlar/tan-oral/tan-oral-ciziyor,39208",
"description": "Türkiye'nin önde gelen çizerlerinden Tan Oral, çizgileriyle Türkiye ve dünya gündemini yorumluyor",
"date": "18 Mart 2023"
},
{
"title": "Tan Oral Çiziyor",
"path": "https://t24.com.tr/yazarlar/hasan-gogus/devekusu,39096",
"description": "Türkiye'nin önde gelen çizerlerinden Tan Oral, çizgileriyle Türkiye ve dünya gündemini yorumluyor",
"date": "17 Mart 2023"
},
{
"title": "Tan Oral Çiziyor",
"path": "https://t24.com.tr/yazarlar/hasan-gogus/deprem-diplomasisi-mi-diplomaside-deprem-mi,38976",
"description": "Türkiye'nin önde gelen çizerlerinden Tan Oral, çizgileriyle Türkiye ve dünya gündemini yorumluyor",
"date": "14 Mart 2023"
},
{
"title": "'Birlik ve beraberlik' tekerlemesi…",
"path": "https://t24.com.tr/yazarlar/tan-oral/birlik-ve-beraberlik-tekerlemesi,39119",
"description": "Tek tek anlamlı olan bu iki kavram yada önerinin aynı anda, aynı yerde var olmaları imkânsızdır. İlk kez 22.07.2015 tarihinde T24'te yayımlanan bu yazının, bol bol seçim konuşmaları dinlemeye hazırlandığımız şu günlerde düşünceli bir pazar okumasına vesile olması dileği ile…",
"date": "12 Mart 2023"
},
//...
]
The JSON file exported after this process can be used for many purposes, of course. However, we will use it in the next step to scrape the columns and save them as HTML and Markdown files. The relevant code snippet is located in the same repository mentioned above. After running the code, the resulting folder and file structure will look like the following:
t24_scrapped_pages
└── tan-oral
├── 39208
│ ├── page.html
│ └── page.md
├── 39192
│ ├── page.html
│ └── page.md
├── 39152
│ ├── page.html
│ └── page.md
└── 39119
├── page.html
└── page.md
6 directories, 8 files
You can test the above sample flow with the code in the repository and you can improve and customize it as you wish.