Web Scraping: Extract and Save Links from HTML to CSV or Text File
Effortlessly Extract and Organize Web Links
As mentioned in the article titled Web Scraping: Text Extraction from Infinite Scrolling Pages, I will be sharing articles and examples related to data scraping for a while. In this article, I will be sharing articles and examples related to data scraping for a while. In this article, I will use the Chrome Bookmarks folders and links, which have become unmanageable and need to be organized.
The example in this article shows how to extract the links, rearrange them, and save them as a CSV file. If necessary, link status checks and re-creating a bookmark file can also be included. First, let's look at the format of the exported HTML file. We will use this format to collect the links.
The goal of the example process I will address in this article is to extract links from the exported file and reorganize them. If necessary, tasks such as link status checks and creating a new bookmark file can also be included. We can start with the basic format for the code provided below.
Google Bookmarks
Google Chrome uses the following structure as a basis for importing and exporting the bookmarks1. In this structure, the DT > H3
refers to the directories, DL
refers to subdirectories and the P
refers to content.
<DL><p>
<DT><H3 ADD_DATE="..." LAST_MODIFIED="..." PERSONAL_TOOLBAR_FOLDER="true">Bookmarks Bar</H3>
<DL><p>
<DT><H3 ADD_DATE="..." LAST_MODIFIED="...">Training</H3>
<DL><p>
<DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
<DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
<DT><A HREF="..." ADD_DATE="..." ICON="...">...</A>
</DL><p>
</DL><p>
</DL><p>
The bookmarks.html
file in the repo can also be examined as an example.
Data Scraping repository is a collection of subdirectories, each of which contains a separate scraping project. You can find the repository homepage here, where you can learn more about our projects.
Using the python file named organize-links in the google-chrome-bookmarks directory in the Data Scraping repo, the exported HTML file content can be reorganized and saved as CSV. In the reorganization process, the exclusion_domains parameter, which is provided by default, represents the domain names to be excluded from the reorganization process, and separate_domains represents the domain names that need to be exported as separate CSV files.
An example usage for scrape_html
is provided below.
scrape_html(
input_path="/Users/user/Desktop/Data-Scraping/google-chome-bookmarks/bookmarks.html",
output_dir="/Users/user/Desktop",
filename="bookmark",
extension="csv",
strip_char="...",
max_text_length=50,
exclusion_domains=["facebook.com", "twitter.com"],
separate_domains={
"eksisozluk.com": "eksisozluk_links",
"etsy.com": "etsy_links",
"tr.pinterest.com": "pinterest_links",
"imdb.com": "imdb_links",
}
)
When the code is executed, output_is
will be checked, and the main directory and related separate link files will be created as defined by the scrape_html
parameters.
.
├── bookmark.csv
├── eksisozluk_links.csv
├── etsy_links.csv
├── github_links.csv
├── imdb_links.csv
├── medium_links.csv
├── pinterest_links.csv
└── youtube_links.csv
1 directory, 8 files
The result will be similar to the folder/file structure shown above.