fb_pixel

Need literature, articles or sites about parsing sites for PHP (and in general about parsing)?

Development | Programming languages
Author
Peter Kovac
Description
Hello! I am writing a diploma on the topic "Development of a universal parser in PHP". Looking for all kinds of literature and articles on this topic, tell me who knows what?
Attachments
No attachments
Info
Description
There is a book "PHP Web Scraping.Jacob Ward." You can download here it-ebooks.info/book/4297
on May 29th, 2020 (9:17 pm)
Description
A bit of general information on the topic

In general, site parsing consists of approximately the following steps:
Site analysis: determining the site structure and data template and, at this stage, it is useful to examine the robots file, xml site map, search the site, issue search engines for the site.
Preparing expressions(xpath or css selectors) to get the necessary data from the pages.
Writing and debugging a parser.


The parser itself may consist of the following parts:
Crawler.Using some of the rules passes through the pages, collects links, can send pages directly to parsing(in the queue for parsing), or simply unload and save them entirely.
Parsers.The blocks are responsible for pulling out specific data and converting it into the desired format.
Supporting services that are responsible for parsing the HTML DOM, caching, HTTP requests, bypassing protection from parsing, saving data in the right format, and so on.


Libraries are used for HTML parsing(example ).Regular HTML is not parsed, but also, of course, used to parse other data.
Sometimes it is required to execute JS, for example with the help of PhantomJS.
For bypassing captcha resort to services like antigate/anti-captcha.
Sometimes you need to log in or bypass cookies-based protection.
For multithreaded parsing use multicurl.

In general, PHP is not the most suitable language for parsing sites.All the same, it is intended for other purposes.Python + Grab will be much more convenient and more productive here.As, however, almost any desktop language has the necessary libraries.

on May 29th, 2020 (9:18 pm)
All coments
This job has not been commented yet.