Crawlee Technical Documentation Review¶
Documentation Title¶
Reviewer Information¶
- Name: Ayu
- Date of Review: 25 September 2024
- Review Level: Beginner
1. Summary¶
Crawlee is a web scraping and browser automation library that helps users build reliable crawlers. It makes HTTP requests that mimic browser headers and fingerprints and can switch crawlers from HTTP to headless browsers. These features allow crawlers to appear human-like and be undetected by modern bot protections.
2. Clarity and Comprehensiveness¶
- Clarity:
The information is easy to understand, and the language is appropriate for the intended audience.
However, not all sentences and paragraphs are clear and straightforward. Some folks might like to read the documentation line by line, which can cause them to read something a couple of times to understand better.
Below is an example:
In the "Creating a new project" section:
A prompt will be shown, asking you to select a template. Crawlee is written in TypeScript so if you're familiar with it, choosing a TypeScript template will give you better code completion and static type checking, but feel free to use JavaScript as well. Functionally they're identical.
Adding some punctuation and splitting the long sentence into shorter ones would be helpful for clarity and better understanding. Although Crawlee is written in TypeScript, considering it also offers to use JavaScript, the sentence above can be improved with something like:
A prompt will be shown, asking you to select a template. Choosing a TypeScript template will give you better code completion and static type checking if you're familiar with it. Otherwise, you can select JavaScript. Functionally, they're identical.
As the documentation is aiming audiences with the knowledge of JavaScript—and potentially know what TypeScript is—another suggestion is not to mention them at all and go straight to the point as below:
A prompt will be shown, asking you to choose between a TypeScript or JavaScript template.
- Comprehensiveness:
The documentation covers all necessary topics and provides sufficient examples and references. It even provides a dedicated chapter for examples. It's also very thoughtful to provide technical explanation pages within the "Guides" chapter and link them within the docs.
3. Accuracy and Relevance¶
- Accuracy:
I found at least one outdated information as below:
The "JavaScript rendering" chapter explains the JavaScript rendering through a demonstration.
As per documentation, I should get ACTOR:
and nothing else on the console when I ran the first code. Instead, I got the below info on the console with no ACTOR:
printed anywhere:
INFO CheerioCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO CheerioCrawler: Final request statistics: {"requestsFinished" :1, "requestsFailed" :0, "retryHistogram":[1], "requestAvgFailedDurationMillis":null, "requestAvgFinishedDurationMillis":4688, "requestsFinishedPerMinute":9, "requestsFailedPerMinute" :0, "requestTotalDurationMillis":4688, "requestsTotal" :1, "crawlerRuntimeMillis":6821}
INFO CheerioCrawler: Finished! Total 1 requests: 1 succeeded, 0 failed. {"terminal":true}
It further says:
You can confirm this using Chrome DevTools. If you go to https://apify.com/store, right-click anywhere in the page, select View Page Source and search for ActorStoreItem you won't find any results.
Searching for ActorStoreItem
in the View Page Source found 183 results for all CSS classes. If these classes do not correspond to the documentation, they must be updated to match what ActorStoreItem
users want to look for.
- Relevance:
The content is relevant to the product's current state. In general, the documentation aligns with the latest Microsoft Writing Style Guide. For example, the documentation is written like we speak. However, it can still be improved by following the best practices of Top 10 tips for Microsoft style and voice.
One example is to improve the consistency of contractions, such as it's, you'll, you're, we're, and let's.
Taking from the "Pagination" section in the "Real-world project" chapter:
When switching between pages, you will see that the URL changes to:
And right in the following line:
Try clicking on the link to page 4. You'll see that the pagination links update and show more pages.
4. Structure and Organization¶
- Logical Flow:
The documentation is logically structured and well-organized. It's pleasant to follow.
However, one subsection in a page might be part of a heading. In the "JavaScript rendering" chapter, "Waiting for elements to render" might be a subsection of "Headless browsers". But they have the same heading level.
- Navigation:
The documentation is easy to navigate because the headings, subheadings, and table of contents are clear, and links are generously provided.
5. Visual and Design Elements¶
- Visuals:
Screenshots are used effectively. They're also clear, properly labeled, and relevant.
- Design:
The documentation's aesthetics can be improved. Admonitions (notes, info, tips, warning cards) are very helpful for clarity and caution. However, some pages, e.g., Scraping the Store, have back-to-back admonitions, which can distract the audience.
Fonts and colors are consistent throughout the docs. Yet, some formatting can be improved for consistency. Let's take an example from Getting some real-world data page. There are two ways a link is formatted in the documentation:
- As an inline code
...when we already know that everything we want to extract can be found at the
https://warehouse-theme-metal.myshopify.com/collections
page.
- As an external link
Let's open DevTools by going to https://warehouse-theme-metal.myshopify.com/collections in Chrome...
6. Suggestions for Improvement¶
Actionable Suggestions¶
Based on the findings and examples in previous sections, below are some suggestions that can be applied:
- Some long sentences can be broken into shorter sentences for better reading and understanding.
- Using punctuation properly can help the audience understand the exact meaning of a sentence. There might be another way to avoid using heavy back-to-back admonitions so that the audience isn't distracted.
- The consistency of using "you" and "we" can be improved. For example, in the "Scraping the Store" page, some sections use "you", and some go with "we"."
Sections that Need Expansion, Rephrasing, or Additional Content¶
There are below sentences in the "Headless browsers" section:
You can choose from two libraries to control your browser: Puppeteer or Playwright. The choice is simple. If you know one of them, choose the one you know. If you know both, or none, choose Playwright, because it's better in most cases.
Playwright is a little more pleasant to use, but both libraries will get the job done.
Looking at the wording I emphasize here, the documentation recommends that the audience use Playwright. But it'd be great to explain why Playwright is better than Puppeteer so that the audience can confidently choose one. Also, these sentences give a subjective opinion.
It'll be even better to have a table that shows the differences, such as their strengths and weaknesses.
7. Notable Strengths¶
The documentation is quite informative for those who are new to web scraping. It provides detailed explanations about using Crawlee for web scraping and enough code examples and screenshots.
The "Introduction" chapter, in particular, does a great job of walking the audience to understand Crawlee and scraping data step-by-step.
8. Identified Errors/Inconsistencies:¶
Please see explanations and examples in below sections:
- 2. Clarity and Comprehensiveness
- 3. Accuracy and Relevance
- Sections that Need Expansion, Rephrasing, or Additional Content
9. Best Practices Compliance¶
- Standards:
Creating standards throughout the documentation can be improved. Please see the "Relevance" in the 3. Accuracy and Relevance as an example.
10. Overall Assessment¶
As a background, I have no knowledge of web scraping. Reading the documentation and following the instructions helped me understand web scraping and what Crawlee does. I also gained more knowledge about web scraping, crawling (which I heard a lot about but am not sure what it is), and so on.
I rate the document 3 on a scale of 1 to 5. Why 3? I'm one of those readers who read line-by-line when I want to understand something more. Sometimes, I had to read a paragraph a couple of times to understand the information because some sentences were not straightforward and too wordy. Additionally, the back-to-back admonitions distracted me because I felt like I had to read them all thoroughly.
11. Additional Comments¶
It would be great to add information about the legal side of web scraping at the beginning of the documentation. This would clarify what kind of data users may keep, use, and share and what they may not. Knowing the limitations would benefit users and Crawlee as a tool to scrap data from the web.