Facebook crawler


*

UNMAINTAINEDFor an undefined period I will be unable to lớn review issues, fix bugs và merge pull requests. As I have been the sole contributor to the project, it"s likely that the code will remain frozen at the current stage.

Bạn đang xem: Facebook crawler

Anytoàn thân who is skilled enough và willing khổng lồ partecipate, may open a dedicated issue or liên hệ at my gmail address: rugantio AT tin nhắn DOT com

I will be bachồng, but in the meantime I"d appreciate if this becomes a community project.

DONATIONS

Fbcrawl is không tính phí software. It is not "miễn phí as beer" nor "không lấy phí as speech", it is "không tính tiền as a toilet": it is always available and working, but someone as to lớn keep it clean and tidy, and I am the only one at the moment, it is not a community project. Please consider make a donation, it will keep this project alive sầu and if I see actual interest from people I will get on with the TODO danh sách. One of the my long-term goal is lớn refactor the framework with a gui, connections lớn databases & graph vizualitations. These tasks would take at least a couple of months of work, and I will be able khổng lồ afford them only with your support! Thank you :)

*

DISCLAIMER

This software is not authorized by Facebook and doesn"t follow Facebook"s robots.txt. Scraping without Facebook explicit written is a violation of the terms và conditions on scraping & can potentially cause a lawsuit

This software is provided as is, for educational purposes, lớn show how a crawler can be made khổng lồ recursively parse a facebook page. Use at your own risk.

Introduction

*

The project is thus divided in several files that serve sầu different purposes:

fbcrawl README.md -- this file scrapy.cfg -- ini-style file that defines the project fbcrawl _init.py_ items.py -- defines the fields that we want khổng lồ export middlewares.py pipelines.py -- defines how we handle each thắng lợi (the phối of fields) settings.py -- all the parameter settings of the project spiders _init.py_ fbcrawl.py -- implements the spider for posts comments.py -- implements the spider for comments

How lớn crawl a page (fbcrawl.py)

The core of the crawler is this spider class, fbcrawl. On init, it navigates to lớn mbasic.facebook.com & logs inkhổng lồ facebook according khổng lồ the provided credentials, passed as parameters at execution time (see "How to lớn use"). Then the parse_page method is called with the page name given at runtime & the crawling process begins recursively retrieving all the posts found in every page. For each of the post it retrieves all the features, using the callbaông xã parse_post, và all the reactions, using parse_reactions.

The webpage are parsed & the fields are extracted using XPath selectors. These selectors are implemented on the pykhông lớn lib lxml so they are very fast.

Thanks lớn XPath, scrapy can navigate the webpage in a DOM mã sản phẩm, as one would navigate a filesystem, with several features of pattern matching. If you know nothing about XPath this guide & this cheatsheet can be helpful. Other resources are the original W3C docs and XPath functions.

Xem thêm: Top 5 Phần Mềm Xem Ảnh Trên Pc Tốt Nhất, Top Phần Mềm Xem Ảnh Trên Pc Tốt Nhất

The XPath are easy to lớn obtain using Firefox"s or Chromium"s dev tools, but sometimes the field relative sầu to a property changes location, which is something to keep in mind. For example, notice how I had lớn handle the source field using the pipe | that is the OR operator: new.add_xpath("source", "https://span/strong/a/text() | //div/a/strong/text() | //td/div/h3/strong/a/text()"). This kind of juggling is helpful khổng lồ maintain consistency of the data in our table. The control on the data & the policy lớn use is often implemented in the Item Pipeline.

So the parse methods populates Item fields (khổng lồ be explained in the next section) và pass control over to lớn the Item Loader.

Refer to Scrapy"s Spider documentation for more info.

Items (items.py)

This file defines an Item class, so that the fields that we have extracted can be grouped in Items và organized in a more concise manner. Item objects are simple containers used to lớn collect the scraped data. They provide a dictionary-lượt thích API with a convenient syntax for declaring their available fields.

Xem thêm: Cách Kiểm Tra Ai Hay Vào Facebook Của Mình Nhiều Nhất Với Chỉ Vài Bước Đơn Giản

I have extracted every field present in the post elements & add a few local ones. Namely for each article we have:


source - name of the post publisher, if it"s shared it"s the original oneshared_from - if the post is shared, is the name profile of the original post creatordate - timestamp in datetime.date() formattext - full text of the post, if empty it"s a pic or a videoreactions - total number of reactionslikes - number of likes ahah - number of ahahlove - number of lovewow - number of wowsigh - number if sighgrrr - number of grrrcomments - number of commentsurl - relative sầu links to lớn the post

ITEM_PIPELINES = "fbcrawl.pipelines.FbcrawlPipeline": 300,
Besides dropping our items according khổng lồ timestamp we can also export it locally to a CSV or a JSON. In case we choose lớn create a CSV tệp tin we need to lớn specify the order of the columns by explicitly setting:


FEED_EXPORT_FIELDS = <"source", "date", "text", "reactions","likes","ahah","love","wow","sigh","grrr","comments","url">
Scrapy"s mặc định behavior is to lớn follow robots.txt guidelines, so we need lớn disable this by setting ROBOTSTXT_OBEY = False.

How lớn use

Make sure that scrapy is installed, và clone this repository. Navigate through the project"s top màn chơi directory & launch scrapy with:


scrapy crawl fb -a email="barackobama

*

You can try it out with:


scrapy crawl comments -a email="EMAILTOLOGIN" -a password="PASSWORDTOLOGIN" -a page="LINKOFTHEPOSTTOCRAWL" -o DUMPFILE.csv

rm trump_comments.csv; scrapy crawl comments -a email="obama
tin nhắn.com" -a password="cm380jixke" -a page="https://mbasic.facebook.com/story.php?story_fbid=10162169751605725&id=153080620724" -o trump_comments.csv
(!) Some comments are duplicated. This is because facebook chooses lớn display a bình luận both in one page & in the next. There are several ways of handling this unwanted (although interesting on its own) behavior. It"s not possible to lớn leave sầu scrapy duplicate filter on, because this would make the crawler quit when it encounters duplicates, leaving out many comments. The best way of handling duplicates is to clean the CSV afterwards using pandas of the csv pykhiêm tốn module.For example, with pandas:


import pandas as pddf = pd.read_csv("./trump.csv")df2 = dfdf2.to_csv("./trump.csv",index=False)

Chuyên mục: SEO