Instagram scraper in Python: a working setup with instagrapi
Updated
What “scraping Instagram” actually means
“Scraper” is a loose word. In Instagram’s universe it usually maps to one of five concrete data shapes: profile metadata (username, bio, follower counts, profile picture), posts (captions, like and comment counts, media URLs, timestamps), stories (24-hour ephemeral media plus their viewers if you own them), hashtag and location feeds (lists of posts that match a tag or geo-pin), and the social graph (followers, following, likers, commenters).
What is not scrapable, no matter how clever your code: private profiles you do not follow, other users’ direct messages, view counts on stories you do not own, story viewers for accounts that are not yours, and any analytics surface that lives behind Instagram’s Business dashboard. Logged-out scraping technically works for a handful of public endpoints, but Instagram throttles it to near-uselessness within a few minutes — every serious scraper authenticates first, even when the data itself is fully public.
The rest of this guide assumes you want a small, working scraper running on your own machine, talking to the private API through instagrapi.
Setting up the client
Install the library and log in. The pattern below loads a saved session if one exists and creates one on first run, which avoids triggering a fresh device fingerprint on every script invocation:
from instagrapi import Client
cl = Client()
try:
cl.load_settings("session.json")
cl.login("USERNAME", "PASSWORD")
except FileNotFoundError:
cl.login("USERNAME", "PASSWORD")
cl.dump_settings("session.json")
Two notes. First, session.json authenticates as your account — add it to .gitignore immediately. Second, this try/except only handles the missing-file case; in real life you also want to catch LoginRequired (cookies expired) and ChallengeRequired (Instagram wants to verify) and re-login through the proper flow. For a single throwaway scraping account on a stable IP, the snippet above is enough to get going.
For anything multi-account or running in containers, store the settings dict in Redis or your database keyed by username instead of writing files. The session persistence guide covers that pattern end-to-end.
Scraping a user profile
user_info_by_username is the canonical entry point. It returns a pydantic model — fully typed, autocompletes in any editor, and serializes to JSON with one call. The only catch is datetime fields: json.dumps cannot encode them by default, so pass default=str:
import json
user = cl.user_info_by_username("instagram")
with open("instagram.json", "w") as f:
json.dump(user.dict(), f, indent=2, default=str)
The model exposes the fields you would expect: pk, username, full_name, biography, follower_count, following_count, media_count, is_private, is_verified, profile_pic_url, external_url. pk is the numeric user ID and is what every other endpoint takes as input — store it, do not re-resolve usernames in a loop.
If you only need the lightweight summary (username, full_name, profile pic), use user_short_gql_by_username — it is cheaper and counts less against your rate budget. Reserve user_info_by_username for when you actually need follower counts or biography text.
Scraping posts (with caption + media URLs)
Once you have a pk, user_medias walks the user’s feed. Pagination is handled internally; pass amount=N to cap the result, amount=0 to fetch everything:
medias = cl.user_medias(user.pk, amount=50)
for m in medias:
print(m.code, m.like_count, m.caption_text[:60] if m.caption_text else "")
cl.media_download(m.pk, folder="./media")
Each Media object carries the fields you typically care about: code (the slug for https://instagram.com/p/<code>/), taken_at (timezone-aware datetime), like_count, comment_count, caption_text, media_type (1 photo, 2 video, 8 album), view_count for videos, and resources for carousel children. media_download writes the file to disk and picks the right extension based on media_type — JPEG for photos, MP4 for videos, individual files for each carousel slide.
Two performance gotchas. like_count and comment_count are point-in-time snapshots — if you need engagement curves, re-fetch on a schedule. And user_medias(pk, amount=0) on a large account is hundreds of paginated requests; throw a time.sleep(2) between iterations or be ready for please_wait_a_few_minutes. The *_v1_chunk variants give you the cursor back so you can checkpoint a long crawl and resume after a throttle.
Scraping by hashtag
Hashtag feeds come in two shapes. hashtag_medias_recent returns posts in reverse-chronological order — useful for monitoring fresh content. hashtag_medias_top returns Instagram’s algorithmic “best of” ranking — useful for finding popular posts on a tag:
posts = cl.hashtag_medias_recent("python", amount=30)
for p in posts:
print(p.user.username, p.caption_text[:60])
Both endpoints return the same Media model as user_medias, so the downstream code is identical. The difference matters only for what you are trying to study: trending posts versus a chronological firehose.
Heads up on volume: hashtag endpoints are some of the most rate-limited in the private API. A naive script that pulls amount=1000 in a tight loop will hit please_wait_a_few_minutes within minutes, sometimes seconds, and the cool-down can stretch from a few minutes to several hours if you keep retrying through it. Plan for time.sleep between pages, set a daily request budget per account, and split large crawls across multiple sessions if the volume justifies it.
Storing results
For a one-off crawl, JSONL is the simplest format that scales: one JSON object per line, append-only, trivially streamable, and every command-line tool already speaks it. Same default=str trick for datetimes:
with open("posts.jsonl", "a") as f:
for p in posts:
f.write(json.dumps(p.dict(), default=str) + "\n")
You can cat posts.jsonl | jq it, load it into pandas with pd.read_json(lines=True), or pipe it into DuckDB without an import step. For anything that needs deduplication, joins, or concurrent writers, graduate to Postgres — media.pk is the natural primary key, and JSONB columns let you keep the raw payload alongside extracted fields. Object storage (S3, Backblaze) is the right home for the actual media files; keep the database to metadata.
Wrapping up
A working scraper is three pieces: a persisted login, a fetch loop with sensible pagination caps, and a durable storage format. Everything past that is pagination math, error handling, and proxies. For the social graph specifically, see scraping followers in Python. When throttling becomes the bottleneck, the proxy setup guide is the next stop.
Related guides
- Get Instagram followers in Python with instagrapi Fetch the full follower list of any public Instagram account from Python using instagrapi: pagination, rate limits, and exporting to CSV.
- Configuring proxies in instagrapi: HTTP, SOCKS5, and residential setups Configure HTTP and SOCKS5 proxies in instagrapi (Python). Residential vs datacenter, per-account pinning, and rotating without breaking sessions.
- instagrapi vs instaloader: which Python Instagram library should you use? instagrapi vs instaloader compared: API surface, login, posting, downloading, async support, and the right tool for each use case.
Frequently asked
Is scraping Instagram allowed?
Instagram's Terms of Service forbid automated collection. Public-data scraping is a legal gray zone in most jurisdictions; commercial scraping at scale carries clear legal and account risk. Use throwaway accounts and respect robots and rate limits.
What is the difference between instagrapi and instaloader?
instaloader is a CLI-first scraper focused on downloading media from public profiles. instagrapi is a programmatic library exposing the full private API surface — login, posting, DMs, stories, two-factor — not just downloads.
Do I need to log in to scrape public data?
Yes. Instagram serves very limited data to logged-out clients and aggressively rate-limits anonymous requests. instagrapi requires a logged-in session for almost every endpoint.
Can instagrapi download videos and stories in bulk?
Yes. story_pk_from_url + story_download, plus media_download for posts. Be aware of bandwidth and storage; large bulk runs without proxies will throttle quickly.
Skip the infra?
Managed Instagram API — same endpoints, sessions and proxies handled.
Try HikerAPI → Full comparison