Data Scraping
What is Data Scraping?
Data scraping is automated data collection from the internet using a computer program.
IRB Review of Research that Involves Data Scraping
Human subjects research protocols that involve data scraping present a unique challenge for the IRB and for Stanford more broadly. Data scraping can pose risks to research participants that the IRB must assess in order to certify that the risks are reasonable in relation to the potential benefits, as required by the IRB Criteria for Approval. The IRB consults with other Stanford officials and specialty experts when necessary to understand the data scraping procedures and apply the IRB Criteria for Approval.
Each of the following elements should be described in the IRB application for data scraping studies:
Describe the source(s) of the data and how the study’s procedure for collection and use of the data fit with the source’s Terms of Use:
- Is the data public or private?
- Public – Data that is freely accessible to anyone with no access restrictions such as an application process or site login. Use of public data generally presents a low risk to participants.
- Private – Data that is not freely available to anyone or is only available via restricted access, such as an application process or site login. Even if a data holder grants access to anyone who asks, any access restriction makes data private.
- Do the website's Terms of Use permit data scraping to use the data for the study's purposes? This may also be referred to as “Terms and Conditions”, “User Agreement”, “Terms”, or a similar name.
- If data scraping is not permitted, is it possible to access the data in a different way without a violation of the Terms of Use?
- For example, by using an Application Programming Interface (API) provided by the site to collect data, accessing data through an agreement with the site, or accessing a pre-existing dataset collected from the site.
- This question applies regardless of whether the data is public or private.
- If data scraping is not permitted, is it possible to access the data in a different way without a violation of the Terms of Use?
Describe the data to be collected and indicate whether the study team will have access to identifiable data, either through viewing on the source’s website, or through data collected and entered into the research record.
IRB protocols should describe the data being accessed and explain what variables the researcher will obtain from the site.
- Identifiable data – Datasets that include direct identifiers, or individuals’ identities can be readily ascertained by using a code with a link to identifiers.
- Using identifiable data for research purposes requires an active IRB protocol.
- De-identified data – Datasets that do not include direct identifiers AND the researcher cannot readily ascertain the identities of individuals using a link back to identifiers.
- Secondary use of de-identified data may not require IRB review. Consult with IRB staff for more information.
- Readily ascertained – An individual’s identity can be easily determined or discovered from the available data, which implies identifiability with minimal effort.
Scraped data can be used for many research purposes, such as secondary analysis or as study stimuli (e.g., images or social media posts). IRB protocols should explain how the scraped data will be used for human subjects research.
Each of the following should be addressed:
- The largest potential risks in using scraped data generally involve data privacy and confidentiality.
- Data privacy refers to a person’s right to control how their personal data is used or disclosed. Researchers must consider whether individuals have any expectation of data privacy when deciding whether to scrape data from a particular website. If the data is “private” as defined above, there likely is some expectation of data privacy. In addition, many countries have data privacy laws that regulate the use of personal data. For example:
- European Union: General Data Protection Regulation (GDPR)
- United Kingdom: UK GDPR
- China: Personal Information Protection Law (PIPL)
- See more information on the Stanford Privacy Office website
- Confidentiality refers to an individual’s right to be free from unauthorized release of information. It is the researchers’ responsibility to protect research participant confidentiality by ensuring identifiable data is securely stored and managed (see below).
- Data privacy refers to a person’s right to control how their personal data is used or disclosed. Researchers must consider whether individuals have any expectation of data privacy when deciding whether to scrape data from a particular website. If the data is “private” as defined above, there likely is some expectation of data privacy. In addition, many countries have data privacy laws that regulate the use of personal data. For example:
- Researchers must consider if their data collection policy is fair. Ensure all relevant populations have an equal chance of being included in the dataset, rather than choosing a particular population out of convenience.
- The IRB considers risks in relation to the size of the proposed dataset to be scraped. IRB protocols should include a size estimate of the proposed dataset [e.g., how many individuals’ personal information will be collected, or how many other records (photos, other content) will be obtained?]. Researchers should always collect the minimum amount of data necessary to achieve their research objectives and should also collect the minimum necessary amount of personally identifiable information.
- Consider if the individuals would expect/approve of the proposed use of their data?
- Does the proposed use of the data introduce additional risks to the participants? If so, are the risks reasonable in relation to the potential benefits of the research
If datasets will be merged, address the following:
- Does the resulting dataset include more sensitive information than the un-merged datasets individually? For example, combining a dataset including individuals’ street addresses with a dataset including the same individuals’ credit card transaction information or medical information.
- Does the merged dataset change the identifiability of the dataset? Sometimes, merged datasets may be considered identifiable even if the individual datasets are considered de-identified on their own.
- All data should be stored using Stanford Approved Services such as Stanford Google Drive or Google Cloud.
- All devices storing or accessing the data must meet Stanford Minimum Security Standards.
- Projects that seek to use high-risk data should seek a Data Risk Assessment from the University Privacy Office prior to submission for IRB review.
Institutional Risk Considerations
Many websites’ Terms of Use specifically prohibit data scraping. Because Terms of Use agreements are contractual obligations, studies that involve potential Terms of Use violations (including but not limited to data scraping) must be evaluated for potential institutional risk by the Office of the Vice Provost and Dean of Research (VPDoR) and the Office of General Counsel. The IRB will facilitate this assessment, which will be completed before IRB approval of the project.
This assessment focuses on institutional risk considerations, including legal risk to Stanford:
- Breach of contract - Terms of Use are a contract
- Deception or accessing data under false pretenses (e.g. signing up for a site with a fake user account)
- Data scraping from adversarial entities
Page updated March, 2025