Do you want to have some fun scraping COVID-19 data from the web, working with R on the Apache Spark Cluster, with Delta-Lake support? Then read this short article.

The main objective of this article is to demonstrate the Integration of R with Apache Spark and Data-Lake operations for web-scraping, data refinery, and transactional storage.

At the time of writing, the technology stack used consisted of

  • R 4.0,
  • Jupyter Notebooks 6.0.3,
  • Apache Spark 3.0 and,
  • Delta-Lake API 0.7.0
Application Stack
Application Stack

Git Repo with the Jupyter Notebook.

The public COVID-19 data used in this article was scraped from Worldometer In particular, we used…

Eric Tsibertzopoulos

Data Architect, Data Engineer, R user, Developer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store