Learning PySpark Locally Before Moving to Multi-node Cluster Databricks Environment


I have come across several frustrating tutorials on PySpark promising to teach me PySpark in under five minutes. They are click baits and lack the necessary depth to get me started and keep me rolling. So, I decided to write an article in hopes of helping others like myself with a project-driven tutorial as opposed to showing you code snippets and know-hows. I will primarily focus on a list of problems and use PySpark to answer the questions. You may follow along by grabbing the dataset and code here. At the end of this article, I have also included excellent…


I wrote an article a few days ago published here that strongly criticized political intolerance and suggested alternative peaceful coexistence. Right after I published the article, I came across a tweet from Dr. Sommers that almost made me take down my previous article in fear of sounding like a hypocrite. At face value, Dr. Sommers seems genuinely concerned about college enrollment rates for 18–to 24-year-old white men. However, the tweet was counterintuitive to the preconceived notion of the white man at the top of the patriarchal society. The tweet also didn’t bother pointing out white women were enrolled at much…

The positive effects of Artificial Intelligence (AI) in our everyday life are no longer disputable with our ever-increasing reliance on its applications. In the early days of the internet, the cost of infrastructure limited the people who can have access to it. Contrary to those days, open-source platforms have simplified the entry to AI significantly, so practically anyone with an internet and a decent laptop could leverage these tools. Even though most people who hear about AI are mostly enamored by fancy machine learning (ML) algorithms, the infrastructures that provide these algorithms the playground to flex their muscles are mostly…

The infamous data science workflow with interconnected circles of data acquisition, wrangling, analysis, and reporting understates the multi-connectivity and non-linearity of these components. The same is true for machine learning and deep learning workflows. I understand the need for oversimplification is expedient in presentations and executive summaries. However, it may paint unrealistic pictures, hide the intricacies of ML development and conceal the realities of the mess. This brings me to the tools of the trade or more commonly referred as the infrastructure of artificial intelligence which is the vehicle under which all libraries, experimentations, designs and creative minds meet. These…

Allegro Trains will take you out of the village!

Source: Carrying out a description of the radiographs of a patient with COPD.

The current global pandemic caused by the COVID-19 virus has threatened the sanctity of our humanity and the well-being of our societies at large. Similar to times of war, the pandemic has also given us the opportunity to appreciate the things we take for granted such as health workers, food suppliers, drivers, grocery store clerks and many others who are in the frontlines keeping us safe at this difficult time, Salute! While they are out fighting the good fight on our behalf, how about you and I get some work done?

Unfortunately, the pandemic has also brought us domain deficient…

Allegro Trains, a project management hub designed to seamlessly integrate with ML workflow.

Allegro Trains, an AI infrastructure for ML project management.

The resurrection of AI due to the drastic increase in computing power has allowed its loyal enthusiasts, casual spectators, and experts alike to experiment with ideas that were pure fantasies a mere two decades ago. The biggest benefactor of this explosion in computing power and ungodly amounts of datasets (thank you, internet!) is none other than deep learning, the sub-field of machine learning(ML) tasked with extracting underlining features, patterns, and identifying cat images. On a serious note, the advancement led by this technology is ubiquitous with news media outlet darlings such as Tesla’s Autopilot, google’s AlphaGo, and Brain-Machine-Interface technologies. Although…

Henok Yemam

Hi, I am a data scientist at Microsoft via Pactera Edge. I have a PhD in applied chemistry. Every comment or article I write on Medium is my personal opinion.

