PinnedJoão PedroinTowards Data ScienceMy First Billion (of Rows) in DuckDBFirst Impressions of DuckDB handling 450Gb in a real projectMay 111May 111
João PedroinTowards Data ScienceAnatomy of Windows FunctionsTheory and practice of an underappreciated SQL operationJun 111Jun 111
João PedroinTowards Data ScienceAutomatically Detecting Label Errors in Datasets with CleanLabA Tale of AI and wrongly-classified Brazilian Federal LawsJul 22, 2023Jul 22, 2023
João PedroinTowards Data ScienceAutomatically Managing Data Pipeline Infrastructures With TerraformI know the manual work you did last summerMay 2, 2023May 2, 2023
João PedroinTowards Data ScienceData Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)Learning a little about these tools and how to integrate themApr 6, 20232Apr 6, 20232
João PedroinTowards Data ScienceCreating a Data Pipeline with Spark, Google Cloud Storage and Big QueryOn-premise and cloud working together to deliver a data productMar 6, 20232Mar 6, 20232
João PedroinTowards Data ScienceHands-On Introduction to Delta Lake with (py)SparkConcepts, theory, and functionalities of this modern data storage frameworkFeb 16, 20233Feb 16, 20233
João PedroTemporal and Geo-referenced Traffic Management with Python+StreamlitApplying modern tools to visualize time and spatial data in a dashboardJan 29, 20231Jan 29, 20231
João PedroinTowards Data ScienceFirst Steps in Machine Learning with Apache SparkBasic concepts and topics of Spark MLlib packageJan 4, 2023Jan 4, 2023
João PedroinTowards Data ScienceA Fast Look at Spark Structured Streaming + KafkaLearning the basics of how to use this powerful duo for stream-processing tasksNov 5, 20224Nov 5, 20224