PinnedPublished inTowards Data ScienceMy First Billion (of Rows) in DuckDBFirst Impressions of DuckDB handling 450Gb in a real projectMay 111May 111
Published inTowards Data ScienceAnatomy of Windows FunctionsTheory and practice of an underappreciated SQL operationJun 111Jun 111
Published inTowards Data ScienceAutomatically Detecting Label Errors in Datasets with CleanLabA Tale of AI and wrongly-classified Brazilian Federal LawsJul 22, 2023Jul 22, 2023
Published inTowards Data ScienceAutomatically Managing Data Pipeline Infrastructures With TerraformI know the manual work you did last summerMay 2, 2023May 2, 2023
Published inTowards Data ScienceData Pipeline with Airflow and AWS Tools (S3, Lambda & Glue)Learning a little about these tools and how to integrate themApr 6, 20232Apr 6, 20232
Published inTowards Data ScienceCreating a Data Pipeline with Spark, Google Cloud Storage and Big QueryOn-premise and cloud working together to deliver a data productMar 6, 20232Mar 6, 20232
Published inTowards Data ScienceHands-On Introduction to Delta Lake with (py)SparkConcepts, theory, and functionalities of this modern data storage frameworkFeb 16, 20233Feb 16, 20233
Temporal and Geo-referenced Traffic Management with Python+StreamlitApplying modern tools to visualize time and spatial data in a dashboardJan 29, 20231Jan 29, 20231
Published inTowards Data ScienceFirst Steps in Machine Learning with Apache SparkBasic concepts and topics of Spark MLlib packageJan 4, 2023Jan 4, 2023
Published inTowards Data ScienceA Fast Look at Spark Structured Streaming + KafkaLearning the basics of how to use this powerful duo for stream-processing tasksNov 5, 20224Nov 5, 20224