Data Impostor

Blog Talks Reads About
Follow me on Mastodon Go to my GitHub repo Go to my RSS feed
  • Jan 18, 2025


    Lessons Learned Implementing Metric Trees by Ergest Xheblati
  • Jan 12, 2025


    Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques | Amazon Web Services by
    Apache Iceberg Won the Future — What’s Next for 2025? by Yingjun Wu
    Building effective agents by
    How I run LLMs locally by
    Next.js 15.1 by
  • Jan 11, 2025


    Hugging Face Smolagents is a Simple Library to Build LLM-Powered Agents by
  • Jan 9, 2025


    A Pixel Parable by Facundo Olano
  • Jan 8, 2025


    Hitting OKRs vs Doing Your Job by jessitron
  • Jan 7, 2025


    Goodhart’s Law Isn’t as Useful as You Might Think by
  • Jan 6, 2025


    Incremental Jobs and Data Quality Are On a Collision Course - Part 2 - The Way Forward — Jack Vanlightly by Jack Vanlightly
  • Jan 5, 2025


    How AI is unlocking ancient texts — and could rewrite history by Marchant, Jo
    Glue work considered harmful by
    Goodbye Github Pages, Hello Coolify · Quakkels.com by
    Tools Worth Changing To in 2025 by Matthew Sanabria
    Databases in 2024: A Year in Review by
    Dismantling ELT: The Case for Graphs, Not Silos — Jack Vanlightly by Jack Vanlightly
  • Jan 4, 2025


    Things we learned about LLMs in 2024 by
  • Jan 3, 2025


    Change query support in Apache Iceberg v2 — Jack Vanlightly by Jack Vanlightly
  • Jan 2, 2025


    Collection of insane and fun facts about SQLite - blag by
  • Jan 1, 2025


    Scalar and binary quantization for pgvector vector search and storage by Jonathan Katz
    Turbocharge Efficiency & Slash Costs: Mastering Spark & Iceberg Joins with Storage Partitioned Join by Samy Gharsouli
    Build Write-Audit-Publish pattern with Apache Iceberg branching and AWS Glue Data Quality | Amazon Web Services by
    Designing data products by
    On writing and getting from zero to done — Jack Vanlightly by Jack Vanlightly
    Introducing AWS Glue Data Catalog automation for table statistics collection for improved query performance on Amazon Redshift and Amazon Athena | Amazon Web Services by
    React v19 – React by
    Tech predictions for 2025 and beyond by Dr Werner Vogels - https://www.allthingsdistributed.com/
    First impressions of the new Amazon Nova LLMs (via a new llm-bedrock plugin) by
    Use open table format libraries on AWS Glue 5.0 for Apache Spark | Amazon Web Services by
    Migrating AWS Glue for Spark jobs to AWS Glue version 5.0 - AWS Glue by
  • Dec 31, 2024


    Getting to Two Million Users as a One Woman Dev Team by
  • Dec 28, 2024


    Why AI language models choke on too much text by Timothy B. Lee
    AI-generated tools can make programming more fun by
  • Dec 22, 2024


    The 150x pgvector speedup: a year-in-review by Jonathan Katz
    How to generate unit tests with GitHub Copilot: Tips and examples by Greg Larkin
    Introducing AWS Glue 5.0 for Apache Spark | Amazon Web Services by
  • Dec 21, 2024


    Building Confidence: A Case Study in How to Create Confidence Scores for GenAI Applications - Spotify Engineering by alexandrawei
    Storing times for human events by
    DataFrames at Scale Comparison: TPC-H by
    Enabling compaction optimizer - AWS Glue by
    Top Python Web Development Frameworks in 2025 · Reflex Blog by
    Building Python tools with a one-shot prompt using uv run and Claude Projects by
  • Dec 18, 2024


    How to Speed Up Spark Jobs on Small Test Datasets by luminousmen
  • Dec 15, 2024


    A high-velocity style of software development by
    Building end-to-end data lineage for one-time and complex queries using Amazon Athena, Amazon Redshift, Amazon Neptune and dbt | Amazon Web Services by
    A First Look at S3 (Iceberg) Tables by Nikhil Benesch
  • Dec 11, 2024


    Amazon Bedrock Knowledge Bases now supports RAG evaluation (Preview) - AWS by
  • Dec 8, 2024


    Amazon Aurora now available as a quick create vector store in Amazon Bedrock Knowledge Bases - AWS by
    Why it took a long time to build that tiny link preview on Wikipedia by Jon Robson (David Lyall)
  • Dec 5, 2024


    AWS Glue Data catalog now automates generating statistics for new tables - AWS by
    Introducing AWS Glue 5.0 - AWS by
  • Dec 3, 2024


    Table format comparisons - Change queries and CDC — Jack Vanlightly by Jack Vanlightly
  • Nov 28, 2024


    Amazon S3 adds new functionality for conditional writes - AWS by
    AWS Amplify introduces passwordless authentication with Amazon Cognito - AWS by
  • Nov 26, 2024


    The Part of PostgreSQL We Hate the Most // Blog // Andy Pavlo - Carnegie Mellon University

    What goes into bronze, silver, and gold layers of a medallion data architecture? | by Lak Lakshmanan | Sep, 2024 | Medium by Lak Lakshmanan

    Using DuckDB-WASM for in-browser Data Engineering by Tobias Müller

    Go talk to the LLM

    The CDC MERGE Pattern. by Ryan Blue | by Tabular | Medium by Tabular

    FireDucks : Pandas but 100x faster
    #!/usr/bin/env -S uv run by
    Amazon Data Firehose supports continuous replication of database changes to Apache Iceberg Tables in Amazon S3 - AWS by
  • Nov 25, 2024


    Use data that looks like data - by Thorsten Ball by Thorsten Ball

    Unit Tests As Documentation - by Teiva Harsanyi by Teiva Harsanyi

  • Nov 14, 2024


    Anthropic’s upgraded Claude 3.5 Sonnet model and computer use now in Amazon Bedrock - AWS
    • It will be fun to see how the computer use api will evolve
    It’s Not Easy Being Green: On the Energy Efficiency of Programming Languages

  • Nov 8, 2024


    Embeddings are underrated

  • Oct 28, 2024


    Technology Radar | Guide to technology landscape | Thoughtworks

  • Oct 23, 2024


    AWS announces a seamless link experience for the AWS Console Mobile App - AWS

    init.py files are optional. Here’s why you should still use them | Arie Bovenberg

  • Oct 9, 2024


    Talks - Reuven M. Lerner: Times and dates in Pandas by PyCon US

    Talks - Bruce Eckel: Functional Error Handling by PyCon US

    What’s New In Python 3.13 — Python 3.13.0 documentation

  • Oct 7, 2024


    Visual Studio Code September 2024 by Microsoft

  • Oct 4, 2024


    NotebookLM | Note Taking & Research Assistant Powered by AI
    • Very interesting experiment from google which provides a summary of any article, video or file. Additionaly can generate a podcast talking about it!
  • Oct 1, 2024


    Talks - Juliana Ferreira Alves: Improve Your ML Projects: Embrace Reproducibility and Production… by PyCon US
    • These truly looks like a good improvement to an ml project 👀
  • Sep 30, 2024


    Talks - Krishi Sharma: Trust Fall: Three Hidden Gems in MLFlow by PyCon US
    Table format comparisons - Streaming ingest of row-level operations — Jack Vanlightly by Jack Vanlightly
    Embeddings · Malaikannan
  • Sep 29, 2024


    Copy-on-Write (CoW) — pandas 2.2.3 documentation
    • Ptty big change to pandas 3.0 but one I think will bring a lot of clarity to data transformations
    Announcing DuckDB 1.1.0 – DuckDB by The DuckDB team

    Introducing Contextual Retrieval \ Anthropic

    What is a Vector Index? An Introduction to Vector Indexing by Alejandro CantareroField CTO of AI, DataStax

    Chunking · Malaikannan

    Chronon - Airbnb’s End-to-End Feature Platform - InfoQ by Nikhil Simha
    • This is a very high level overview of the feature platform but allowed me to get a better sense on why use this platform. But I’d love to test it to understand if for those without streaming I’m this isn’t a bit overkill
  • Sep 28, 2024


    Deno 2.0 Release Candidate
    Time spent programming is often time well spent - Stan Bright
    What I tell people new to on-call | nicole@web
    What is io_uring?
    Memory Management in DuckDB – DuckDB by Mark Raasveldt
    Yuno | How Apache Hudi transformed Yuno’s data lake
    Google Proposes Adding Pipe Syntax to SQL - InfoQ by Renato Losio
  • Sep 26, 2024


    PostgreSQL: PostgreSQL 17 Released!
    The sorry state of Java deserialization @ marginalia.nu
    • Interesting to see that duckdb is serving as the benchmark to process data in the order of GB’s
    Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
  • Sep 24, 2024


    @daily_cache implementation in Python • Max Halford
    Good programmers worry about data structures and their relationships by Engineer’s Codex
  • Sep 23, 2024


    I Like Makefiles by Sebastian Witowski
    • I tend to avoid make files because I hadn’t looked at how they worked and always associated either Java projects. I might want to take another look at them
    Rethinking Analyst Roles in the Age of Generative AI by Ben Lorica 罗瑞卡
    No, Data Engineers Don’t NEED dbt. | by Leo Godin | Jul, 2024 | Data Engineer Things by Leo Godin
  • Sep 20, 2024


    Gotten my reading from +100 articles to 58! 🎉

    Predicting the Future of Distributed Systems by Colin Breck
    Continuous reinvention: A brief history of block storage at AWS | All Things Distributed by Werner Vogels
    Iceberg vs Hudi — Benchmarking TableFormats | by Mudit Sharma | Aug, 2024 | Flipkart Tech Blog by Mudit Sharma
    Splicing Duck and Elephant DNA by Jordan Tigani, Brett Griffin
    How we sped up Notion in the browser with WASM SQLite by Carlo Francisco
    Table format comparisons - How do the table formats represent the canonical set of files? — Jack Vanlightly by Jack Vanlightly
  • Sep 14, 2024


    Should you be migrating to an Iceberg Lakehouse? | Hightouch by Hugo Lu
    How data departments have evolved and spread across English football clubs - The Athletic by Mark Carey
    How to use AI coding tools to learn a new programming language - The GitHub Blog by Sara Verdi
    How to choose the best rendering strategy for your app – Vercel by Alice Alexandra MooreSr. Content Engineer, Vercel
    Amazon’s Exabyte-Scale Migration from Apache Spark to Ray on Amazon EC2 | AWS Open Source Blog
  • Sep 14, 2024


    Don’t Use JS for That: Moving Features to CSS and HTML by Kilian Valkhof by JSConf
    AWS Chatbot now allows you to interact with Amazon Bedrock agents from Microsoft Teams and Slack - AWS
    Astro 5.0 Beta Release | Astro by Erika
    Making progress on side projects with content-driven development | nicole@web
    AWS Glue Data Catalog now supports storage optimization of Apache Iceberg tables - AWS
  • Sep 13, 2024


    Introducing OpenAI o1 | OpenAI
    Rewrite Bigdata in Rust by Xuanwo
    Recommending for Long-Term Member Satisfaction at Netflix | by Netflix Technology Blog | Aug, 2024 | Netflix TechBlog by Netflix Technology Blog
    • Til: reward engineering. Measure proxy features and optimize for them to get to your actual end goal
    We need to talk about ENUMs | boringSQL by Radim Marek
    I spent 8 hours learning Parquet. Here’s what I discovered | by Vu Trinh | Aug, 2024 | Data Engineer Things by Vu Trinh
  • Sep 12, 2024


    What I Gave Up To Become An Engineering Manager by Suresh Choudhary
    Unpacking the Buzz around ClickHouse - by Chris Riccomini by Chris Riccomini
    Introducing job queuing to scale your AWS Glue workloads | AWS Big Data Blog
    ”SRE” doesn’t seem to mean anything useful any more
    Microsoft Launches Open-Source Phi-3.5 Models for Advanced AI Development - InfoQ by Robert Krzaczyński
  • Sep 3, 2024


    Production-ready Docker Containers with uv by Hynek Schlawack
    Many of us can save a child’s life, if we rely on the best data - Our World in Data by By: Max Roser
    • Another article that I find eye opening when we look at data to improve our decisions
    Why I Still Use Python Virtual Environments in Docker by Hynek Schlawack
    • Suddenly this way of working, very similar to node modules, is being talked everywhere.
    Python Developers Survey 2023 Results
    • Was looking for this for a long time. Got to learn a bit more on twine, mlflow and sqlmodel. In terms of getting on track with python world I think I’m good with my current rss feed and podcast
    Elasticsearch is Open Source, Again | Elastic Blog by ByShay Banon29 August 2024
    Monitor data quality in your data lake using PyDeequ and AWS Glue | AWS Big Data Blog
    • Data drift detection 👀
    • Makes sense. Pydeequ is just a wrapper
    How top data teams are structured by Mikkel Dengsøe
    • As I work on a team that doesn’t have this kind of distribution I need to reflect a bit on the impact of being the sole data engineer on an ML team
    Talks - Amitosh Swain: Testing Data Pipelines by PyCon US
  • Aug 26, 2024


    uv: Unified Python packaging

    Wow, need to test this out on my side projects.

    CSS finally adds vertical centering in 2024 | Blog | build-your-own.org by James Smith
    Timeless Skills For Data Engineers And Analysts by SeattleDataGuy

    Skimmed the article but although high level I find the main points very true. Understanding the system, being on top of the state of the art. That and the tips for head of data

    CSS 4, 5, and 6! With Google’s Una and Adam by Syntax

    Great episode! Nice to see that css is getting better and better

  • Aug 23, 2024


    I’ve Built My First Successful Side Project, and I Hate It by Sebastian Witowski

    Wow, I’d love this one man saas but knowing how it can burn you out…

    NodeJS Evolves by Syntax

    Nice episode on the long sought features for node that have been introduced on bun and demo (single file, typescript support and top await async are the big ones for me)

    Talks - Brandt Bucher: Building a JIT compiler for CPython by PyCon US
    Python Insider: Python 3.13.0 release candidate 1 released

    Great to see a new version coming along! Is pdb worth using with vs code? 🤔

    Google kills Chromecast, replaces it with Apple TV and Roku Ultra competitor | Ars Technica by Samuel Axon

    As a google to owner would be good to know that my working hardware wouldn’t brick just because I don’t want to upgrade

  • Aug 6, 2024


    Can Reading Make You Happier? | The New Yorker by Ceridwen Dovey
    Beyond Hypermodern: Python is easy now - Chris Arderne

    Although I had my eyes already on rye I certainly didn’t know it was so full feature. Gotta try and change some of my projects to it ,

    Visual Studio Code July 2024 by Microsoft

    Python support slowly getting good

    Introducing GitHub Models: A new generation of AI engineers building on GitHub - The GitHub Blog by Thomas Dohmke
    Creativity Fundamentally Comes From Memorization
    Gen AI Increases Workloads and Decreases Productivity Upwork Study Finds - InfoQ by Sergio De Simone

    The report raises a good point about an increase in workload. With a big productivity boost bosses might start loading even bigger workloads than what we gained from the productivity boost

    Ofc this isn’t good as people will feel overloaded

    tea-tasting: a Python package for the statistical analysis of A/B tests | Evgeny Ivanov by Evgeny Ivanov

    Super interesting IMO. Might give it a try on my team when we start deploying solutions to our clients

    Data Science Spotlight: Cracking the SQL Interview at Instacart (LLM Edition) | by Monta Shen | Jul 2024 | tech-at-instacart by Monta Shen

    Would be interesting to test this out. Have an example of a dataset that can be queried using duckdb. Given a question understand if a query is correct how to fix it and improve it’s performance. One in Sal and another in pyspark (or ibis/pandas)

  • Jul 31, 2024


    Are you a workaholic? Here’s how to spot the signs | Ars Technica by Chris Woolston, Knowable Magazine
    At the Olympics, AI is watching you | Ars Technica by Morgan Meaker, WIRED.comCrazy to see that now with ai it’s actually possible to survey an entire city either cameras
    Use Apache Spark™ from Anywhere: Remote Connectivity with Spark Connect by Databricks

    English sdk looks awesome but requires an OpenAI key. Could it be replaced with ollama?

  • Jul 26, 2024


    Let’s Consign CAP to the Cabinet of Curiosities - Marc’s Blog by Marc Brooker

    Interesting topics to research a bit more on CAP alternatives

    Why Your Generative AI Projects Are Failing by Ben Lorica 罗瑞卡

    In summary, to have a good AI product we need to have data with quality which requires good data governance. With this data we need to define useful products that we can measure it’s value using data driven metrics. We must also ensure the product has good practices avoiding security or bias issues

    Engage your audience by getting to the point, using story structure, and forcing specificity – Ian Daniel Stewart

    Resumed by:

    storyline talk

    Slack Conquers Deployment Fears with Z-score Monitoring - InfoQ by Matt Saunders

    This is something I would love to implement. Allowing to define the metrics on which to evaluate a new feature, the expected hypothesis and revert the feature (i.e feature flags) automatically with a report on the experiment

    DuckDB + dbt : Accelerating the developer experience with local power - YouTube

    Could I replace Athena with this? I think the main blocker for me is I want to work with S3. And need to check how it runs for a really large dataset…

    How Unit Tests Really Help Preventing Bugs | Amazing CTO

    Good tip. For any project define the metric of code coverage goals and start increasing on the project

    Mocking is an Anti-Pattern | Amazing CTO
    Building an open data pipeline in 2024 - by Dan Goldin by Dan Goldin
  • Jul 25, 2024


    Spark-Connect: I’m starting to love it! | Sem Sinchenko by Sem Sinchenko

    This article wasn’t properly parsed by omnivore but the big takeaways:

    • We can add plugins for extended functionality to our spark server
    • Using spark connect we can implement any library in any language we want and send grpc requests to the server (spark connect server needs to be running)
    • spark connect works on >3.5. Should be much better on v4.0
    • If I truly want to be good in spark I eventually need to relearn scala/java
    • Glue is cool but it’s still on version 3.3. All these goodies will take too long to be implemented in glue
    So you got a null result. Will anyone publish it? by Kozlov, Max

    After reading the statistics books I can see much more clear the value of proving a null hypophesis, this is the feeling I am getting out of the academia. We are seeing more research without any any added value. Goodhart’s law.

    Maestro: Netflix’s Workflow Orchestrator | by Netflix Technology Blog | Jul, 2024 | Netflix TechBlog by Netflix Technology Blog

    Sounds just like an airflow contender. with the plus of being able to run notebooks 🤔

    How to Create CI/CD Pipelines for dbt Core | by Paul Fry | Medium by Paul Fry
    Simplify PySpark testing with DataFrame equality functions | Databricks Blog by Haejoon Lee, Allison Wang and Amanda Liu

    Good theme for a blog post on the changes of spark 4, this is really useful for human errors (been there multiple times)

    How to build a Semantic Layer in pieces: step-by-step for busy analytics engineers | dbt Developer Blog by Gwen Windflower

    So this will be generated on the fly as views by the semantic layer?, This looked neat until the moment I understood that the semantic layer requires dbt cloud

    Meta’s approach to machine learning prediction robustness #### Engineering at Meta by Yijia Liu, Fei Tian, Yi Meng, Habiya Beg, Kun Jiang, Ritu Singh, Wenlin Chen, Peng Sun

    Meta seems to be a couple of years ahead of the industry. The article doesn’t provide a lot of insights but gives a feeling of their model evaluation being mostly automated + having a good AI debug tool

    Free-threaded CPython is ready to experiment with! | Labs

    First step in a long way before we can get run python wihtout GIL. Interested on seeing if libraries like pandas will be able to leverage multithreading eventually with this

    Amazon DataZone introduces OpenLineage-compatible data lineage visualization in preview | AWS Big Data Blog

    Mixed feelings here. Great to see Open Lineage implemented at AWS. However it feels again that AWS just created the integration and won’t be driving the development of open lineage

    Unlocking Efficiency and Performance: Navigating the Spark 3 and EMR 6 Upgrade Journey at Slack - Slack Engineering by Nilanjana Mukherjee

    What could be improved to help this kind of migrations be done in a matter of days?, Livy might be deprecated in favor of spark connect. With their migration to Spark 3 and eventually 3.5 (not clear on this article) they could be interested in moving new jobs to connect , Basically solved issues by using the old behaviours. These will need to be migrated eventually. Would need to better understand these features , This looks like an important detail. With no explicit order spark can have random order of rows?, Cool to see these migrations and teams using open source solutions. EMR although expensive with a good engineering team can prove to be quite cost effective

    DuckDB Community Extensions – DuckDB by The DuckDB team
    Visual Studio Code June 2024 by Microsoft
    The Rise of the Data Platform Engineer - by Pedram Navid by Pedram Navid

    The need to define a data platform is something I see everywhere. It really looks like we are missing a piece here. Netflix maestro for example seems like a good contender for to solve the issue of (yet another data custom platform)

    Lambda, Kappa, Delta Architectures for Data | The Pipeline by Venkatesh Subramanian
    Write-Audit-Publish (WAP) Pattern - by Julien Hurault by Julien Hurault

    This articles brings me the question. Can we improve dbt by using WAP? How does the rollback mechanism work when a process fails?

    AWS Batch Introduces Multi-Container Jobs for Large-Scale Simulations - InfoQ by Renato Losio
    Data Council 2024: The future data stack is composable, and other hot takes | by Chase Roberts | Vertex Ventures US | Apr, 2024 | Medium by Chase Roberts
    Amazon DataZone now integrates with AWS Glue Data Quality and external data quality solutions | AWS Big Data Blog

    Super interesting to see how we can enable data quality visibility

    Pyspark 2023: New Features and Performance Improvement | Databricks Blog by in Industries
    Cost Optimization Strategies for scalable Data Lakehouse by Suresh Hasundi

    Good to use open data lakes showing the big cost and speed improvements

    Prompt injection and jailbreaking are not the same thing
    Flink SQL and the Joy of JARs by ByRobin MoffattShare this post
    Catalogs in Flink SQL—Hands On by ByRobin MoffattShare this post
    Catalogs in Flink SQL—A Primer by ByRobin MoffattShare this post

    Why is this so hard? 😭

Follow me on Mastodon Go to GitHub repo Go to my RSS feed