Welcome to the Google Cloud Platform 2014 Year in Review blog series. Each day for the next two weeks, we will be featuring a different Googler sharing their highlight from the past year in Cloud Platform.

We worked hard on finishing our features and demos around containers, mobile, developer tools and big data technology leading up to Google Cloud Platform Live. The product and marketing teams, with the help of engineering, put on an inspiring display of where Cloud Platform is going in 2015. Having joined the Cloud Platform only this summer, after 7 years of working on Google’s Display Ads products, it was the highlight of my year seeing the reaction to our story: the nodding and excitement from customers and analysts alike. It motivates us all to work even harder to deliver on the next wave of innovation in the cloud.

- Posted by Joerg Heilig, Vice President, Engineering

Welcome to the Google Cloud Platform 2014 Year in Review blog series. Each day for the next two weeks, we will be featuring a different Googler sharing their highlight from the past year in Cloud Platform.

Most developers spend more than 50% of their time tracking down and fixing issues in production. This year, Google Cloud Platform took a big step to get you some of that time back with Google Cloud Logging, which allows you to aggregate logs from your compute instances and other services in a unified Logs Viewer interface with search capabilities. We were stoked that we could just click on the logs (e.g. showing an error or high latency) and get to the exact version, file and line of code causing the problem.

We're looking forward to 2015 where we will save developers even more time!

--Posted by Deepak Tiwari, Product Manager, and Cody Bratt, Product Manager


Fun fact: around 170 million taxi journeys occur across New York City yearly, holding vast amounts of information each time someone steps in and out of one of those bright yellow cabs. How much information exactly? Being a not-so-secret maps enthusiast, I made it my challenge to visualize a NYC taxi dataset on Google Maps.

Anyone who’s tried to put a large amount of data points on a map knows about the difficulties one faces when working with big geolocation data. That's why I want to share with you how I used Cloud Dataflow to spatially aggregate every single pick-up and drop-off location with the objective of painting the whole picture on a map. For background info, Google Cloud Dataflow is now in alpha stage and can help you gain insight into large geolocation datasets. You can try experimenting with it by applying for the alpha program or learn more with yesterday's update.

When I first sat down to think through this data visualization, I knew I needed to create a thematic map, so I built a simple pipeline that was able to geofence all the 340 million pick-up and drop-off locations against 342 different polygons that resulted from converting the NYC neighbourhood tabulation areas into single-part polygons. You can find the processed data in this public BigQuery table. (In order to access BigQuery you need to have at least one project listed in your Google Developers Console. After creating a project you can access the table by following this link.)
Thematic map showing the distribution of taxi pick-up locations in NYC in 2013. Midtown South is New Yorkers’ favourite area to get a cab with almost 28 million trips starting there, which is roughly 1 trip per second. You can find an interactive map here.

This open data, released by the NYC Taxi & Limo Commission, has been the foundation for some beautiful visualizations. By utilizing the power of Google Cloud Platform's tools, I’ve been able to spatially aggregate the data using Cloud Dataflow, and then do ad hoc querying on the results using BigQuery, to gain fast and comprehensive insight into this immense dataset.

With the Google Cloud Dataflow SDK, which parallels the data transformations across multiple Cloud Platform instances, I was able to build, test and run the whole processing pipeline in a couple of days. The actual processing, distributed across five workers, took slightly less than two hours.

The pipeline’s architecture is extremely simple. Since Cloud Dataflow offers a BigQuery reader and writer, most of the heavy lifting is already taken care of. The only thing I had to provide was the geofencing function that could be parallelised across multiple instances. For a detailed description on how to do complex geofencing using open source libraries see this post on the Google Developers Blog.

When executing the pipeline, Cloud Dataflow automatically optimizes your data-centric pipeline code by collapsing multiple logical passes into a single execution pass and deploys the result to multiple Google Compute Engine instances. At the time of deploying the pipeline you can read in files from Google Cloud Storage that contain data you need for your transformations, e.g., shapefiles or GeoJSON formats. Alternatively you can call an external API to load in the geofences you want to test against.

I utilized an API I built on App Engine which exposes a list of geofences stored in Datastore. Using the Java Topology Suite I created a spatial index maintained in a class variable in the memory of each instance for fast querying access.

Distributed across five workers, Cloud Dataflow was able to process an average of 25,000 records per second, each record having two locations, ploughing through more than 170 million table rows in just under two hours. The amount of workers can be flexibly assigned at the time of deployment. The more workers you use, the more records can be processed in parallel, the faster the execution of your pipeline.
The interactive Cloud Dataflow graph of your Pipeline, helping you to monitor and debug your Pipeline in your Google Developer Console in the browser.
Having the data preprocessed and written back into BigQuery, we were then able to run super fast queries over the whole table answering questions like, “where do the best-paid trips start from?”.

Unsurprisingly they start from JFK airport with an average fare of $46 and an average tip of 20.7%*. Okay, this is probably not a secret, but did you know that, even though the average fare from LGA airport is $15 less, there are roughly 800,000 trips more starting from LGA? And with 22.2%*, passengers from LGA airport actually tip best. *As cash tips aren’t reported, only 52% of trips have a tip noted, therefore the values regarding tips could be inaccurate.

Most of the taxi trips start in Midtown-South (28 million) with an average fare of $11. Carnegie Hill in the Upper East Side comes fourth with 12 million pick-ups, however these trips are fairly short. Journeys that start there mostly stay in the Upper East Side and therefore only generate an average fare of $9.80. Here's an interactive map visualizing where people went to, what they paid on average and how they tipped at and some other visualizations of of how people tip from where:

The processed data is publicly available in this BigQuery table. You can find some interesting queries to run against this data in this gist.

Though NYC taxi cab journeys may not seem to amount to much, they actually that conceal a ton of information, which Google Cloud Dataflow, as a powerful big data tool, helped reveal by making big data processing easy and affordable. Maybe I'll try London's black cabs next.

- Posted by Thorsten Schaeff, Sales Engineer Intern

The value of data lies in analysis -- and the intelligence one generates from it. Turning data into intelligence can be very challenging as data sets become large and distributed across disparate storage systems. Add to that the increasing demand for real-time analytics, and the barriers to extracting value from data sets becomes a huge challenge for developers.

In June 2014, we announced a significant step toward a managed service model for data processing. Aimed at relieving operational burden and enabling developers to focus on development, Google Cloud Dataflow was unveiled. We created Cloud Dataflow, which is now currently an alpha release, as a platform to democratize large scale data processing by enabling easier and more scalable access to data for data scientists, data analysts and data-centric developers. Regardless of role or goal - users can discover meaningful results from their data via simple and intuitive programing concepts, without the extra noise from managing distributed systems.

Today, we are announcing availability of the Cloud Dataflow SDK as open-source. This will make it easier for developers to integrate with our managed service while also forming the basis for porting Cloud Dataflow to other languages and execution environments.

We’ve learned a lot about how to turn data into intelligence as the original FlumeJava programming models (basis for Cloud Dataflow) have continued to evolve internally at Google. Why share this via open source? It’s so that the developer community can:
  • Spur future innovation in combining stream and batch based processing models: Reusable programming patterns are a key enabler of developer efficiency. The Cloud Dataflow SDK introduces a unified model for batch and stream data processing. Our approach to temporal based aggregations provides a rich set of windowing primitives allowing the same computations to be used with batch or stream based data sources. We will continue to innovate on new programming primitives and welcome the community to participate in this process.
  • Adapt the Dataflow programming model to other languages: As the proliferation of data grows, so do programming languages and patterns. We are currently building a Python 3 version of the SDK, to give developers even more choice and to make dataflow accessible to more applications.
  • Execute Dataflow on other service environments: Modern development - especially in the cloud - is about heterogeneous service and composition. Although we are building a massively scalable, highly reliable, strongly consistent managed service for Dataflow execution, we also embrace portability. As Storm, Spark, and the greater Hadoop family continue to mature - developers are challenged with bifurcated programming models. We hope to relieve developer fatigue and enable choice in deployment platforms by supporting execution and service portability.

We look forward to collaboratively building a system that enables distributed data processing for users from all backgrounds. We encourage developers to check out the Dataflow SDK for Java on GitHub and contribute to the community.

Interested in adding to the Cloud Dataflow conversation? Here’s how:

- Posted by Sam McVeety, Software Engineer


What do you get when you combine a group of engineers obsessed about cutting edge technology and add a hint ton of geek? A bunch of tech enthusiasts that make up the Developer Advocate Team at Google. You may have already seen some of our work or seen us speak.

We love helping make all of you as successful as possible as you build apps that take full advantage of everything that Google Cloud Platform has to offer. We like talking to you, but even more than that, we like to listen to your feedback. We want to be your voice to the Google Cloud Platform product and engineering teams and use what we hear to help create the best possible developer experience.

You’ll often meet us at technology events (conferences, meetups, user groups, etc.), where we talk about the many products and technologies that get us excited about coming to work everyday. If you do see us, don’t be shy--come say hi!

Ask us anything and everything regarding Google Cloud Platform on Twitter and learn more through our videos on the Google Developers and Google Cloud Platform channels.

Without further ado, please meet your friendly neighborhood Cloud Developer Advocates!

Aja Hammerly
Aja just joined Google as a Developer Advocate. Before Google she spent 10 years working as an engineer building websites at a variety of web companies. She came to Google in order to help people use Google's amazing cloud resources effectively on their own projects.

Fun Fact
Aja learned to solve a Rubik's Cube by racing the build at her first dev job.

Brian Dorsey
Brian Dorsey aims to help you build cool stuff with our APIs and focuses on Kubernetes and Containers. He loves Python and taught it at the University of Washington. He’s spoken at both PyCon & PyCon Japan. Brian is currently learning Go and enjoying it.

Fun Fact
Brian speaks Japanese.

David East
David is passionate about creating resources and speaking about them to help educate developers. A military brat, David has moved over a dozen times in his life.

Fun Fact
David once broke his leg in the middle of the wilderness and had to crawl back to civilization.

Felipe Hoffa
@felipehoffa, +FelipeHoffa
Felipe Hoffa is originally from Chile and joined Google as a Software Engineer. Since 2013 he's been a Developer Advocate on big data - to inspire developers around the world to leverage Google Cloud Platform tools to analyze and understand their data in ways they could never before. You can find him in several YouTube videos, blog posts, and conferences around the world.

Fun Fact
He once went to the New York Film Academy to produce his own 16mm short films.

Francesc Campoy Flores
@francesc+FrancescCampoyFlores, Site
Francesc Campoy Flores focuses on Go for Google Cloud Platform. Since joining the Go team in 2014, he has written several didactic resources and traveled the world attending conferences, organizing live courses, and meeting fellow Go-phers. He joined Google in 2011 as a backend software engineer working mostly in C++ and Python, but it was with Go and Cloud Platform that he re-discovered how fun programming can be.

Fun Fact
Francesc celebrated his 30th birthday riding a bike wearing a red tutu from San Francisco to Los Angeles.

Greg Wilson
@gregsramblings+GregWilsonDev, Site
Greg Wilson leads the Google Cloud Platform Developer Advocacy team and has over 25 years of software development experience spanning multiple platforms, including cloud, mobile, web, gaming, and various large-scale systems.

Fun Fact
Greg is a part-time pro-photographer and a struggling jazz piano player.

Jenny Tong
@baconatedgeek+JennyMurphy, Site
Jenny comes from the Firebase family at Google and helps developers build realtime stuff on all sorts of platforms. If she's away from her laptop, she's probably skating around a roller derby track, or hanging from aerial silk.

Fun Fact
Jenny once ate discount fugu puffer fish from a supermarket. It was priced less than $0.10 per piece. Somehow, she survived.

Julia Ferraioli
@juliaferraioli, +JuliaFerraioli, Site
Julia helps developers harness the power of Google’s infrastructure to tackle their computationally intensive processes and jobs. She comes from an industrial background in software engineering and an academic background in machine learning and assistive technology.

Fun Fact

Julia once deleted her entire thesis with a malformed regular expression, which she blames on lack of sleep and bad coffee. One good night's sleep outside the sysadmin's door restored it from the tape backup, and luckily only a couple of paragraphs were lost!

Kazunori Sato
Kazunori Sato recently joined the team after working as a Cloud Platform Solutions Architect for 2.5 years. During that time, he has produced over 10 solutions and has been hosting the largest Google Cloud Platform community event in Japan for the past 5 years, as well as hosting Docker Meetup in Tokyo. He will be one of our resident experts in Japan on BigQuery, BigData, Docker, Kubernetes, mBaaS and IoT.

Fun Fact
Kaz’s hobby is playing with littleBits, RasPi, Arduino and FPGA and having fun connecting them to BigQuery.

Mandy Waite
Mandy is working to make the world a better place for developers building applications for Cloud Platform. She came to Google from Sun Microsystems where she worked with partners on performance and optimisation of large scale applications and services before moving on to building an ecosystem of Open Source applications for OpenSolaris. In her spare time she is learning Japanese and plays the guitar.

Fun Fact
Mandy has been studying Japanese for some time now, in the hopes of of one day working in Japan and travelling the country in search of Cicadas.

Ossama Alami
Ossama is focused on Firebase, making sure developers have a great experience building realtime apps on Google Cloud Platform. He has worked as a software engineer, consultant, developer advocate and engineering manager at a variety of small and big companies. Prior to Firebase he was Head of Developer Relations for Glass at Google[x]. In the winter he can be found snowboarding in the Sierras.

Fun Fact
Ossama has worked on 8 different Google developer products: Ads APIs, Geo APIs, Android, Commerce APIs, Google TV, Chromecast, Glass and now Firebase.

Paul Newson
Paul currently focuses on helping developers harness the power of Google Cloud Platform to solve their big data problems. Previously, he was an engineer on Google Cloud Storage. Before joining Google, Paul founded a startup which was acquired by Microsoft, where he worked on DirectX, Xbox, Xbox Live, and Forza Motorsport, before spending time working on machine learning problems at Microsoft Research.

Fun Fact
Paul is a private pilot.

Ray Tsang
@saturnism, +RayTsang,
Ray had extensive hands-on cross-industry enterprise systems integration delivery and management experiences during his time at Accenture, managed full stack application development, DevOps, and ITOps.  Ray specialized in middleware, big data, and PaaS products during his time at Red Hat while contributing to open source projects, such as Infinispan. Aside from technology, Ray enjoys traveling and adventures.

Fun Fact
Ray has been posting at least one picture a day on Flickr since 2010.

Sara Robinson
Sara joins Google from the Firebase family. She previously worked as an analyst at Sandbox Industries, a venture firm and startup foundry. She's passionate about learning to code, running, and finding the best ice cream in town.

Fun Fact
Sara wrote her senior thesis on Harry Potter, and enjoys finding ways to relate Harry Potter to almost anything.

Terrence Ryan
Terrence (Terry) Ryan is a Developer Advocate for the Cloud Platform team. He has a passion for web standards and 15 years of experience working with both front- and back-end applications for both industry and academia.

Fun Fact
Before doubling down on technology in the early aughts, Terry was a semi-professional improv comic. 

-Posted by Greg Wilson, Head of Developer Advocacy

If you’ve hunted for new office space for your company in recent years, you know what a nightmare it can be: dealing with quickly outdated spreadsheets and flyers, finding inaccurate data on listings, or even missing out on a great spot because it wasn’t listed properly. The commercial real estate industry today is technologically behind, and RealMassive aims to fix that.

RealMassive uses Google App Engine, Google Compute Engine, Google Cloud Storage, and Google Maps to bring transparency to the commercial real estate industry. The company gives its customers accurate up-to-the-minute digital real estate listings and eliminates conventional operating models. With more than 1 billion square feet of properties in their database, they’re well on their way to transforming an old industry.

Read our new case study on RealMassive here to learn more about how Cloud Platform helped the company achieve 1,360% growth in data in just three months.

-Posted by Chris Palmisano, Senior Key Account Manager

Can you change the world for the better in 24-hours? That was the challenge 39 teams tackled at the Bayes Hack data-science challenge in November.

Bayes Impact is a Y Combinator-backed nonprofit which runs programs to bring data-science solutions to high impact social problems. In addition to a 12-month full-time fellowship supporting leading data scientists to work with civic and nonprofit organizations such as the Gates Foundation, Johns Hopkins and the White House, the organization runs an annual 24-hour hackathon to bring together data scientists and engineers to tackle social problems.

Starting from a set of 20 challenge problems proposed by government and non-profit organizations, teams drawn from the Silicon Valley’s top data-science talent applied their skills to finding impactful ways to use already available data to solve pressing social problems.

Google Cloud Platform sponsored the event with $500 Google Cloud Starter pack credit for each team, and a prize of $100K of Google Cloud Platform Credits to the winning team.

With only only 24 hours and large quantities of data to process, teams were able to leverage the power of tools such as Google Compute Engine and BigQuery to quickly chew through terabytes of information looking for ways to make meaningful impacts on people’s lives.

The winning team, comprised of five local Bay Area data scientists, used their data savvy and their Cloud Platform credits to identify prostitution rings by analyzing patterns of phone numbers and text in postings to adult escort websites. Using a cluster of Compute Engine nodes, the team processed a dataset provided by the non-profit group Thorn. They indexed 38,600 phone numbers and combined that with a heuristic phrase matching strategy to detect 143 separate networks or cells operating in the US.

“Realizing that it was going to take 76 days to process the data on a local laptop, we saw this as a place to use our Cloud Platform credits,” notes Peter Reinhardt, the lead for the winning team. “We found it really straightforward to get SSH access to our first compute instance right from the console. Once that was running, we were able to use that image to quickly bring up 10 machines, and went from nothing to a high powered compute cluster in just over half an hour.”

Paul Duan, President of Bayes Impact, observed that Cloud Platform “enabled the participants to get going quickly and focus on their application without having to spend too much time setting up infrastructure.”

It is estimated that 100,000 to 300,000 children are at risk of commercial sexual exploitation in the United States and one million children are exploited by the global commercial sex trade each year.* As the winning entry, the team’s work will be adopted and expanded as a resident Bayes Impact project.

Companies use data-science and Google’s Big Data tools to quickly answer tough data-intensive questions. Bayes Impact and Google worked together to show what is possible when human and technology resources are brought to bear against social problems.

Posted by Preston Holmes, Google Cloud Platform Solutions Architect

*U.S. Department of State, The Facts About Child Sex Tourism: 2005.