Top Five Takeaways from the Strata Data Conference

This was only my second Strata conference, but it was interesting to see an emphasis on Data Science and Machine Learning in a conference that was historically had more of a bent to big data platforms and software engineering. This is not a criticism; the conference was a fantastic mix of talks both on the platform side and the data science side. I was impressed with the quality of talks, including such industry giants such as Jeff Dean from Google Brain. I was also impressed with the conference organizers and the program committee in providing a breadth of topics and a generally well-run conference. Here are my top five takeaways from the March 2018 Strata in San Jose:

1.Machine Learning vendors and developers are focusing on tools that make model deployment easier

Machine Learning developers for the past few years have focused on developing tools that make it easier for data scientists to build machine learning models. This has included support for a wide variety of models and architectures, as well as providing APIs that make working with these models tractable. Frameworks such as TensorFlow open sourced by Google has accelerated the use of neural networks to solve practical business problems. In addition, new data science platforms have appeared to help accelerate data science work, including machine learning model development from companies such as Anaconda, Cloudera, and Domino. Features vary, but these platforms generally combine a computational notebook development environment with self service data management tools that all work from a web browser.
Representatives from these Machine Learning frameworks and data science platforms at Strata have predominantly focused their talks on how to practically deploy models into production. For example, Tensorflow’s fantastic talk by Google’s Rajat Monga mainly focused on their feature road map that makes using and deploying TensorFlow models easier, including full interoperability with the Keras API. Microsoft gave a talk on practical considerations when deploying finicky Recurrent Neural Network models that rely on high velocity streaming data. Companies in the field gave similar presentations. Although the specific details vary, a recommended production deployment workflow has emerged, which consists of the following steps:
  • Model Creation
     This is where you train your model, test your model, and export your finalized model.
  • Model Containerization with Docker
     In the big data world, making sure that the right software and package dependencies are installed in your cluster is a DevOps nightmare. Model containerization simply means you pack all the dependencies necessary with the model in a ‘container’ before you ‘ship it’ to production. Docker is the most popular container engine that makes this process convenient.
  • Model Hosting in a production cluster
Once you’re ready to ‘ship’ your model, you have to make sure your production cluster can ‘receive’ the model, run the model, and provide remote access to the model. Different machine learning products accomplish this differently – Tensorflow has a service called “TensorFlow serving’ for this puprose, for example.
  • Production deployment with Kubernetes
     Once you have your model ‘packaged’, and the production cluster is ready to ‘receive’ your model, you have to ‘ship it.’ This is where Kubernetes comes in. Kubernetes is the FedEx/UPS of the deployment process – it ‘orchestrates’ docker containers and makes sure they find the appropriate place in the production environment.
The big tech players are not the only ones jumping on this workflow – the Data Science platform companies are incorporating new features to make this workflow easier. Kubernetes deployment is one of the main new features in version 5 of the Anaconda Data Science platform, for example.
  1. Hadoop is on life support

The above model deployment workflow should somewhat startle you. Using Kubernetes for container orchestration implies that YARN – the Hadoop resource management and job scheduling tool – becomes significantly less important for data science workflows, to the point that it’s worth asking the question of whether YARN specifically or Hadoop generally is still relevant. The increased popularity of cloud object storage frameworks (such as Amazon S3) is also rendering Hadoop obsolete. Common Big Data stacks increasingly use a Spark cluster in conjunction with Amazon S3, bypassing Hadoop all together.  It is abundantly clear the community is moving away from Hadoop. Even Cloudera is aware of this, and to their credit their platform roadmap is deemphasizing the Hadoop ecosystem.
Although Hadoop as a distributed file system and YARN for resource management is on life support, I’m not fully ready to call Hadoop dead. For one, options for on premises object storage solutions aren’t fully mature, and standards haven’t been fully set. There are still enterprise applications where the the Hadoop Distributed File System (HDFS) still works well, and the database solutions built on HDFS are well-understood and work quite well for certain domains.
  1.  Data Science is maturing as a field and data-driven companies are able to offer valuable lessons learned

Netflix, Gap, Google, Slack, Cloudera, and Blizzard Entertainment are among the companies who have mature data science organizations that offered invaluable lessons learned from building real life data science products. Michelle Casbon from Google offered a great ’10 lessons learned from using Kubernetes for streaming NLP applications’ talk. Google yet again has provided thought leadership on a disruptive new tool, Kubernetes in this case, that is only beginning to be adopted by the community. Slack offered insights on enriching search results with Machine learning. Ted Melaska from Blizzard offered lessons learned from successful data science projects from the managerial and team building perspective. We are entering an era where data science is a mature discipline being used by data-driven companies to make business decisions. They offered valuable insight of approaches to try, pitfalls to avoid, and how to successfully transitioning a company to become a data driven organization. It’s a testament to how much Strata (formerly ‘HadoopWorld’) has changed as a conference.
  1. Niche Machine Learning applications such as AutoML are becoming mainstream.

I was surprised to see what used to be considered niche areas in Machine Learning research are becoming mainstream features in data products and overall data strategy in certain organizations. AutoML used to be a niche research area of automating machine learning model selection and hyper parameter optimization, being evangelized by such Machine Learning researchers such as Randy Olson formerly at the University of Pennsylvania. DataRobot was one of the first out the gate to productize AutoML, and now AutoML everywhere. Google published a blog piece back in November (https://research.googleblog.com/2017/11/automl-for-large-scale-image.html) regarding AutoML, and it’s now a Google Cloud product as well as a feature being integrated within TensorFlow. H20.ai partnered with Nvidia for their version of AutoML, called ‘Driverless AI’, that runs on Nvidia GPUs.
I was also surprised to see other niche approaches becoming more mainstream, such as Active Learning and other semi-supervised Machine Learning approaches. Paco Nathan from O’reilly media gave a fantastic talk regarding actual business applications of semi-supervised learning and using domain expertise to fill in label gaps.

 

5. Ethics in big data and data science are more broadly addressed.

It’s refreshing to see ethics in data science and big data being more broadly addressed than a single talk at Strata. Natalie Evans Harris, former senior data policy advisor to Obama’s Chief Technology Officer, offered case studies on using open source cloud data integration services focused on social programs, including a use case involving a homeless intervention and prevention program in Indiana. Her take home message regarding ethics in data science is powerful, where she quoted congresswoman Barbara Jordan to challenge and wake up the data science and big data community: “There is no executive order; there is no law that can require the American people to form a national community. This we must do as indivudals, there is no President of the United States who can veto that decision. We must define the common good and begin again to shape a common future.”

More broadly, talks brought up fascinating ethical concerns. Ryan Boyd from Neo4j did a graph analysis of Russian trolls on Twitter and open sourced his work. What is the ethical responsibility of the tech community as a whole to combat fake news? What is the ethical responsibility of the US Government to continue this work for our defense? To what degree do the social media companies share this responsibility?

Seth Stephens-Davidowitz from the New York Times gave a talk called “Everybody Lies’ based on his book of the same name. In his talk, he touched upon several points of how big data and the Internet can reveal important facts about ourselves. Of particular interest to me was that Google search histories can actually be more accurate than Gallup surveys in certain questions, such as suicide prevention and quantifying racism. This brings up an interesting ethical dilemma. We are all concerned about data privacy, and where do we draw the line to make decisions based on data that was not originally collected to be used in a formal survey?

 

Leave a comment