Category Archives: Big Data

You Are Not Worth Much

I’ve said again and again that we undervalue how much we are unmoored from traditional and historic notions of privacy.

The undervaluing is clear when one looks at some data that the FT has collected on how much it costs to buy info about you and me.

Cost of buying a list per thousand people:

$260 – people with cancer

$85 – identifying new parents

$85 – new house owners

$3 – movie choices

$2.11 – people looking to buy a car

$1.85 – TV viewing data

$1.35 – past purchases

$.50 – age or location

Over a decade of collecting this info by hundreds of players has driven down prices so much, and yet governments around the world have barely started thinking about the question.

In other words, there is now very little cost to buy information that violate your privacy.  Where market prices don’t reflect social costs, the government often steps in to correct the externality.

Basic economics; basic policy.


Blunt Tools

While data is a game-changer, it’s also important to remember that half-assed data tools are not better than no data tools.  A good example are the horrible software algorithms used to sort through resumes over the last decade:

Algorithms and big data are powerful tools. Wisely used, they can help match the right people with the right jobs. But they must be designed and used by humans, so they can go horribly wrong. Peter Cappelli of the University of Pennsylvania’s Wharton School of Business recalls a case where the software rejected every one of many good applicants for a job because the firm in question had specified that they must have held a particular job title—one that existed at no other company.

Freeing Drugs Data

On Thursday, GSK announced an effort to release to open up its clinical trial data sets to outside researchers, perhaps leading the pharma industry to join the Kaggle trend, in freeing data to get the larger community of smart people to mine innovative truths from the data.

GSK announced:

GSK is fully committed to sharing information about its clinical trials. It posts summary information about each trial it begins and shares the summary results of all of its clinical trials – whether positive or negative – on a website accessible to all. Today this website includes almost 4,500 clinical trial result summaries and receives an average of almost 10,000 visitors each month. The company has also committed to seek publication of the results of all of its clinical trials that evaluate its medicines – regardless of what the results say – to peer-reviewed scientific journals.

Expanding further on its commitments to openness and transparency, GSK also announced today that the company will create a system that will enable researchers to access the detailed anonymised patient-level data that sit behind the results of clinical trials of its approved medicines and discontinued investigational medicines. To ensure that this information will be used for valid scientific endeavour, researchers will submit requests which will be reviewed for scientific merit by an independent panel of experts and, where approved, access will be granted via a secure web site. This will enable researchers to examine the data more closely or to combine data from different studies in order to conduct further research, to learn more about how medicines work in different patient populations and to help optimise the use of medicines with the aim of improving patient care.

This initiative is a step towards the ultimate aim of the clinical research community developing a broader system where researchers will be able to access data from clinical trials conducted by different sponsors. GSK hopes the experience gained through this initiative will be of value in developing and catalysing this wider approach.

Enabling More Shots on Goal

One of the metaphors I have used in connection with finding talent is that of Antonio Gates, the San Diego Chargers monster TE who never played college football.  With too narrow a view of talent, Gates may have spent his life doing something in which we football fans never got to appreciate his pass-catching talent.  As I commented then, we need to embrace breaking down limiting beliefs about the narrowness of talent that are embedded in talent recognition processes:

Despite this, the way the recruiting process often works is very conventional and it takes a narrow view of things and it can whittle unconventional candidates out of the talent pool.  My thesis is that this is causing a huge economic and human deadweight loss to our economy and society as talented folks don’t get to move around or it takes them too long to make a cross-functional move.

With Antonio Gates at the back of my mind, it was interesting this weekend to tie it to global football or soccer as we Americans know it better.  Manchester City, the reigning Premier League champions, are working with a private provider of soccer match data, to release that data to fans (tying into another favorite topic of this blog regarding freeing data sets).  The expectation is that in Kaggle-like fashion, a crowd of smart people and fans will attack that data, mining new truths out of them.  As a club official says, capturing the power of empowering the crowd through open processes:

I want our industry to find a Bill James. Bill James needs data, and whoever the Bill James of football is, he doesn’t have the data because it costs money

Liking “Like”

The textbook instance of what I have dubbed the routine rite is Facebook’s “Like” button.  Justin Rosenstein describes the deceptively simple, and consequently powerful nature of this gesture:

Following the lead of early Internet sharing services such as, Zuckerberg then created a button that would allow users to signal an endorsement on Facebook, and eventually on other websites, of a video, picture, article, or even a brand. Other engineers wanted to call it the “awesome” button. Zuckerberg decided to name it the “like” button. “It sounded bland and generic,” said Justin Rosenstein, an early Facebook engineer who went on to found Asana, an online collaboration tool, with Facebook co-founder Dustin Moskovitz. “I feel foolish in hindsight to have missed the genius: Facebook has managed to take concepts as basic as ‘friend,’ ‘event,’ and ‘like’ and co-opt them.”

Context and Presentation

Sometimes, especially when data is concerned, we immediately think complication – algorithms, data science, and equations.  There is a place for that, but we can outsmart ourselves.  Often, simplicity wins the marketplace.  A good reminder about data plays on this week:

Here’s the thing. Data, big, medium or small, has no value in and of itself. The value of data is unlocked through context and presentation.

Context and presentation, and not only the data itself, make the difference.  Without it, the data is likely not to be seen as useful to the user, even if it could be, and without being useful and causing a change in behavior, the opportunity to create value is missed.

Mark Suster on the Rich Waters of the Twitter Ecosystem

In his post today explaining his investment in DataSift, Mark Suster explores in insightful detail the value of Twitter for content-creation and of tools like DataSift for helping users extract from that content what is useful to them. His analogy in the below excerpt is to transforming an overwhelming fire-hose blast into a manageable tap stream.

Our goal is to make the enormous volume of real-time information more manageable for the 99% of companies that lack the infrastructure to process these volumes in real time. Think of DataSift as turning the fire-hose into a cost-effective and manageable tap of running water.

To draft in the airstream of his post, I wanted to refer back to two recent posts on this blog addressing both subjects: here is a post on the value of Twitter in enabling users to create content and here is a post on the opportunity that lies in providing the filters required to make manageable the flood of content from Twitter and other content-creation sources.

Content Dhobi Ghats: Easier Self-Expression and Better Filters

In Mumbai, there are massive open-air laundromats called dhobi ghats. Somehow, clothing is picked up from a client’s home, is placed into close proximity, if not outright intermingled with the clothes of thousands of other households, but yet it makes it back to the home of the correct owner. This ability to match the right clothing, picked out from a mountain of the wrong clothing, to the right household is critical to the value created by the dhobi ghats, as returning the wrong stuff would have no value and indeed, would drain value by the cascading frustration it would cause to clients.

So what does this have to do with anything?

In my most recent post, which discussed “enablement” as an Internet business model, I linked back to a post from roughly two months ago about Internet-enabled expression as a foundation of successful business models on the Internet and as an unprecedented historic enabler of human expression. I noted that one could trace the history of the Internet through sites that made self-expression easier and easier:

I can chart a chain of  tools from when I first started using the Internet: interest-based UseNet groups, listservs, GeoCities and other “create your own website” tools, blogging tools, YouTube, Facebook and social networking sites, Twitter, Tumblr and microblogging sites, photosharing tools etc.

Coincidentally, also yesterday, Fred Wilson posted on a similar topic. He noted that using posts/day for WordPress, Tumblr, and Twitter that:

The frequency of posts in a service is inversely proportional to the size of the post. Said another way, the longer the post, the less frequently they will happen.


If you want to understand the power of Tumblr and Twitter, you need to look at how quick and how easy it is to post. There are of course many other factors at work, but brevity and ease is a big part of why these services work so well.

The point is simple: the easier you make it for people to express themselves by giving them a variety of simple tools to do so, the easier it is for the user to overcome inertia and “say” something, and the more total content is created.

This raises the obvious question: we already have too much content; so isn’t making it easier for users to create further content sort of pointless. Fred’s most illuminating point is in the comments to his post: “that’s why we need filters and curation. we want more posts and more filters”

Filters to sort through the ever-increasing content and present it on a silver platter to those for whom it is most relevant is now a critically important business model and the other side of the coin of the expression thesis. Some system, like the Mumbai dhobi ghats, has to get the content to the right place in order for it to have optimal value. Ultimately, if the content is not read, it will not be created, nor will society benefit from users seeing the content most useful to them. For these reasons, critical to “building better soapboxes” is also creating “dhobi ghats” to keep advancing the enabling of human expression as we have been doing this last fifteen or so years.

Creating Value from Big, Messy Data

Information is power, so the saying goes.  Today, decades into the digital age, a transition that enabled the collection and creation of data on a previously unknown scale, we have enormous data sets.  These sets are growing today at a monstrous pace, even relative to the digital age, with social networks, blogs and self-publishing, and the data being collected by mobile phones. We also for the first time have the ability to cheaply (“bootstrapped startup” levels of cheap) use technology to search and find patterns in these datasets, opening opportunities to creating new markets by disrupting existing markets.  An article from the FT, excerpted below, provides an overview of these issues.  I am going to dive much deeper into this as it has personal relevance for my current project.

If information is power, harnessing the increased information available provides opportunities for unprecedented value creation/disruption/redistribution.

Excerpt below (bolding added):

“While “big data” has become the buzzword, a better description would be “messy data”, says Roger Ehrenberg of IA Ventures, an early-stage investor. Harvesting, cleaning up and organising raw data in a way that it can be processed is a large part of the battle, he says.

This has been complicated further by the big growth in unstructured data – information, such as text, that is not organised in a way that a computer can easily process. With the volume of user-generated text and video growing rapidly, this has become one of the main focuses of technological development.

Chief among the new tools are natural language processing, which enables a computer to extract meaning from text, and machine learning, the feedback loops through which computers can test their conclusions on large amounts of data in order progressively to refine their results.

Subjecting large data sets to analysis has also been made easier by two of the forces that have reshaped information technology more widely: the spread of low-cost, standardised computer hardware and the emergence of open-source software.

This has created a cheap computing platform for new technologies such as Hadoop – a piece of software architecture that is designed to handle massive amounts of data. The idea was based on breakthroughs at Google, which needed to find ways to conduct large volumes of intensive web searches simultaneously. It has since been taken up by companies including Facebook and Yahoo.

The rise of cloud computing – which centralises storage and processing power in larger data centres – has also brought big data within the reach of more companies. By tapping into the cloud computing services offered by Amazon, say, a company such as Color can get instant access to all the analytical power it needs without needing to take on the fixed costs of buying its own servers, says D.J. Patil, chief product officer at the IT start-up.

For business leaders, “the big skill in future will be to ask the right question”, says Tim O’Reilly, a technology commentator and publisher.

Besides smartphones, new sources of data include social networks, blogs and other sources of user-generated content; sensors collecting everything from traffic patterns to a user’s heart rhythm; and click streams generated by people spending an increasing amount of their lives online.

Much of the information is in unstructured form. It has never been collated in a traditional relational database, where it could be queried at will. Without techniques to harvest, verify and analyse it – often in real time – valuable commercial signals are lost in the noise.

It sometimes takes the analysis of massive data sets to detect useful patterns, says Michael Olson. His California start-up, Cloudera, is commercialising the type of technology used by companies such as Facebook and Yahoo to crunch through vast bodies of information. Retailers, for instance, might learn far more from the 10 years’ worth of customer data they can now analyse in one go than from the more limited runs to which they were once restricted, he says.”

Startup America Reboot Redux

A couple of links to a U Chicago professor article and the CTO of the US Aneesh Chopra since my post that talked about government data sets.  Still much more to be done on this front.  Way too early to pat ourselves on the back especially from the perspective of federal government data.  Hopefully this means more focus on this issue because it can be entreprenurial fuel as I have discussed.