Data scientists: Bring the narrative to the forefront

By 2025, 463 exabytes of data will be created each day, according to some estimates. (For perspective, one exabyte of storage could hold 50,000 years of DVD-quality video.) It’s now easier than ever to translate physical and digital actions into data, and businesses of all types have raced to amass as much data as possible in order to gain a competitive edge.

However, in our collective infatuation with data (and obtaining more of it), what’s often overlooked is the role that storytelling plays in extracting real value from data.

The reality is that data by itself is insufficient to really influence human behavior. Whether the goal is to improve a business’ bottom line or convince people to stay home amid a pandemic, it’s the narrative that compels action, rather than the numbers alone. As more data is collected and analyzed, communication and storytelling will become even more integral in the data science discipline because of their role in separating the signal from the noise.

Data alone doesn’t spur innovation — rather, it’s data-driven storytelling that helps uncover hidden trends, powers personalization, and streamlines processes.

Yet this can be an area where data scientists struggle. In Anaconda’s 2020 State of Data Science survey of more than 2,300 data scientists, nearly a quarter of respondents said that their data science or machine learning (ML) teams lacked communication skills. This may be one reason why roughly 40% of respondents said they were able to effectively demonstrate business impact “only sometimes” or “almost never.”

The best data practitioners must be as skilled in storytelling as they are in coding and deploying models — and yes, this extends beyond creating visualizations to accompany reports. Here are some recommendations for how data scientists can situate their results within larger contextual narratives.

Make the abstract more tangible

Ever-growing datasets help machine learning models better understand the scope of a problem space, but more data does not necessarily help with human comprehension. Even for the most left-brain of thinkers, it’s not in our nature to understand large abstract numbers or things like marginal improvements in accuracy. This is why it’s important to include points of reference in your storytelling that make data tangible.

For example, throughout the pandemic, we’ve been bombarded with countless statistics around case counts, death rates, positivity rates, and more. While all of this data is important, tools like interactive maps and conversations around reproduction numbers are more effective than massive data dumps in terms of providing context, conveying risk, and, consequently, helping change behaviors as needed. In working with numbers, data practitioners have a responsibility to provide the necessary structure so that the data can be understood by the intended audience.

Enterprise security attackers are one password away from your worst day

If the definition of insanity is doing the same thing over and over and expecting a different outcome, then one might say the cybersecurity industry is insane.

Criminals continue to innovate with highly sophisticated attack methods, but many security organizations still use the same technological approaches they did 10 years ago. The world has changed, but cybersecurity hasn’t kept pace.

Distributed systems, with people and data everywhere, mean the perimeter has disappeared. And the hackers couldn’t be more excited. The same technology approaches, like correlation rules, manual processes and reviewing alerts in isolation, do little more than remedy symptoms while hardly addressing the underlying problem.

The current risks aren’t just technology problems; they’re also problems of people and processes.

Credentials are supposed to be the front gates of the castle, but as the SOC is failing to change, it is failing to detect. The cybersecurity industry must rethink its strategy to analyze how credentials are used and stop breaches before they become bigger problems.

It’s all about the credentials

Compromised credentials have long been a primary attack vector, but the problem has only grown worse in the midpandemic world. The acceleration of remote work has increased the attack footprint as organizations struggle to secure their network while employees work from unsecured connections. In April 2020, the FBI said that cybersecurity attacks reported to the organization grew by 400% compared to before the pandemic. Just imagine where that number is now in early 2021.

It only takes one compromised account for an attacker to enter the active directory and create their own credentials. In such an environment, all user accounts should be considered as potentially compromised.

Nearly all of the hundreds of breach reports I’ve read have involved compromised credentials. More than 80% of hacking breaches are now enabled by brute force or the use of lost or stolen credentials, according to the 2020 Data Breach Investigations Report. The most effective and commonly-used strategy is credential stuffing attacks, where digital adversaries break in, exploit the environment, then move laterally to gain higher-level access.

Should Dell have pursued a more aggressive debt-reduction move with VMware?

When Dell announced it was spinning out VMware yesterday, the move itself wasn’t surprising: there had been public speculation for some time. But Dell could have gone a number of ways in this deal, despite its choice to spin VMware out as a separate company with a constituent dividend instead of an outright sale.

The dividend route, which involves a payment to shareholders between $11.5 and $12 billion, has the advantage of being tax-free (or at least that’s what Dell hopes as it petitions the IRS). For Dell, which owns 81% of VMware, the dividend translates to somewhere between $9.3 and $9.7 billion in cash, which the company plans to use to pay down a portion of the huge debt it still holds from its $58 billion EMC purchase in 2016.

VMware was the crown jewel in that transaction, giving Dell an inroad to the cloud it had lacked prior to the deal. For context, VMware popularized the notion of the virtual machine, a concept that led to the development of cloud computing as we know it today. It has since expanded much more broadly beyond that, giving Dell a solid foothold in cloud native computing.

Dell hopes to have its cake and eat it too with this deal: it generates a large slug of cash to use for personal debt relief while securing a five-year commercial deal that should keep the two companies closely aligned. Dell CEO Michael Dell will remain chairman of the VMware board, which should help smooth the post-spinout relationship.

But could Dell have extracted more cash out of the deal?

Doing what’s best for everyone

Patrick Moorhead, principal analyst at Moor Insights and Strategies, says that beyond the cash transaction, the deal provides a way for the companies to continue working closely together with the least amount of disruption.

“In the end, this move is more about maximizing the Dell and VMware stock price [in a way that] doesn’t impact customers, ISVs or the channel. Wall Street wasn’t valuing the two companies together nearly as [strongly] as I believe it will as separate entities,” Moorhead said.

IBM acquires Italy’s MyInvenio to integrate process mining directly into its suite of automation tools

Automation has become a big theme in enterprise IT, with organizations using RPA, no-code and low-code tools, and other  technology to speed up work and bring more insights and analytics into how they do things every day, and today IBM is announcing an acquisition as it hopes to take on a bigger role in providing those automation services. The IT giant has acquired MyInvenio, an Italian startup that builds and operates process mining software.

Process mining is the part of the automation stack that tracks data produced by a company’s software, as well as how the software works, in order to provide guidance on what a company could and should do to improve it. In the case of myInvenio, the company’s approach involves making a “digital twin” of an organization to help track and optimize processes. IBM is interested in how myInvenio’s tools are able to monitor data in areas like sales, procurement, production and accounting to help organizations identify what might be better served with more automation, which it can in turn run using RPA or other tools as needed.

Terms of the deal are not being disclosed. It is not clear if myInvenio had any outside investors (we’ve asked and are awaiting a response). This is the second acquisition IBM has made out of Italy. (The first was in 2014, a company called CrossIdeas that now forms part of the company’s security business.)

IBM and myInvenio are not exactly strangers: the two inked a deal as recently as November 2020 to integrate the Italian startup’s technology into IBM’s bigger automation services business globally.

Dinesh Nirmal, GM of IBM Automation, said in an interview that the reason IBM acquired the company was two-fold. First, it lets IBM integrate the technology more closely into the company’s Cloud Pak for Business Automation, which sits on and is powered by Red Hat OpenShift and has other automation capabilities already embedded within it, specifically robotic process automation (RPA), document processing, workflows and decisions.

Second and perhaps more importantly, it will mean that IBM will not have to tussle for priority for its customers in competition with other solution partners that myInvenio already had. IBM will be the sole provider.

“Partnerships are great but in a partnership you also have the option to partner with others, and when it comes to priority who decides?” he said. “From the customer perspective, will they will work just on our deal, or others first? Now, our customers will get the end result of this… We can bring a single solution to an end user or an enterprise, saying, ‘look you have document processing, RPA, workflow, mining. That is the beauty of this and what customers will see.”

He said that IBM currently serves customers across a range of verticals including financial, insurance, healthcare and manufacturing with its automation products.

Notably, this is not the first acquisition that IBM has made to build out this stack. Last year, it acquired WDG to expand into robotic process automation.

And interestingly, it’s not even the only partnership that IBM has had in process mining. Just earlier this month, it announced a deal with one of the bigger names in the field, Celonis, a German startup valued at $2.5 billion in 2019.

Ironically, at the time, my colleague Ron wondered aloud why IBM wasn’t just buying Celonis outright in that deal. It’s hard to speculate if price was one reason. Remember: we don’t know the terms of this acquisition, but given myInvenio was off the fundraising radar, chances are it’s possibly a little less than Celonis’s pricetag.

We’ve asked and IBM has confirmed that it will continue to work with Celonis alongside now offering its own native process mining tools.

“In keeping with IBM’s open approach and $1 billion investment in ecosystem, [Global Business Services, IBM’s enterprise services division] works with a broad range of technologies based on client and market demand, including IBM AI and Automation software,” a spokesperson said in a statement. “Celonis focuses on execution management which supports GBS’ transformation of clients’ business processes through intelligent workflows across industries and domains. Specifically, Celonis has deep connectivity into enterprise systems such as Salesforce, SAP, Workday or ServiceNow, so the Celonis EMS platform helps GBS accelerate clients’ transformations and BPO engagements with these ERP platforms.”

Indeed, at the end of the day, companies that offer services, especially suites of services, are working in environments where they have to be open to customers using their own technology, or bringing in something else.

There may have been another force pushing IBM to bring more of this technology in-house, and that’s wider competitive climate. Earlier this year, SAP acquired another European startup in the process mining space, Signavio, in a deal reportedly worth about $1.2 billion. As more of these companies get snapped up by would-be IBM rivals, and those left standing are working with a plethora of other parties, maybe it was high time for IBM to make sure it had its own horse in the race.

“Through IBM’s planned acquisition of myInvenio, we are revolutionizing the way companies manage their process operations,” said Massimiliano Delsante, CEO, myInvenio, who will be staying on with the deal. “myInvenio’s unique capability to automatically analyze processes and create simulations — what we call a ‘Digital Twin of an Organization’ —  is joining with IBM’s AI-powered automation capabilities to better manage process execution. Together we will offer a comprehensive solution for digital process transformation and automation to help enterprises continuously transform insights into action.”

Cado Security locks in $10M for its cloud-native digital forensics platform

As computing systems become increasingly bigger and more complex, forensics have become an increasingly important part of how organizations can better secure them. As the recent Solar Winds breach has shown, it’s not always just a matter of being able to identify data loss, or prevent hackers from coming in in the first place. In cases where a network has already been breached, running a thorough investigation is often the only way to identify what happened, if a breach is still active, and whether a malicious hacker can strike again.

As a sign of this growing priority, a startup called Cado Security, which has built forensics technology native to the cloud to run those investigations, is announcing $10 million in funding to expand its business.

Cado’s tools today are used directly by organizations, but also security companies like Redacted — a somewhat under-the-radar security startup in San Francisco co-founded by Facebook’s former chief security officer Max Kelly and John Hering, the co-founder of Lookout. It uses Cado to carry out the forensics part of its work.

The funding for London-based Cado is being led by Blossom Capital, with existing investors Ten Eleven Ventures also participating, among others. As another signal of demand, this Series A is coming only six months after Cado raised its seed round.

The task of securing data on digital networks has grown increasingly complex over the years: not only are there more devices, more data and a wider range of configurations and uses around it, but malicious hackers have become increasingly sophisticated in their approaches to needling inside networks and doing their dirty work.

The move to the cloud has also been a major factor. While it has helped a wave of organizations expand and run much bigger computing processes are part of their business operations, it has also increased the so-called attack surface and made investigations much more complicated, not least because a lot of organizations run elastic processes, scaling their capacity up and down: this means when something is scaled down, logs of previous activity essentially disappear.

Cado’s Response product — which works proactively on a network and all of its activity after it’s installed — is built to work across cloud, on-premise and hybrid environments. Currently it’s available for AWS EC2 deployments and Docker, Kubernetes, OpenShift and AWS Fargate container systems, and the plan is to expand to Azure very soon. (Google Cloud Platform is less of a priority at the moment, CEO James Campbell said, since it rarely comes up with current and potential customers.)

Campbell co-founded Cado with Christopher Doman (the CTO) last April, with the concept for the company coming out of their respective experiences working on security services together at PwC, and respectively for government organizations (Campbell in Australia) and AlienVault (the security firm acquired by AT&T). In all of those, one persistent issue the two continued to encounter was the issue with adequate forensics data, essential for tracking the most complex breaches.

A lot of legacy forensics tools, in particular those tackling the trove of data in the cloud, was based on “processing data with open source and pulling together analysis in spreadsheets,” Campbell said. “There is a need to modernize this space for the cloud era.”

In a typical breach, it can take up to a month to run a thorough investigation to figure out what is going on, since, as Doman describes it, forensics looks at “every part of the disk, the files in a binary system. You just can’t find what you need without going to that level, those logs. We would look at the whole thing.”

However, that posed a major problem. “Having a month with a hacker running around before you can do something about it is just not acceptable,” Campbell added. The result, typically, is that other forensics tools investigate only about 5% of an organization’s data.

The solution — for which Cado has filed patents, the pair said — has essentially involved building big data tools that can automate and speed up the very labor intensive process of looking through activity logs to figure out what looks unusual and to find patterns within all the ones and zeros.

“That gives security teams more room to focus on what the hacker is getting up to, the remediation aspect,” Campbell explained.

Arguably, if there were better, faster tracking and investigation technology in place, something like Solar Winds could have been better mitigated.

The plan for the company is to bring in more integrations to cover more kinds of systems, and go beyond deployments that you’d generally classify as “infrastructure as a service.”

“Over the past year, enterprises have compressed their cloud adoption timelines while protecting the applications that enable their remote workforces,” said Imran Ghory, partner at Blossom Capital, in a statement. “Yet as high-profile breaches like SolarWinds illustrate, the complexity of cloud environments makes rapid investigation and response extremely difficult since security analysts typically are not trained as cloud experts. Cado Security solves for this with an elegant solution that automates time-consuming tasks like capturing forensically sound cloud data so security teams can move faster and more efficiently. The opportunity to help Cado Security scale rapidly is a terrific one for Blossom Capital.”

Dell is spinning out VMware in a deal expected to generate over $9B for the company

Dell announced this afternoon that it’s spinning out VMware, a move that has been suspected for some time. Dell, which acquired VMware as part of the massive $58 billion EMC acquisition (announced as $67 billion) in 2015, owns approximately 81% of the stock and the company is expected to receive between $9.3 and $9.7 billion when the deal closes later this year.

Even when it was part of EMC, VMware had a special status in that it operates as a separate entity with its own executive team, board of directors and the stock has been sold separately as well.

“Both companies will remain important partners, providing Dell Technologies with a differentiated advantage in how we bring solutions to customers. At the same time, Dell Technologies will continue to modernize its core infrastructure and PC businesses and embrace new opportunities through an open ecosystem to grow in hybrid and private cloud, edge and telecom,” Dell CEO Michael Dell said in a statement.]

While there is a lot of CEO speak in that statement, it appears to mean that the move is mostly administrative as the companies will continue to work closely together, even after the spin off is official. Dell will remain as chairman of both companies. What’s more, the company plans to use the cash proceeds from the deal to help pay down the massive debt it still has left over from the EMC deal.

The deal is expected to close at the end of this year, but it has to clear a number of regulatory hurdles first. That includes garnering a favorable ruling from the IRS that the deal qualifies for a tax-free spin-off, which is seems to be a considerable hurdle for a deal like this.

This is a breaking story. We will have more soon.

Upstack raises $50M for its platform and advisory to help businesses plan and buy for digital transformation

Digital transformation has been one of the biggest catchphrases of the past year, with many an organization forced to reckon with aging IT, a lack of digital strategy, or simply the challenges of growth after being faced with newly-remote workforces, customers doing everything online and other tech demands.

Now, a startup called Upstack that has built a platform to help those businesses evaluate how to grapple with those next steps — including planning and costing out different options and scenarios, and then ultimately buying solutions — is announcing financing to do some growth of its own.

The New York startup has picked up funding of $50 million, money that it will be using to continue building out its platform and expanding its services business.

The funding is coming from Berkshire Partners, and it’s being described as an “initial investment”. The firm, which makes private equity and late-stage growth investments, typically puts between $100 million and $1 billion in its portfolio companies so this could end up as a bigger number, especially when you consider the size of the market that Upstack is tackling: the cloud and internet infrastructure brokerage industry generates annual revenues “in excess of $70 billion,” the company estimates.

We’re asking about the valuation, but PitchBook notes that the median valuation in its deals is around $211 million. Upstack had previously raised around $35 million.

Upstack today already provides tools to large enterprises, government organizations, and smaller businesses to compare offerings and plan out pricing for different scenarios covering a range of IT areas, including private, public and hybrid cloud deployments; data center investments; network connectivity; business continuity and mobile services, and the plan is to bring in more categories to the mix, including unified communications and security.

Notably, Upstack itself is profitable and names a lot of customers that themselves are tech companies — they include Cisco, Accenture, cloud storage company Backblaze, Riverbed and Lumen — a mark of how digital transformation and planning for it are not necessarily a core competency even of digital businesses, but especially those that are not technology companies. It says it has helped complete over 3,700 IT projects across 1,000 engagements to date.

“Upstack was founded to bring enterprise-grade advisory services to businesses of all sizes,” said Christopher Trapp, founder and CEO, in a statement. “Berkshire’s expertise in the data center, connectivity and managed services sectors aligns well with our commitment to enabling and empowering a world-class ecosystem of technology solutions advisors with a platform that delivers higher value to their customers.”

The core of the Upstack’s proposition is a platform that system integrators, or advisors, plus end users themselves, can use to design and compare pricing for different services and solutions. This is an unsung but critical aspect of the ecosystem: We love to hear and write about all the interesting enterprise technology that is being developed, but the truth of the matter is that buying and using that tech is never just a simple click on a “buy” button.

Even for smaller organizations, buying tech can be a hugely time-consuming task. It involves evaluating different companies and what they have to offer — which can differ widely in the same category, and gets more complex when you start to compare different technological approaches to the same problem.

It also includes the task of designing solutions to fit one’s particular network. And finally, there are the calculations that need to be made to determine the real cost of services once implemented in an organization. It also gives users the ability to present their work, which also forms a critical part of the evaluating and decision-making process. When you think about all of this, it’s no wonder that so many organizations have opted to follow the “if it ain’t broke, don’t fix it” school of digital strategy.

As technology has evolved, the concept of digital transformation itself has become more complicated, making tools like Upstack’s more in demand both by companies and the people they hire to do this work for them. Upstack also employs a group of about 15 advisors — consultants — who also provide insight and guidance in the procurement process, and it seems some of the funding will also be used to invest in expanding that team.

(Incidentally, the model of balancing technology with human experts is one used by other enterprise startups that are built around the premise of helping businesses procure technology: BlueVoyant, a security startup that has built a platform to help businesses manage and use different security services, also retains advisors who are experts in that field.)

The advisors are part of the business model: Upstack’s customers can either pay Upstack a consulting fee to work with its advisors, or Upstack receives a commission from suppliers that a company ends up using, having evaluated and selected them via the Upstack platform.

The company competes with traditional systems integrators and consultants, but it seems that the fact that it has built a tech platform that some of its competitors also use is one reason why it’s caught the eye of investors, and also seen strong growth.

Indeed, when you consider the breadth of services that a company might use within their infrastructure — whether it’s software to run sales or marketing, or AI to run a recommendation for products on a site, or business intelligence or RPA — it will be interesting to see how and if Upstack considers deeper moves into these areas.

“Upstack has quickly become a leader in a large, rapidly growing and highly fragmented market,” said Josh Johnson, principal at Berkshire Partners, in a statement. “Our experience has reinforced the importance of the agent channel to enterprises designing and procuring digital infrastructure. Upstack’s platform accelerates this digital transformation by helping its advisors better serve their enterprise customers. We look forward to supporting Upstack’s continued growth through M&A and further investment in the platform.”

Zoho launches new low code workflow automation product

Workflow automation has been one of the key trends this year so far, and Zoho, a company known for its suite of affordable business tools has joined the parade with a new low code workflow product called Qntrl (pronounced control).

Zoho’s Rodrigo Vaca, who is in charge of Qntrl’s marketing says that most of the solutions we’ve been seeing are built for larger enterprise customers. Zoho is aiming for the mid-market with a product that requires less technical expertise than traditional business process management tools.

“We enable customers to design their workflows visually without the need for any particular kind of prior knowledge of business process management notation or any kind of that esoteric modeling or discipline,” Vaca told me.

While Vaca says, Qntrl could require some technical help to connect a workflow to more complex backend systems like CRM or ERP, it allows a less technical end user to drag and drop the components and then get help to finish the rest.

“We certainly expect that when you need to connect to NetSuite or SAP you’re going to need a developer. If nothing else, the IT guys are going to ask questions, and they will need to provide access,” Vaca said.

He believes this product is putting this kind of tooling in reach of companies that may have been left out of workflow automation for the most part, or which have been using spreadsheets or other tools to create crude workflows. With Qntrl, you drag and drop components, and then select each component and configure what happens before, during and after each step.

What’s more, Qntrl provides a central place for processing and understanding what’s happening within each workflow at any given time, and who is responsible for completing it.

We’ve seen bigger companies like Microsoft, SAP, ServiceNow and others offering this type of functionality over the last year as low code workflow automation has taken center stage in business.

This has become a more pronounced need during the pandemic when so many workers could not be in the office. It made moving work in a more automated workflow more imperative, and we have seen companies moving to add more of this kind of functionality as a result.

Brent Leary, principal analyst at CRM Essentials, says that Zoho is attempting to remove some the complexity from this kind of tool.

“It handles the security pieces to make sure the right people have access to the data and processes used in the workflows in the background, so regular users can drag and drop to build their flows and processes without having to worry about that stuff,” Leary told me.

Zoho Qntrl is available starting today starting at just $7 per user month.

Zoho launches new low code workflow automation product

Workflow automation has been one of the key trends this year so far, and Zoho, a company known for its suite of affordable business tools has joined the parade with a new low code workflow product called Qntrl (pronounced control).

Zoho’s Rodrigo Vaca, who is in charge of Qntrl’s marketing says that most of the solutions we’ve been seeing are built for larger enterprise customers. Zoho is aiming for the mid-market with a product that requires less technical expertise than traditional business process management tools.

“We enable customers to design their workflows visually without the need for any particular kind of prior knowledge of business process management notation or any kind of that esoteric modeling or discipline,” Vaca told me.

While Vaca says, Qntrl could require some technical help to connect a workflow to more complex backend systems like CRM or ERP, it allows a less technical end user to drag and drop the components and then get help to finish the rest.

“We certainly expect that when you need to connect to NetSuite or SAP you’re going to need a developer. If nothing else, the IT guys are going to ask questions, and they will need to provide access,” Vaca said.

He believes this product is putting this kind of tooling in reach of companies that may have been left out of workflow automation for the most part, or which have been using spreadsheets or other tools to create crude workflows. With Qntrl, you drag and drop components, and then select each component and configure what happens before, during and after each step.

What’s more, Qntrl provides a central place for processing and understanding what’s happening within each workflow at any given time, and who is responsible for completing it.

We’ve seen bigger companies like Microsoft, SAP, ServiceNow and others offering this type of functionality over the last year as low code workflow automation has taken center stage in business.

This has become a more pronounced need during the pandemic when so many workers could not be in the office. It made moving work in a more automated workflow more imperative, and we have seen companies moving to add more of this kind of functionality as a result.

Brent Leary, principal analyst at CRM Essentials, says that Zoho is attempting to remove some the complexity from this kind of tool.

“It handles the security pieces to make sure the right people have access to the data and processes used in the workflows in the background, so regular users can drag and drop to build their flows and processes without having to worry about that stuff,” Leary told me.

Zoho Qntrl is available starting today starting at just $7 per user month.

Docugami’s new model for understanding documents cuts its teeth on NASA archives

You hear so much about data these days that you might forget that a huge amount of the world runs on documents: a veritable menagerie of heterogeneous files and formats holding enormous value yet incompatible with the new era of clean, structured databases. Docugami plans to change that with a system that intuitively understands any set of documents and intelligently indexes their contents — and NASA is already on board.

If Docugami’s product works as planned, anyone will be able to take piles of documents accumulated over the years and near-instantly convert them to the kind of data that’s actually useful to people.

If Docugami’s product works as planned, anyone will be able to take piles of documents accumulated over the years and near-instantly convert them to the kind of data that’s actually useful to people.

Because it turns out that running just about any business ends up producing a ton of documents. Contracts and briefs in legal work, leases and agreements in real estate, proposals and releases in marketing, medical charts, etc, etc. Not to mention the various formats: Word docs, PDFs, scans of paper printouts of PDFs exported from Word docs, and so on.

Over the last decade there’s been an effort to corral this problem, but movement has largely been on the organizational side: put all your documents in one place, share and edit them collaboratively. Understanding the document itself has pretty much been left to the people who handle them, and for good reason — understanding documents is hard!

Think of a rental contract. We humans understand when the renter is named as Jill Jackson, that later on, “the renter” also refers to that person. Furthermore, in any of a hundred other contracts, we understand that the renters in those documents are the same type of person or concept in the context of the document, but not the same actual person. These are surprisingly difficult concepts for machine learning and natural language understanding systems to grasp and apply. Yet if they could be mastered, an enormous amount of useful information could be extracted from the millions of documents squirreled away around the world.

What’s up, .docx?

Docugami founder Jean Paoli says they’ve cracked the problem wide open, and while it’s a major claim, he’s one of few people who could credibly make it. Paoli was a major figure at Microsoft for decades, and among other things helped create the XML format — you know all those files that end in x, like .docx and .xlsx? Paoli is at least partly to thank for them.

“Data and documents aren’t the same thing,” he told me. “There’s a thing you understand, called documents, and there’s something that computers understand, called data. Why are they not the same thing? So my first job [at Microsoft] was to create a format that can represent documents as data. I created XML with friends in the industry, and Bill accepted it.” (Yes, that Bill.)

The formats became ubiquitous, yet 20 years later the same problem persists, having grown in scale with the digitization of industry after industry. But for Paoli the solution is the same. At the core of XML was the idea that a document should be structured almost like a webpage: boxes within boxes, each clearly defined by metadata — a hierarchical model more easily understood by computers.

Illustration showing a document corresponding to pieces of another document.

Image Credits: Docugami

“A few years ago I drank the AI kool-aid, got the idea to transform documents into data. I needed an algorithm that navigates the hierarchical model, and they told me that the algorithm you want does not exist,” he explained. “The XML model, where every piece is inside another, and each has a different name to represent the data it contains — that has not been married to the AI model we have today. That’s just a fact. I hoped the AI people would go and jump on it, but it didn’t happen.” (“I was busy doing something else,” he added, to excuse himself.)

The lack of compatibility with this new model of computing shouldn’t come as a surprise — every emerging technology carries with it certain assumptions and limitations, and AI has focused on a few other, equally crucial areas like speech understanding and computer vision. The approach taken there doesn’t match the needs of systematically understanding a document.

“Many people think that documents are like cats. You train the AI to look for their eyes, for their tails … documents are not like cats,” he said.

It sounds obvious, but it’s a real limitation. Advanced AI methods like segmentation, scene understanding, multimodal context, and such are all a sort of hyperadvanced cat detection that has moved beyond cats to detect dogs, car types, facial expressions, locations, etc. Documents are too different from one another, or in other ways too similar, for these approaches to do much more than roughly categorize them.

As for language understanding, it’s good in some ways but not in the ways Paoli needed. “They’re working sort of at the English language level,” he said. “They look at the text but they disconnect it from the document where they found it. I love NLP people, half my team is NLP people — but NLP people don’t think about business processes. You need to mix them with XML people, people who understand computer vision, then you start looking at the document at a different level.”

Docugami in action

Illustration showing a person interacting with a digital document.

Image Credits: Docugami

Paoli’s goal couldn’t be reached by adapting existing tools (beyond mature primitives like optical character recognition), so he assembled his own private AI lab, where a multidisciplinary team has been tinkering away for about two years.

“We did core science, self-funded, in stealth mode, and we sent a bunch of patents to the patent office,” he said. “Then we went to see the VCs, and SignalFire basically volunteered to lead the seed round at $10 million.”

Coverage of the round didn’t really get into the actual experience of using Docugami, but Paoli walked me through the platform with some live documents. I wasn’t given access myself and the company wouldn’t provide screenshots or video, saying it is still working on the integrations and UI, so you’ll have to use your imagination … but if you picture pretty much any enterprise SaaS service, you’re 90% of the way there.

As the user, you upload any number of documents to Docugami, from a couple dozen to hundreds or thousands. These enter a machine understanding workflow that parses the documents, whether they’re scanned PDFs, Word files, or something else, into an XML-esque hierarchical organization unique to the contents.

“Say you’ve got 500 documents, we try to categorize it in document sets, these 30 look the same, those 20 look the same, those five together. We group them with a mix of hints coming from how the document looked, what it’s talking about, what we think people are using it for, etc.,” said Paoli. Other services might be able to tell the difference between a lease and an NDA, but documents are too diverse to slot into pre-trained ideas of categories and expect it to work out. Every set of documents is potentially unique, and so Docugami trains itself anew every time, even for a set of one. “Once we group them, we understand the overall structure and hierarchy of that particular set of documents, because that’s how documents become useful: together.”

Illustration showing a document being turned into a report and a spreadsheet.

Image Credits: Docugami

That doesn’t just mean it picks up on header text and creates an index, or lets you search for words. The data that is in the document, for example who is paying whom, how much and when, and under what conditions, all that becomes structured and editable within the context of similar documents. (It asks for a little input to double check what it has deduced.)

It can be a little hard to picture, but now just imagine that you want to put together a report on your company’s active loans. All you need to do is highlight the information that’s important to you in an example document — literally, you just click “Jane Roe” and “$20,000” and “five years” anywhere they occur — and then select the other documents you want to pull corresponding information from. A few seconds later you have an ordered spreadsheet with names, amounts, dates, anything you wanted out of that set of documents.

All this data is meant to be portable too, of course — there are integrations planned with various other common pipes and services in business, allowing for automatic reports, alerts if certain conditions are reached, automated creation of templates and standard documents (no more keeping an old one around with underscores where the principals go).

Remember, this is all half an hour after you uploaded them in the first place, no labeling or pre-processing or cleaning required. And the AI isn’t working from some preconceived notion or format of what a lease document looks like. It’s learned all it needs to know from the actual docs you uploaded — how they’re structured, where things like names and dates figure relative to one another, and so on. And it works across verticals and uses an interface anyone can figure out in a few minutes. Whether you’re in healthcare data entry or construction contract management, the tool should make sense.

The web interface where you ingest and create new documents is one of the main tools, while the other lives inside Word. There Docugami acts as a sort of assistant that’s fully aware of every other document of whatever type you’re in, so you can create new ones, fill in standard information, comply with regulations and so on.

Okay, so processing legal documents isn’t exactly the most exciting application of machine learning in the world. But I wouldn’t be writing this (at all, let alone at this length) if I didn’t think this was a big deal. This sort of deep understanding of document types can be found here and there among established industries with standard document types (such as police or medical reports), but have fun waiting until someone trains a bespoke model for your kayak rental service. But small businesses have just as much value locked up in documents as large enterprises — and they can’t afford to hire a team of data scientists. And even the big organizations can’t do it all manually.

NASA’s treasure trove

Image Credits: NASA

The problem is extremely difficult, yet to humans seems almost trivial. You or I could glance through 20 similar documents and a list of names and amounts easily, perhaps even in less time than it takes for Docugami to crawl them and train itself.

But AI, after all, is meant to imitate and transcend human capacity, and it’s one thing for an account manager to do monthly reports on 20 contracts — quite another to do a daily report on a thousand. Yet Docugami accomplishes the latter and former equally easily — which is where it fits into both the enterprise system, where scaling this kind of operation is crucial, and to NASA, which is buried under a backlog of documentation from which it hopes to glean clean data and insights.

If there’s one thing NASA’s got a lot of, it’s documents. Its reasonably well-maintained archives go back to its founding, and many important ones are available by various means — I’ve spent many a pleasant hour perusing its cache of historical documents.

But NASA isn’t looking for new insights into Apollo 11. Through its many past and present programs, solicitations, grant programs, budgets, and of course engineering projects, it generates a huge amount of documents — being, after all, very much a part of the federal bureaucracy. And as with any large organization with its paperwork spread over decades, NASA’s document stash represents untapped potential.

Expert opinions, research precursors, engineering solutions, and a dozen more categories of important information are sitting in files searchable perhaps by basic word matching but otherwise unstructured. Wouldn’t it be nice for someone at JPL to get it in their head to look at the evolution of nozzle design, and within a few minutes have a complete and current list of documents on that topic, organized by type, date, author and status? What about the patent advisor who needs to provide a NIAC grant recipient information on prior art — shouldn’t they be able to pull those old patents and applications up with more specificity than any with a given keyword?

The NASA SBIR grant, awarded last summer, isn’t for any specific work, like collecting all the documents of such and such a type from Johnson Space Center or something. It’s an exploratory or investigative agreement, as many of these grants are, and Docugami is working with NASA scientists on the best ways to apply the technology to their archives. (One of the best applications may be to the SBIR and other small business funding programs themselves.)

Another SBIR grant with the NSF differs in that, while at NASA the team is looking into better organizing tons of disparate types of documents with some overlapping information, at NSF they’re aiming to better identify “small data.” “We are looking at the tiny things, the tiny details,” said Paoli. “For instance, if you have a name, is it the lender or the borrower? The doctor or the patient name? When you read a patient record, penicillin is mentioned, is it prescribed or prohibited? If there’s a section called allergies and another called prescriptions, we can make that connection.”

“Maybe it’s because I’m French”

When I pointed out the rather small budgets involved with SBIR grants and how his company couldn’t possibly survive on these, he laughed.

“Oh, we’re not running on grants! This isn’t our business. For me, this is a way to work with scientists, with the best labs in the world,” he said, while noting many more grant projects were in the offing. “Science for me is a fuel. The business model is very simple — a service that you subscribe to, like Docusign or Dropbox.”

The company is only just now beginning its real business operations, having made a few connections with integration partners and testers. But over the next year it will expand its private beta and eventually open it up — though there’s no timeline on that just yet.

“We’re very young. A year ago we were like five, six people, now we went and got this $10 million seed round and boom,” said Paoli. But he’s certain that this is a business that will be not just lucrative but will represent an important change in how companies work.

“People love documents. Maybe it’s because I’m French,” he said, “but I think text and books and writing are critical — that’s just how humans work. We really think people can help machines think better, and machines can help people think better.”