How to Extract Numbers From a String in Python

Including Decimals, Negatives, Numbers With Commas, and European Format.

This post covers how to extract numbers from text in Python. I have also written about doing it in JavaScript.

You can use pypi logo extract-numbers library to extract numbers from text in your Python code.

For JavaScript code, you can use npm logoextract-numbers.

Valid Numbers

First, let’s define what could be considered a valid number in a text:

  • Decimal: such as price.

  • Negative: e.g. temperature, stock market indexes.

  • Comma-separated: usually found in large values like bank statements, invoices, or astronomical data.

  • European format: In many European countries, a dot is used instead of a comma, and for decimal values, a comma is used instead of a dot. For example, 100,000,000.97 becomes 100.000.000,97.

The Regexes

In my JavaScript library, extract-numbers, I used a much looser regex:

text.match(/(-\d+|\d+)(,\d+)*(\.\d+)*/g)

It matches all possible number formats, including negatives, and captures every number even if the commas are not placed in a standard format.

For Python, though, I have used a much stricter regex that only captures a number in proper format (proper numbering between commas):

r'-?(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?'

This regex does the trick for all numbers including negatives, decimals, comma-separated, and the combination of any of them. But it doesn’t work with the European format. So instead of making this regex any more complex, we can handle it with a separate regex.

r'-?(?:\d{1,3}(?:\.\d{3})+|\d+)(?:,\d+)?'

It’s the same regex, with dot and comma swapped.

The Code

Here’s the code with options, as published on PyPI.

from typing import Optional, TypedDict, List, Union
import re

class Options(TypedDict, total=False):
    as_string: bool
    remove_commas: bool
    european_format: bool

Number = Union[str, int, float]

EU_REGEX = r'-?(?:\d{1,3}(?:\.\d{3})+|\d+)(?:,\d+)?'
US_REGEX = r'-?(?:\d{1,3}(?:,\d{3})+|\d+)(?:\.\d+)?'

class ExtractNumbers:
    def __init__(self, options: Optional[Options] = None) -> None:
        self.options: Options = {
            "as_string": True,
            "remove_commas": False,
            "european_format": False
        }

        if isinstance(options, dict):
            for key, value in options.items():
                if key not in self.options:
                    raise ValueError(f"Invalid option '{key}'. Expected one of {list(self.options.keys())}.")
                if not isinstance(value, bool):
                    raise TypeError(f"Option '{key}' must be a boolean.")
            self.options.update(options)

    def _sanitize_number(self, number: str) -> Number:
        if self.options["european_format"]:
            number = number.replace(".", "").replace(",", ".")
        else:
            number = number.replace(",", "")
        return float(number) if "." in number else int(number)

    def extractNumbers(self, text: str) -> List[Number]:
        if not isinstance(text, str):
            raise ValueError(f"Invalid argument: Expected 'text' to be of type str, but got {type(text).__name__}.")

        as_string = self.options.get("as_string", False)
        remove_commas = self.options.get("remove_commas", False)
        numbers = re.findall(US_REGEX, text)
        comma_type = ","

        if self.options.get("european_format"):
            numbers = re.findall(EU_REGEX, text)
            comma_type = "."

        if as_string and remove_commas:
            return [number.replace(comma_type, "") for number in numbers]

        if not as_string:
          return [self._sanitize_number(n) for n in numbers]

        return numbers

Example Usage:

extractor = ExtractNumbers({}) # add options if needed

extractor.extractNumbers("20 results out of 100,000")
# => ["20", "100,000"]

Here are some notes about the code:

  • ExtractNumbers: Why make it a class? The same functionality can be achieved in a functional way with a pure function, but the class encapsulation is better for storing options at instance creation and possible use of multiple instances with different settings.

  • Regex patterns: Define two regexes EU_REGEX and US_REGEX, and use them based on the european_format value.

  • All options are optional and we set their defaults in __init__.

  • With ValueError and TypeError we ensure no wrong values are passed.

  • In JavaScript, we could use Number to parse either integer or float values, but in Python, we need to distinguish between floats and integers and parse them accordingly. We do that in the _sanitize_number private helper along with European format processing.

  • Source code of the libraries: Python, JavaScript




python  pypi 

See also

When you purchase through links on techighness.com, I may earn an affiliate commission.
We use cookies to enhance your experience. By continuing to visit this site you agree to our use of cookies. More info cookie script